The Mystic H1B Decoded

Deep dive into H1B Visa Analysis.

What is H1B Visa?

The H-1B is a visa in the United States under the Immigration and Nationality Act, section 101(a)(15)(H) that allows U.S. employers to employ foreign workers in specialty occupations.
A specialty occupation requires the application of specialized knowledge and a bachelor’s degree or the equivalent of work experience.
The duration of stay is three years, extendable to six years, after which the visa holder may need to reapply.
Laws limit the number of H-1B visas that are issued each year: 206,002 new and initial H-1B visas were issued in 2022.
Employers must generally withhold Social Security and Medicare taxes from the wages paid to employees in H-1B status.

Quick facts about H1B Visa lottery:

The Immigration Act of 1990 established a limit of 65,000 foreign nationals who may be issued a visa each fiscal year.
An additional 20,000 H-1Bs are available to foreign nationals holding a master’s or higher degree from U.S. universities.
In addition, excluded from the ceiling are all H-1B non-immigrants who work at universities, non-profit research facilities associated with universities, and government research facilities.
Person in H-1B status must continue to be employed by their employer in order to stay in H-1B status. If the person’s employment ends for any reason, the person must leave the United States, unless the person is granted a change of status or finds another employer compatible with the H-1B status.
The United States Citizenship and Immigration Services allows grace period of up to 60 days to stay in the United States after the person’s end of employment.

Data Source

This dataset contains five year’s worth of H-1B petition data, with approximately 3 million records overall.
The columns in the dataset include case status, employer name, worksite coordinates, job title, prevailing wage, occupation code, and year filed.

Purpose of the Analysis

How the number of visa applications are growing over the years.
Companies offering H1B sponsorships.
Key roles that are being offered Sponsorships.
Salary distribution for the roles being offered.
How the Data related roles look like.
Acceptance and Rejection rate.
Which part of the United States records more applications.

Importing Necessary Libraries

#Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from os import path
from IPython import display

Reading CSV

df = pd.read_csv("../input/h-1b-visa/h1b_kaggle.csv")

Trim function

def trim(x):
    if x == 'TATA CONSULTANCY SERVICES LIMITED':
        return 'TCS'
    elif x == 'DELOITTE CONSULTING LLP':
        return 'DELOITTE'
    elif x == 'IBM INDIA PRIVATE LIMITED':
        return 'IBM LIMITED'
    elif x == 'COGNIZANT TECHNOLOGY SOLUTIONS U.S. CORPORATION':
        return 'COGNIZANT SOLUTIONS'
    else:
        return x
df['EMPLOYER_NAME'] = df['EMPLOYER_NAME'].apply(trim)

Visa applications over years

yoy_applications = df.groupby("YEAR").count()['CASE_STATUS'].values
year = [ i for i in range(2011,2017)]
plt.figure(figsize=(15,8))
sns.set_style('whitegrid')
sns.lineplot(x = year,y = yoy_applications,color='green')
plt.title("No of applications over years")
plt.plot();

Findings:

No of applications have almost tripled from 2011 to 2016.
There has been a steady or almost exponential increase in the number of applications over the years, While the number of slots available per year stayed constant at 65000 slots.
To put it in a probabilistic manner, as the number of applications increasing exponentially your probability of being picked up is reducing dramatically.

Chances of Picking

chance_of_picking = (65000/np.array(yoy_applications ))*100
plt.figure(figsize=(18,8))
sns.lineplot(x = year,y = chance_of_picking,color='red')
plt.title("Chance of Picking")
plt.ylabel("Picking Chance%")
plt.plot();

Findings:

If you have applied for a H1B Visa in 2011, Your chances of being picked is at approximately 18%.
Whereas 5 years down the line, your probability has dropped down to 10%.
However there can be external factors involved in deciding your chances(Such as applying more than one H1B applications per applicant) which increases your probability.
But to put it in the clean perspective as the number of applications are growing, your chance at the table is goind down.

Case status per year

col = ['green','red']
df1 = df[df['CASE_STATUS'].isin(['CERTIFIED','DENIED'])]
df2 = df1.groupby(['YEAR','CASE_STATUS'])['SOC_NAME'].count().reset_index()
plt.figure(figsize=(16,8))
sns.barplot(x = df2['YEAR'],y = df2['SOC_NAME'],hue=df2['CASE_STATUS'],palette=col)
plt.title("Case Status Per Year");

No of Full time & Part time roles offered from 2011 to 2016.

col = ['red','green']
count_position = df.groupby(["YEAR","FULL_TIME_POSITION"]).count()['CASE_STATUS'].reset_index()
plt.figure(figsize=(16,8))
sns.barplot(x = count_position['YEAR'],y = count_position['CASE_STATUS'].values,hue=count_position["FULL_TIME_POSITION"],palette=col)
plt.title("No of Full_time Vs Part-time roles offered")
plt.plot();

Findings:

Surprisingly More number of part time roles were applied for H1B sponsorship during 2016.

Which Companies offered Part time roles during 2016?

df1 =df[ df['YEAR']==2016]
df1 = df1[df1['FULL_TIME_POSITION']== 'N']
part_time = df1.groupby('EMPLOYER_NAME').count().sort_values(by='YEAR',ascending=False)
part_time = part_time[['YEAR']]
part_time.columns = ['No of Part_Time roles offered']
part_time[:20]

Major H1B Offering Companies

col = ['brown','brown','brown']
col.extend(['grey' for i in range(7)])
companies = df.groupby("EMPLOYER_NAME").count().sort_values(by = "CASE_STATUS",ascending=False)[:10]
no_of_applications = companies['CASE_STATUS'].values
plt.figure(figsize=(15,10))
sns.barplot(x = no_of_applications,y = companies.index,orient='h',palette=col)
plt.title("Top 10 Companies with highest number of H1B Applications")
plt.xlabel("No of Applications")
plt.ylabel("")
plt.plot();

Findings:

No Surprise that the top 3 companies offering H1B sponsorship are INFOSYS, TCS & WIPRO.
Some of the major consulting companies in India are the major applicants in H1B lottery.
Key reason behind are cheap labor, plethora of resources.

No of applications made by Top 5 Companies over Years

color = ['red','grey','grey','grey','grey','grey']
df1 = df[df['EMPLOYER_NAME'].isin(['INFOSYS LIMITED','TCS','WIPRO LIMITED','DELOITTE','IBM LIMITED'])]
df2 = df1.groupby(['YEAR','EMPLOYER_NAME'])['SOC_NAME'].count().reset_index()
plt.figure(figsize=(14,8))
sns.barplot(x = df2['YEAR'],y = df2['SOC_NAME'],hue=df2['EMPLOYER_NAME'],hue_order=['INFOSYS LIMITED','TCS','WIPRO LIMITED','DELOITTE','IBM LIMITED'],palette=color)
plt.title("No of Applications Per Year");

Findings:

During 2011, Infosys has the lowest number of H1B applications.
Fast forward to 2016, Infosys has become the biggest company in terms of applications.

What type of roles are being sponsored by the top companies?

Infosys = df[df['EMPLOYER_NAME'] == "INFOSYS LIMITED"]
Infosys_Roles = Infosys.groupby("JOB_TITLE").count().sort_values(by="CASE_STATUS",ascending=False)
Infosys_Roles = Infosys_Roles[['CASE_STATUS']][:10]
#Infosys_Roles = Infosys_Roles.style.set_caption('INFOSYS')

TCS = df[df['EMPLOYER_NAME'] == "TCS"]
TCS_Roles = TCS.groupby("JOB_TITLE").count().sort_values(by="CASE_STATUS",ascending=False)
TCS_Roles = TCS_Roles[['CASE_STATUS']][:10]

fig = plt.figure(figsize = (10, 12))
ax = fig.add_subplot(411)

ax.table(cellText = Infosys_Roles.values,
          rowLabels = Infosys_Roles.index,
          colLabels = ['No of Applications from 2011 to 2016'],
          loc = "center"
         )
ax.set_title("Infosys")

ax.axis("off");

ax1 = fig.add_subplot(412)
ax1.table(cellText = TCS_Roles.values,
          rowLabels = TCS_Roles.index,
          colLabels = ['No of Applications from 2011 to 2016'],
          loc = "center"
         )
ax1.set_title("TCS")

ax1.axis("off");


#Wipro
Wipro = df[df['EMPLOYER_NAME'] == "WIPRO LIMITED"]
Wipro_Roles = Wipro.groupby("JOB_TITLE").count().sort_values(by="CASE_STATUS",ascending=False)
Wipro_Roles = Wipro_Roles[['CASE_STATUS']][:10]

#Deloitte
Deloitte = df[df['EMPLOYER_NAME'] == "DELOITTE"]
Deloitte_Roles = Deloitte.groupby("JOB_TITLE").count().sort_values(by="CASE_STATUS",ascending=False)
Deloitte_Roles = Deloitte_Roles[['CASE_STATUS']][:10]

ax = fig.add_subplot(413)

ax.table(cellText = Wipro_Roles.values,
          rowLabels = Wipro_Roles.index,
          colLabels = ['No of Applications from 2011 to 2016'],
          loc = "center"
         )
ax.set_title("WIPRO")

ax.axis("off");

ax1 = fig.add_subplot(414)
ax1.table(cellText = Deloitte_Roles.values,
          rowLabels = Deloitte_Roles.index,
          colLabels =['No of Applications from 2011 to 2016'],
          loc = "center"
         )
ax1.set_title("TCS")

ax1.axis("off");

Findings:

If you are employed in any of the roles in the respective companies, Then you have good chances of being offered H1B sponsorship to undertake job in United States.
Infosys is offering H1B majorly to lead consultants, Whereas TCS is offering much of the H1B’s to the developers.
Most of the companies are offering H1B to majority of the key software roles.

Roles with highest no of applications from 2011 to 2016

Roles = df.groupby("JOB_TITLE").count().sort_values(by="CASE_STATUS",ascending=False)
plt.figure(figsize=(15,10))
sns.set_style('whitegrid')
sns.set_palette('colorblind')
sns.barplot(x = Roles[:10]['CASE_STATUS'],y = Roles.index[:10],orient='h',palette='viridis')
plt.title("Top 10 Roles")
plt.plot();

Findings:

Programmar Analyst has the highest number of H1B applications followed by Software Engineer and Computer Programmer.

Salary distribution for Top roles

PROGRAMMER ANALYST

plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
sns.set_style('whitegrid')
P_A = df[df['JOB_TITLE'] == "PROGRAMMER ANALYST"].groupby('YEAR').count()['CASE_STATUS']
year = [ i for i in range(2011,2017)]
sns.lineplot(x = year,y = P_A,color='red')
plt.title("No of applications over years")
plt.plot();

plt.subplot(1,2,2)
plt.title("Salary Distribution")
PM = df[(df['JOB_TITLE'] == "PROGRAMMER ANALYST") & (df['PREVAILING_WAGE']<300000)]
sns.histplot(x = PM['PREVAILING_WAGE'],color='green');
plt.plot();

SOFTWARE ENGINEER

plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
SOFTWARE_ENGINEER = df[df['JOB_TITLE'] == "SOFTWARE ENGINEER"].groupby('YEAR').count()['CASE_STATUS']
year = [ i for i in range(2011,2017)]
sns.lineplot(x = year,y = SOFTWARE_ENGINEER,color='red')
plt.title("No of applications over years")
plt.plot();

plt.subplot(1,2,2)
plt.title("Salary Distribution")
PM = df[(df['JOB_TITLE'] == "SOFTWARE ENGINEER") & (df['PREVAILING_WAGE']<300000)]
sns.histplot(PM['PREVAILING_WAGE'],color='green');
plt.plot();

COMPUTER PROGRAMMER

plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
P_A = df[df['JOB_TITLE'] == "COMPUTER PROGRAMMER"].groupby('YEAR').count()['CASE_STATUS']
year = [ i for i in range(2011,2017)]
sns.lineplot(x = year,y = P_A,color='red')
plt.title("No of applications over years")
plt.plot();

plt.subplot(1,2,2)
plt.title("Salary Distribution")
PM = df[(df['JOB_TITLE'] == "COMPUTER PROGRAMMER") & (df['PREVAILING_WAGE']<300000)]
sns.histplot(PM['PREVAILING_WAGE'],color='green');
plt.plot();

SYSTEMS ANALYST

plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
P_A = df[df['JOB_TITLE'] == "SYSTEMS ANALYST"].groupby('YEAR').count()['CASE_STATUS']
year = [ i for i in range(2011,2017)]
sns.lineplot(x = year,y = P_A,color='red')
plt.title("No of applications over years")
plt.plot();

plt.subplot(1,2,2)
plt.title("Salary Distribution")
PM = df[(df['JOB_TITLE'] == "SYSTEMS ANALYST") & (df['PREVAILING_WAGE']<300000)]
sns.histplot(PM['PREVAILING_WAGE'],color='green');
plt.plot();

SOFTWARE DEVELOPER

plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
P_A = df[df['JOB_TITLE'] == "SOFTWARE DEVELOPER"].groupby('YEAR').count()['CASE_STATUS']
year = [ i for i in range(2011,2017)]
sns.lineplot(x = year,y = P_A,color='red')
plt.title("No of applications over years")
plt.plot();

plt.subplot(1,2,2)
plt.title("Salary Distribution")
PM = df[(df['JOB_TITLE'] == "SOFTWARE DEVELOPER") & (df['PREVAILING_WAGE']<300000)]
sns.histplot(PM['PREVAILING_WAGE'],color='green');
plt.plot();

Data Scientist

plt.figure(figsize=(16,8))
sns.set_style('whitegrid')
plt.subplot(1,2,1)
Data_Scientist = df[df['JOB_TITLE'] == "DATA SCIENTIST"].groupby('YEAR').count()['CASE_STATUS']
certified = df[df['CASE_STATUS'].isin(['CERTIFIED']) & df['JOB_TITLE'].isin(['DATA SCIENTIST'])].groupby('YEAR').count()['CASE_STATUS']
year = [ i for i in range(2011,2017)]
sns.lineplot(x = year,y = Data_Scientist,legend = 'brief',color='red', label='No of Applications')
sns.lineplot(x = year,y = certified.values,color='green', label = 'Certified')
plt.title("No of applications over years")
plt.plot();

plt.subplot(1,2,2)
plt.title("Salary Distribution")
sns.histplot(df[df['JOB_TITLE'] == "DATA SCIENTIST"]["PREVAILING_WAGE"],color='green')
plt.plot();

Data Engineer

plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
Data_Scientist = df[df['JOB_TITLE'] == "DATA ENGINEER"].groupby('YEAR').count()['CASE_STATUS']
certified = df[df['CASE_STATUS'].isin(['CERTIFIED']) & df['JOB_TITLE'].isin(['DATA ENGINEER'])].groupby('YEAR').count()['CASE_STATUS']
year = [ i for i in range(2011,2017)]
sns.lineplot(x = year,y = Data_Scientist,color='red',label='No of Applications')
sns.lineplot(x = year,y = certified.values,color='green',label = 'Certified')
plt.title("No of applications over years")
plt.plot();

plt.subplot(1,2,2)
plt.title("Salary Distribution")
sns.histplot(df[df['JOB_TITLE'] == "DATA ENGINEER"]["PREVAILING_WAGE"],color='green')
plt.plot();

Data Analyst

plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
Data_Scientist = df[df['JOB_TITLE'] == "DATA ANALYST"].groupby('YEAR').count()['CASE_STATUS']
certified = df[df['CASE_STATUS'].isin(['CERTIFIED']) & df['JOB_TITLE'].isin(['DATA ANALYST'])].groupby('YEAR').count()['CASE_STATUS']

year = [ i for i in range(2011,2017)]
sns.lineplot(x = year,y = Data_Scientist,color='red',label='No of Applications')
sns.lineplot(x = year,y = certified.values,color='green', label = 'Certified')
plt.title("No of applications over years")
plt.plot();

plt.subplot(1,2,2)
plt.title("Salary Distribution")
DA = df[(df['JOB_TITLE'] == "DATA ANALYST") & (df['PREVAILING_WAGE']<200000)]
sns.histplot(DA['PREVAILING_WAGE'],color='green')
plt.plot();

Findings:

Red line signifies the no of applications and green signifies the no of applications got certified.
Data roles are the most dominant roles in the 21st Century.
Less skilled people and high wages are clearly evident from the above graphs.
Getting a H1B is highly likely if you have a offer with Data Roles.

plt.figure(figsize=(15,10))
companies = df[df['JOB_TITLE'].isin(['DATA ANALYST','DATA ENGINEER','DATA SCIENTIST'])].groupby('EMPLOYER_NAME').count().sort_values(by='CASE_STATUS',ascending=False)[:10]['CASE_STATUS']
sns.set_style('whitegrid')
sns.barplot(x = companies.values,y = companies.index,orient='h',palette='viridis')
plt.title("Top 10 Companies")
plt.plot();

Rejection & Acceptance Rate

df3 = df.groupby(['EMPLOYER_NAME','CASE_STATUS']).count().reset_index()
df3 = df3[df3['CASE_STATUS'].isin(['CERTIFIED'])][['EMPLOYER_NAME','JOB_TITLE']]
df4 = df.groupby('EMPLOYER_NAME',).count().reset_index()[['EMPLOYER_NAME','JOB_TITLE']]
df4.columns = ['EMPLOYER_NAME','TOTAL']
df5 = df4.join(df3.set_index('EMPLOYER_NAME'),on='EMPLOYER_NAME')
df5['%Certified'] = round(df5['JOB_TITLE']/df5['TOTAL']*100,2)
df5.columns = ['EMPLOYER','TOTAL_APPLICATIONS','Approved','%']
df5 = df5[df5['TOTAL_APPLICATIONS']>1000]

def chipchop(x):
    if x == 'ADVENT GLOBAL SOLUTIONS INC.':
        return 'ADVENT GLOBAL'
    elif x == 'INDIANA UNIV. PURDUE UNIV. INDIANAPOLIS':
        return 'INDIANA UNIV'
    elif x == 'AMERICAN INFORMATION TECHNOLOGY CORPORATION':
        return 'US INFOTECH CORP'
    elif x == 'EXILANT TECHNOLOGIES PRIVATE LIMITED':
        return 'EXILANT TECH'
    else:
        return x

Top 15 Companies with lowest Certified %

low_prop = df5.sort_values(by='%')
low_prop['EMPLOYER'] = low_prop['EMPLOYER'].apply(chipchop)
low_prop[['EMPLOYER','%']][:15].reset_index().drop('index',axis=1)

Top 15 Companies with highest Certified %

low_prop = df5.sort_values(by='%',ascending=False)
low_prop['EMPLOYER'] = low_prop['EMPLOYER'].apply(chipchop)
low_prop[['EMPLOYER','%']][1:15].reset_index().drop('index',axis=1)

Applications all over United States

Findings:

Most of the applications are concentrated on the northern part of United states. Which is the home for major companies.
Second high dense cluster is found on the south part of the United States. i.e California.. etc.

Conclusion

No of applications are growing rapidly so a the no of applications being approved.
Infosys, TCS, Wipro offers the highest no of H1B applications as compared to other companies.
Some of the key roles attracting H1B are related to software developer, Analyst, Testing.
Even though the number of applications are higher for software related roles, The Data related roles seems to be yielding great returns.
Data related roles has very low rejection rate and high paying salaries.
Some of the Companies had as low as 45% Acceptance rate.
Most of the applications over the year are concentrated over the Northern and Southern part of the Country.

The Mystic H1B Decoded

Purpose of the Analysis

Importing Necessary Libraries

Reading CSV

Trim function

Visa applications over years

Chances of Picking

Case status per year

No of Full time & Part time roles offered from 2011 to 2016.

Which Companies offered Part time roles during 2016?

Major H1B Offering Companies

No of applications made by Top 5 Companies over Years

What type of roles are being sponsored by the top companies?

Roles with highest no of applications from 2011 to 2016

Salary distribution for Top roles

PROGRAMMER ANALYST

SOFTWARE ENGINEER

COMPUTER PROGRAMMER

SYSTEMS ANALYST

SOFTWARE DEVELOPER

Data Scientist

Data Engineer

Data Analyst

Rejection & Acceptance Rate

Top 15 Companies with lowest Certified %

Top 15 Companies with highest Certified %

Applications all over United States

Conclusion

Thank you.

Sukesh Perla

The Mystic H1B Decoded

Purpose of the Analysis

Importing Necessary Libraries

Reading CSV

Trim function

Visa applications over years

Chances of Picking

Case status per year

No of Full time & Part time roles offered from 2011 to 2016.

Which Companies offered Part time roles during 2016?

Major H1B Offering Companies

No of applications made by Top 5 Companies over Years

What type of roles are being sponsored by the top companies?

Roles with highest no of applications from 2011 to 2016

Salary distribution for Top roles

PROGRAMMER ANALYST

SOFTWARE ENGINEER

COMPUTER PROGRAMMER

SYSTEMS ANALYST

SOFTWARE DEVELOPER

How the Data related roles look like

Data Scientist

Data Engineer

Data Analyst

Companies offering H1B for Data related roles

Rejection & Acceptance Rate

Top 15 Companies with lowest Certified %

Top 15 Companies with highest Certified %

Applications all over United States

Conclusion

Thank you.

Sukesh Perla