Problem Statement
The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-driven solutions. You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a classification model:
Facilitate the process of visa approvals.
Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the drivers that significantly influence the case status.
Data Analysis
The data contains the different attributes of employee and the employer. The detailed data dictionary is: case_id, continent, education_of_employee, has_job_experience and more others
This dataset has12 features.
Data Processing
Data Pre-Processing: data generally contains many issues like noises, missing values, and not given in proper format which cannot be directly used for machine learning algorithms.
This is the process for cleaning the data and making it suitable for a ML model to increase the model efficiency and increase the accuracy of the model also.
Data pre-processing in Machine Learning is a crucial step that helps enhance the quality of data
Checking null value if exist then need to fill it.
Load Dataset
This is initial step which used to load the dataset before data analysis
# Read Data
visa = pd.read_csv('/content/drive/My Drive/EasyVisa.csv')
# copying data to another variable to avoid any changes to original data
data = visa.copy()
data.head()
Checking Duplicate Values
Below the code which used to find the all duplicate values in dataset.
# checking for duplicate values
data.duplicated()
The duplicated() function return “False” if not any duplicate value and return “True” if it has duplicate value.
Exploratory Data Analysis
In this we need to visualize the dataset features
In this we have to use some visualization to show and understand the features and their relationship with other features easily.
It is divided into two categories:
Univariate Analysis
Multivariate Analysis
Observations on number of employees
In this graph we can show maximum number of employee between 0 to 100k
In box plot there are many outliers so we need to remove it to get better result
Observations on prevailing wage
At starting it goes to peak and then graph down when the prevailing_wage increases.
It also has the many outliers which you can see in box plot.
Observations on continent
Asia has the large number of continent compared to other.
And ‘Oceania’ has minimum number of continent.
Observations on education of employee
At bachelor's level 42.2% student apply for visa which is max.
At Doctorate level only 8.6% which is low compared to other.
So, need to increase doctorate level by focussing on this education
Observations on job experience
Above plot display number of adults and number of children . And plot a graph
Observations on job training
In graph we can see that only 2955 which required for job training but large number of record 22535 not has job training.
So, I recommended to increase graph of job training by reducing non-job training
Observations on region of employment
Northeast region has more employment compare other other(7195)
Island region has low employment rate
I recommended that need to focus on Island region employment to overcome this issue.
Observations on unit of wage
The unit of wage is max at yearly basis and minimum at monthly.
You can seed in graph for more clearance.
Correlation Plot
Here three variable which is highly correlated to itself. You can see it in diagonally in blue color.
Here some are negative which is show the low correlation between them.
Those with higher education may want to travel abroad for a well-paid job. Let's find out if education has any impact on visa certification
Here we see that when we go with highly then denied cases min but at high school standard denied cases is max.
We need to check the reason for high school stand for which is has max denied cases.
Different regions have different requirements of talent having diverse educational backgrounds. Let's analyze it further
In this heat map we can see Island has very weak level in all education standard.
We need to focus on Island education background to overcome this.
Let's have a look at the percentage of visa certifications across each region
Here we seed that Island has the min certified visa certifications and Midwest has the max visa certifications.
So we can so need to focus on Island region with other which has low.
The US government has established a prevailing wage to protect local talent and foreign workers. Let's analyze the data and see if the visa status changes with the prevailing wage
Here we see that when we see that when we remove the outliers the certified and denied graph increased.
Data Preparation for modeling
We want to predict which visa will be certified.
Before we proceed to build a model, we'll have to encode categorical features.
We'll split the data into train and test to be able to evaluate the model that we build on the train data
Here the code Script which used for decision tree model.
Decision Tree Classifier
Here we find the score of training data
Precision: Here we find the precision score 1
Recall: Recall score is also 1
F1-Score: f1-score is 1 And finally we see that accuracy of this decision tree model is 1
Here we find the score of test data:
Precision: Here we find the precision score 0.50
Recall: Recall score is also 0.49
F1-Score: f1-score is 0.50 And finally we see that accuracy of this decision tree model is 0.66
Bagging Classifier
Bagging is a type of ensemble machine learning approach that combines the outputs from many learner to improve performance.
These algorithms function by breaking down the training set into subsets and running them through various machine learning models.
Confusion Matrix:
Model Performance on training Set:
Accuracy: 0.98
Recall: 0.98
Precision: 0.99
F1-score: 0.98
Model Performance on training Set:
Accuracy: 0.70
Recall: 0.77
Precision: 0.77
F1-score: 0.77
Output:
Random Forest
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression
Model Performance on training Set:
Accuracy: 1.0
Recall: 1.0
Precision: 1.0
F1-score: 1.0
Model Performance on test Set:
Accuracy: 0.71
Recall: 0.83
Precision: 0.76
F1-score: 0.7
Here you also get all data analysis assignment help, project help and homework help. For any help related to Python, Java, R and other programming you can send your project requirement detail at:
realcode4you@gmail.com