Background :
McCurr Consultancy is an MNC that has thousands of employees spread across the globe. The company believes in hiring the best talent available and retaining them for as long as possible. A huge amount of resources is spent on retaining existing employees through various initiatives. The Head of People Operations wants to bring down the cost of retaining employees. For this, he proposes limiting the incentives to only those employees who are at risk of attrition. As a recently hired Data Scientist in the People Operations Department, you have been asked to identify patterns in characteristics of employees who leave the organization. Also, you have to use this information to predict if an employee is at risk of attrition. This information will be used to target them with incentives.
Reference: Great Learning
Objective :
To identify which are the different factors that drive attrition?
Make a model to predict the attrition? Which algorithm gives the best performance?
Dataset :
The data contains demographic details, work-related metrics and attrition flag.
EmployeeNumber - Employee Identifier
Attrition - Did the employee attrite?
Age - Age of the employee
BusinessTravel - Travel commitments for the job
DailyRate - Data description not available**
Department - Employee Department
DistanceFromHome - Distance from work to home (in km)
Education - 1-Below College, 2-College, 3-Bachelor, 4-Master,5-Doctor
EducationField - Field of Education
EmployeeCount - Employee Count in a row
EnvironmentSatisfaction - 1-Low, 2-Medium, 3-High, 4-Very High
Gender - Employee's gender
HourlyRate - Data description not available**
JobInvolvement - 1-Low, 2-Medium, 3-High, 4-Very High
JobLevel - Level of job (1 to 5)
JobRole - Job Roles
JobSatisfaction - 1-Low, 2-Medium, 3-High, 4-Very High
MaritalStatus - Marital Status
MonthlyIncome - Monthly Salary
MonthlyRate - Data description not available**
NumCompaniesWorked - Number of companies worked at
Over18 - Over 18 years of age?
OverTime - Overtime?
PercentSalaryHike - The percentage increase in salary last year
PerformanceRating - 1-Low, 2-Good, 3-Excellent, 4-Outstanding
RelationshipSatisfaction - 1-Low, 2-Medium, 3-High, 4-Very High
StandardHours - Standard Hours
StockOptionLevel - Stock Option Level
TotalWorkingYears - Total years worked
TrainingTimesLastYear - Number of training attended last year
WorkLifeBalance - 1-Low, 2-Good, 3-Excellent, 4-Outstanding
YearsAtCompany - Years at Company
YearsInCurrentRole - Years in the current role
YearsSinceLastPromotion - Years since the last promotion
YearsWithCurrManager - Years with the current manager
** In the real world, you will not find definitions for some of your variables. It is a part of the analysis to figure out what they might mean.
Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
import scipy.stats as stats
from sklearn import metrics
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')
Read the dataset
hr=pd.read_csv("HR_Employee_Attrition-1.csv")
# copying data to another varaible to avoid any changes to original data
data=hr.copy()
View the first and last 5 rows of the dataset.
data.head()
output:
Understand the shape of the dataset.
data.shape
Check the data types of the columns for the dataset
data.info()
output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 EmployeeNumber 2940 non-null int64
1 Attrition 2940 non-null object
2 Age 2940 non-null int64
3 BusinessTravel 2940 non-null object
4 DailyRate 2940 non-null int64
5 Department 2940 non-null object
6 DistanceFromHome 2940 non-null int64
7 Education 2940 non-null int64
8 EducationField 2940 non-null object
9 EmployeeCount 2940 non-null int64
10 EnvironmentSatisfaction 2940 non-null int64
11 Gender 2940 non-null object
12 HourlyRate 2940 non-null int64
13 JobInvolvement 2940 non-null int64
14 JobLevel 2940 non-null int64
15 JobRole 2940 non-null object
16 JobSatisfaction 2940 non-null int64
17 MaritalStatus 2940 non-null object
18 MonthlyIncome 2940 non-null int64
19 MonthlyRate 2940 non-null int64
20 NumCompaniesWorked 2940 non-null int64
21 Over18 2940 non-null object
22 OverTime 2940 non-null object
23 PercentSalaryHike 2940 non-null int64
24 PerformanceRating 2940 non-null int64
25 RelationshipSatisfaction 2940 non-null int64
26 StandardHours 2940 non-null int64
27 StockOptionLevel 2940 non-null int64
28 TotalWorkingYears 2940 non-null int64
29 TrainingTimesLastYear 2940 non-null int64
30 WorkLifeBalance 2940 non-null int64
31 YearsAtCompany 2940 non-null int64
32 YearsInCurrentRole 2940 non-null int64
33 YearsSinceLastPromotion 2940 non-null int64
34 YearsWithCurrManager 2940 non-null int64
dtypes: int64(26), object(9)
memory usage: 804.0+ KB
Observations -
There are no null values in the dataset.
We can convert the object type columns to categories.
converting "objects" to "category" reduces the data space required to store the dataframe
Fixing the data types
cols = data.select_dtypes(['object'])
cols.columns
output:
Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
'JobRole', 'MaritalStatus', 'Over18', 'OverTime'],
dtype='object')
for i in cols.columns:
data[i] = data[i].astype('category')
data.info()
output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 EmployeeNumber 2940 non-null int64
1 Attrition 2940 non-null category
2 Age 2940 non-null int64
3 BusinessTravel 2940 non-null category
4 DailyRate 2940 non-null int64
5 Department 2940 non-null category
6 DistanceFromHome 2940 non-null int64
7 Education 2940 non-null int64
8 EducationField 2940 non-null category
9 EmployeeCount 2940 non-null int64
10 EnvironmentSatisfaction 2940 non-null int64
11 Gender 2940 non-null category
12 HourlyRate 2940 non-null int64
13 JobInvolvement 2940 non-null int64
14 JobLevel 2940 non-null int64
15 JobRole 2940 non-null category
16 JobSatisfaction 2940 non-null int64
17 MaritalStatus 2940 non-null category
18 MonthlyIncome 2940 non-null int64
19 MonthlyRate 2940 non-null int64
20 NumCompaniesWorked 2940 non-null int64
21 Over18 2940 non-null category
22 OverTime 2940 non-null category
23 PercentSalaryHike 2940 non-null int64
24 PerformanceRating 2940 non-null int64
25 RelationshipSatisfaction 2940 non-null int64
26 StandardHours 2940 non-null int64
27 StockOptionLevel 2940 non-null int64
28 TotalWorkingYears 2940 non-null int64
29 TrainingTimesLastYear 2940 non-null int64
30 WorkLifeBalance 2940 non-null int64
31 YearsAtCompany 2940 non-null int64
32 YearsInCurrentRole 2940 non-null int64
33 YearsSinceLastPromotion 2940 non-null int64
34 YearsWithCurrManager 2940 non-null int64
dtypes: category(9), int64(26)
memory usage: 624.6 KB
we can see that the memory usage has decreased from 804 KB to 624.4 KB, this technique is generally useful for bigger datasets.
Summary of the dataset.
data.describe().T
output:
EmployeeNumber is an ID variable and not useful for predictive modelling.
Age of the employees range from 18 to 60 years and the average age is 36 years.
EmployeeCount has only 1 as the value in all the rows and can be dropped as it will not be adding any information to our analysis.
Standard Hours has only 80 as the value in all the rows and can be dropped as it will not be adding any information to our analysis.
Hourly rate has a huge range, but we do not know what this variable stands for, yet. The same goes for daily and monthly rates.
Monthly Income has a high range and the difference in mean and median indicate the presence of outliers.
data.describe(include=['category']).T
output:
Attrition is our target variable with 84% records 'No' or employee will not attrite.
Majority of the employees have low business travel requirements
Majority of the employees are from the Research and Development department.
All employees are over 18 years of age - we can drop this variable as it will not be adding any information to our analysis.
There are more male employees than female employees.
Dropping columns which are not adding any information.
data.drop(['EmployeeNumber','EmployeeCount','StandardHours','Over18'],axis=1,inplace=True)
Let's look at the unqiue values of all the categories
cols_cat= data.select_dtypes(['category'])
for i in cols_cat.columns:
print('Unique values in',i, 'are :')
print(cols_cat[i].value_counts())
print('*'*50)
output:
EDA(Exploratory Data Analysis)
Univariate analysis¶
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
Observations on Age
histogram_boxplot(data,'Age')
output:
Bivariate Analysis
plt.figure(figsize=(20,10))
sns.heatmap(data.corr(),annot=True,vmin=-1,vmax=1,fmt='.2f',cmap="Spectral")
plt.show()
output:
There are a few variables that are correlated with each other but there are no surprises here.
Unsurprisingly, TotalWorkingYears is highly correlated to Job Level (i.e., the longer you work the higher job level you achieve).
HourlyRate, DailyRate, and MonthlyRate are completely uncorrelated with each other which makes it harder to understand what these variables might represent.
MonthlyIncome is highly correlated to Job Level.
Age is positively correlated JobLevel and Education (i.e., the older an employee is, the more educated and at a higher job level they are).
Work-life Balance is correlated with none of the numeric values
sns.pairplot(data,hue='Attrition')
plt.show()
output:
We can see varying distributions in variables for Attrition, we should investigate it further.
Attrition vs Earnings of employee
cols = data[['DailyRate','HourlyRate','MonthlyRate','MonthlyIncome','PercentSalaryHike']].columns.tolist()
plt.figure(figsize=(10,10))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data["Attrition"],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
output:
Employees having lower Daily rate and less monthly wage are more likely to attrite.
Monthly rate and the hourly rate doesn't seem to have any effect on attrition.
Lesser salary hike also contributes to attrition
Model Building - Approach
Data preparation
Partition the data into train and test set.
Build model on the train data.
Tune the model if required.
Test the data on test set.
Split Data
When classification problems exhibit a significant imbalance in the distribution of the target classes, it is good to use stratified sampling to ensure that relative class frequencies are approximately preserved in train and test sets.
This is done using the stratify parameter in the train_test_split function.
X = data.drop(['Attrition'],axis=1)
X = pd.get_dummies(X,drop_first=True)
y = data['Attrition'].apply(lambda x : 1 if x=='Yes' else 0)
# Splitting data into training and test set:
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1,stratify=y)
print(X_train.shape, X_test.shape)
Model evaluation criterion Model can make wrong predictions as:
Predicting an employee will attrite and the employee doesn't attrite
Predicting an employee will not attrite and the employee attrites
Which case is more important?
Predicting that employee will not attrite but he attrites i.e. losing on a valuable employee or asset.
How to reduce this loss i.e need to reduce False Negatives?
Company wants Recall to be maximized, greater the Recall higher the chances of minimizing false negatives. Hence, the focus should be on increasing Recall or minimizing the false negatives or in other words identifying the true positives(i.e. Class 1) so that the company can provide incentives to control attrition rate especially for top-performers thereby optimizing the overall project cost in retaining the best talent.
Let's define function to provide metric scores(accuracy,recall and precision) on train and test set and a function to show confusion matrix so that we do not have use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Build Decision Tree Model
We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
In this case, we can pass a dictionary {0:0.17,1:0.83} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
class_weight is a hyperparameter for the decision tree classifier.
dtree = DecisionTreeClassifier(criterion='gini',class_weight={0:0.17,1:0.83},random_state=1)
dtree.fit(X_train, y_train)
confusion_matrix_sklearn(dtree, X_test, y_test)
output:
Confusion Matrix -
Employee left and the model predicted it correctly that is employee will attrite : True Positive (observed=1,predicted=1)
Employee didn't leave and the model predicted employee will attrite : False Positive (observed=0,predicted=1)
Employee didn't leave and the model predicted employee will not attrite : True Negative (observed=0,predicted=0)
Employee left and the model predicted that employee won't : False Negative (observed=1,predicted=0)
dtree_model_train_perf=model_performance_classification_sklearn(dtree, X_train, y_train)
print("Training performance \n",dtree_model_train_perf)
output:
Training performance
Accuracy Recall Precision F1 ROC-AUC
0 1.0 1.0 1.0 1.0 1.0
dtree_model_test_perf=model_performance_classification_sklearn(dtree, X_test, y_test)
print("Testing performance \n",dtree_model_test_perf)
output:
Testing performance
Accuracy Recall Precision F1 ROC-AUC
0 0.941043 0.84507 0.8 0.821918 0.902265
Comments