Dataset Download from here
Frist Need to import all libraries
# First, read-in the data and check for null values
import numpy as np
import pandas as pd
import aif360
from aif360.algorithms.preprocessing import DisparateImpactRemover
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
pd.options.mode.chained_assignment = None # default='warn', silencing Setting With Copy warning
df = pd.read_csv('credit_risk_test.csv')
df
Output:
# See the different columns and check for null entries
df.info()
Output:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 981 entries, 0 to 980 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Loan_ID 981 non-null object 1 Gender 957 non-null object 2 Race 981 non-null object 3 Married 978 non-null object 4 Dependents 956 non-null object 5 Education 981 non-null object 6 Self_Employed 926 non-null object 7 ApplicantIncome 981 non-null int64 8 CoapplicantIncome 981 non-null float64 9 LoanAmount 954 non-null float64 10 Loan_Amount_Term 961 non-null float64 11 Credit_History 902 non-null float64 12 Property_Area 981 non-null object 13 Loan_Status 981 non-null object dtypes: float64(4), int64(1), object(9) memory usage: 107.4+ KB
# Remove rows with null values
df = df.dropna(how='any', axis = 0)
df.info()
Output:
<class 'pandas.core.frame.DataFrame'> Int64Index: 769 entries, 1 to 980 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Loan_ID 769 non-null object 1 Gender 769 non-null object 2 Race 769 non-null object 3 Married 769 non-null object 4 Dependents 769 non-null object 5 Education 769 non-null object 6 Self_Employed 769 non-null object 7 ApplicantIncome 769 non-null int64 8 CoapplicantIncome 769 non-null float64 9 LoanAmount 769 non-null float64 10 Loan_Amount_Term 769 non-null float64 11 Credit_History 769 non-null float64 12 Property_Area 769 non-null object 13 Loan_Status 769 non-null object dtypes: float64(4), int64(1), object(9) memory usage: 90.1+ KB
I then want to check to see the breakdown of values for the outcome variable, Loan_Status.
target_counts = df['Loan_Status'].value_counts()
target_counts
Output:
Y 561 N 208 Name: Loan_Status, dtype: int64
# Drop unnecessary column
df = df.drop(['Loan_ID'], axis = 1)
Encode categorical variables
# Encode Male as 1, Female as 0
df.loc[df.Gender == 'Male', 'Gender'] = 1
df.loc[df.Gender == 'Female', 'Gender'] = 0
# Encode Y Loan_Status as 1, N Loan_Status as 0
df.loc[df.Loan_Status == 'Y', 'Loan_Status'] = 1
df.loc[df.Loan_Status == 'N', 'Loan_Status'] = 0
df
Output:
Now verify the Loan_status
y = df['Loan_Status']
y
Find result as:
1 0 2 1 3 1 4 1 5 1 .. 975 1 976 1 977 1 979 0 980 1 Name: Loan_Status, Length: 769, dtype: object
Change All Non Numerical Values to numeric Using get_dummies
# Replace the categorical values with the numeric equivalents that we have above
categoricalFeatures = ['Race', 'Property_Area', 'Married', 'Dependents', 'Education', 'Self_Employed']
# Iterate through the list of categorical features and one hot encode them.
for feature in categoricalFeatures:
onehot = pd.get_dummies(df[feature], prefix=feature)
df = df.drop(feature, axis=1)
df = df.join(onehot)
df
Output:
Separate dataset by x and y and split
from sklearn.model_selection import train_test_split
encoded_df = df.copy()
x = df.drop(['Loan_Status'], axis = 1)
Create Test and Train splits
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_std = scaler.fit_transform(x)
# We will follow an 80-20 split pattern for our training and test data, respectively
x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.2, random_state = 0)
x_train=x_train.astype('int')
y_train=y_train.astype('int')
x_test=x_test.astype('int')
y_test=y_test.astype('int')
Calculating actual disparate impact on testing values from original dataset
Disparate Impact is defined as the ratio of favorable outcomes for the unpriviliged group divided by the ratio of favorable outcomes for the priviliged group. The acceptable threshold is between .8 and 1.25, with .8 favoring the priviliged group, and 1.25 favoring the unpriviliged group.
actual_test = x_test.copy()
actual_test['Loan_Status_Actual'] = y_test
actual_test.shape
Output:
(154, 26)
# Priviliged group: Males (1)
# Unpriviliged group: Females (0)
male_df = actual_test[actual_test['Gender'] == 1]
num_of_priviliged = male_df.shape[0]
female_df = actual_test[actual_test['Gender'] == 0]
num_of_unpriviliged = female_df.shape[0]
unpriviliged_outcomes = female_df[female_df['Loan_Status_Actual'] == 1].shape[0]
unpriviliged_ratio = unpriviliged_outcomes/num_of_unpriviliged
unpriviliged_ratio
Output:
0.6
priviliged_outcomes = male_df[male_df['Loan_Status_Actual'] == 1].shape[0]
priviliged_ratio = priviliged_outcomes/num_of_priviliged
priviliged_ratio
Output:
0.7226890756302521
# Calculating disparate impact
disparate_impact = unpriviliged_ratio / priviliged_ratio
print("Disparate Impact, Sex vs. Predicted Loan Status: " + str(disparate_impact))
Output:
Disparate Impact, Sex vs. Predicted Loan Status: 0.8302325581395349
Training a model on the original dataset
from sklearn.linear_model import LogisticRegression
# Liblinear is a solver that is very fast for small datasets, like ours
model = LogisticRegression(solver='liblinear', class_weight='balanced')
Fit Into Model:
model.fit(x_train, y_train)
Output:
LogisticRegression(class_weight='balanced', solver='liblinear')
Evaluating performance
# Let's see how well it predicted with a couple values
y_pred = pd.Series(model.predict(x_test))
y_test = y_test.reset_index(drop=True)
z = pd.concat([y_test, y_pred], axis=1)
z.columns = ['True', 'Prediction']
z.head()
# Predicts 4/5 correctly in this sample
Output:
True Prediction
0 1 0
1 1 1
2 0 0
3 0 0
4 0 0
import matplotlib.pyplot as plt
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred))
print("Recall:", metrics.recall_score(y_test, y_pred))
Output:
Accuracy: 0.8311688311688312
Precision: 0.9263157894736842
Recall: 0.822429906542056
Comments