Jul 25, 20213 min read

Loan Credit default risk modeling Using IBM Toolkit AIX 360 and AIF 360

Dataset Download from here

Frist Need to import all libraries

# First, read-in the data and check for null values
import numpy as np
import pandas as pd
import aif360
from aif360.algorithms.preprocessing import DisparateImpactRemover
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
pd.options.mode.chained_assignment = None  # default='warn', silencing Setting With Copy warning
df = pd.read_csv('credit_risk_test.csv')
df

Output:

# See the different columns and check for null entries
df.info()

Output:

<class 'pandas.core.frame.DataFrame'> RangeIndex: 981 entries, 0 to 980 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Loan_ID 981 non-null object 1 Gender 957 non-null object 2 Race 981 non-null object 3 Married 978 non-null object 4 Dependents 956 non-null object 5 Education 981 non-null object 6 Self_Employed 926 non-null object 7 ApplicantIncome 981 non-null int64 8 CoapplicantIncome 981 non-null float64 9 LoanAmount 954 non-null float64 10 Loan_Amount_Term 961 non-null float64 11 Credit_History 902 non-null float64 12 Property_Area 981 non-null object 13 Loan_Status 981 non-null object dtypes: float64(4), int64(1), object(9) memory usage: 107.4+ KB

# Remove rows with null values
df = df.dropna(how='any', axis = 0)
df.info()

Output:

<class 'pandas.core.frame.DataFrame'> Int64Index: 769 entries, 1 to 980 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Loan_ID 769 non-null object 1 Gender 769 non-null object 2 Race 769 non-null object 3 Married 769 non-null object 4 Dependents 769 non-null object 5 Education 769 non-null object 6 Self_Employed 769 non-null object 7 ApplicantIncome 769 non-null int64 8 CoapplicantIncome 769 non-null float64 9 LoanAmount 769 non-null float64 10 Loan_Amount_Term 769 non-null float64 11 Credit_History 769 non-null float64 12 Property_Area 769 non-null object 13 Loan_Status 769 non-null object dtypes: float64(4), int64(1), object(9) memory usage: 90.1+ KB

I then want to check to see the breakdown of values for the outcome variable, Loan_Status.

target_counts = df['Loan_Status'].value_counts()
target_counts

Output:

Y 561 N 208 Name: Loan_Status, dtype: int64

# Drop unnecessary column
df = df.drop(['Loan_ID'], axis = 1)

Encode categorical variables


# Encode Male as 1, Female as 0
df.loc[df.Gender == 'Male', 'Gender'] = 1
df.loc[df.Gender == 'Female', 'Gender'] = 0
# Encode Y Loan_Status as 1, N Loan_Status as 0
df.loc[df.Loan_Status == 'Y', 'Loan_Status'] = 1
df.loc[df.Loan_Status == 'N', 'Loan_Status'] = 0
df

Output:

Now verify the Loan_status

y = df['Loan_Status']
y

Find result as:

1 0 2 1 3 1 4 1 5 1 .. 975 1 976 1 977 1 979 0 980 1 Name: Loan_Status, Length: 769, dtype: object

Change All Non Numerical Values to numeric Using get_dummies

# Replace the categorical values with the numeric equivalents that we have above
categoricalFeatures = ['Race', 'Property_Area', 'Married', 'Dependents', 'Education', 'Self_Employed']

# Iterate through the list of categorical features and one hot encode them.
for feature in categoricalFeatures:
    onehot = pd.get_dummies(df[feature], prefix=feature)
    df = df.drop(feature, axis=1)
    df = df.join(onehot)
df

Output:

Separate dataset by x and y and split

from sklearn.model_selection import train_test_split
encoded_df = df.copy()
x = df.drop(['Loan_Status'], axis = 1)

Create Test and Train splits

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_std = scaler.fit_transform(x)
# We will follow an 80-20 split pattern for our training and test data, respectively

x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.2, random_state = 0)
x_train=x_train.astype('int')
y_train=y_train.astype('int')
x_test=x_test.astype('int')
y_test=y_test.astype('int')

Calculating actual disparate impact on testing values from original dataset

Disparate Impact is defined as the ratio of favorable outcomes for the unpriviliged group divided by the ratio of favorable outcomes for the priviliged group. The acceptable threshold is between .8 and 1.25, with .8 favoring the priviliged group, and 1.25 favoring the unpriviliged group.

actual_test = x_test.copy()
actual_test['Loan_Status_Actual'] = y_test
actual_test.shape

Output:

(154, 26)

# Priviliged group: Males (1)
# Unpriviliged group: Females (0)
male_df = actual_test[actual_test['Gender'] == 1]
num_of_priviliged = male_df.shape[0]
female_df = actual_test[actual_test['Gender'] == 0]
num_of_unpriviliged = female_df.shape[0]

unpriviliged_outcomes = female_df[female_df['Loan_Status_Actual'] == 1].shape[0]
unpriviliged_ratio = unpriviliged_outcomes/num_of_unpriviliged
unpriviliged_ratio

Output:

0.6

priviliged_outcomes = male_df[male_df['Loan_Status_Actual'] == 1].shape[0]
priviliged_ratio = priviliged_outcomes/num_of_priviliged
priviliged_ratio

Output:

0.7226890756302521

# Calculating disparate impact
disparate_impact = unpriviliged_ratio / priviliged_ratio
print("Disparate Impact, Sex vs. Predicted Loan Status: " + str(disparate_impact))

Output:

Disparate Impact, Sex vs. Predicted Loan Status: 0.8302325581395349

Training a model on the original dataset

from sklearn.linear_model import LogisticRegression
# Liblinear is a solver that is very fast for small datasets, like ours
model = LogisticRegression(solver='liblinear', class_weight='balanced')

Fit Into Model:

model.fit(x_train, y_train)

Output:

LogisticRegression(class_weight='balanced', solver='liblinear')

Evaluating performance

# Let's see how well it predicted with a couple values 
y_pred = pd.Series(model.predict(x_test))
y_test = y_test.reset_index(drop=True)
z = pd.concat([y_test, y_pred], axis=1)
z.columns = ['True', 'Prediction']
z.head()
# Predicts 4/5 correctly in this sample

Output:

True Prediction

0 1 0

1 1 1

2 0 0

3 0 0

4 0 0


import matplotlib.pyplot as plt
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred))
print("Recall:", metrics.recall_score(y_test, y_pred))

Output:

Accuracy: 0.8311688311688312
Precision: 0.9263157894736842
Recall: 0.822429906542056

RealCode4You

Loan Credit default risk modeling Using IBM Toolkit AIX 360 and AIF 360

Encode categorical variables

Now verify the Loan_status

Output:

Separate dataset by x and y and split

Create Test and Train splits

Calculating actual disparate impact on testing values from original dataset

Training a model on the original dataset

Evaluating performance

Recent Posts

Comments