Context
Banks incur significant losses due to default in loans. This has led to a tightening up of loan underwriting and has increased loan rejection rates. The need for a better credit risk scoring model is also raised by banks.
The CNK bank has collected customer data for the past few years and wants to build a model to predict if a customer coming to purchase a loan is a good customer (will not default) or a bad customer (will default).
Reference: Great Learning
Data Dictionary
month - the month of purchase
credit_amount - amount for which loan is requested
credit_term - for how long customer wants a loan
age - age of the customer
sex - gender of the customer
education - education level of customer
product_type - for purchasing what type of product does the customer need a loan (0, 1, 2, 3, 4)
having_children_flg - if the customer has children or not
region - customer region category(0, 1, 2)
income - income of the customer
family_status - another, married, unmarried
phone_operator - mobile operator category(0, 1, 2, 3)
is_client - if the customer wanting to purchase a loan is our client or not
target - 1-bad customer, 0-good customer
Import Libraries
# To help with reading and manipulation of data
import numpy as np
import pandas as pd
# To help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To split the data
from sklearn.model_selection import train_test_split
# To impute missing values
from sklearn.impute import SimpleImputer
# To build a Random forest classifier
from sklearn.ensemble import RandomForestClassifier
# To tune a model
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# To get different performance metrics
import sklearn.metrics as metrics
from sklearn.metrics import (
classification_report,
confusion_matrix,
recall_score,
accuracy_score,
precision_score,
f1_score,
)
# To suppress warnings
import warnings
warnings.filterwarnings("ignore")
Load and view dataset
df = pd.read_csv("Loanclients.csv")
data = df.copy()
data.head()
output:
data.info()
output:
# checking missing values in the data
data.isna().sum()
output:
The income variable has some missing values, we will impute them later
data["region"] = data["region"].astype("category")
data["phone_operator"] = data["phone_operator"].astype("category")
data["product_type"] = data["product_type"].astype("category")
# checking the distribution of the target variable
data["target"].value_counts(1)
output:
Splitting the data into X and y
# separating the independent and dependent variables
X = data.drop(["target"], axis=1)
y = data["target"]
# creating dummy variables
X = pd.get_dummies(X, drop_first=True)
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=5, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.2, random_state=5, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
output:
(640, 30) (160, 30) (200, 30)
# Let's impute the missing values
imp_median = SimpleImputer(missing_values=np.nan, strategy="median")
# fit the imputer on train data and transform the train data
X_train["income"] = imp_median.fit_transform(X_train[["income"]])
# transform the validation and test data using the imputer fit on train data
X_val["income"] = imp_median.transform(X_val[["income"]])
X_test["income"] = imp_median.transform(X_test[["income"]])
# Checking class balance for whole data, train set, validation set, and test set
print("Target value ratio in y")
print(y.value_counts(1))
print("*" * 80)
print("Target value ratio in y_train")
print(y_train.value_counts(1))
print("*" * 80)
print("Target value ratio in y_val")
print(y_val.value_counts(1))
print("*" * 80)
print("Target value ratio in y_test")
print(y_test.value_counts(1))
print("*" * 80)
output:
Model evaluation criterion What does a bank want?
A bank wants to minimize the loss - it can face 2 types of losses here:
Whenever a bank lends money to a customer, they don't return it.
A bank doesn't lend money to a customer thinking a customer will default but in reality, the customer won't - opportunity loss.
Which loss is greater ?
Lending to a customer who wouldn't be able to pay back.
Since we want to reduce loan defaults we should use Recall as a metric of model evaluation instead of accuracy.
Recall - It gives the ratio of True positives to Actual positives, so high Recall implies low false negatives, i.e. low chances of predicting a bad customer as a good customer.
Hyperparameter Tuning Let's first build a model with default parameters and see it's performance
# model without hyperparameter tuning
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train, y_train)
Let's check model's performance
# Checking recall score on train and validation set
print("Recall on train and validation set")
print(recall_score(y_train, rf.predict(X_train)))
print(recall_score(y_val, rf.predict(X_val)))
print("")
# Checking Precision score on train and validation set
print("Precision on train and validation set")
print(precision_score(y_train, rf.predict(X_train)))
print(precision_score(y_val, rf.predict(X_val)))
print("")
# Checking Accuracy score on train and validation set
print("Accuracy on train and validation set")
print(accuracy_score(y_train, rf.predict(X_train)))
print(accuracy_score(y_val, rf.predict(X_val)))
output:
The model is performing well on the train data but the performance on the validation data is very poor.
Let's see if we can improve it with hyperparameter tuning.
Grid Search CV
Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we'll use Grid search
Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
It is an exhaustive search that is performed on the specific parameter values of a model.
The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
How to know the hyperparameters available for an algorithm?
RandomForestClassifier().get_params()
output:
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
<IPython.core.display.Javascript object>
We can see the names of hyperparameters available and their default values.
We can choose which ones to tune
print(np.arange(0.2, 0.7, 0.1))
print(np.arange(5,10))
output:
[0.2 0.3 0.4 0.5 0.6] [5 6 7 8 9]
Let's tune Random forest using Grid Search
%%time
# Choose the type of classifier.
rf1 = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {"n_estimators": [150,200,250],
"min_samples_leaf": np.arange(5, 10),
"max_features": np.arange(0.2, 0.7, 0.1),
"max_samples": np.arange(0.3, 0.7, 0.1),
"class_weight" : ['balanced', 'balanced_subsample'],
"max_depth":np.arange(3,4,5),
"min_impurity_decrease":[0.001, 0.002, 0.003]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(rf1, parameters, scoring=acc_scorer, cv=5, n_jobs= -1, verbose = 2)
# verbose = 2 tells about the number of fits, which can give an idea of how long will the model take in tuning
# n_jobs = -1 so that all CPU cores can be run parallelly to optimize the Search
grid_obj = grid_obj.fit(X_train, y_train)
# Print the best combination of parameters
grid_obj.best_params_
output:
Let's check the best CV score, for the obtained parameters
grid_obj.best_score_
output:
0.7692307692307692
Let's build a model with obtained best parameters
We are hard coding the hyperparameters separately so that we don't have to run the grid search again.
output:
# Set the clf to the best combination of parameters
rf1_tuned = RandomForestClassifier(
class_weight="balanced",
max_features=0.2,
max_samples=0.6000000000000001,
min_samples_leaf=5,
n_estimators=150,
max_depth=3,
random_state=1,
min_impurity_decrease=0.001,
)
# Fit the best algorithm to the data.
rf1_tuned.fit(X_train, y_train)
output:
RandomForestClassifier(class_weight='balanced', max_depth=3, max_features=0.2, max_samples=0.6000000000000001, min_impurity_decrease=0.001, min_samples_leaf=5, n_estimators=150, random_state=1) <IPython.core.display.Javascript object>
Let's check the model's performance
# Checking recall score on train and validation set
print("Recall on train and validation set")
print(recall_score(y_train, rf1_tuned.predict(X_train)))
print(recall_score(y_val, rf1_tuned.predict(X_val)))
print("")
# Checking precision score on train and validation set
print("Precision on train and validation set")
print(precision_score(y_train, rf1_tuned.predict(X_train)))
print(precision_score(y_val, rf1_tuned.predict(X_val)))
print("")
# Checking accuracy score on train and validation set
print("Accuracy on train and validation set")
print(accuracy_score(y_train, rf1_tuned.predict(X_train)))
print(accuracy_score(y_val, rf1_tuned.predict(X_val)))
output:
We can see improvement in validation performance as compared to the model without hyperparameter tuning
Recall on both training set and validation set is good and is 88% on the validation
Randomized Search CV
Random search is a tuning technique that attempts to compute the optimum values of hyperparameters randomly unlike grid search
Let's tune Random forest using Randomized Search¶
%%time
# Choose the type of classifier.
rf2 = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {"n_estimators": [150,200,250],
"min_samples_leaf": np.arange(5, 10),
"max_features": np.arange(0.2, 0.7, 0.1),
"max_samples": np.arange(0.3, 0.7, 0.1),
"max_depth":np.arange(3,4,5),
"class_weight" : ['balanced', 'balanced_subsample'],
"min_impurity_decrease":[0.001, 0.002, 0.003]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the random search
grid_obj = RandomizedSearchCV(rf2, parameters,n_iter=30, scoring=acc_scorer,cv=5, random_state = 1, n_jobs = -1, verbose = 2)
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10
grid_obj = grid_obj.fit(X_train, y_train)
# Print the best combination of parameters
grid_obj.best_params_
output:
Let's check the best CV score, for the obtained parameters
grid_obj.best_score_
output:
0.7538461538461538
Let's build a model with obtained best parameters
# Set the clf to the best combination of parameters
rf2_tuned = RandomForestClassifier(
class_weight="balanced",
max_features=0.2,
max_samples=0.5,
min_samples_leaf=5,
n_estimators=150,
random_state=1,
max_depth=3,
min_impurity_decrease=0.003,
)
# Fit the best algorithm to the data.
rf2_tuned.fit(X_train, y_train)
output:
RandomForestClassifier(class_weight='balanced', max_depth=3, max_features=0.2, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=5, n_estimators=150, random_state=1) <IPython.core.display.Javascript object>
Different results from the grid and the random search
Randomised search might give better results than grid search for the same parameter grid because of the use of cross-validation as fold varies the scores also vary
Let's check the model's performance
# Checking recall score on train and validation set
print("Recall on train and validation set")
print(recall_score(y_train, rf2_tuned.predict(X_train)))
print(recall_score(y_val, rf2_tuned.predict(X_val)))
print("")
print("Precision on train and validation set")
# Checking precision score on train and validation set
print(precision_score(y_train, rf2_tuned.predict(X_train)))
print(precision_score(y_val, rf2_tuned.predict(X_val)))
print("")
print("Accuracy on train and validation set")
# Checking accuracy score on train and validation set
print(accuracy_score(y_train, rf2_tuned.predict(X_train)))
print(accuracy_score(y_val, rf2_tuned.predict(X_val)))
output:
The model is performing better than model with default parameters and the performance is similar to the model we received with grid search
Choose a best model and predict the performance on the test set
model = rf1_tuned
# Checking recall score on test set
print("Recall on test set")
print(recall_score(y_test, model.predict(X_test)))
print("")
# Checking precision score on test set
print("Precision on test set")
print(precision_score(y_test, model.predict(X_test)))
print("")
# Checking accuracy score on test set
print("Accuracy on test set")
print(accuracy_score(y_test, model.predict(X_test)))
output:
The performance is close to one we observed in the validation set, so there is no overfitting
Comments