top of page
realcode4you

Decision Trees Based Models In Machine Learning | Realcode4you



Introduction

During the last couple of weeks we learned about the typical ML model development process.

In this blog we will explore decision tree based models.


The lab can be executed on either your own machine or use google colab using GPU.


Objective

  • Continue to familiarise with Python and other ML packages.

  • Learning classification decision trees from both categorical and continuous numerical data

  • Comparing the performance of various trees after pruned.

  • Learning regression decision trees and comparing these models to regression models from previous labs.

Dataset

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.


Input variables:

  • Bank client data:

    1. age (numeric)

    2. job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services")

    3. marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

    4. education (categorical: "unknown","secondary","primary","tertiary")

    5. default: has credit in default? (binary: "yes","no")

    6. balance: average yearly balance, in euros (numeric)

    7. housing: has housing loan? (binary: "yes","no")

    8. loan: has personal loan? (binary: "yes","no")

  • Related with the last contact of the current campaign:

    1. contact: contact communication type (categorical: "unknown","telephone","cellular")

    2. day: last contact day of the month (numeric)

    3. month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

    4. duration: last contact duration, in seconds (numeric)

  • Other attributes:

    1. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

    2. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

    3. previous: number of contacts performed before this campaign and for this client (numeric)

    4. poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")


Output variable (desired target):

17. y - has the client subscribed a term deposit? (binary: "yes","no")

This dataset is public available for research. The details are described in Moro et al., 2011.

Moro et al., 2011: S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011.

Lets read the data first.


Read Data

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv('./bank-full.csv', delimiter=';')
data.head()

Output:


The dataset contains categorical and numerical attributes. Lets convert the categorical columns to categorical data type in pandas.


for col in data.columns:
    if data[col].dtype == object:
        data[col] = data[col].astype('category')

sklearn’s classification decision tree learner doesn’t work with categorical attributes. It only works with continuous numeric attributes. The target class, however, must be categorical. So the categorical attributed must be converted into a suitable continuous format. Helpfully, Pandas can do this.


First, split the data into the target class and attributes:

dataY = data['y']
dataX = data.drop(columns='y')

Then use Pandas to generate "numerical" versions of the attributes:

dataXExpand = pd.get_dummies(dataX)
dataXExpand.head()

Output:


As you can see, the categories are expanded into boolean (yes/no, that is, 1/0) values that can be treated as continuous numerical values. It’s not ideal, but it will allow a correct decision tree to be learned.

� Why is it necessary to convert the attributes into boolean representations, rather than just convert them into integer values? What problem would be caused by converting the attributes into integers?

The target class also needs to be pre-processed. The target will be treated by sklearn as a category, but sklearn requires that these categories are represented as integers (not strings). To convert the strings into numbers, the preprocessing. LabelEncoder class from sklearn can be used, as shown below. The two print statements show how to convert in both directions (strings to integers, and vice-versa).


from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(dataY)
class_labels = le.inverse_transform([0,1])
dataY = le.transform(dataY)
print(dataY)
print(class_labels)

[0 0 0 ... 1 0 0] [0 1]


Setting up the performance (evaluation) metric

There are many performance metrics that apply to this problem such as accuracy_score, f1_score, etc. More information on performance metrics available in sklearn can be found at: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics


The insights gained in the EDA becomes vital in determining the performance metric. Try to identify the characteristics that are important in making this decision from the EDA results. Use your judgment to pick the best performance measure - discuss with the lab demonstrator to see if the performance measure you came up with is appropriate.

In this task, I want to give equal importance to all classes. Therefore I will select macro-averaged f1_score as my performance measure and I wish to achieve a target value of 75% f1_score.


F1-score is NOT the only performance measure that can be used for this problem.


Setup the experiment - data splits

Next what data should we use to evaluate the performance?

We can generate "simulated" unseen data in several methods

  1. Hold-Out validation

  2. Cross-Validation

Lets use hold out validation for this experiment.


☞ Task: Use the knowledge from last couple of weeks to split the data appropriately.


from sklearn.model_selection import train_test_split

with pd.option_context('mode.chained_assignment', None):
    train_data_X_, test_data_X, train_data_y_ , test_data_y = train_test_split(dataXExpand, dataY, test_size=0.2, 
                                              shuffle=True,random_state=0)
    
with pd.option_context('mode.chained_assignment', None):
    train_data_X, val_data_X, train_data_y, val_data_y = train_test_split(train_data_X_, train_data_y_, test_size=0.25, 
                                            shuffle=True,random_state=0)
    
print(train_data_X.shape, val_data_X.shape, test_data_X.shape)

Output:

(27126, 51) (9042, 51) (9043, 51)


train_X = train_data_X.to_numpy()
train_y = train_data_y

test_X = test_data_X.to_numpy()
test_y = test_data_y

val_X = val_data_X.to_numpy()
val_y = val_data_y

Lets setup few functions to visualise the results.

(Ignore section if on AWS) It is likely that you won’t have the graphviz package available, in which case you will need to install graphviz. This can be done through the anacoda-navigator interface (environment tab):

  1. Change the dropbox to “All”

  2. Search for the packagecpython-graphviz

  3. Select the python-graphviz package and install (press “apply”)

If you cant install graphviz don’t worry - you can still complete the lab. Graphviz is nice to be able to see the trees that are being calculated. However, once the trees become complex, visualising them isn’t practical.


import graphviz 

def get_tree_2_plot(clf):
    dot_data = tree.export_graphviz(clf, out_file=None, 
                      feature_names=dataXExpand.columns,  
                      class_names=class_labels,  
                      filled=True, rounded=True,  
                      special_characters=True)  
    graph = graphviz.Source(dot_data) 
    return graph

from sklearn.metrics import f1_score

def get_acc_scores(clf, train_X, train_y, val_X, val_y):
    train_pred = clf.predict(train_X)
    val_pred = clf.predict(val_X)
    
    train_acc = f1_score(train_y, train_pred, average='macro')
    val_acc = f1_score(val_y, val_pred, average='macro')
    
    return train_acc, val_acc

Simple decision tree training

Lets train a simple decision tree and visualize it.


from sklearn import tree

tree_max_depth = 2   #change this value and observe

clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=tree_max_depth, class_weight='balanced')
clf = clf.fit(train_X, train_y)
Dtree = get_tree_2_plot(clf)
Dtree
train_acc, val_acc = get_acc_scores(clf,train_X, train_y, val_X, val_y)
print("Train f1 score: {:.3f}".format(train_acc))
print("Validation f1 score: {:.3f}".format(val_acc))

Output:

Train f1 score: 0.552 Validation f1 score: 0.560


Hyper parameter tuning � What are the hyper parameters of the DecisionTreeClassifier? You may decide to tune the important hyper-paramters of the decision tree classifier (identified in the above question) to get the best performance. As an example I have selected two hyper parameters: max_depth and min_samples_split.

In this exercise I will be using GridSearch to tune my parameters. Sklearn has a function that do cross validation to tune the hyper parameters called GridSearchCV. Lets use this function.

This step may take several steps depending on the performance of your computer

from sklearn.model_selection import GridSearchCV

parameters = {'max_depth':np.arange(2,400, 50), 'min_samples_split':np.arange(2,50,5)}

dt_clf = tree.DecisionTreeClassifier(criterion='entropy', class_weight='balanced')
Gridclf = GridSearchCV(dt_clf, parameters, scoring='f1_macro')
Gridclf.fit(train_X, train_y)

Output:

GridSearchCV(estimator=DecisionTreeClassifier(class_weight='balanced', criterion='entropy'), param_grid={'max_depth': array([ 2, 52, 102, 152, 202, 252, 302, 352]), 'min_samples_split': array([ 2, 7, 12, 17, 22, 27, 32, 37, 42, 47])}, scoring='f1_macro')


pd.DataFrame(Gridclf.cv_results_)

Output:


Print Score

print(Gridclf.best_score_)
print(Gridclf.best_params_)
clf = Gridclf.best_estimator_

Output:

0.7112886400915267 {'max_depth': 52, 'min_samples_split': 47}



Training Score

train_acc, val_acc = get_acc_scores(clf,train_X, train_y, val_X, val_y)
print("Train f1 score: {:.3f}".format(train_acc))
print("Validation f1 score: {:.3f}".format(val_acc))

Output:

Train f1 score: 0.784 Validation f1 score: 0.724



Comments


bottom of page