Classification and Regression
Decision Trees can be used for both.
Classification
Spam / not Spam
Admit to ICU /not
Lend money / deny
Intrusion detections
Regression
Predict stock returns
Pricing a house or a car
Weather predictions (temp, rain fall etc)
Economic growth predictions
Predicting sports scores
Decision Trees
The general idea is that we will segment the space into a number of simple regions.
The segmentation can be illustrated as a tree
The end nodes can have a category (classification) or a continuous number (regression)
These methods, while quite simple are very powerful.
Visualizing Classification as a Tree
Metrics
Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items.
Different algorithms use different metrics for measuring “best"
These metrics measure how similar a region or a node is. They are said to measure the impurity of a region.
Larger these impurity metrics the larger the “dissimilarity” of a nodes/regions data.
Examples: Gini impurity, Entropy, Variance
Algorithms for building Decision Trees
Popular ones include
CART (Classification And Regression Tree)
CHAID (CHi-squared Automatic Interaction Detector)
C4.5
We will focus on CART, that uses the Gini impurity as its impurity measure.
Gini impurity
Used by the CART
Is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.
Splitting using Gini impurity
When splitting, the Gini impurity of the two resulting nodes are combined using a weighted average.
With weights being the fraction of data on each node.
The CART algorithm simply chooses the right “split” by finding the split that maximizes the “decrease in Gini impurity” - also called the Gini Gain.
Decision trees are prone to ‘overfitting’
Decision Tree is a powerful algorithm that can adapt well and capture various patterns in the data
If allowed to grow fully, they become over-complex & tend to fit even the noise 10
Thus, a fully grown tree may not ‘generalize’ well on test or new unseen data
Pruning
Ideally we would like a tree that does not over-fit the given data
One popular and simple way to prune a decision tree is by limiting the depth of the tree to avoid over fitting.
For example the tree on the right below is generated with a max depth of 2 while the tree on the left has no depth restriction (and hence overfits the data)
Pre-Pruning
Stop growing the tree before it grows too big
This can be achieved by bounding hyperparameters 12 Hyperparameters in Decision Trees
- max_depth - The maximum depth of the tree. If set to ‘None’, then nodes are
expanded until all leaves are pure. Higher the value, more complex the tree
- min_samples_split - The minimum number of samples required to split a node.
Doesn’t split any node that is smaller than this number. Higher the values, less
complex the tree
- min_samples_leaf - The minimum number of samples required at a leaf node. All
leaf nodes have at least these many data points. Higher the value, less
complex the tree
Hyperparameter Tuning using Grid Search
Grid Search is a process of searching the best combination of hyperparameters from a predefined set of values
A parameter grid (Hyperparameters and corresponding values) is provided as an input to the Grid-search function
It tries all the combinations of the values passed and evaluates the model for each combination
It returns the combination of hyperparameter values that works best as per the metric provided for model evaluation
GridSearchCV( ) is an implementation of Grid Search with Cross Validation.
Cross Validation
- Cross Validation is a common Machine Learning technique that splits the data into n non-overlapping groups, and runs n experiments:
In each experiment, n-1 groups are used to train a model and the model is tested on the left out group.
The results are summarized over the n experiments.
- It gives a mechanism that allows us to test a model repeatedly on data that was not used to build the model.
- For Decision Trees, a very common approach is simply to choose the tree with minimum cross validation error.
Post-Pruning: Cost-complexity pruning
- Starting from the Full tree, create a sequence of trees that are sequentially smaller (pruned)
- At each step the algorithm
try removing each possible subtree
find the ‘relative error decrease per node’ for that subtree - Complexity parameter,
And remove the subtree with the minimum
- With the list of subtrees, one usually reverts back to using crossvalidation errors to find the best final pruned tree
Few Additional Thoughts
Other impurity measures
Regression Trees
Pros and Cons
Impurity Measures in Decision Trees
Regression Trees
Pros and Cons of Decision Trees
Pros -
Easy to understand and interpret
Useful in data exploration as it gives the splitting based on the significance of variables • Not influenced by the outlier/Null values and hence requires less data cleaning. Requires less time and effort during data pre-processing than other algorithms.
Can handle both continuous and categorical variables
Does not require any underlying assumptions in data. Works with both linearly and nonlinearly related variables.
Cons-
A small change in the data-set can result in large change in the structure of the decision tree causing instability in the model.
Large trees can be difficult to interpret.
Tends to overfit.
Comments