What is a decision tree?
A decision tree is a map of the possible outcomes of a series of related choices. It allows an individual or organization to weigh possible actions against one another based on their costs, probabilities, and benefits. They can can be used either to drive informal discussion or to map out an algorithm that predicts the best choice mathematically.
A decision tree typically starts with a single node, which branches into possible outcomes. Each of those outcomes leads to additional nodes, which branch off into other possibilities. This gives it a treelike shape.
There are three different types of nodes: chance nodes, decision nodes, and end nodes. A chance node, represented by a circle, shows the probabilities of certain results. A decision node, represented by a square, shows a decision to be made, and an end node shows the final outcome of a decision path.
Decision trees can also be drawn with flowchart symbols, which some people find easier to read and understand.
Decision tree symbols
Important Terminology related to Decision Trees
Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.
Types of Decision Trees
Types of decision tree is based on the type of target variable we have. It can be of two types:
1. Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called as categorical variable decision tree. E.g.:- In above scenario of student problem, where the target variable was “Student will play cricket or not” i.e. YES or NO.
2. Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree.
Assumptions while creating Decision Tree
Some of the assumptions we make while using Decision tree:
At the beginning, the whole training set is considered as the root.
Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
Records are distributed recursively on the basis of attribute values.
Order to placing attributes as root or internal node of the tree is done by using some statistical approach.
Advantages of Decision Tree:
Ø Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical background. It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis.
Ø Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of decision trees, we can create new variables / features that has better power to predict target variable. It can also be used in data exploration stage. For e.g., we are working on a problem where we have information available in hundreds of variables, there decision tree will help to identify most significant variable.
Ø Decision trees implicitly perform variable screening or feature selection.
Ø Decision trees require relatively little effort from users for data preparation.
Ø Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree.
Ø Data type is not a constraint: It can handle both numerical and categorical variables. Can also handle multi-output problems.
Ø Non-Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.
Disadvantages of Decision Tree:
Ø Over fitting: Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning.
Ø Not fit for continuous variables: While working with continuous numerical variables, decision tree loses information, when it categorizes variables in different categories.
Ø Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
Ø Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement.
Ø Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree.
Ø Information gain in a decision tree with categorical variables gives a biased response for attributes with greater no. of categories.
How to draw a decision tree
To draw a decision tree, first pick a medium. You can draw it by hand on paper or a whiteboard, or you can use special decision tree software. In either case, here are the steps to follow:
1. Start with the main decision. Draw a small box to represent this point, then draw a line from the box to the right for each possible solution or action. Label them accordingly.
2. Add chance and decision nodes to expand the tree as follows:
If another decision is necessary, draw another box.
If the outcome is uncertain, draw a circle (circles represent chance nodes).
If the problem is solved, leave it blank (for now).
From each decision node, draw possible solutions. From each chance node, draw lines representing possible outcomes. If you intend to analyze your options numerically, include the probability of each outcome and the cost of each action.
3. Continue to expand until every line reaches an endpoint, meaning that there are no more choices to be made or chance outcomes to consider. Then, assign a value to each possible outcome. It could be an abstract score or a financial value. Add triangles to signify endpoints.
With a complete decision tree, you’re now ready to begin analyzing the decision you face.
Decision trees in machine learning and data mining
A decision tree can also be used to help build automated predictive models, which have applications in machine learning, data mining, and statistics. Known as decision tree learning, this method takes into account observations about an item to predict that item’s value.
In these decision trees, nodes represent data rather than decisions. This type of tree is also known as a classification tree. Each branch contains a set of attributes, or classification rules, that are associated with a particular class label, which is found at the end of the branch.
These rules, also known as decision rules, can be expressed in an if-then clause, with each decision or data value forming a clause, such that, for instance, “if conditions 1, 2 and 3 are fulfilled, then outcome x will be the result with y certainty.”
Each additional piece of data helps the model more accurately predict which of a finite set of values the subject in question belongs to. That information can then be used as an input in a larger decision making model.
Sometimes the predicted variable will be a real number, such as a price. Decision trees with continuous, infinite possible outcomes are called regression trees.
For increased accuracy, sometimes multiple trees are used together in ensemble methods:
Bagging creates multiple trees by re-sampling the source data, then has those trees vote to reach consensus.
A Random Forest classifier consists of multiple trees designed to increase the classification rate
Boosted trees that can be used for regression and classification trees.
The trees in a Rotation Forest are all trained by using PCA (principal component analysis) on a random portion of the data
A decision tree is considered optimal when it represents the most data with the fewest number of levels or questions. Algorithms designed to create optimized decision trees include CART, ASSISTANT, CLS and ID3/4/5. A decision tree can also be created by building association rules, placing the target variable on the right.
Each method has to determine which is the best way to split the data at each level. Common methods for doing so include measuring the Gini impurity, information gain, and variance reduction.
Using decision trees in machine learning has several advantages:
Ø The cost of using the tree to predict data decreases with each additional data point
Ø Works for either categorical or numerical data
Ø Can model problems with multiple outputs
Ø Uses a white box model (making results easy to explain)
Ø A tree’s reliability can be tested and quantified
Ø Tends to be accurate regardless of whether it violates the assumptions of source data
But they also have a few disadvantages:
Ø When dealing with categorical data with multiple levels, the information gain is biased in favor of the attributes with the most levels.
Ø Calculations can become complex when dealing with uncertainty and lots of linked outcomes.
Ø Conjunctions between nodes are limited to AND, whereas decision graphs allow for nodes linked by OR.
The algorithm selection is also based on type of target variables. The four most commonly used algorithms in decision tree are:
Gini Index
Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.
1. It works with categorical target variable “Success” or “Failure”.
2. It performs only Binary splits
3. Higher the value of Gini higher the homogeneity.
4. CART (Classification and Regression Tree) uses Gini method to create binary splits.
Steps to Calculate Gini for a split
1. Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p²+q²).
2. Calculate Gini for split using weighted Gini score of each node of that split
Chi-Square
1. It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent node. We measure it by sum of squares of standardized differences between observed and expected frequencies of target variable.
2. It works with categorical target variable “Success” or “Failure”.
3. It can perform two or more splits.
4. Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
5. Chi-Square of each node is calculated using formula,
6. Chi-square = ((Actual — Expected)² / Expected)¹/2
7. It generates tree called CHAID (Chi-square Automatic Interaction Detector)
Steps to Calculate Chi-square for a split:
1. Calculate Chi-square for individual node by calculating the deviation for Success and Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split
Information Gain:
Less impure node requires less information to describe it. And, more impure node requires more information. Information theory is a measure to define this degree of disorganization in a system known as Entropy. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided (50% — 50%), it has entropy of one.
Entropy can be calculated using formula:- Entropy = -p log2 p — q log2q
Here p and q is probability of success and failure respectively in that node. Entropy is also used with categorical target variable. It chooses the split which has lowest entropy compared to parent node and other splits. The lesser the entropy, the better it is.
Steps to calculate entropy for a split:
1. Calculate entropy of parent node
2. Calculate entropy of each individual node of split and calculate weighted average of all sub-nodes available in split.
We can derive information gain from entropy as 1- Entropy.