Table of content:
1. What is a Decision Tree
2. Graphical Representation of Decision Tree
3. Terminology in Decision Tree
4. CART algorithm
5. Gini index
6. Entropy
7. Information Gain
8. Reduction Variance
9. Example of Classification
10. Pruning
11. Decision Tree Regressor
12. How Decision Tree Regressor works
13. Example in Regression
14. Code
What is a Decision Tree:
It is a supervised machine learning algorithm, so it can be used in classification and regression problems. As the name suggests “Decision Tree” has a number of nodes including a root node, leaf node, branches, and internal nodes. The decision tree is widely used in Machine learning to solve different kinds of problems and it is the main algorithm that is used in Ensemble techniques (Random Forest, Gradient Boost, XG Boost, Ada Boost…). That means if you learn this you have covered the base of Machine learning because of its use in multiple algorithms. Furthermore, the best thing about a decision tree is that it provides a graphical representation of the decision tree splits after building the model.
Graphical Representation of Decision Tree:
There is some important and useful Terminology that is used in the Decision trees.
1. Root Node:
The Root is just a node like others, but it is used to represent the whole tree. It is the starting node of the tree as you can see in Tree above which has “Am I Hungry?” as the Root node. So, the node present at the beginning of a decision tree from this node the population starts dividing according to various features.
2. Leaf Node/ Terminal Node:
This node is totally opposite of the Root node because it is the last node in the decision tree where the tree ends. It has no further nodes and is not segregated into further nodes. Hence, the nodes where further splitting is not possible are called leaf nodes or terminal nodes.
3. Splitting:
This is not a node, but it plays a vital role in decision tree building because this is the thing that tells us in which condition, we must split the data in which direction. Splitting is used to divide the root/subnode into different parts under some conditions.
4. Branch/ Sub Tree:
It is formed by splitting the tree/node. Branch as the name suggests that it is the sub-part of any node which might be the Root node or might be any sub-tree. It has further branches that are not available, also it is a branch of a tree/node. Therefore, just like a small portion of a graph is called a sub-graph similarly a sub-section of this decision tree is called a sub-tree.
5. Pruning:
It is the crucial part of the Decision Tree because it is used to cut the unwanted nodes from the tree that might be the reason for the overfitting. Pruning has further two types: Post-Pruning and Pre-Pruning.
6. Parent Node:
The parent node is just like a root node. It has some properties that are further divided. The parent node is the base node of any Decision Tree.
7. Child Node:
The child node is the sub-node of the Parent node it consumes some properties of its Parent Node and makes a new node. The child could be a further Parent and has Child Nodes.
CART algorithm:
Decision trees use the CART algorithm (Classification and Regression Tree) because it can be used for continuous and discrete datasets. This algorithm is a type of classification problem that is used to build a Decision Tree on the basics of Gini’s impurity or Entropy. Likewise, if data is in continuous format and Regression is required at that time it used Variance Reduction is to solve this. It is a basic algorithm, but the use case of this algorithm is large in Machine Learning like in the decision trees, Ensemble techniques and more.
CART is an umbrella that has two different types of use cases:
1. Classification problem: When the data is in a form of classification both binary-classification and multi-classification at that time decision tree use the Gini index to solve this kind of problem.
2. Regression problem: When the data is continuous at that time CART uses Variance Reduction to split the tree and make a prediction.
Gini Index:
The Gini index plays an imperative role in the CART algorithm and in Decision Tree splitting because it is used to measure the impurity in the data while building a Decision Tree. Furthermore, the greater the value of the index, the higher would be the inequality. In this way, the Gini index is responsible to divide the decision tree into two halves with respect to the value of the Gini of each value. Like, if the Gini index of the fifth data value is higher than the Gini of the remaining then the root node will be the Fifth value. Further, it will be divided into the same technique to solve the problem.
Entropy:
Entropy works the same as the Gini index works in the CART algorithm and it is also used to split the tree into two different halves by calculating the impurity in the data. Entropy is an information theory metric that measures the impurity or uncertainty in a group of observations. It determines how a decision tree chooses to split data. The image below gives a better description of the purity of a set.
The formula for Entropy:
E=−∑i=1N pi log2 pi
or
E=−(prlog2pr+pplog2pp+pylog2py)
Information Gain:
The information gain is the decrease in entropy after a dataset is split on the basics of an attribute. Constructing a Decision tree is all about finding the attribute that returns the highest information gain. After finding the Entropy or Gini index the next step is finding the Information gained to elaborate the data and choose the best node for the root and the further split in Decision Tree.
Formula:
Gain=Eparent−Echildren
Reduction in Variance:
It is an algorithm used for continuous target variables (Regression problems). The split with lower variance is selected as the criteria to split the population. It uses the CART algorithm to find the best node to split the data and solve the problem with higher accuracy.
Formula:
First, we need to find the variance before finding the Variance Reduction
Variance=1/n ∑ni=1 (yi -ŷ)2
Now find the variance Reduction
Variance Reduction= Var(Root)- ∑ Wi Var(child)
Example:
In the Above Table as can see there are 5 different Parameters Play is the dependent feature and the remaining are independent. The first 4 will tell the Kid to play or not on the basics of the different values.
Step 1: In this step, we must choose the best Root Node and make the perfect split of the child/sub nodes. For this purpose, we need to find Entropy to find the impurity in the data. Let’s apply the formula that is giving below:
Entropy(s)=-p(yes) log2 p(yes)-p(no) log2 p(no)
Here s is the total sample space and p(yes) and p(no) is the probability of “yes” and “no” happing.
Let’s find out the Entropy so firstly there are 9 yes and 5 no out of 14 total values. So,
E(s)=-(9/14) log2 (9/14) — (5/14) log2 (5/14)
E(s)=0.94
Now select from Outlook:
Find Entropy for all the sub-features of Outlook,
E(outlook=sunny) =-(2/5) log2 (2/5) — (3/5) log2 (3/5) =0.971
E(outlook=overcast) = — (-1) log2 (1) =0 because we have only Yes in overcast
E(outlook=rainy) = -(3/5) log2 (3/5) –(2/5) log2 (2/5) =0.971
Information from Outlook:
I(outlook)=5/14(0.971) +4/14(0)+5/14(0.971) = 0.693
Here we took 5 because Sunny has 5 total values in the dataset out of 14 and multiply with the Entropy of Sunny and do the same for all the remaining.
Information Gain for Outlook:
Gain(outlook) = E(s)- I(outlook)
=0.94–0.693 = 0.247
Now find the information gained for the remaining features with the same process that looks like the following,
Gain(windy)=0.048 = information(windy)= 0.892
Gain(Temperature) = 0.029 =information(Temperature) = 0.911
Gain (Humidity) = 0.152 =information (Humidity) = 0.788
So, the windy feature will be got selected because of having the biggest Information Gain (0.247) which is greater than any other feature
Therefore, the tree will look like the following:
Pruning
Pruning has two different types that are defined below,
Post-Pruning:
The process of pruning starts after creating the Decision Tree. We implement post-pruning by defining the parameters of the Decision Tree model like max-depth, max-features, etc.
It prevents overfitting. This can be used with smaller datasets.
Pre-Pruning:
Apply max-depth, and max-features before constructing the Decision Tree. This can be useful with large datasets.
Decision Tree Regressor
How Decision Tree Regressor works:
Regression is all about continuous data when datasets have continuous type data in dependent features at that Decision Tree Regressor applies. Decision Tree Regressor uses Variance Reduction from the CART algorithm to perform regression. In Variance Reduction, it founds Variance first with the following formula:
Variance(root/child)= 1/n ∑ni=1 (y-ŷ)2
Which is further used to calculate the Variance Reduction:
Variance reduction=Variance(root)- ∑ wi Variace(child)
Example:
Firstly, find the mean value of Salary which is a dependent feature, and it will be ŷ of the formula. (40k+42k+52k+60k+56k)/2 = 50k. So, the mean value is 50,000 of the Salary.
Now, find the root node by using the Variance Reduction here first check the variance reduction of 2 then 2.5 and so on and which one will have more variance reduction will good be selected. Here I will check the first two nodes and will select the best one.
Variance (<=2) = 1/5[(40k-50k)2 +(42k-50k)2+(52k-50k)2+(60k-50k)2+(65k-50k)2)]
Here use 1/5 because there are 5 total samples and divide each value by the mean value.
Variance (<=2) = 60.8k
Variance (<=2.5) = 60.k
Now from here start finding the variance of each child.
Variance (child 1/=2) = 1/1 (40k-50k)2 = 100k
Variance (child 2/>2) = 1/4 [(42k-50k)2+(52k-50k)2+(60k-50k)2+(56k-50k)2] = 51K
Variance reduction (<=2) = 60.8k-(1/5(100k) +4/5(51k)) = 0
Now for (<=2.5)
Variance (child 1/ <=2.5) = 1/2 [(40k-50k)2 + (42k-50k)2 ] à 82k
Variance (child 2 / >2.5) = 1/3 [(52k-50k)2+(60k-50k)2+(56k-50k)2] à 46.66k
Variance reduction(<=2.5) = 51k- (2/5(82k) + 3/5(46.66k)) à 0.304
VR(<=2) < VR(<=2.5)
This is why the root node, for now, is <=2.5 and makes the Decision tree from <=2.5
Now from here split the Decision Tree further by just following the same steps.
Code:
df=pd.read_csv('/content/Iris.csv')
df.drop(['Id'],axis=1,inplace=True)
#Split data and preprocessing:
x=df.iloc[:,0:4]
y=df.iloc[:,-1:]
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y=le.fit_transform(y)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)
#Model Building:
from sklearn import tree
clf2=tree.DecisionTreeClassifier( criterion='gini',min_samples_leaf=2, max_depth=3)
clf2.fit(x_train,y_train)
clf2.score(x_test,y_test)
plt.figure(figsize=(15,10))