Decision Tree Classifier and Regressor with Example

Usman Ali
9 min readApr 19, 2023

--

Table of content:

1. What is a Decision Tree

2. Graphical Representation of Decision Tree

3. Terminology in Decision Tree

4. CART algorithm

5. Gini index

6. Entropy

7. Information Gain

8. Reduction Variance

9. Example of Classification

10. Pruning

11. Decision Tree Regressor

12. How Decision Tree Regressor works

13. Example in Regression

14. Code

What is a Decision Tree:

It is a supervised machine learning algorithm, so it can be used in classification and regression problems. As the name suggests “Decision Tree” has a number of nodes including a root node, leaf node, branches, and internal nodes. The decision tree is widely used in Machine learning to solve different kinds of problems and it is the main algorithm that is used in Ensemble techniques (Random Forest, Gradient Boost, XG Boost, Ada Boost…). That means if you learn this you have covered the base of Machine learning because of its use in multiple algorithms. Furthermore, the best thing about a decision tree is that it provides a graphical representation of the decision tree splits after building the model.

Graphical Representation of Decision Tree:

Graph of Decision Tree

There is some important and useful Terminology that is used in the Decision trees.

1. Root Node:

The Root is just a node like others, but it is used to represent the whole tree. It is the starting node of the tree as you can see in Tree above which has “Am I Hungry?” as the Root node. So, the node present at the beginning of a decision tree from this node the population starts dividing according to various features.

2. Leaf Node/ Terminal Node:

This node is totally opposite of the Root node because it is the last node in the decision tree where the tree ends. It has no further nodes and is not segregated into further nodes. Hence, the nodes where further splitting is not possible are called leaf nodes or terminal nodes.

3. Splitting:

This is not a node, but it plays a vital role in decision tree building because this is the thing that tells us in which condition, we must split the data in which direction. Splitting is used to divide the root/subnode into different parts under some conditions.

4. Branch/ Sub Tree:

It is formed by splitting the tree/node. Branch as the name suggests that it is the sub-part of any node which might be the Root node or might be any sub-tree. It has further branches that are not available, also it is a branch of a tree/node. Therefore, just like a small portion of a graph is called a sub-graph similarly a sub-section of this decision tree is called a sub-tree.

5. Pruning:

It is the crucial part of the Decision Tree because it is used to cut the unwanted nodes from the tree that might be the reason for the overfitting. Pruning has further two types: Post-Pruning and Pre-Pruning.

6. Parent Node:

The parent node is just like a root node. It has some properties that are further divided. The parent node is the base node of any Decision Tree.

7. Child Node:

The child node is the sub-node of the Parent node it consumes some properties of its Parent Node and makes a new node. The child could be a further Parent and has Child Nodes.

CART algorithm:

Decision trees use the CART algorithm (Classification and Regression Tree) because it can be used for continuous and discrete datasets. This algorithm is a type of classification problem that is used to build a Decision Tree on the basics of Gini’s impurity or Entropy. Likewise, if data is in continuous format and Regression is required at that time it used Variance Reduction is to solve this. It is a basic algorithm, but the use case of this algorithm is large in Machine Learning like in the decision trees, Ensemble techniques and more.

CART is an umbrella that has two different types of use cases:

1. Classification problem: When the data is in a form of classification both binary-classification and multi-classification at that time decision tree use the Gini index to solve this kind of problem.

2. Regression problem: When the data is continuous at that time CART uses Variance Reduction to split the tree and make a prediction.

Gini Index:

The Gini index plays an imperative role in the CART algorithm and in Decision Tree splitting because it is used to measure the impurity in the data while building a Decision Tree. Furthermore, the greater the value of the index, the higher would be the inequality. In this way, the Gini index is responsible to divide the decision tree into two halves with respect to the value of the Gini of each value. Like, if the Gini index of the fifth data value is higher than the Gini of the remaining then the root node will be the Fifth value. Further, it will be divided into the same technique to solve the problem.

Entropy:

Entropy works the same as the Gini index works in the CART algorithm and it is also used to split the tree into two different halves by calculating the impurity in the data. Entropy is an information theory metric that measures the impurity or uncertainty in a group of observations. It determines how a decision tree chooses to split data. The image below gives a better description of the purity of a set.

Gini/Entropy impurity

The formula for Entropy:

E=−∑i=1N pi log2 pi

or

E=−(prlog2pr+pplog2pp+pylog2py)

Information Gain:

The information gain is the decrease in entropy after a dataset is split on the basics of an attribute. Constructing a Decision tree is all about finding the attribute that returns the highest information gain. After finding the Entropy or Gini index the next step is finding the Information gained to elaborate the data and choose the best node for the root and the further split in Decision Tree.

Formula:

Gain=Eparent−Echildren

Reduction in Variance:

It is an algorithm used for continuous target variables (Regression problems). The split with lower variance is selected as the criteria to split the population. It uses the CART algorithm to find the best node to split the data and solve the problem with higher accuracy.

Formula:

First, we need to find the variance before finding the Variance Reduction

Variance=1/n ∑ni=1 (yi -ŷ)2

Now find the variance Reduction

Variance Reduction= Var(Root)- ∑ Wi Var(child)

Example:

Play or Not table

In the Above Table as can see there are 5 different Parameters Play is the dependent feature and the remaining are independent. The first 4 will tell the Kid to play or not on the basics of the different values.

Step 1: In this step, we must choose the best Root Node and make the perfect split of the child/sub nodes. For this purpose, we need to find Entropy to find the impurity in the data. Let’s apply the formula that is giving below:

Entropy(s)=-p(yes) log2 p(yes)-p(no) log2 p(no)

Here s is the total sample space and p(yes) and p(no) is the probability of “yes” and “no” happing.

Let’s find out the Entropy so firstly there are 9 yes and 5 no out of 14 total values. So,

E(s)=-(9/14) log2 (9/14) — (5/14) log2 (5/14)

E(s)=0.94

Now select from Outlook:

Find Entropy for all the sub-features of Outlook,

E(outlook=sunny) =-(2/5) log2 (2/5) — (3/5) log2 (3/5) =0.971

E(outlook=overcast) = — (-1) log2 (1) =0 because we have only Yes in overcast

E(outlook=rainy) = -(3/5) log2 (3/5) –(2/5) log2 (2/5) =0.971

Information from Outlook:

I(outlook)=5/14(0.971) +4/14(0)+5/14(0.971) = 0.693

Here we took 5 because Sunny has 5 total values in the dataset out of 14 and multiply with the Entropy of Sunny and do the same for all the remaining.

Information Gain for Outlook:

Gain(outlook) = E(s)- I(outlook)

=0.94–0.693 = 0.247

Now find the information gained for the remaining features with the same process that looks like the following,

Gain(windy)=0.048 = information(windy)= 0.892

Gain(Temperature) = 0.029 =information(Temperature) = 0.911

Gain (Humidity) = 0.152 =information (Humidity) = 0.788

So, the windy feature will be got selected because of having the biggest Information Gain (0.247) which is greater than any other feature

Therefore, the tree will look like the following:

Play or Not Decision Tree

Pruning

Pruning Example

Pruning has two different types that are defined below,

Post-Pruning:

The process of pruning starts after creating the Decision Tree. We implement post-pruning by defining the parameters of the Decision Tree model like max-depth, max-features, etc.

It prevents overfitting. This can be used with smaller datasets.

Pre-Pruning:

Apply max-depth, and max-features before constructing the Decision Tree. This can be useful with large datasets.

Decision Tree Regressor

How Decision Tree Regressor works:

Regression is all about continuous data when datasets have continuous type data in dependent features at that Decision Tree Regressor applies. Decision Tree Regressor uses Variance Reduction from the CART algorithm to perform regression. In Variance Reduction, it founds Variance first with the following formula:

Variance(root/child)= 1/n ∑ni=1 (y-ŷ)2

Which is further used to calculate the Variance Reduction:

Variance reduction=Variance(root)- ∑ wi Variace(child)

Example:

Monthly salary data for regression

Firstly, find the mean value of Salary which is a dependent feature, and it will be ŷ of the formula. (40k+42k+52k+60k+56k)/2 = 50k. So, the mean value is 50,000 of the Salary.

Now, find the root node by using the Variance Reduction here first check the variance reduction of 2 then 2.5 and so on and which one will have more variance reduction will good be selected. Here I will check the first two nodes and will select the best one.

Decision tree of monthly salary

Variance (<=2) = 1/5[(40k-50k)2 +(42k-50k)2+(52k-50k)2+(60k-50k)2+(65k-50k)2)]

Here use 1/5 because there are 5 total samples and divide each value by the mean value.

Variance (<=2) = 60.8k

Variance (<=2.5) = 60.k

Now from here start finding the variance of each child.

Variance (child 1/=2) = 1/1 (40k-50k)2 = 100k

Variance (child 2/>2) = 1/4 [(42k-50k)2+(52k-50k)2+(60k-50k)2+(56k-50k)2] = 51K

Variance reduction (<=2) = 60.8k-(1/5(100k) +4/5(51k)) = 0

Now for (<=2.5)

Variance (child 1/ <=2.5) = 1/2 [(40k-50k)2 + (42k-50k)2 ] à 82k

Variance (child 2 / >2.5) = 1/3 [(52k-50k)2+(60k-50k)2+(56k-50k)2] à 46.66k

Variance reduction(<=2.5) = 51k- (2/5(82k) + 3/5(46.66k)) à 0.304

VR(<=2) < VR(<=2.5)

This is why the root node, for now, is <=2.5 and makes the Decision tree from <=2.5

The root node for the monthly salary

Now from here split the Decision Tree further by just following the same steps.

Code:

df=pd.read_csv('/content/Iris.csv')
df.drop(['Id'],axis=1,inplace=True)
#Split data and preprocessing:
x=df.iloc[:,0:4]
y=df.iloc[:,-1:]
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y=le.fit_transform(y)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

#Model Building:
from sklearn import tree
clf2=tree.DecisionTreeClassifier( criterion='gini',min_samples_leaf=2, max_depth=3)
clf2.fit(x_train,y_train)
clf2.score(x_test,y_test)
plt.figure(figsize=(15,10))
The graphical representation of Iris data

https://www.linkedin.com/in/usman-ali-06a6351b1/

https://twitter.com/Usmanal52128963

--

--

Usman Ali
Usman Ali

Written by Usman Ali

Data Scientist expert of Machine learning, Deep learning, Computer Vision, NLP, SQL, Statistics, Python and Tableau/PowerBI.

Responses (1)