Decision Tree in Machine Learning

Decision trees help data scientists and businesses make decisions using different algorithms. The decision is based on modeling techniques that use tree-like models for accurate results. In recent years, many businesses handle large datasets, making them incorporate machine learning into their organizational infrastructure. It helps in analyzing and coming up with good predictions about the company. This article will look at the advantages and disadvantages of decision trees, their applications, what it is, and an example of the decision tree.

Decision Tree in Machine Learning

What is a decision tree

A decision tree uses a supervised machine learning algorithm in regression and classification issues. It uses root nodes and leaf nodes. It relies on using different training models to find the prediction of certain target variables depending on the inputs.

It works well with boolean functions(True or False). The node is the feature, the leaf is the result, and the links show the decision outcome. Most of the decision tree relies on different types of nodes:

  • Root node-its the whole sample and can get divided into several homogeneous sets.
  • Decision node-this node occurs when you split a subnode into extra subnodes.
  • Terminal node-this type of node does not split.
  • Parent node-a node with the ability to split into sub-nodes
  • Child node-are sub-nodes that result from the parent node.

Other terms used in the decision tree include:

  • Splitting - it involves dividing a node into several sub-nodes.
  • Pruning - It involves removing sub-nodes from the decision node.
  • Branch/subtree - it's a small section of the entire tree.

decision tree

Want to Become a Master in Machine Learning? Then visit here to Learn Machine Learning Training

How do decision trees work

There are several ways that ensure that the results are accurate. It uses different algorithms to work with nodes and subnodes that will work with the variables.

Some of the algorithms that work well with decision trees are:

  • ID3 algorithm.
  • C4.5 algorithm.
  • CART algorithm
  • CHAID algorithm.
  • MARS algorithm.

ID3 algorithm

This algorithm begins by first setting the root node as Original S. On every iteration, it loops through all the unused S attributes, and it chooses the attribute with small entropy.

It later partitions it using the selected attribute to have many subsets. It continues recurring on the subset with the consideration of the not selected attributes before. Despite all these, the recursion stops if all the subset elements belong to the same class and no attributes are selected.

There are several solutions this algorithm uses in case of attribute selection issues. Some of these solutions include:

  • Entropy
  • Gini index
  • Information gain
  • Reduction invariance
  • Chi-Square
  • Gain ratio

Machine Learning Training

  • Master Your Craft
  • Lifetime LMS & Faculty Access
  • 24/7 online expert support
  • Real-world & Project Based Learning


It uses the greedy heuristic method, and the accuracy can increase when you process the data. It checks if there is randomness in the data used in the process. It identifies the outcome of any result. An example is when one tosses a coin, there will be different landings and outcomes.

The formulae of calculating entropy is:


Where S stands for the current dataset, X means set classes in S, and px represents the probability.

Entropy enables users to calculate the expectations of gained information from a specific variable and the value of distributed unknown values. The distribution of a value is known when the entropy is zero. When the entropy is higher at the node, then there is little information about the data, and improvement is needed to improve it.

Gini index

Sometimes it's referred to as Gini impurity. It uses the CART algorithm to randomly choose an element from a set that can face incorrect labeling issues if randomly labeled by the distribution of labels in a subset.

You calculate it by checking the probability of a certain feature that gets classified wrongly when it gets selected randomly. To get the index, you subtract the sum of all squared probabilities from each class.

The Gini index can be expressed as:

Gini index

Where Pi is the probability of the elements that get classified from a specific class.

Information gain

It's a function that uses Entropy. It shows how a certain attribute can separate from other training examples using the classification of the target attribute.

It decides if a feature can split a node or not by looking at the feature with high information gain at the node.

To calculate the Information gain, we first calculate the entropy of the descriptive feature by splitting the dataset using its value minus the entropy of the original dataset. This method works with the C4.5 algorithm.

The formulae for calculating this can be illustrated as:

InformationGain(descriptive feature)=Entropy(original dataset)- Entropy(descriptive feature) 

Reduction in variance

This algorithm is mostly for continuous variables. It measures which feature where nodes get split into child nodes. Variance is mainly for finding a homogenous node(a node that has zero variance)

To calculate variance:

  • You find out the variance of individual child nodes in each split.
  • You find the variance of each split as the average variance of each child node.
  • Choose the split that has the lowest variance.
  • Repeat procedure 1,2,3 until you have a homogenous node

To calculate variance, we use the following formulae:

Reduction in variance


Sometimes it uses the acronym CHAID meaning CHi-squared Automatic Interaction Detection. It's a statistical method that finds differences between subnodes and the parent nodes. It uses the Chi-square Automatic Interaction Detector tree.

The algorithm is non-parametric, and it works well with large datasets.

You calculate it as:


First, you get the deviation of both the success and failure variables and calculate the chi-square of each node.

Second, you check the sum of both success and failure nodes and later calculate their chi-square of the split.

It has the power to perform several splits at one node, which brings in more accurate results. Its mostly applied in the field of direct marketing to get more clients.

Gain ratio

According to Rose Quinlan, the gain ratio is the ratio between information gain and intrinsic information, which normally has a bias towards the multi-value attributes. When you choose the attribute, you consider the one with the largest number by looking at the size and number.

It normally selects the attribute with large values.

The gain ratio can be calculated as:

First, we have to calculate the information gain:


Where X is the variable and H(X|a) is the entropy.

Second, we calculate the split information as follows:

Gain ratio

Where X is the variable,t is set events, N(ti) is the how many times it appears divided by the number of events N(t)

The gain ratio is calculated as:

Gain ratio

Types of decision trees

There are several types of decision trees depending on the variables. Both categories share some similarities and have some differences. There are two types of decision trees.

  • Categorical decision tree. It has a scenario where the prediction is the discrete class the data belongs to.
  • Regression tree. The prediction gets viewed as an actual number, e.g., account balance, age.

Both types use CART analysis(classification and regression tree ).

Subscribe to our youtube channel to get new updates..!

Applications of the decision tree in machine learning

There are several uses of decision trees in business and other fields. Some of the common applications include:


Companies use historical and competitor data to analyze how their marketing campaigns affect the buying of products and services. It helps them develop better campaigns that will produce more sales and conversions than the competitor.

Checking the level of energy consumption

Companies use decision trees to know the amount of energy needed by each household. You can use variables like the number of household members and one of the equipment each household has, like a refrigerator, and try to determine the result.

Fraud Detection

Fraud reduces tax collections for many countries and brings losses to many businesses. You use a decision tree to monitor any fraudulent behavior that is suspicious and treat it as fraud.

Find faults

Engineers use decision trees to find if their rotary machines have any faults in the bearing. It involves using vibrations, signals, and variables during the evaluation process.

Disease and Ailments Diagnosis

Many companies use decision trees with the help of doctors to help identify early symptoms of diseases and ailments. It uses different methodologies and algorithms to find out. Some examples include detecting breast cancer, child diseases, diabetes, e.t.c.It helps in taking preventive measures at the early stages of diseases.

Retaining customers

Most companies use decision trees to check the purchasing behavior of their customers. Once you understand their behaviors, they use the trees to recommend new products that meet their behavior, making them love the whole experience.

Advantages of using decision tree in machine learning

  • It takes less time. Using it for data exploration and data normalization takes less time to prepare. It doesn't require one to use dummy variables in the process. It's also easier to create a relationship between two variables making the target variable stronger. 
  • It's easier to interpret the data. The result it produces is well organized and readable, making it easier for those without statistics and machine learning to understand it better. It uses representation-like graphics, which makes it easier to understand. It allows users with a non-technical background to understand the results.
  • The decision tree produces accurate results. It removes all the missing values which produce trusted results.
  • It works with all types of data and variables. Decision trees work with categorical data, discrete and qualitative variables, and continuous data that generate several outputs.
  • It helps in decision-making in companies. After the process, businesses use the visuals to see how they can improve the factors raised.
  • It has little data cleaning. Data cleaning is good for producing accurate results. The decision trees use different techniques that can be affected by outliers and the missing values. It helps in isolating all the values that may affect the process.
  • It fixes all the missing values. Decision trees use the CHi-squared Automatic Interaction Detection statistical method to merge and separate the values and put them in one category to make the whole process easier.
  • They use the non-parametric method. It involves using independent variables that don't follow any probability methods. It ensures that you can use collinear variables and won't affect the final output of the decision tree results.

Disadvantages of the decision tree in machine learning

  • It's difficult to predict if the variable is continuous. When predicting continuous variables, the decision trees lose information when you categorize the data into several categories.
  • Overdependence on other trees. If, for example, you have all the nodes of a specific level that rely on nodes of the previous levels, it makes the rest of the nodes wrong.
  • It's challenging to analyze independent variables at the same time. It is due to the algorithm where it evaluates it sequentially, making a tree never revise the nodes division at all levels. It can lead to confusion when selecting the tree choices.
  • They are sometimes unstable. When you introduce a slight change in data, it affects the whole process, which affects the structure and displays different results to the user.

Machine Learning Training

Weekday / Weekend Batches

 Final Thoughts

Decision trees are significant when dealing with machine learning. It uses different algorithms to come out with the best accurate prediction. When working with the trees, it works according to the rules and guidelines you provide.

The majority of the programming languages can create decision trees, and it involves a few steps to achieve the final result. There are a lot of applications of decision trees, but there are methods people use instead of decision trees when working with datasets.

Find our upcoming Machine Learning Training Online Classes

  • Batch starts on 8th Jun 2023, Weekday batch

  • Batch starts on 12th Jun 2023, Weekday batch

  • Batch starts on 16th Jun 2023, Fast Track batch

Global Promotional Image


Request for more information

Saritha Reddy
Saritha Reddy
Research Analyst
A technical lead content writer in HKR Trainings with an expertise in delivering content on the market demanding technologies like Networking, Storage & Virtualization,Cyber Security & SIEM Tools, Server Administration, Operating System & Administration, IAM Tools, Cloud Computing, etc. She does a great job in creating wonderful content for the users and always keeps updated with the latest trends in the market. To know more information connect her on Linkedin, Twitter, and Facebook.