Decision Trees Vs. Random Forests

 

    I came across an article on popular data science interview questions that listed 23 different different topics that frequently come up in the interview process. Among them were two questions: what is a Decision Tree and what is a Random Forest? Since decision making algorithms are often used in the machine learning world, I thought I'd explore the differences between the two and the ways in which they are used. (You can find the article mentioned above here!)

What is a Decision Tree?

A decision tree is a supervised learning algorithm used in ML for both classification and regression problems when the relationship between a set of predictor variables and a response variable is non-linear. The goal of the decision tree is to use decision rules to produce a tree of predictor variables that will predict the value of a response variable. 
For example, you and your friends are going to the beach and you are in charge of the drinks. You have $15 to choose from a variety of sodas and to decide you use a decision tree.  
The algorithm checks which soda brands can be bought in bulk for $15 or under and then most likely chooses the one sold most. Your decision tree produced an answer and you purchase a 24 pack of 12 Fl. Oz. cans of Sprite for $12. Sweet deal!
This algorithm is efficient and can handle large datasets with ease.  As the name suggests, a decision tree resembles a literal tree with branches and leaves; the data is split into nodes starting from root nodes, to children nodes, to leaf nodes, until it reaches a threshold unit. To move throughout, or transverses between, the nodes, a decision tree uses recursion which is the process in which a function calls itself directly or indirectly. (1) 
Decision trees answer sequential questions by setting "if this, then that" conditions until we are ultimately provided a specific result. 
Take the tree below:
Say you were considering going fishing but were on the fence because of the forecast. Since fishing generally requires certain conditions for guaranteeing a catch (or at least a pleasant time!), you choose to use a decision tree to help you decide. You lay out all of your reasons and then come to a decision.

This tree flows from the top downward. Here you have 3 options: sunny, cloudy, or rainy.

    If the weather is sunny then you move down to the next node. Is it windy? If it isn't then you should go fishing, but if is then you won't. If the weather has changed to  cloudy then you should most definitely prep your gear since your chances of catching a fish in overcast conditions is much higher. However, if it's raining outside you have one of two options: stay in if it's thundering and go fishing if its not. 

This is an example of a simple decision tree.

What is a Random Forest?

Consider the soda example above: as you were checking out your groceries, you overheard the next customer ahead of you happy about his soda purchase; he too struck a great deal! Instead of using a decision tree, he chose to use a random forest algorithm to make his choice and in doing so, made several decisions that led him to one major one.

The algorithm, like your decision tree, checked the best seller. It also checked the size of the cans and the largest quantity. He settled on a 36 pack of 10 fl. oz. cans of Coca-Cola for $15. He was ecstatic, and you were in awe!

This is what a random forest is: essentially a collection of decision trees. A random forest doesn't just rely on one decision, it makes random decisions based on several decisions and then concludes the final decision based on the majority, and the more diversity provided, the smoother the decision.

Random Forest, an ensemble method, is one of the most well known and powerful ML algorithms that uses Bootstrap Aggregation or bagging. In a nutshell, ensemble methods are techniques that combine the predictions from multiple algorithms to make more accurate predictions than a single model. For example, a random forest will change the algorithm for the way the sub-trees are learned so that the resulting predictions from all of the sub-trees have less correlation. (2)


The Process

Decision Trees:

1. Splitting: the data is split into various categories under "branches" as it's fed to the tree. Branches are the arrows connecting the nodes that show the flow from one question to another.

2. Pruning: the data is classified to be subsidized, and thus the "branches" are shred until the leaf node is reached.

3. Selection: the tree with the best of the following factors is then selected:

        1- Entropy: in simple terms, entropy is the measure of how disordered the data is. The ultimate goal in the decision tree is to group similar data groups into similar classes; basically, to tidy the data. (4) If the entropy is 0, the data is homogeneous (meaning the same in structure or composition).

        2- Information gain: the branches are split further once the entropy is decreased and the information is gained.

It's important to note that the depth of a tree is highly important: the depth determines the number of decisions that need to be made in order to produce a conclusion. Shallower  decision trees generally tend to perform better than deep ones.

Random Forests:

1. Bagging: a decision tree is produced on some of the training data and then the process is repeated for a defined period of time. The majority decision is then concluded.

2. Bootstrapping: this is a powerful statistical method for estimating a quantity from a data sample. (5) It is randomly choosing samples by randomly selecting conditions, calculating the root node, splitting and repeating until a forest is formed. 

Which One Is Better?

The answer lies in the type of problem we are trying to solve, however it helps to compare the two types of models to get a better understanding of their uses.

Advantages Vs. Disadvantages of Decision Trees:

Pros: They are incredibly fast,  perform great on large datasets, easy to interpret and understand, versatile (can handle both categorical and numerical data), and have a transparent process so they can easily be reproduced.

Cons: They are prone to overfitting, especially when the model is deep because we look at specific samples and traverse based on those samples. By setting a max depth we can limit the issue of error of variance and overfitting, however it will be at the expense of error due to bias.
In addition, the process of pruning is large and cannot guarantee that the model will be optimized since at each step the algorithm chooses the best result, however this does not ensure we are traversing in the direction that will lead us to the best decision.

Advantages Vs. Disadvantages of Random Forests:

Pros: They are very powerful and highly accurate, limit overfitting by training on different samples of data or random subsets of features (2), don't require normalization, can run parallel trees, and can handle several features at a time.

Cons: Since they are so robust, they run rather slow, may have limitations with high dimensional data when the number of predictors is much larger than the number of observations, cannot be used for linear methods and can be biased to certain features.

Conclusion

 To compare, decision trees are far simpler and easier than a random forest: a tree combines decisions while a forest combines several trees (the reason why they are slow.) Decision trees operates efficiently on large datasets and so if you are time restricted than a decision tree may be the way to go. However, if time is not an issue, than a random forest may be optimal since it requires more rigorous training and offers more reliable predictions.


References

Comments

Popular Posts