Decision Trees Vs. Random Forests
What is a Decision Tree?
For example, you and your friends are going to the beach and you are in charge of the drinks. You have $15 to choose from a variety of sodas and to decide you use a decision tree.
The algorithm checks which soda brands can be bought in bulk for $15 or under and then most likely chooses the one sold most. Your decision tree produced an answer and you purchase a 24 pack of 12 Fl. Oz. cans of Sprite for $12. Sweet deal!
This algorithm is efficient and can handle large datasets with ease. As the name suggests, a decision tree resembles a literal tree with branches and leaves; the data is split into nodes starting from root nodes, to children nodes, to leaf nodes, until it reaches a threshold unit. To move throughout, or transverses between, the nodes, a decision tree uses recursion which is the process in which a function calls itself directly or indirectly. (1)
Decision trees answer sequential questions by setting "if this, then that" conditions until we are ultimately provided a specific result.
Take the tree below:
Say you were considering going fishing but were on the fence because of the forecast. Since fishing generally requires certain conditions for guaranteeing a catch (or at least a pleasant time!), you choose to use a decision tree to help you decide. You lay out all of your reasons and then come to a decision.
This tree flows from the top downward. Here you have 3 options: sunny, cloudy, or rainy.
If the weather is sunny then you move down to the next node. Is it windy? If it isn't then you should go fishing, but if is then you won't. If the weather has changed to cloudy then you should most definitely prep your gear since your chances of catching a fish in overcast conditions is much higher. However, if it's raining outside you have one of two options: stay in if it's thundering and go fishing if its not.
This is an example of a simple decision tree.
What is a Random Forest?
Consider the soda example above: as you were checking out your groceries, you overheard the next customer ahead of you happy about his soda purchase; he too struck a great deal! Instead of using a decision tree, he chose to use a random forest algorithm to make his choice and in doing so, made several decisions that led him to one major one.
The algorithm, like your decision tree, checked the best seller. It also checked the size of the cans and the largest quantity. He settled on a 36 pack of 10 fl. oz. cans of Coca-Cola for $15. He was ecstatic, and you were in awe!
This is what a random forest is: essentially a collection of decision trees. A random forest doesn't just rely on one decision, it makes random decisions based on several decisions and then concludes the final decision based on the majority, and the more diversity provided, the smoother the decision.
Random Forest, an ensemble method, is one of the most well known and powerful ML algorithms that uses Bootstrap Aggregation or bagging. In a nutshell, ensemble methods are techniques that combine the predictions from multiple algorithms to make more accurate predictions than a single model. For example, a random forest will change the algorithm for the way the sub-trees are learned so that the resulting predictions from all of the sub-trees have less correlation. (2)
The Process
Decision Trees:
1. Splitting: the data is split into various categories under "branches" as it's fed to the tree. Branches are the arrows connecting the nodes that show the flow from one question to another.
2. Pruning: the data is classified to be subsidized, and thus the "branches" are shred until the leaf node is reached.
3. Selection: the tree with the best of the following factors is then selected:
1- Entropy: in simple terms, entropy is the measure of how disordered the data is. The ultimate goal in the decision tree is to group similar data groups into similar classes; basically, to tidy the data. (4) If the entropy is 0, the data is homogeneous (meaning the same in structure or composition).
2- Information gain: the branches are split further once the entropy is decreased and the information is gained.
It's important to note that the depth of a tree is highly important: the depth determines the number of decisions that need to be made in order to produce a conclusion. Shallower decision trees generally tend to perform better than deep ones.
Random Forests:
2. Bootstrapping: this is a powerful statistical method for estimating a quantity from a data sample. (5) It is randomly choosing samples by randomly selecting conditions, calculating the root node, splitting and repeating until a forest is formed.
Which One Is Better?
The answer lies in the type of problem we are trying to solve, however it helps to compare the two types of models to get a better understanding of their uses.
Advantages Vs. Disadvantages of Decision Trees:
Advantages Vs. Disadvantages of Random Forests:
Conclusion
To compare, decision trees are far simpler and easier than a random forest: a tree combines decisions while a forest combines several trees (the reason why they are slow.) Decision trees operates efficiently on large datasets and so if you are time restricted than a decision tree may be the way to go. However, if time is not an issue, than a random forest may be optimal since it requires more rigorous training and offers more reliable predictions.
References
- Recursion - GeeksforGeeks
- Bagging and Random Forest Ensemble Algorithms for Machine Learning (machinelearningmastery.com)
- Random forests for high-dimensional longitudinal data | DeepAI
- Decision Tree Algorithm | Explanation and Role of Entropy in Decision Tree (educba.com)
- Bagging and Random Forest Ensemble Algorithms for Machine Learning (machinelearningmastery.com)
Comments
Post a Comment