Ensemble learning techniques are quite popular in machine learning. These techniques work by training multiple models and combining their results to get the best possible outcome. In this article, we will learn about three popular ensemble learning methods-bagging, boosting, and stacking.
Each one of these methods has its own benefits and limitations, but in practice ensemble methods often give the best results. Most of the Kaggle competition’s top-rankers admit that combing multiple models together often gives the best results.
The purpose of this article is to give the readers an overview of ensemble learning methods and answers to some questions like-what is ensemble learning? why do these methods work? what are the benefits? when to apply which method? what are the advantages or limitations?.. etc.
I am assuming everyone has a fair understanding of a decision tree-based machine learning algorithms. If not, I would recommend you to first understand the decision tree-based models and then learn about ensemble learning methods as most of the times you will find only decision tree-based ensembles.
Without delaying any further let’s start with the topic. This post is divided into multiple parts as per the following structure-
- What is ensemble learning?
- Why not a single model?
- Bagging
- Boosting
- Stacking
- Conclusions
1. What is ensemble learning?
Ensemble learning is a machine learning technique that involves training multiple machine learning models on a given dataset to solve the same problem and the final output(prediction) is a combination of the outputs of all these models. So the basic idea is that-“Multiple weak performing machine learning models can be combined together to generate more accurate predictions for a given machine learning problem(Classification or Regression)”.
2. Why not a single model?
While solving a machine learning problem, be it a regression or classification task, the model selection is a very crucial part. Which model is going to be the best model for your task, depends upon a list of factors (ie., training data size, number of features, class distributions, feature types…etc).
Once the selected model is trained on the given training dataset, there are three different possibilities with respect to the model training-
- The Model is over-fitting on the dataset (low bias and high variance)
- The model is under-fitting the dataset (high bias and low variance).
- The model is robust (bias and variance are balanced).
Scenario 3 is the ideal scenario, but practically every single model faces one of the over-fitting or under-fitting problems. Models with high bias(under-fitted) or high variance(over-fitted) often do not perform well on unseen data because either they are undertrained (under-fitting case) or they learn the training data so well that they are not able to generalize (over-fitted models).
This problem is very common in machine learning and is known as the bias-variance tradeoff. Ensemble methods try to overcome such problems(bias-variance tradeoff) by combining the results of multiple weak learners. These weak learners often do not perform well individually but when they are put together, they result in a very strong (robust) model.
3. Bagging
Bootstrap aggregation, also known as the bagging, ensemble method works by training multiple models independently and combining later(using some deterministic averaging method) to result in a strong model. This final model is more robust than those individual weak learners.
Because these constituent individual models are independent, they can be trained parallelly to speed-up the process. These models needn’t be of the same type(same ML algorithm), but when/if they are, the ensemble model is said to be “homogeneous” else it is called a “heterogeneous” model. Bagging method based models are homogeneous most of the time.
Constituent models(base learners or weak learners) are trained on bootstrapped samples from the original training dataset. The bootstrapping method resamples the data from the training dataset(with the same cardinality) and aims to help in decreasing the variance of the model and reduce overfitting. A model with low variance is always good as it is not very sensitive to the training data.
Though ensemble learning methods perform really well in practice, they are hard to interpret. You might be interested in my article on:- “Mining Interpretable Rules from Classification Models“
There are multiple possible ways to combine the results of weak learners to get the final output. Two of the most popular ways are-1. hard voting and 2.soft voting. In ‘hard voting’ mode, output class votes are collected from individual weak learners and the class getting the highest number of votes is selected as the final output class. While in ‘soft voting’ mode, class probabilities are collected from individual models and then averaged to get the final class probabilities, and finally, the class with the highest probability is chosen as the final output.
Bagging based ensembles aim to decrease the variance of the prediction. So, if you need a model with low variance, go with the bagging method. Random Forest is an example of a tree-based ensemble that uses the bagging method and works really well in practice.
Example: Random Forest
Random forest is an ensemble learning technique that combines multiple decision trees, implementing the bagging method and results in a robust model(Classifier or Regressor) with low variance. The random forest approach trains multiple independent deep decision trees. Deep trees are typically overfitted and have low bias and high variance, but when combined using the bagging method. they result in a low variance robust model.
Random Forest algorithm uses one extra trick to keep the constituent trees less-correlated. While bootstrapping the training data, instead of just sampling the training rows it also samples the features as well. Each of the constituent trees is trained over the sampled rows as well as sampled subset of the features. The advantages of using only a subset of features are-1. The trees in the forest(forest: a group of trees) are less correlated and each of them studies the different kinds of information from the training data, 2. The model is robust towards the examples with a few missing valued rows.
Therefore, the random forest approach combines the bagging method and the random feature selection method in order to result in a more efficient/robust machine learning model.
4. Boosting
The boosting method again trains multiple models(weak learners) to get the final output, but the constituent weak learners are not independent this time. These constituent models are trained sequentially and each new model tries to improve on the mistakes of the previously trained model. As these models are dependent and trained sequentially, it makes them quite slow in execution.
Constituent models are typically weak learners with high bias. The boosting approach combines these high bias weak learners and results in a low bias model with high predictive force. Unlike bagging, the boosting based ensembles have overfitting issues as they are low bias, high variance models. There are several optimization methods for boosting models in order to deal with the overfitting issues but they’re out of the scope of this article.
Similar to the bagging models, most of the time boosting models are homogeneous(as weak learners are the same algorithmically most of the time). To get the final output, the boosting method based ensembles use a weighted sum approach over the outputs of the weak learners. This weight depends upon the individual performances of the given weak learners.
Decision tree-based boosted ensembles work by training shallow decision trees sequentially where each subsequent tree tries to solve the misclassified examples from the previous one. There are several popular implementations of decision tree-based boosting ensembles-
- Gradient Boosting
- Ada boost (Adaptive Boosting)
- XGBoost (Extreme Gradient Boosting)
We will talk about Gradient boosting and XGBoost techniques in brief.
Example: Gradient Boosting
Gradient boosting(GBM) is a machine learning algorithm that implements boosting concepts over decision trees. It combines weak learners(shallow trees) in a sequential manner to achieve the final high-performance model. Gradient boosting can be used for both types of problems-Regression as well as Classification.
Being a boosting algorithm, GBM can easily overfit over the training data. To overcome this problem several regularization techniques are used. Depth of the weak learners and the number of iterations are two such parameters. These parameters can be tuned by monitoring the performance of the model over a test dataset with different values of these parameters.
Stochastic Gradient Boosting is another modification of the Gradient Boosting algorithm that takes advantage of the bootstrap aggregation (bagging) technique and trains each constituent decision tree over a random sub-sample of the training dataset. This approach usually results in considerable improvements in model accuracy. How much training data should each weak learner utilize? again becomes a regularization parameter.
Example: XGBoost
XGBoost(also known as Extreme Gradient Boosting) is an open-source implementation of the Gradient Boosting algorithm. It aims at providing a portable, scalable, and fast (Distributed) gradient boosting model. XGBoost has become a popular choice to win the various machine learning competitions. It can run on a single machine as well as distributed frameworks.
Though it is an implementation of the vanilla gradient boosting algorithm, it has a few modifications to its core-algorithm that makes it fast as well as a more efficient algorithm. Some of these features include-clever penalization of the weak learners, better shrinkage of leaf nodes, and extra randomization.
5. Stacking
The stacking method is slightly different from the bagging and the boosting techniques. Unlike bagging and boosting, it uses a separate model(a meta learner) to combine the results of the base models(weak learners or constituent models). The second difference is that the stacking based models are mostly heterogeneous in nature as they tend to train different kinds(algorithmically different) of base models. Meta learner takes the outputs of base models as input and gives the prediction as to the final output.
The stacking based model can be visualized in levels and has at least two levels of the models. The first level typically trains the two or more base learners(can be heterogeneous) and the second level might be a single meta learner that utilizes the base models predictions as input and gives the final result as output. A stacked model can have more than two such levels but increasing the levels doesn’t always guarantee better performance.
In the classification tasks, often logistic regression is used as a meta learner, while linear regression is more suitable as a meta learner for regression-based tasks.
6. Conclusions
After reading this article you should be comfortable in answering the following questions-
- What is ensemble learning? and why it is beneficial?
- What are the different ensemble learning approaches?
- How does bagging, boosting, and stacking techniques work? and why do they perform better than single learners?
- What is the bias-variance tradeoff and which of ensemble learning techniques (bagging, boosting, and stacking) solves what part of it?
Ensemble learning techniques are good for imbalanced datasets. you might be interested in reading my article on:
Thanks for reading! hope you have enjoyed the article. Kindly give your feedback through the comments.
Further Reading
- https://en.wikipedia.org/wiki/Ensemble_learning (Ensemble learning)
- https://en.wikipedia.org/wiki/Random_forest (Random Forest)
- https://en.wikipedia.org/wiki/Gradient_boosting (GBM)
- https://en.wikipedia.org/wiki/XGBoost (XGBoost)