Introduction to Data Science Lectures

Here is the material for a course I will be giving in a Master of Data Science and AI

View project on GitHub

Ensemble methods

The main idea behind ensemble methods is simple: many models are more robust than only one. Indeed, such a technique combines several base models in order to produce one optimal predictive model.

Bagging and boosting

These are the two main techniques of combining models.

Bagging

Bagging (Bootstrapping and aggregating) is a technique statistically based on central limit theorem.

Bootstrap sampling is a statistical technique involving drawing of sample data repeatedly with replacement from a data source to estimate a population parameter.

Bootstrap sampling is used in a machine learning ensemble algorithm called bootstrap aggregating (also called bagging). It helps in avoiding overfitting and improves the stability of machine learning algorithms.

In bagging, a certain number of equally sized subsets of a dataset are extracted with replacement. Then, a machine learning algorithm is applied to each of these subsets and the outputs are ensembled as illustrated below.

This method is part of the wider class of average methods for ensemble learning, where the resulting model is better than all its components mainly because its variance is reduced.

Boosting

The idea behind boosting is quite simple: we will train a “chain” of models, such that the next model will focus on mis-predicted examples of the previous one.

In other words, boosting is an ensemble method that seeks to change the training data to focus attention on examples that previous fit models on the training dataset have gotten wrong.

Such a method is part of the wider class of boosting methods for ensemble learning, where base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

References

  1. A nice introduction to ensemble algorithms can be found here.
  2. A more complete treatment of the subject can be found in chapter $8$ of Introduction to Statistical Learning.
  3. A much more deep discussion about the topic is given by the book Ensemble Methods: Foundations and Algorithms.