Bagging, boosting, and stacking are three ensemble learning techniques used to improve model performance.

Bagging involves training multiple models independently on random subsets of data and then combining their predictions through a majority vote.

Boosting focuses on correcting the errors made by previous weak models in a sequence to create a stronger model.

Stacking combines multiple models by training a meta-model, which takes model predictions as input and outputs the final prediction.

Keep reading to learn the differences between them in detail, as well as how and when to use each of them in your Python projects.

A Refresher Of Ensemble Methods In Machine Learning

Before we delve into the differences between bagging, boosting, and stacking, let’s first review what ensemble methods are and why they are useful in the realm of machine learning.

Feel free to skip this section if you are already familiar with the concept.

You might have heard the saying, “two heads are better than one.”

This statement captures the essence of ensemble methods in machine learning.

Ensembles are like teams of machine learning models, such as decision trees, logistic regression, or neural networks, working together to solve a problem.

Picture a group of friends trying to answer a trivia question.

Each friend has their own knowledge and skills, but when they brainstorm and work together, they could come up with the correct answer more often than if they were on their own. That’s precisely what ensemble methods do.

They combine multiple models (also called learners, predictors, or base classifiers) to enhance the overall performance, accuracy, and stability of the predictions.

In most cases, this combination yields better results than using a single model!

Intuition Behind Why Ensembles Work

Each individual model in machine learning tends to have its own limitations and biases.

Expanding on the team analogy, think of these models as players, each with their unique abilities and specialties.

One player may be particularly effective at capturing linear relationships, while another might excel at finding patterns in high-dimensional data.

When you put these models together in a team, something magical happens — they’re able to balance each other’s weaknesses and capitalize on their strengths.

The model skilled in capturing linear relationships can focus on that aspect, while the other model excels at processing complex patterns.

This teamwork allows the models to cover each other’s shortcomings and create a more well-rounded and effective overall approach.

Now, you might be wondering when ensembles are not the best idea.

If you’re working with a small dataset or a simple problem, using an ensemble could over-complicate things and actually make it harder to get a good answer.

Stick to using a single, simpler model for those situations.

There are a few different techniques you can use when it comes to ensemble learning, like bagging, boosting, and stacking.

Each one has its own approach to creating and combining the learners to get the best results.

Let’s dive into the details of these techniques now.

What Is Bagging?

Bagging, which stands for Bootstrap Aggregating, is a powerful ensemble method designed to improve the stability and accuracy of machine learning models.

The key idea behind bagging is to train several base (weak) models separately, each on a random subset of the training data, and then combine their predictions through a process called aggregation, usually by averaging or taking a majority vote.

The subsets of the training data are generated using a technique called bootstrapping, which involves sampling with replacement.

To help illustrate bootstrapping, let’s consider a non-technical example using balls in urns.

Imagine you have an urn containing 6 balls, each ball representing a unique data point in your training dataset.

Each ball has a distinct number and color:

  1. Ball 1 - Red
  2. Ball 2 - Blue
  3. Ball 3 - Green
  4. Ball 4 - Yellow
  5. Ball 5 - Orange
  6. Ball 6 - Purple

Now, let’s say we want to create 3 bootstrapped subsets (sub-urns) using sampling with replacement.

For each sub-urn, we will draw out 4 balls, but with each draw, we will replace the ball before drawing again.

This procedure allows balls to appear multiple times in a given sub-urn or not be included at all.

Here’s one possible outcome after the bootstrapping process:

  • Sub-Urn 1: Ball 1 (Red), Ball 2 (Blue), Ball 2 (Blue), Ball 4 (Yellow)
  • Sub-Urn 2: Ball 3 (Green), Ball 1 (Red), Ball 6 (Purple), Ball 1 (Red)
  • Sub-Urn 3: Ball 5 (Orange), Ball 4 (Yellow), Ball 6 (Purple), Ball 3 (Green)

Notice that in each sub-urn, some balls are selected more than once, and others are not selected at all. This randomness creates diversity among the subsets.

Now, let’s relate this back to the machine learning context.

When creating bootstrapped subsets for bagging, we are drawing samples (with replacement) from the training data to form new subsets.

Each base learner in the ensemble is trained on a different subset, exposing it to slightly different perspectives of the data.

By doing this, we increase the diversity among base learners and ultimately improve the performance of the ensemble method as a whole.

Bagging is particularly effective with high-variance, low-stability, models as base learners, such as decision trees.

By training multiple models on different subsets of the training data, bagging effectively “averages out” their individual errors, resulting in a more robust and stable final predictor.

Bagging with high-bias, low-variance models, such as linear regression, is not as effective.

Bagging With Scikit-Learn

Scikit-learn makes it easy to implement bagging using a wide range of base models.

It provides a BaggingClassifier class that can be used for classification problems and a BaggingRegressor class for regression problems.

Here’s an example of how to use the BaggingClassifier with a decision tree base model:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()
X, y =,

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a decision tree base model
base_model = DecisionTreeClassifier()

# Initialize the bagging classifier
bagging_clf = BaggingClassifier(base_estimator=base_model,

# Train the bagging classifier, y_train)

# Evaluate the performance
accuracy = bagging_clf.score(X_test, y_test)
print("Accuracy:", accuracy)

BaggingClassifier will sample 100 subsets of the training data (with replacement) and train a base_estimator on each subset.

Then, it will aggregate the predictions of all the base estimators taking the average of predicted probabilities for classification problems or the average of predicted values for regression problems.

Difference Between Bagging and Random Forests

While both bagging and Random Forests rely on the concept of training multiple decision trees on random subsets of data and aggregating their predictions, Random Forests go one step further.

In bagging, each decision tree considers all the features when making a split, which can result in trees that are quite similar to each other, especially if certain features are strong predictors.

On the other hand, random forest introduces an additional layer of randomness by only considering a random subset of features at each split during the tree construction (Random Subspace Method).

This forces the individual trees to be more diverse and decorrelated, which can lead to even better performance and generalization when compared to bagging alone.

So we have two sampling techniques at play here: sampling with replacement to create bootstrapped subsets and sampling without replacement to select features at each split.

Out-of-Bag Evaluation

Out-of-Bag (OOB) evaluation is a technique used to estimate the generalization performance of an ensemble learning method, specifically for bagging-based models.

Since bootstrapping involves sampling with replacement, some samples may not be selected at all for a given subset, while others may be chosen multiple times.

Typically, around 63% of the original data points end up in a bootstrapped subset, leaving approximately 37% of them unused during the training of each base model.

These unused samples are called “out-of-bag” samples.

OOB evaluation is essentially a built-in cross-validation process.

Here’s how it works:

  1. Train each base model in the bagging ensemble on its bootstrapped subset of the training data.
  2. For each base model, test its performance on the OOB samples — the samples that were not included in its training subset.
  3. Average the OOB performance metrics, like accuracy for classification or mean squared error for regression, across all the base models.

The aggregated OOB performance provides an estimate of the ensemble’s generalization performance on unseen data.

But, is this estimate good?

I included it here because you may see someone using it, but I never saw it as a good estimate of the generalization performance in practice.

In my experience, OOB evaluation is not even close to traditional validation methods like cross-validation or hold-out validation.

It tends to give an overly optimistic estimate of the generalization performance.

So I would NOT recommend using it to evaluate your bagging-based models.

Bagging As Kaggle Slang

A lot of time, when you read about bagging in winning Kaggle solutions, it doesn’t refer to bootstrapping at all.

Instead, it’s used as a general term for any ensemble learning method that combines multiple base models using an average of their predictions.

For example, it’s common to see winners use bagging to describe an average of multiple XGBoost models with different random seeds.

What Is Boosting?

Boosting is another effective ensemble learning technique that focuses on improving the performance of weak learners, typically decision trees with a small depth.

The main idea behind boosting is to train a series of base models sequentially, where each model aims to correct the errors made by its predecessor.

In other words, boosting focuses on instances that were misclassified by earlier models and assigns a higher weight to them, motivating subsequent models to pay more attention to these challenging cases.

To illustrate the boosting concept with a simple example, let’s consider a binary classification problem where we need to determine whether a student will pass or fail a test based on features like hours of study and number of practice problems solved.

Our dataset consists of 5 samples:

Sample No. Hours of Study Practice Problems Pass (1) / Fail (0)
1 1.0 5 0
2 3.0 7 1
3 4.5 10 1
4 2.0 4 0
5 5.5 15 1

We’ll use a boosting algorithm with weak learners; in this case, decision stumps (one-level decision trees). We will only train 3 weak learners to keep the example simple.

Here is a step-by-step breakdown of our illustration of boosting:

  1. Initialize sample weights: Give each sample an equal weight of 1/5 (0.2) as we have 5 samples.
  2. Train Weak Learner 1: Build a decision stump that minimizes total weighted error (e.g., based on hours of study: if hours of study <= 2.5, predict Fail, else predict Pass).
  3. Evaluate Weak Learner 1: Determine the misclassified samples, and calculate the overall weighted error.
  4. Compute Weak Learner 1 weight: Assign a weight to Weak Learner 1 based on its performance (higher if more accurate).
  5. Update sample weights: Increase the weight of misclassified instances and decrease the weight of correctly classified instances.
  6. Normalize sample weights: Ensure the sum of sample weights equals 1.
  7. Train Weak Learner 2: Build another decision stump, now prioritizing the misclassified instances from the previous step (e.g., based on the number of practice problems: if practice problems <= 6, predict Fail, else predict Pass).
  8. Evaluate Weak Learner 2: Determine the misclassified samples and calculate the overall weighted error.
  9. Compute Weak Learner 2 weight: Assign a weight to Weak Learner 2 based on its performance.
  10. Update sample weights: Increase the weight of misclassified instances by Weak Learner 2 and decrease the weight of correctly classified instances.
  11. Normalize sample weights: Ensure the sum of sample weights equals 1.
  12. Train Weak Learner 3: Build another decision stump, prioritizing the misclassified instances from the previous step (e.g., if hours of study <= 3.5, predict Fail, else predict Pass).
  13. Evaluate Weak Learner 3: Determine the misclassified samples and calculate the overall weighted error.
  14. Compute Weak Learner 3 weight: Assign a weight to Weak Learner 3 based on its performance.
  15. Combine weak learners: Form the final model by combining the predictions of Weak Learners 1, 2, and 3 with their respective weights.

Our boosting algorithm has now successfully combined three weak learners into a stronger ensemble model that can better classify students as pass or fail based on their study habits.

Boosting not only reduces bias but also variance, making this method suitable for dealing with both underfitting and overfitting.

Can You Use Boosting And Bagging Together?

Yes, combining both boosting and bagging ensemble techniques to create an even better machine learning model is indeed possible.

To incorporate bagging into boosting, you can use the subsampling technique when building the weak learners during the boosting process.

In algorithms like XGBoost, subsampling is usually done without replacement, because it empirically seems to work better than sampling with replacement, but nothing stops you from trying both approaches and seeing which one works better for your problem.

During each iteration of the boosting process, instead of using the whole training dataset, you would randomly select a fraction of the dataset.

By doing this, weak learners will be trained on different subsets of the dataset.

The most popular implementations of Boosting in Python have a subsample parameter that allows you to specify the fraction of the training dataset to be used for each iteration.

This introduces an element of randomness, which can help to reduce overfitting, combining the strength of both boosting and bagging techniques.

Gradient Boosting

Gradient boosting is a powerful and widely used ensemble technique that builds upon the concept of boosting by combining weak learners, typically decision trees, into an accurate model.

What sets gradient boosting apart is that it utilizes the concept of gradient descent optimization.

Gradient descent is an optimization algorithm used in machine learning to minimize some function, such as a loss function.

In gradient boosting, instead of focusing solely on the misclassified instances, each weak learner is built to predict the negative gradient (residual errors) of the loss function concerning the previous combination of weak learners predictions.

The key steps in gradient boosting are as follows:

  1. Initialize the model: Estimate the initial model with a constant value or a simple base learner (e.g., shallow decision tree).
  2. Calculate the residuals: Compute the residual errors between the predictions of the current model and the actual target values.
  3. Train a weak learner: Fit a new weak learner (e.g., a decision tree) to model the relationships between the input features and the residuals.
  4. Update the model: Add the weak learner to the existing ensemble, applying a learning rate to control the contribution of each weak learner and avoid overfitting.
  5. Iterate: Repeat the steps 2 to 4 a predefined number of times or until some criteria are met (e.g., no significant improvement in performance).

One notable advantage of gradient boosting is that it adapts well to various types of problems and loss functions, making it a versatile and robust machine learning method.

Gradient Boosting With XGBoost

XGBoost (short for eXtreme Gradient Boosting) is a highly optimized, open-source implementation of the gradient boosting algorithm.

It has gained immense popularity in recent years due to its excellent performance, parallelization capabilities, and flexibility to handle a wide range of machine learning tasks.

To demonstrate the simplicity and power of XGBoost, let’s see a basic code example in Python.

First, you need to install the XGBoost library if you haven’t already:

pip install xgboost

Now, let’s create a simple example of using XGBoost for a classification problem.

Given the popular Iris dataset, we will classify the flowers into their species.

import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y =,

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an XGBoost classifier
xgb_clf = xgb.XGBClassifier()

# Train the classifier on the training data, y_train)

# Make predictions on the test data
y_pred = xgb_clf.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

In this example, we load the Iris dataset, divide it into training and testing sets, create an XGBoost classifier using its default settings (e.g., learning_rate, max_depth, etc.), and evaluate the model’s performance by calculating its accuracy.

Keep in mind that XGBoost offers many hyperparameters that you can fine-tune to achieve optimal results.

The simplicity and effectiveness of XGBoost make it a popular choice when working with gradient boosting algorithms.

Other options are LightGBM and CatBoost, which are also powerful and widely used gradient boosting libraries.

AdaBoost With Scikit-Learn

If you want to create a boosting ensemble with a scikit-learn base model, you can use the AdaBoost algorithm to achieve this.

AdaBoost, short for Adaptive Boosting, is a popular ensemble learning technique that combines weak classifiers to construct a robust and reliable model.

Typically, decision trees with a limited depth are used as the base models, although you have the flexibility to choose other scikit-learn models as well.

The key idea behind AdaBoost is to adaptively update the training data’s instance weights in each iteration.

The algorithm emphasizes the misclassified samples by assigning higher weights and thereby encouraging the next weak learner to focus more on these challenging cases.

This sequential learning process continues until a predefined number of weak learners are built, or the accuracy reaches a satisfactory level.

Scikit-learn provides an AdaBoostClassifier class for classification problems and an AdaBoostRegressor class for regression problems.

Implementing AdaBoost with scikit-learn is simple, as shown in the following example using the Iris dataset:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()
X, y =,

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a decision tree base model
base_model = DecisionTreeClassifier(max_depth=1)

# Initialize the AdaBoost classifier
ada_clf = AdaBoostClassifier(base_estimator=base_model,

# Train the AdaBoost classifier, y_train)

# Evaluate the performance
accuracy = ada_clf.score(X_test, y_test)
print("Accuracy:", accuracy)

In this code snippet, we use shallow decision trees (max_depth=1) as the base model for the AdaBoost classifier, train it on the Iris dataset, and evaluate its performance.

You can do some crazy stuff with it: in the Telstra Network Disruptions Kaggle competition, one of the models of my winning ensemble was AdaBoost with Random Forests as base estimators. A boosted ensemble of bagged trees.

What Is Stacking?

Stacking, short for stacked generalization, is another powerful ensemble learning technique that combines multiple base models to create a more accurate and robust predictive model.

Unlike boosting and bagging, stacking uses a diverse set of base models and then adds a new model, called the meta-model or the second-level model, to make final predictions based on the output of these base models.

Just as we use raw features as input to train a machine learning model, stacking uses the predictions of the base models as input features to train the meta-model.

The main idea behind stacking is to take advantage of the strengths of different base models, capturing various aspects of the data.

By letting the meta-model learn from these base model predictions, the stacking technique creates a more powerful ensemble that is capable of better generalization on unseen data.

The overall stacking process can be divided into two main steps:

  1. Training base models: Train multiple diverse base models on the given dataset. They can be of different types, such as decision trees, support vector machines, neural networks, etc.
  2. Training the meta-model: Use the predictions of the base models as input features to train the meta-model, often a simple linear model or a decision tree, to make the final prediction.

Stacking With Scikit-Learn

Scikit-learn provides an easy-to-use implementation of stacking with StackingClassifier and StackingRegressor.

With these tools, you can create stacked models without writing any custom code, leveraging the existing base models provided by scikit-learn.

Let’s see a simple example of using scikit-learn to create a stacking classifier:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

# Load dataset
iris = load_iris()
X, y =,

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base models
base_models = [
    ('svm', SVC()),
    ('rf', RandomForestClassifier())

# Create the stacking classifier with a logistic regression as the meta-model
stacking_clf = StackingClassifier(estimators=base_models, final_estimator=LogisticRegression())

# Train the stacking classifier, y_train)

# Make predictions and evaluate the classifiers' accuracy
y_pred = stacking_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

In this example, we use the Iris dataset and create a stacking classifier with an SVM and a random forest as base models, combined by a logistic regression meta-model.

After training the stacking classifier, we evaluate its accuracy on the test data.

The scikit-learn’s stacking implementation offers many more customization options and provides an efficient way to create powerful ensemble models that combine the strengths of diverse base models.

Preventing Data Leakage in Stacking Models

One of the critical concerns when using stacking models is data leakage, which occurs when information from the test data is unintentionally used during model training.

This can lead to overoptimistic performance estimates and a model that doesn’t generalize well to unseen data.

To prevent data leakage in stacking models, it’s essential to properly split the data when training the base models and the meta-model.

A common approach to avoid data leakage in stacking is to use out-of-fold (OOF) predictions.

This technique ensures that the meta-model is trained on predictions made by the base models on data they haven’t seen during their training.

Here’s a high-level outline of the process:

  1. Split the training data: Divide the training data into K-folds (e.g., 5 folds).

  2. Train base models with cross-validation: For each base model, run the K-fold cross-validation. In each fold, train the base model on K-1 parts of the data and make predictions on the remaining part. This will result in out-of-fold predictions for the entire dataset without data leakage.

  3. Prepare new training data for the meta-model: Combine the out-of-fold predictions of all base models to form a new dataset. Each row contains the predictions made by the base models for a single instance, and the target variable remains the same as in the original dataset.

  4. Train the meta-model: Train the meta-model using the newly created dataset of base model predictions.

  5. Make predictions with the stacking model: When predicting new instances, first obtain predictions from the base models, combine them to form meta-model inputs, and finally, use the meta-model to predict the target.

The base models are usually retrained on the entire training data before making predictions on the test data, but you can also make predictions on the test data using the base models trained during step 2 and average them to get the meta-model inputs.

By following this process with cross-validation or out-of-fold predictions, you can create stacking models that are less susceptible to data leakage and can generalize better to unseen data.

Scikit-learn’s StackingClassifier and StackingRegressor implementations handle the data splitting and cross-validation process automatically, protecting against data leakage issues.

However, it’s still essential to understand the underlying concepts and be cautious when preparing your data, especially when designing custom ensemble methods or handling sensitive datasets.

Evaluating Stacked Models Using Nested Validation

When using stacked models, it is crucial to evaluate their performance accurately and avoid biased estimates that may arise from data leakage.

To properly assess stacked models, a technique called nested validation, also known as nested cross-validation, should be employed.

Nested validation involves performing cross-validation multiple times, with an outer loop for model evaluation (outer cross-validation) and an inner loop for model selection, hyperparameter tuning, or stacking (inner cross-validation).

Here’s a high-level overview of the nested validation process:

  1. Outer loop (model evaluation): Divide the data into K outer folds.

  2. Inner loop (model selection or stacking): For each outer fold, perform an inner cross-validation loop to train the base models and the meta-model. The inner loop consists of the following steps:

    • Train the base models using out-of-fold predictions to prevent data leakage, as explained earlier.
    • Train the meta-model on the base models’ out-of-fold predictions.
    • Optionally perform model selection or hyperparameter tuning using a validation set or another cross-validation.
  3. Evaluate the model: With the base models and the meta-model trained in the inner loop, evaluate the stacked model on the outer fold. Repeat this process for all outer folds.

  4. Calculate performance metrics: By evaluating the stacked model on each outer fold, you obtain unbiased performance metrics that can be averaged to get a more reliable estimate of the model’s performance on unseen data.

This is especially important when the same dataset is used for model selection, hyperparameter tuning, or stacking.

Although not as comprehensive as cross-validation, the code above gives a simplified illustration of how nested validation works.

We divided the data into training and testing sets using the train_test_split function (single outer loop).

Now let’s consider the inner loop:

  1. The StackingClassifier trains multiple diverse base models (SVM and RandomForestClassifier in the example) on the X_train (training set) using cross-validation. This cross-validation is the inner loop, creating out-of-fold predictions for each base model.

  2. The out-of-fold predictions of these base models serve as input features to the meta-model (LogisticRegression in the example).

  3. The meta-model (LogisticRegression) also trains on the same X_train (training set) using the out-of-fold predictions of the base models.

This process creates two levels of validation within the training set:

  • The first level is for training the base models using cross-validation (inner loop).
  • The second level is for training the meta-model on the out-of-fold predictions of the base models.

After training the StackingClassifier using this nested process, we evaluate its performance on the previously separated testing set (X_test).

The evaluation is based on the accuracy of the model’s predictions (y_pred) compared to the actual target values (y_test).

It is essential to perform this inner and outer loop mechanism to prevent data leakage and achieve a more accurate estimate of our model’s performance.

Keep in mind that a more robust evaluation can be achieved using cross-validation for both the outer and inner loops instead of using a single train_test_split.