Are you looking to train a Random Forest using XGBoost for classification or regression tasks but aren’t sure where to start?

In this tutorial, I will first briefly explain the mechanisms behind XGBoost and Random Forest and highlight their differences.

Then, I’ll guide you through a step-by-step process of training an XGBoost Random Forest for both classification and regression tasks using a real-world dataset.

By the end of this tutorial, you’ll be well-equipped to tackle your own projects with confidence and expertise.

So, let’s dive in and start exploring!

Standard XGBoost vs Random Forest

Before we dive into the code, let’s first understand the mechanisms behind XGBoost and Random Forest.

XGBoost is an implementation of gradient boosting that uses decision trees as base learners.

Boosting, in a nutshell, is a sequential process where each subsequent model attempts to correct the errors of the previous model.

The models are built sequentially, with each new model being trained to correct the errors made by the combination of the previous ones.

This sequential nature requires careful tuning and a lot of computational resources.

On the other hand, Random Forest is a bagging technique mixed with the random subspace method, also applied to decision trees.

Bagging, or bootstrap aggregating, involves training many models on different subsets of the original data (sampled with replacement) and then aggregating their predictions.

The Random Forest goes one step further and randomly selects a subset of features at each split point of the decision tree.

The models are built independently, which allows for parallel computation.

Random Forest is particularly effective when dealing with high variance in data, as the aggregation process tends to smooth out the noise.

In summary, the main difference between XGBoost and Random Forest lies in how they build and combine the models: XGBoost builds one tree at a time, where each new tree helps to correct errors made by the combination of previously trained trees.

On the other hand, Random Forest builds many trees over randomly sampled subsets of the data and features, and then averages their predictions.

Training XGBoost Random Forest For Classification

Let’s now dive into training a Random Forest using XGBoost for a classification task.

We will use the XGBRFClassifier class from the xgboost library.

This is an interface that imitates the sklearn API, so it should be familiar to you if you’ve used scikit-learn before.

First, we need to import the necessary libraries and load our data.

We’ll use the Red Wine dataset, which we can load using the pandas library.

This dataset contains chemical features measured from red wines (alcohol, pH, citric acid, etc.) and a quality score between 3 and 8. Higher is better.

Our goal is to train a model that can predict the quality score of a wine given its chemical features.

This task can be treated as a classification or a regression, so we’ll train both types of models.

import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBRFClassifier

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, sep=";")

Next, we split our data into features (X) and target (y).

X = data.drop('quality', axis=1)
y = data['quality'] - data['quality'].min()

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Notice that I subtracted the minimum quality score from the target variable.

XGBoost expects the classes to be numbered from 0 to n_classes - 1, so we need to make sure that our target variable is in this format.

Now, we can train our XGBoost Random Forest Classifier.

We’ll instantiate the XGBRFClassifier class and fit it to our training data.

# Instantiate the XGBRFClassifier
xgbrf_classifier = XGBRFClassifier(n_estimators=100, max_depth=100)

# Fit the classifier to the training data
xgbrf_classifier.fit(X_train, y_train)

You can make predictions with this model by calling the predict method on xgbrf_classifier with new data.

from sklearn.metrics import classification_report

p = xgbrf_classifier.predict(X_test)
classification_report(y_test, p)

XGBRFClassifier Classification Report

In the code above, we used the classification_report function from sklearn to evaluate the performance of our model.

This function shows us the precision, recall, and F1 score for each class in our dataset.

If you want the probabilities of each class instead of only the most likely class, you can use the predict_proba method instead.

p = xgbrf_classifier.predict_proba(X_test)

Training XGBoost Random Forest For Regression

Now, let’s see how we can train an XGBoost Random Forest for a regression task.

The process is very similar to the classification task, but this time we’ll use the XGBRFRegressor class from the xgboost library.

To save time, I’ll use the same dataset as before, but now we don’t need to subtract the minimum quality score from the target variable.

X = data.drop('alcohol', axis=1)
y = data['alcohol']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now, we can train our XGBoost Random Forest Regressor.

We’ll instantiate the XGBRFRegressor class and fit it to our training data.

# Instantiate the XGBRFRegressor
xgbrf_regressor = XGBRFRegressor(n_estimators=100, max_depth=100)

# Fit the regressor to the training data
xgbrf_regressor.fit(X_train, y_train)

Like before, you can make predictions with this model by calling the predict method on xgbrf_regressor with new data.

from sklearn.metrics import mean_squared_error

p = xgbrf_regressor.predict(X_test)
mean_squared_error(y_test, p, squared=False)

The mean_squared_error function from sklearn with squared=False returns the root mean squared error (RMSE) of our model.

Top Hyperparameters To Tune

I sometimes say that Random Forests are perfect models for “lazy days”.

They are easy to train and basically require a single hyperparameter: the number of trees.

In the XGBoost Random Forest implementation, n_estimators refers to the number of trees you want to build.

Higher values of n_estimators create more trees, which can lead to better performance.

However, adding more trees can slow down the training process.

Furthermore, after a certain point, adding more trees will not improve the performance.

In the plot below, we can see that the performance of the model increases as we add more trees, but it starts to plateau after 1000.

Effect of number of trees on XGB RF performance

Still, this is why I say it’s a perfect model for lazy days: you can just set n_estimators to a high value and let the model do its thing.

A secondary hyperparameter you can tune is max_depth.

max_depth is the maximum depth of each tree and can be seen as a regularization hyperparameter.

The deeper the tree, the more splits it has and it captures more information about the data.

For gradient boosting, shallow trees are preferred, but for Random Forests, deeper trees are better.

This is why I set it to 100 in the code above, as the default value is 6 (because it’s the default value for gradient boosting).

I usually prefer to regularize my Random Forests by increasing the number of trees instead of limiting their depth.

You usually don’t want to set a limit on the depth of the trees, but if you need to limit the number of trees, limiting their depth might help avoid overfitting.

The “ideal” Random Forest has infinite trees with infinite depth.

In practice, we obviously can’t afford to train or infer infinite trees, so we have to make some trade-offs.