You’ve built a CatBoost model; now what?

Hyperparameter tuning is the key to unlocking your model’s full potential.

But if the thought of tackling this task feels daunting, you’re not alone.

Once you’ve mastered the tips and tricks presented in this tutorial, you’ll be equipped with the skills to fine-tune any CatBoost model effectively.

Let’s get started!

Installing CatBoost and Optuna

First, let’s install both libraries simply by running:

pip install catboost optuna

Or, if you’re using Anaconda, run:

conda install -c anaconda catboost
conda install -c conda-forge optuna

Optuna uses a smart technique called Bayesian optimization to find the best hyperparameters for your model.

Using Bayesian optimization is like having a seasoned navigator guiding you through unknown waters.

Instead of aimlessly drifting in random directions (random search) or painstakingly charting each point manually (grid search), the navigator steers the ship expertly by using past experiences and knowledge of currents to find the best route to your destination.

In this tutorial, I’ll use the Red Wine Quality dataset from UCI.

It’s a real dataset with chemical properties of red wines, and our goal is to predict the quality of the wine.

import pandas as pd
from sklearn.model_selection import train_test_split

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, delimiter=";")

X = data.drop("quality", axis=1)
y = data["quality"]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
8.7 0.69 0.31 3 0.086 23 81 1.0002 3.48 0.74 11.6
6.1 0.21 0.4 1.4 0.066 40.5 165 0.9912 3.25 0.59 11.9
10.9 0.39 0.47 1.8 0.118 6 14 0.9982 3.3 0.75 9.8
8.8 0.685 0.26 1.6 0.088 16 23 0.99694 3.32 0.47 9.4
8.4 1.035 0.15 6 0.073 11 54 0.999 3.37 0.49 9.9

We can treat this problem as regression or classification, but in this tutorial, we’ll focus on regression.

Which CatBoost Hyperparameters Should I Tune?

In tuning CatBoost, there are six main hyperparameters to focus on:

Number of Trees (iterations)

Picture a sculptor gradually chipping away at a block of stone to create a detailed statue. Each stroke brings the sculpture closer to the final vision.

Similarly, the number of iterations in CatBoost represents the steps (or rounds of refinement) the algorithm takes to create a more accurate model that learns from the data.

The best value for the number of iterations depends on your specific problem and dataset.

If your model will run in real-time, you’ll want to keep the number of iterations low, but if you only need to make predictions every week, you can afford to use more iterations.

I like to use a fixed numbers, like 100-200 for real-time and 1000-2000 for batch applications, as I don’t see the point of tuning both the number of iterations and the learning rate.

Learning Rate (learning_rate)

Imagine a choir where each singer adds their voice to create the perfect harmony.

However, some singers have louder voices than others, so the choir director instructs them to adjust their volume to maintain balance.

In CatBoost, the learning rate operates similarly—it scales the contribution of each decision tree to manage the overall balance and accuracy of the model.

A smaller learning rate signifies that each tree offers a smaller “voice,” or a smaller update to the model, resulting in gradual learning.

This can lead to higher accuracy but increases the risk of underfitting and longer training times.

A larger learning rate, on the other hand, means each tree has a more significant impact on the model, speeding up the learning process.

However, a high learning rate can result in overfitting or model instability.

A range of 0.001 to 0.1 is a good starting point.

Tree Depth (depth)

You can think of the depth as the complexity or “height” of decision trees in your CatBoost model.

A higher depth can capture more intricate patterns in your data, leading to better performance.

But there’s a catch - the deeper the tree, the more time it takes to train, and the higher the risk of overfitting.

When tuning depth, it’s a good idea to try out values between 1 and 10.

Subsample (subsample)

Subsampling is a technique used to randomly choose a fraction of the dataset when constructing each tree.

This promotes diversity among the trees and helps reduce overfitting.

The subsample parameter ranges I recommend go from 0.05 to 1.

Lower values increase diversity but may result in underfitting.

Feature Sampling by Level (colsample_bylevel)

colsample_bylevel is the fraction of features to choose when determining the best split for each node at a specific level during the tree building process.

The idea is the same as with subsample, but this time, we’re sampling features instead of rows.

I like to use values between 0.05 and 1.0 in the search space.

Minimum Data in Leaf (min_data_in_leaf)

min_data_in_leaf specifies the minimum number of samples required to create a leaf, effectively controlling the split creation process.

Think about it as: how many data points will the tree use to estimate a prediction?

Higher values generate less complex trees, reducing overfitting risks, but might result in underfitting. Lower values lead to more complex trees that might overfit.

I like to consider values between 1 and 100.

Code for Tuning CatBoost Hyperparameters with Optuna

Now that you know the critical hyperparameters, let’s learn how to optimize them using Optuna.

First, we’ll define the objective function, which Optuna aims to optimize.

import catboost as cb
from sklearn.metrics import mean_squared_error
import optuna

def objective(trial):
    params = {
        "iterations": 1000,
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
        "depth": trial.suggest_int("depth", 1, 10),
        "subsample": trial.suggest_float("subsample", 0.05, 1.0),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.05, 1.0),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 100),
    }

    model = cb.CatBoostRegressor(**params, silent=True)
    model.fit(X_train, y_train)
    predictions = model.predict(X_val)
    rmse = mean_squared_error(y_val, predictions, squared=False)
    return rmse

Our primary objective is to minimize the root mean squared error (RMSE) on the validation set.

The params dictionary within the objective function holds the CatBoost hyperparameters to be fine-tuned by Optuna.

The trial.suggest_* methods define the search space for each hyperparameter according to the values I suggested above.

For example, learning_rate is searched within a logarithmic scale from 1e-3 to 0.1, and depth is searched within an integer range from 1 to 10.

Why logarithmic? Because smaller learning rates tend to be more stable and effective when we have a large number of iterations.

Once the model is trained, it generates predictions on the validation set and calculates the RMSE.

To execute the optimization, we create a study object and pass the objective function to the optimize method.

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=30)

The direction parameter specifies whether we want to minimize or maximize the objective function.

Since a lower RMSE indicates a better model, our aim is to minimize it.

The n_trials parameter defines the number of times the model will be trained with different hyperparameter values.

In practice, about 30 trials are usually sufficient to find a solid set of hyperparameters.

During the optimization, Optuna will display the best hyperparameters discovered thus far, along with the RMSE score.

[I 2023-04-19 12:06:08,041] Trial 0 finished with value: 0.5924479493712448 and parameters: {'learning_rate': 0.004223041167235877, 'depth': 7, 'subsample': 0.7417292559386053, 'colsample_bylevel': 0.8745350089011158, 'min_data_in_leaf': 13}. Best is trial 0 with value: 0.5924479493712448.
[I 2023-04-19 12:06:09,555] Trial 1 finished with value: 0.5614677547730352 and parameters: {'learning_rate': 0.02103994077207402, 'depth': 5, 'subsample': 0.22061724636372576, 'colsample_bylevel': 0.5059245643597533, 'min_data_in_leaf': 48}. Best is trial 1 with value: 0.5614677547730352.
[I 2023-04-19 12:06:10,126] Trial 2 finished with value: 0.6902916540317992 and parameters: {'learning_rate': 0.0041358521262659514, 'depth': 6, 'subsample': 0.6463635383371191, 'colsample_bylevel': 0.06872124261390113, 'min_data_in_leaf': 90}. Best is trial 1 with value: 0.5614677547730352.

Once the optimization is complete, we can display the best hyperparameters and the RMSE score.

print('Best hyperparameters:', study.best_params)
print('Best RMSE:', study.best_value)

Best hyperparameters: {'learning_rate': 0.044248358418971304, 'depth': 10, 'subsample': 0.860245768257485, 'colsample_bylevel': 0.2813359918917325, 'min_data_in_leaf': 23}
Best RMSE: 0.5248051919337918

Now you can use these values in your CatBoost model moving forward!

An additional tip: if most of the best trials utilize a specific hyperparameter near the minimum or maximum value, consider expanding the search space for that hyperparameter.

For example, if most of the best trials use learning_rate close to 0.001, you should probably restart the optimization with the search space trial.suggest_float("learning_rate", 1e-4, 1e-2, log=True).

Using this method, you’ll obtain an excellent set of hyperparameters, allowing you to concentrate on other tasks with a greater impact on model performance, such as feature engineering.