As a Kaggle Grandmaster, I absolutely love working with LightGBM, a fantastic machine learning library that’s become one of my go-to tools.

I always focus on tuning the model’s hyperparameters before diving into feature engineering.

Think of it like cooking up the perfect dish.

You want to make sure you’ve got the right ingredients and their quantities before you start experimenting with new flavors.

By fine-tuning your hyperparameters first, you’ll squeeze every last drop of performance from your model in the data you already have.

Once you’ve got the optimal hyperparameters, feel free to move on to feature engineering.

But I warn you against re-tuning it after that.

More often than not, it’s just not worth the extra effort.

The potential gains are usually small, and you might even risk overfitting your model. So, tune once, and then let your model shine!

In this tutorial I will teach you everything I know about tuning LightGBM hyperparameters using Optuna.

Installing LightGBM And Optuna

Installing LightGBM is easy, just run:

pip install lightgbm

If you run into any problems, check the official documentation.

Installing Optuna is also easy, just run:

pip install optuna

Optuna uses a smart technique called Bayesian optimization to find the best hyperparameters for your model.

Bayesian optimization is like a treasure hunter using an advanced metal detector to find hidden gold, instead of just digging random holes (random search) or going through the entire area with a shovel (grid search).

The best part? You only need a few lines of code to make it work!

Using Bayesian optimization is a no-brainer.

It’s usually just as good, if not better, than random search when you have more than 2 or 3 hyperparameters to adjust.

I will use the Red Wine Quality dataset from UCI.

It’s a real dataset with chemical properties of red wines where the goal is to predict the quality of the wine.

import pandas as pd
from sklearn.model_selection import train_test_split

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, delimiter=";")

X = data.drop("quality", axis=1)
y = data["quality"]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
7.4 0.7 0 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.8 0.88 0 2.6 0.098 25 67 0.9968 3.2 0.68 9.8 5
7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 5
11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 6
7.4 0.7 0 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5

It can be treated as a regression or a classification problem, but I will treat it as a regression.

The tuning process for LightGBM is the same for both cases.

Which LightGBM Hyperparameters Should I Tune?

There are only 6 hyperparameters you really need to worry about when tuning LightGBM.

The first thing to consider is the number of trees you’ll be training, also known as num_iterations.

The more trees you have, the more stable your predictions will be.

So, how many trees should you choose?

Well, it depends on your model’s use case.

If your model needs to deliver results with low latency (e.g.: high-frequency trading, ad click prediction), you might want to limit the number of trees to around 200.

However, if your model runs once a week (e.g.: sales forecasting) and has more time to make the predictions, you could consider using up to 5,000 trees.

As a general rule of thumb, start by fixing the number of trees and then focus on tuning the learning_rate.

This controls how much each tree contributes to the final prediction.

The more trees you have, the smaller the learning rate should be.

The recommended range for the learning rate is between 0.001 and 0.1.

Next up is num_leaves.

This determines the complexity of each tree in your model. You can think of it as the equivalent of the max_depth parameter in other tree-based models.

It refers to the maximum number of terminal nodes, or leaves, that can be present in each tree.

In a decision tree, a leaf represents a decision or an outcome.

By increasing the num_leaves, you allow the tree to grow more complex, creating a higher number of distinct decision paths.

This can lead to a more flexible model that can capture intricate patterns in the data.

However, increasing the number of leaves may also cause the model to overfit the training data, as it will have a lower amount of data per leaf.

I like to tune it in powers of 2, starting from 2 and going up to 1024.

The subsample hyperparameter plays a role in controlling the amount of data used for building each tree in your model.

It is a fraction that ranges from 0 to 1, representing the proportion of the dataset to be randomly selected for training each tree.

By using only a subset of the data for each tree, the model can benefit from the diversity and reduce the correlation between the trees, which may help combat overfitting.

Remember to set bagging_freq to a positive value or LightGBM will ignore subsample.

bagging_freq is the frequency at which the data is sampled.

Setting it to 1 means resampling the data before every tree, which is the default behavior of other tree-based models.

The range I use for subsample is between 0.05 and 1.

The colsample_bytree hyperparameter is another important aspect to consider when tuning your LightGBM model.

It determines the proportion of features to be used for each tree.

This value ranges from 0 to 1, where a value of 1 means that all features will be considered for every tree, and a lower value indicates that only a subset of features will be randomly chosen before building each tree.

This method is also known as Random Subspace.

The subsample and colsample_bytree hyperparameters in LightGBM make it similar to Random Forests in some ways, as the latter samples rows (with replacement) and columns for each tree.

Even though these hyperparameters make LightGBM and Random Forests seem similar, they are different algorithms.

LightGBM is a gradient boosting method, while Random Forests is a bagging method, which means they learn from the data in different ways.

Finally, the min_data_in_leaf hyperparameter sets the minimum number of data points that must be present in a leaf node in each tree.

This parameter helps control the complexity of the model and prevents overfitting.

Think about it, if you have a leaf node with only 1 data point, your label will be the value of that single data point.

If you have a leaf node with 30 data points, your label will be the average of those 30 data points.

It’s statistically better to take decisions based on more data points.

This doesn’t mean that you should set min_data_in_leaf to a high value, as it will make your model less flexible and more prone to underfitting.

I like to keep it in the range of 1 to 100.

Code For Tuning LightGBM Hyperparameters With Optuna

Now that you know which hyperparameters to tune, let’s see how to do it with Optuna.

First we define the objective function, which is the function that Optuna will try to optimize.

import lightgbm as lgb
from sklearn.metrics import mean_squared_error
import optuna

def objective(trial):
    params = {
        "objective": "regression",
        "metric": "rmse",
        "n_estimators": 1000,
        "verbosity": -1,
        "bagging_freq": 1,
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 2, 2**10),
        "subsample": trial.suggest_float("subsample", 0.05, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.05, 1.0),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 100),
    }

    model = lgb.LGBMRegressor(**params)
    model.fit(X_train, y_train, verbose=False)
    predictions = model.predict(X_val)
    rmse = mean_squared_error(y_val, predictions, squared=False)
    return rmse

The main goal is to minimize the root mean squared error (RMSE) on the validation set.

The params dictionary within the objective function contains the LightGBM hyperparameters that will be tuned by Optuna.

The trial.suggest_* methods are used to specify the search space for each hyperparameter.

For example, learning_rate is searched within a logarithmic scale from 1e-3 to 0.1, and num_leaves is searched within an integer range from 2 to 1024.

The log scale is used for the learning rate because it will try more values close to 0.001, as small learning rates with a large number of trees tend to be more stable.

Once the model is trained, it makes predictions on the validation set and calculates the RMSE.

Some people like to use early stopping, which is a technique that stops the training process when the model reaches a number of trees that doesn’t improve the validation score.

I never had good results with early stopping, as it tends to overfit the validation set, so be careful with it.

To run the optimization, we create a study object and pass the objective function to the optimize method.

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=30)

The direction parameter specifies whether we want to minimize or maximize the objective function.

A lower RMSE means a better model, so we want to minimize it.

If you are tuning a classification model with accuracy or AUROC, you should set direction to maximize.

The n_trials parameter specifies the number of times the model will be trained with different hyperparameter values.

In practice, about 30 trials are enough to find a pretty good set of hyperparameters.

During the optimization, Optuna will print the best hyperparameters found so far, along with the RMSE score.

[I 2023-04-06 16:16:34,661] A new study created in memory with name: no-name-80c68d30-4fbb-478e-9da3-0aa784570916
[LightGBM] [Warning] min_data_in_leaf is set=60, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=60
[I 2023-04-06 16:16:35,034] Trial 0 finished with value: 0.6393520948457281 and parameters: {'learning_rate': 0.0012764866815097534, 'num_leaves': 13, 'subsample': 0.8731849254391235, 'colsample_bytree': 0.9619948164716645, 'min_data_in_leaf': 60}. Best is trial 0 with value: 0.6393520948457281.
[I 2023-04-06 16:16:35,208] Trial 1 finished with value: 0.6012776822611954 and parameters: {'learning_rate': 0.005445217093720794, 'num_leaves': 6, 'subsample': 0.11943732668607995, 'colsample_bytree': 0.6656123402975522, 'min_data_in_leaf': 43}. Best is trial 1 with value: 0.6012776822611954.
[LightGBM] [Warning] min_data_in_leaf is set=43, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=43
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50

After the optimization is finished, we can print the best hyperparameters and the RMSE score.

print('Best hyperparameters:', study.best_params)
print('Best RMSE:', study.best_value)

Best hyperparameters: {'learning_rate': 0.015247440377194395, 'num_leaves': 13, 'subsample': 0.13740858380047208, 'colsample_bytree': 0.4167953910212117, 'min_data_in_leaf': 15}
Best RMSE: 0.5582819486587627

Now you can take these values and use them in your LightGBM model moving forward!

An additional tip: if you see that most of the best trials are using a specific hyperparameter close to the minimum or maximum value, you should probably increase the search space for that hyperparameter.

For example, if most of the best trials are using learning_rate close to 0.001, you should probably reset the optimization with the search space trial.suggest_float("learning_rate", 1e-4, 1e-2, log=True).

This method will already give you a really good set of hyperparameters, so you can focus on other tasks that tend to have a better impact on the model performance, like feature engineering.

It never failed me in my Kaggle competitions or in my day-to-day work!

To get more advanced, check the official LightGBM documentation.