Trying to find the right hyperparameters for XGBoost can feel like searching for a needle in a haystack.

Trust me, I’ve been there. XGBoost was a crucial model to win at least two of the Kaggle competitions I participated in.

By the end of this tutorial, you’ll be equipped with the exact same techniques I used to optimize my models and achieve those top rankings.

Let’s get started!

## Installing XGBoost And Optuna

Installing XGBoost is easy, just run:

```
pip install xgboost
```

Or, if you are using Anaconda, run:

```
conda install -c conda-forge py-xgboost
```

Installing Optuna is also easy, just run:

```
pip install optuna
```

Optuna uses a smart technique called Bayesian optimization to find the best hyperparameters for your model.

Bayesian optimization is like a treasure hunter using an advanced metal detector to find hidden gold, instead of just digging random holes (random search) or going through the entire area with a shovel (grid search).

The best part? You only need a few lines of code to make it work!

Using Bayesian optimization is a no-brainer.

It’s usually just as good, if not better, than random and grid search when you have more than 2 or 3 hyperparameters to adjust.

I will use the Red Wine Quality dataset from UCI.

It’s a real dataset with chemical properties of red wines where the goal is to predict the quality of the wine.

```
import pandas as pd
from sklearn.model_selection import train_test_split
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, delimiter=";")
X = data.drop("quality", axis=1)
y = data["quality"]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
```

fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality |
---|---|---|---|---|---|---|---|---|---|---|---|

7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |

7.8 | 0.88 | 0 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.2 | 0.68 | 9.8 | 5 |

7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.997 | 3.26 | 0.65 | 9.8 | 5 |

11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.998 | 3.16 | 0.58 | 9.8 | 6 |

7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |

It can be treated as a regression or a classification problem, but I will treat it as a regression.

The tuning process for XGBoost is the same for both cases.

## Which XGBoost Hyperparameters Should I Tune?

There are only 6 hyperparameters you really need to focus on when tuning XGBoost.

First up is the number of trees you’ll be training, also known as `n_estimators`

.

The more trees you have, the more reliable your predictions will be.

So, how many trees should you pick?

Well, it depends on your model’s purpose.

If your model needs to deliver results quickly (e.g.: high-frequency trading, ad click prediction), you might want to limit the number of trees to around 200.

However, if your model runs once a week (e.g.: sales forecasting) and has more time to make the predictions, you could consider using up to 5,000 trees.

As a general guideline, start by fixing the number of trees and then concentrate on tuning the `learning_rate`

.

This regulates how much each tree contributes to the final prediction.

The more trees you have, the smaller the learning rate should be.

The recommended range for the learning rate is between 0.001 and 0.1.

Next, let’s talk about `max_depth`

.

This decides the complexity of each tree in your model. It refers to the maximum depth that a tree can grow to.

A deeper tree means more decision paths and potentially capturing more complex patterns in the data.

However, increasing the depth may also cause the model to overfit the training data, as it will make the tree more complex.

I like to tune it in the range of 1 to 10.

The `subsample`

hyperparameter plays a role in controlling the amount of data used for building each tree in your model.

It is a fraction that ranges from 0 to 1, representing the proportion of the dataset to be randomly selected for training each tree.

By using only a portion of the data for each tree, the model can benefit from diversity and reduce the correlation between the trees, which may help combat overfitting.

The range I use for `subsample`

is between 0.05 and 1.

The `colsample_bytree`

hyperparameter is another hyperparameter to consider when tuning your XGBoost model.

It determines the proportion of features to be considered for each tree.

This value ranges from 0 to 1, where a value of 1 means that all features will be considered for every tree, and a lower value indicates that only a subset of features will be randomly chosen before building each tree.

This method is also known as Random Subspace.

The `subsample`

and `colsample_bytree`

hyperparameters in XGBoost make it somewhat similar to Random Forests, as the latter samples rows (with replacement) and columns for each tree.

Despite these hyperparameters making XGBoost and Random Forests seem similar, they are different algorithms.

XGBoost is a gradient boosting method, while Random Forests is a bagging method, which means they learn from the data in different ways.

Lastly, the `min_child_weight`

hyperparameter sets the minimum sum of instance weights that must be present in a child node in each tree.

In regression it just means the number of observations that must be present in each node.

This parameter helps control the complexity of the model and prevents overfitting.

This doesn’t mean that you should set `min_child_weight`

to a high value, as it will make your model less flexible and more prone to underfitting.

I like to keep it in the range of 1 to 20.

## Code For Tuning XGBoost Hyperparameters With Optuna

Now that you’re familiar with the essential hyperparameters for tuning, let’s explore the code to optimize them using Optuna.

First, we’ll define the objective function, which Optuna will aim to optimize.

```
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import optuna
def objective(trial):
params = {
"objective": "reg:squarederror",
"n_estimators": 1000,
"verbosity": 0,
"learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
"max_depth": trial.suggest_int("max_depth", 1, 10),
"subsample": trial.suggest_float("subsample", 0.05, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.05, 1.0),
"min_child_weight": trial.suggest_int("min_child_weight", 1, 20),
}
model = xgb.XGBRegressor(**params)
model.fit(X_train, y_train, verbose=False)
predictions = model.predict(X_val)
rmse = mean_squared_error(y_val, predictions, squared=False)
return rmse
```

Our primary objective is to minimize the root mean squared error (RMSE) on the validation set.

The `params`

dictionary within the objective function holds the XGBoost hyperparameters to be fine-tuned by Optuna.

The `trial.suggest_*`

methods define the search space for each hyperparameter according to the values I suggested above.

For example, `learning_rate`

is searched within a logarithmic scale from 1e-3 to 0.1, and `max_depth`

is searched within an integer range from 1 to 10.

The log scale is applied to the learning rate to test more values closer to 0.001, as smaller learning rates paired with a high number of trees generally yield more stable models.

Once the model is trained, it generates predictions on the validation set and calculates the RMSE.

To execute the optimization, we create a study object and pass the objective function to the optimize method.

```
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=30)
```

The `direction`

parameter specifies whether we want to minimize or maximize the objective function.

Since a lower RMSE indicates a better model, our aim is to minimize it.

If you’re tuning a classification model with accuracy or AUROC, set the `direction`

to maximize.

The `n_trials`

parameter defines the number of times the model will be trained with different hyperparameter values.

In practice, about 30 trials are usually sufficient to find a solid set of hyperparameters.

During the optimization, Optuna will display the best hyperparameters discovered thus far, along with the RMSE score.

```
[I 2023-04-09 10:37:16,565] A new study created in memory with name: no-name-8560b24e-3a91-48de-8449-c32977dd4c2b
[I 2023-04-09 10:37:49,060] Trial 0 finished with value: 0.5532276610157627 and parameters: {'learning_rate': 0.010958651631216162, 'max_depth': 7, 'subsample': 0.5354179500853077, 'colsample_bytree': 0.7165765694026183, 'min_child_weight': 12}. Best is trial 0 with value: 0.5532276610157627.
[I 2023-04-09 10:37:53,790] Trial 1 finished with value: 0.6019606330122452 and parameters: {'learning_rate': 0.017141615266207846, 'max_depth': 2, 'subsample': 0.5707056227081352, 'colsample_bytree': 0.31105599394671785, 'min_child_weight': 6}. Best is trial 0 with value: 0.5532276610157627.
[I 2023-04-09 10:37:58,257] Trial 2 finished with value: 0.5745420558557753 and parameters: {'learning_rate': 0.05475472895519491, 'max_depth': 3, 'subsample': 0.8723964423668571, 'colsample_bytree': 0.5917021002963235, 'min_child_weight': 18}. Best is trial 0 with value: 0.5532276610157627.
```

Once the optimization is complete, we can display the best hyperparameters and the RMSE score.

```
print('Best hyperparameters:', study.best_params)
print('Best RMSE:', study.best_value)
Best hyperparameters: {'learning_rate': 0.01742253012219986, 'max_depth': 9, 'subsample': 0.6685330381926933, 'colsample_bytree': 0.669833311520857, 'min_child_weight': 14}
Best RMSE: 0.5404657574376048
```

Now you can use these values in your XGBoost model moving forward!

An additional tip: if most of the best trials utilize a specific hyperparameter near the minimum or maximum value, consider expanding the search space for that hyperparameter.

For example, if most of the best trials use `learning_rate`

close to 0.001, you should probably restart the optimization with the search space `trial.suggest_float("learning_rate", 1e-4, 1e-2, log=True)`

.

Using this method, you’ll obtain an excellent set of hyperparameters, allowing you to concentrate on other tasks with a greater impact on model performance, such as feature engineering.

This approach has consistently proven successful in my Kaggle competitions and daily work!