As a Python user aiming to predict a continuous target variable from a dataset with both numerical and categorical features, you’ve made a great choice in considering CatBoost.

This high-performance machine learning algorithm is particularly known for its ability to handle categorical variables effectively.

In this tutorial, I’ll guide you step-by-step on how to use CatBoost for regression tasks.

We’ll start from preparing your data, training the CatBoost model, and finally evaluating its performance.

By the end of this tutorial, you’ll have a solid understanding of how to use CatBoost for regression tasks in Python.

Regression Objective Function In CatBoost

Different than classification, in regression tasks we are trying to predict a continuous value, such as the price of a house or the number of sales of a product.

Because of that, the loss function we commonly use is the Mean Squared Error (MSE).

The MSE calculates the average squared difference between the actual and predicted values. It’s a popular choice because it punishes larger errors more than smaller ones.

That said, CatBoost can optimize a variety of loss functions, including the MAE, Poisson and Quantile.

Installing CatBoost in Python

Installing CatBoost in your Python environment is straightforward and can be done using either pip or conda.

To install CatBoost using pip, you can use the following command in your terminal:

pip install catboost

If you prefer using conda, you can install CatBoost with the following command:

conda install -c conda-forge catboost

Remember to run these commands in your terminal, not in your Python script or notebook.

Once the installation is complete, you can import CatBoost into your Python script using:

import catboost

This will allow you to access all the functionalities of CatBoost in your script.

Loading and Preprocessing Data

Before we can start training our model, we need to load and preprocess our data.

We’ll be using the Melbourne Housing Snapshot Dataset for this tutorial.

This dataset contains both numerical and categorical features of houses in Melbourne, Australia, and the goal is to predict the price of a house.

First, let’s load the dataset using Pandas. We’ll use the read_csv() function, which allows us to read a CSV file and load it into a DataFrame.

import pandas as pd

data = pd.read_csv('melb_data.csv')

After loading the data, it’s a good practice to look at the first few rows using the head() function. This will give you a quick overview of the data you’ll be working with.

data.head()
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
0 Abbotsford 85 Turner St 2 h 1.48e+06 S Biggin 3/12/2016 2.5 3067 2 1 1 202 nan nan Yarra -37.7996 144.998 Northern Metropolitan 4019
1 Abbotsford 25 Bloomburg St 2 h 1.035e+06 S Biggin 4/02/2016 2.5 3067 2 1 0 156 79 1900 Yarra -37.8079 144.993 Northern Metropolitan 4019
2 Abbotsford 5 Charles St 3 h 1.465e+06 SP Biggin 4/03/2017 2.5 3067 3 2 0 134 150 1900 Yarra -37.8093 144.994 Northern Metropolitan 4019
3 Abbotsford 40 Federation La 3 h 850000 PI Biggin 4/03/2017 2.5 3067 3 2 1 94 nan nan Yarra -37.7969 144.997 Northern Metropolitan 4019
4 Abbotsford 55a Park St 4 h 1.6e+06 VB Nelson 4/06/2016 2.5 3067 3 1 2 120 142 2014 Yarra -37.8072 144.994 Northern Metropolitan 4019

Now that our data is ready, we can split it into a training set and a testing set.

The training set is used to train the model, while the testing set is used to evaluate its performance.

We’ll use the train_test_split() function from the sklearn.model_selection module to do this.

from sklearn.model_selection import train_test_split

X = data.drop('Price', axis=1)
y = data['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In this code, test_size=0.2 means that 20% of the data will be used for the test set, and the rest for the training set.

random_state=42 is used to ensure that the splits you generate are reproducible. If you run the code again, you’ll get the same train/test split.

Next, we need to handle categorical variables.

CatBoost can handle categorical variables directly, but we need to specify the name of these features.

To do this, we can use the cat_features parameter and pass in a list of strings.

Let’s create this list by selecting all the columns with the object data type.

cat_features = X_train.select_dtypes(include='object').columns.tolist()

for feature in cat_features:
    X_train[feature] = X_train[feature].astype('str')
    X_test[feature] = X_test[feature].astype('str')

I added a loop to make them all strings, because otherwise CatBoost will complain that it has NaNs in the column.

You can put any value you want for the missing values, which can be even a “NaN” string, as long as your column is made of strings.

By default, any categorical feature with less than 255 unique values will be one-hot encoded. Others will be encoded with more advanced methods.

Training the CatBoost Regressor Model

Now that we have our data ready, let’s set up the CatBoost regressor model.

First, we import the necessary module from the CatBoost library.

Then, we create an instance of the CatBoostRegressor class.

When setting up the model, you can specify various hyperparameters. Here are a few key ones:

  • Learning Rate: This controls the step size at each iteration while moving toward a minimum of a loss function. A smaller learning rate requires more iterations, but can lead to a more accurate model. Conversely, a larger learning rate requires fewer iterations, but the model may be less accurate.

  • Number of Trees: This is the number of trees to be constructed in the boosting process.

  • Tree Depth: This is the depth of the trees, i.e., the maximum number of levels in each decision tree. A larger depth can make the model more complex and potentially lead to overfitting, while a smaller depth might result in underfitting. You can think about it as a regularization parameter.

Lastly, we pass the list with the names of the categorical features to the cat_features parameter.

from catboost import CatBoostRegressor

model = CatBoostRegressor(learning_rate=0.1, n_estimators=100, depth=7, cat_features=cat_features)

Once the model is set up, we can fit it to our training data using the fit() function.

model.fit(X_train, y_train)

During the training process, CatBoost displays the iteration number and the loss function value at each step.

This information can help you understand how your model is learning from the data.

If the loss function value is decreasing, it means that the model is improving.

If it’s increasing or fluctuating, it might mean that the model is struggling to learn from the data.

But be careful, because the loss function value is calculated on the training data, so it’s not a good indicator of how well your model will perform on unseen data.

Evaluating the CatBoost Regressor Model

After training our model, we need to evaluate its performance on unseen data.

We do this by making predictions on our test set and comparing these predictions to the actual values.

To make predictions with our trained model, we use the predict() function and pass in our test data.

predictions = model.predict(X_test)

CatBoost Regressor Predictions

Now, we need to calculate some performance metrics to quantify how well our model is doing.

Two commonly used metrics for regression tasks are the Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE).

  • RMSE: This is the square root of the average of the squared differences between the actual and predicted values. It’s useful because it punishes larger errors more than smaller ones.

  • MAE: This is the average of the absolute differences between the actual and predicted values. It’s less sensitive to outliers than the RMSE.

We can calculate these metrics using the mean_squared_error and mean_absolute_error functions from the sklearn.metrics module.

from sklearn.metrics import mean_squared_error, mean_absolute_error

rmse = mean_squared_error(y_test, predictions, squared=False)
mae = mean_absolute_error(y_test, predictions)

In this code, squared=False is used to get the RMSE. If squared=True, the function returns the Mean Squared Error instead.

Remember, lower values for both RMSE and MAE indicate a better fit of the model.

These metrics provide a quantitative measure of how accurate our model’s predictions are, which can guide us in further tuning and improving it.