How To Use CatBoost For Regression In Python

As a Python user aiming to predict a continuous target variable from a dataset with both numerical and categorical features, you’ve made a great choice in considering CatBoost.

This high-performance machine learning algorithm is particularly known for its ability to handle categorical variables effectively.

In this tutorial, I’ll guide you step-by-step on how to use CatBoost for regression tasks.

We’ll start from preparing your data, training the CatBoost model, and finally evaluating its performance.

By the end of this tutorial, you’ll have a solid understanding of how to use CatBoost for regression tasks in Python.

Regression Objective Function In CatBoost

Different than classification, in regression tasks we are trying to predict a continuous value, such as the price of a house or the number of sales of a product.

Because of that, the loss function we commonly use is the Mean Squared Error (MSE).

The MSE calculates the average squared difference between the actual and predicted values. It’s a popular choice because it punishes larger errors more than smaller ones.

That said, CatBoost can optimize a variety of loss functions, including the MAE, Poisson and Quantile.

Installing CatBoost in Python

Installing CatBoost in your Python environment is straightforward and can be done using either pip or conda.

To install CatBoost using pip, you can use the following command in your terminal:

pip install catboost

If you prefer using conda, you can install CatBoost with the following command:

conda install -c conda-forge catboost

Remember to run these commands in your terminal, not in your Python script or notebook.

Once the installation is complete, you can import CatBoost into your Python script using:

import catboost

This will allow you to access all the functionalities of CatBoost in your script.

Loading and Preprocessing Data

Before we can start training our model, we need to load and preprocess our data.

We’ll be using the Melbourne Housing Snapshot Dataset for this tutorial.

This dataset contains both numerical and categorical features of houses in Melbourne, Australia, and the goal is to predict the price of a house.

First, let’s load the dataset using Pandas. We’ll use the read_csv() function, which allows us to read a CSV file and load it into a DataFrame.

import pandas as pd

data = pd.read_csv('melb_data.csv')

After loading the data, it’s a good practice to look at the first few rows using the head() function. This will give you a quick overview of the data you’ll be working with.

data.head()

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
0	Abbotsford	85 Turner St	2	h	1.48e+06	S	Biggin	3/12/2016	2.5	3067	2	1	1	202	nan	nan	Yarra	-37.7996	144.998	Northern Metropolitan	4019
1	Abbotsford	25 Bloomburg St	2	h	1.035e+06	S	Biggin	4/02/2016	2.5	3067	2	1	0	156	79	1900	Yarra	-37.8079	144.993	Northern Metropolitan	4019
2	Abbotsford	5 Charles St	3	h	1.465e+06	SP	Biggin	4/03/2017	2.5	3067	3	2	0	134	150	1900	Yarra	-37.8093	144.994	Northern Metropolitan	4019
3	Abbotsford	40 Federation La	3	h	850000	PI	Biggin	4/03/2017	2.5	3067	3	2	1	94	nan	nan	Yarra	-37.7969	144.997	Northern Metropolitan	4019
4	Abbotsford	55a Park St	4	h	1.6e+06	VB	Nelson	4/06/2016	2.5	3067	3	1	2	120	142	2014	Yarra	-37.8072	144.994	Northern Metropolitan	4019

Now that our data is ready, we can split it into a training set and a testing set.

The training set is used to train the model, while the testing set is used to evaluate its performance.

We’ll use the train_test_split() function from the sklearn.model_selection module to do this.

from sklearn.model_selection import train_test_split

X = data.drop('Price', axis=1)
y = data['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In this code, test_size=0.2 means that 20% of the data will be used for the test set, and the rest for the training set.

random_state=42 is used to ensure that the splits you generate are reproducible. If you run the code again, you’ll get the same train/test split.

Next, we need to handle categorical variables.

CatBoost can handle categorical variables directly, but we need to specify the name of these features.

To do this, we can use the cat_features parameter and pass in a list of strings.

Let’s create this list by selecting all the columns with the object data type.

cat_features = X_train.select_dtypes(include='object').columns.tolist()

for feature in cat_features:
    X_train[feature] = X_train[feature].astype('str')
    X_test[feature] = X_test[feature].astype('str')

I added a loop to make them all strings, because otherwise CatBoost will complain that it has NaNs in the column.

You can put any value you want for the missing values, which can be even a “NaN” string, as long as your column is made of strings.

By default, any categorical feature with less than 255 unique values will be one-hot encoded. Others will be encoded with more advanced methods.

Training the CatBoost Regressor Model

Now that we have our data ready, let’s set up the CatBoost regressor model.

First, we import the necessary module from the CatBoost library.

Then, we create an instance of the CatBoostRegressor class.

When setting up the model, you can specify various hyperparameters. Here are a few key ones:

Learning Rate: This controls the step size at each iteration while moving toward a minimum of a loss function. A smaller learning rate requires more iterations, but can lead to a more accurate model. Conversely, a larger learning rate requires fewer iterations, but the model may be less accurate.
Number of Trees: This is the number of trees to be constructed in the boosting process.
Tree Depth: This is the depth of the trees, i.e., the maximum number of levels in each decision tree. A larger depth can make the model more complex and potentially lead to overfitting, while a smaller depth might result in underfitting. You can think about it as a regularization parameter.

Lastly, we pass the list with the names of the categorical features to the cat_features parameter.

from catboost import CatBoostRegressor

model = CatBoostRegressor(learning_rate=0.1, n_estimators=100, depth=7, cat_features=cat_features)

Once the model is set up, we can fit it to our training data using the fit() function.

model.fit(X_train, y_train)

During the training process, CatBoost displays the iteration number and the loss function value at each step.

This information can help you understand how your model is learning from the data.

If the loss function value is decreasing, it means that the model is improving.

If it’s increasing or fluctuating, it might mean that the model is struggling to learn from the data.

But be careful, because the loss function value is calculated on the training data, so it’s not a good indicator of how well your model will perform on unseen data.

Evaluating the CatBoost Regressor Model

After training our model, we need to evaluate its performance on unseen data.

We do this by making predictions on our test set and comparing these predictions to the actual values.

To make predictions with our trained model, we use the predict() function and pass in our test data.

predictions = model.predict(X_test)

CatBoost Regressor Predictions

Now, we need to calculate some performance metrics to quantify how well our model is doing.

Two commonly used metrics for regression tasks are the Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE).

RMSE: This is the square root of the average of the squared differences between the actual and predicted values. It’s useful because it punishes larger errors more than smaller ones.
MAE: This is the average of the absolute differences between the actual and predicted values. It’s less sensitive to outliers than the RMSE.

We can calculate these metrics using the mean_squared_error and mean_absolute_error functions from the sklearn.metrics module.

from sklearn.metrics import mean_squared_error, mean_absolute_error

rmse = mean_squared_error(y_test, predictions, squared=False)
mae = mean_absolute_error(y_test, predictions)

In this code, squared=False is used to get the RMSE. If squared=True, the function returns the Mean Squared Error instead.

Remember, lower values for both RMSE and MAE indicate a better fit of the model.

These metrics provide a quantitative measure of how accurate our model’s predictions are, which can guide us in further tuning and improving it.

Regression Objective Function In CatBoost#

Installing CatBoost in Python#

Loading and Preprocessing Data#

Training the CatBoost Regressor Model#

Evaluating the CatBoost Regressor Model#

Regression Objective Function In CatBoost

Installing CatBoost in Python

Loading and Preprocessing Data

Training the CatBoost Regressor Model

Evaluating the CatBoost Regressor Model