Multi-output regression is a machine learning task where we need to predict multiple outputs from a single set of inputs.

Imagine you’re a financial analyst at an investment firm.

Your job is to predict the future performance of various stocks to guide investment decisions.

For each stock, you want to predict several outputs such as the expected return, the volatility (risk), and the correlation with other stocks or market indices.

These outputs are all interrelated and depend on a common set of inputs such as the company’s financial health, market conditions, economic indicators, and so on.

In this case, you could use a multi-output regression model.

The model would take in all the relevant financial data and output the expected return, volatility, and correlation for each stock.

In this tutorial, we’re going to explore how to use XGBoost, a powerful machine learning library, to tackle this modeling problem.

I’ll walk you through the process step by step, from installing XGBoost and loading your data, to training your model and evaluating its performance.

By the end of this tutorial, you’ll have a clear understanding of how to handle multi-output regression with XGBoost.

So, let’s dive in and start learning together!

Installing XGBoost In Python

Before we can start using XGBoost, we need to install it.

XGBoost can be installed using pip, which is a package manager for Python.

To install it, you can use the following command in your terminal:

pip install xgboost

If you are using a Jupyter Notebook, you can run this command in a code cell by prefixing it with an exclamation mark:

!pip install xgboost

You can do it using conda and mamba too:

conda install -c conda-forge xgboost
mamba install -c conda-forge xgboost

After running this command, XGBoost should be installed and ready to use.

You can check if it’s installed correctly by importing it in your Python script:

import xgboost as xgb

If this command runs without any errors, congratulations! You have successfully installed XGBoost.

Loading The Data

Let’s import the necessary packages and load our dataset.

Specifically, we’re using pandas to load the data into a DataFrame and NumPy to work with arrays.

import pandas as pd
import numpy as np

data_path = "path_to_your_data"
data = pd.read_csv(data_path)
u_q coolant stator_winding u_d stator_tooth motor_speed i_d i_q pm stator_yoke ambient torque profile_id
-0.450682 18.8052 19.0867 -0.350055 18.2932 0.00286557 0.00441914 0.000328102 24.5542 18.3165 19.8507 0.187101 17
-0.325737 18.8186 19.0924 -0.305803 18.2948 0.000256782 0.000605872 -0.000785353 24.5381 18.315 19.8507 0.245417 17
-0.440864 18.8288 19.0894 -0.372503 18.2941 0.00235497 0.00128959 0.000386468 24.5447 18.3263 19.8507 0.176615 17
-0.327026 18.8356 19.083 -0.316199 18.2925 0.00610467 2.55843e-05 0.00204566 24.554 18.3308 19.8506 0.238303 17
-0.47115 18.857 19.0825 -0.332272 18.2914 0.00313282 -0.0643168 0.0371838 24.5654 18.3267 19.8506 0.208197 17

data_path is a neat, readable way to define a file path. Then, pd.read_csv(data_path) reads the CSV file at that location into a pandas DataFrame.

To exemplify, I will use a dataset that represents sensor data collected from a Permanent Magnet Synchronous Motor (PMSM).

It’s a collection of measurements taken at a rate of 2 Hz during several testing sessions.

Each test session is identified by a unique “profile_id” and can last between one and six hours.

The dataset includes variables such as voltages in d/q-coordinates (“u_d” and “u_q”), currents in d/q-coordinates (“i_d” and “i_q”), motor speed, torque, and others.

Given all this information, we want to use machine learning to build a model that can predict the performance of the PMSM based on the provided sensor data.

Specifically, we want to predict the ‘pm’, ‘stator_yoke’, ‘stator_tooth’, ‘stator_winding’ values, as these represent important aspects of the motor’s performance.

Don’t get too hung up on the details of the dataset, it’s just a clean dataset with multiple outputs that serves as a good example for multi-output regression.

Everything you learn here can be applied to your own datasets.

Training XGBoost With Native Multi-Output Regression Support

Now that we have our data loaded, let’s move on to the exciting part - training our XGBoost model!

Since version 1.6.0, XGBoost has native support for multi-output regression and classification.

This is the recommended way to train multi-output models with it, so I will show you this approach first.

Before we can do that, we need to split our data into input (features) and output (labels).

# Define the features and the targets
features = data.drop(['pm', 'stator_yoke', 'stator_tooth', 'stator_winding'], axis=1)
targets = data[['pm', 'stator_yoke', 'stator_tooth', 'stator_winding']]

In the code above, we’re using the drop function to remove the columns we want to use as targets from the DataFrame.

Then, we’re creating a new DataFrame with just those columns to use as our targets.

Next, we’ll split our data into training and testing sets.

This allows us to evaluate how well our model performs on unseen data (which is what we care about).

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=42)

The train_test_split function shuffles our data and then splits it.

We’re using 80% of the data for training and 20% for testing.

Now, we’re ready to train our model.

For this, we’ll use XGBoost’s XGBRegressor class.

from xgboost import XGBRegressor

# Define and train the model
model = XGBRegressor(tree_method='hist')

model.fit(features_train, targets_train)

Here, we’re creating an instance of XGBRegressor and then fitting it to our training data.

It will automatically detect that we’re doing multi-output regression and train a model for each output.

I used tree_method='hist' to speed up the training process, as it will use a histogram-based algorithm instead of the exact greedy algorithm, which is much slower.

After running this code, our model is trained and ready to make predictions!

Training XGBoost With Scikit-Learn’s MultiOutputRegressor

Although I don’t recommend it now that XGBoost has native support for multi-output regression, you can still use Scikit-learn’s MultiOutputRegressor to train a multi-output model with XGBoost.

This could be the case if you’re working with an older version of XGBoost or if you want to use a specific Scikit-learn feature.

Here’s how you can do it:

First, import the MultiOutputRegressor from Scikit-learn.

from sklearn.multioutput import MultiOutputRegressor

Then, create an instance of XGBRegressor just like before.

xgb_regressor = XGBRegressor(tree_method='hist')

Next, wrap your XGBRegressor with MultiOutputRegressor.

multioutput_regressor = MultiOutputRegressor(xgb_regressor)

Now, you can fit this model to your data.

multioutput_regressor.fit(X_train, y_train)

Under the hood, MultiOutputRegressor trains one XGBoost regressor per target.

Just like before, after running this code, your model is trained and ready to make predictions!

Making Predictions

There is no difference when it comes to making predictions with the first or second approach.

In both cases, you can use the predict() function to make predictions on new data.

y_pred = model.predict(X_test)

Predictions

From left to right, the columns represent the predicted values for ‘pm’, ‘stator_yoke’, ‘stator_tooth’, ‘stator_winding’.

Evaluating Model Performance With Scikit-Learn Metrics

Once you’ve made predictions with your model, the next step is to evaluate how well those predictions match the actual values.

For this, we’ll use Scikit-learn’s metrics.

There are many different metrics we could use, as the same ones from single-output regression apply here, but in this tutorial, we’ll focus on Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

First, let’s import the necessary functions:

from sklearn.metrics import mean_absolute_error, mean_squared_error

Mean Absolute Error (MAE)

MAE is the average of the absolute differences between the predicted and actual values.

We can calculate MAE in Python using the mean_absolute_error function:

mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")

Root Mean Squared Error (RMSE)

RMSE is the square root of the average of the squared differences between the predicted and actual values.

It’s more sensitive to large errors than MAE, meaning it punishes large errors more.

We can calculate RMSE in Python using the mean_squared_error function with squared=False:

rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Root Mean Squared Error: {rmse}")

And that’s it!

You’ve now successfully trained, made predictions and evaluated a multi-output regression model with XGBoost in Python.

To improve your model from here, check out my XGBoost hyperparameter tuning tutorial.