Today, we’re going to dive into the world of LightGBM and multi-output tasks.
LightGBM is a powerful gradient boosting framework (like XGBoost) that’s widely used for various tasks.
But what if you want to predict multiple outputs at once?
That’s where multi-output regression and classification comes in.
Unfortunately, LightGBM doesn’t support multi-output tasks directly, but we can use scikit-learn’s MultiOutputRegressor
to get around this limitation.
What Is Multi-Output Regression and Classification
First, let’s break down what these terms mean.
In machine learning, we often want to predict an outcome based on some input data.
For example, you might want to predict the price of a house based on its size, location, and age. This is called regression.
Now, imagine you want to predict not just one thing, but several things at once.
For instance, you might want to predict both the price of a house and how long it will take to sell. This is what we call multi-output regression.
Similarly, in classification, we’re trying to sort data into categories. Like, is this email spam or not? But what if we want to sort data into multiple categories at once?
Is this email urgent or not? And is it about work or personal matters? This is what we call multi-output classification.
In healthcare, doctors might use multi-output regression to predict multiple health outcomes for a patient based on their medical history.
For example, they might want to predict a patient’s risk of heart disease, diabetes, and stroke all at once.
In finance, an analyst might use multi-output regression to predict several aspects of a company’s future performance, like its revenue, profit margin, and stock price.
For multi-output classification, think about a news website that wants to automatically categorize articles.
Each article could fall into multiple categories at the same time, like “politics”, “international”, and “breaking news”.
Or consider a music app that wants to recommend songs. Each song could be classified by multiple genres, like “rock”, “pop”, and “indie”.
So, that’s the gist of multi-output regression and classification.
They’re powerful tools that can help us make multiple predictions at once, saving us time and effort.
How to Train a Multi-Output Regression Model with LightGBM In Python (Code Example)
First things first, you need to set up your environment.
If you’re using Python, you’ll need to install LightGBM and scikit-learn.
Just open up your terminal and type:
pip install lightgbm scikit-learn
Or if you’re using Anaconda, type:
conda install -c conda-forge lightgbm scikit-learn
We’re importing necessary packages and loading our dataset.
Specifically, we’re using pandas (a data manipulation library) and numpy (a library for numerical computations).
import pandas as pd
import numpy as np
data_path = "path_to_your_data"
data = pd.read_csv(data_path)
u_q | coolant | stator_winding | u_d | stator_tooth | motor_speed | i_d | i_q | pm | stator_yoke | ambient | torque | profile_id |
---|---|---|---|---|---|---|---|---|---|---|---|---|
-0.450682 | 18.8052 | 19.0867 | -0.350055 | 18.2932 | 0.00286557 | 0.00441914 | 0.000328102 | 24.5542 | 18.3165 | 19.8507 | 0.187101 | 17 |
-0.325737 | 18.8186 | 19.0924 | -0.305803 | 18.2948 | 0.000256782 | 0.000605872 | -0.000785353 | 24.5381 | 18.315 | 19.8507 | 0.245417 | 17 |
-0.440864 | 18.8288 | 19.0894 | -0.372503 | 18.2941 | 0.00235497 | 0.00128959 | 0.000386468 | 24.5447 | 18.3263 | 19.8507 | 0.176615 | 17 |
-0.327026 | 18.8356 | 19.083 | -0.316199 | 18.2925 | 0.00610467 | 2.55843e-05 | 0.00204566 | 24.554 | 18.3308 | 19.8506 | 0.238303 | 17 |
-0.47115 | 18.857 | 19.0825 | -0.332272 | 18.2914 | 0.00313282 | -0.0643168 | 0.0371838 | 24.5654 | 18.3267 | 19.8506 | 0.208197 | 17 |
data_path
is a neat, readable way to define a file path. Then, pd.read_csv(data_path)
reads the CSV file at that location into a pandas DataFrame.
To exemplify, I will use a dataset that represents sensor data collected from a Permanent Magnet Synchronous Motor (PMSM).
It’s a collection of measurements taken at a rate of 2 Hz during several testing sessions.
Each test session is identified by a unique “profile_id” and can last between one and six hours.
The dataset includes variables such as voltages in d/q-coordinates (“u_d” and “u_q”), currents in d/q-coordinates (“i_d” and “i_q”), motor speed, torque, and others.
Given all this information, we want to use machine learning to build a model that can predict the performance of the PMSM based on the provided sensor data.
Specifically, we want to predict the ‘pm’, ‘stator_yoke’, ‘stator_tooth’, ‘stator_winding’ values, as these represent important aspects of the motor’s performance.
Don’t get too hung up on the details of the dataset, it’s just a clean dataset with multiple outputs that serves as a good example for multi-output regression.
Everything you learn here can be applied to your own datasets.
import lightgbm as lgb
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import train_test_split
Here we import the necessary machine learning tools.
lightgbm
is the gradient boosting framework that we’re using to build our model.
MultiOutputRegressor
is a scikit-learn tool that allows us to perform multi-output regression by encapsulating the training of multiple regressors, one for each target.
It’s the simplest way to train a multi-output model with LightGBM.
train_test_split
is a utility function to split our data into training and test sets.
X = data.drop(['pm', 'stator_yoke', 'stator_tooth', 'stator_winding'], axis=1)
y = data[['pm', 'stator_yoke', 'stator_tooth', 'stator_winding']]
We’re defining our features (X) and outputs (y).
The features are every column in the data except ‘pm’, ‘stator_yoke’, ‘stator_tooth’, ‘stator_winding’. The outputs are exactly those four columns.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
This line splits our data into training and test sets. The test_size=0.2
means 20% of the data is reserved for testing.
We create a simple LightGBM regressor model with arbitrary hyperparameters that should be tuned for your specific problem.
Training the model is extremely simple. We pass it to the MultiOutputRegressor
and call the fit
method.
lgb_model = lgb.LGBMRegressor(learning_rate=0.05, n_estimators=100)
model = MultiOutputRegressor(lgb_model)
model.fit(X_train, y_train)
After the model is trained, we can make predictions on the test data.
y_pred = model.predict(X_test)
This will return a numpy array with the predicted values for each output.
In this case, it will be a 2D array with 4 columns (one for each output) and as many rows as there are in the test set.
How To Adapt This Code To A Multi-Output Classification Problem
Adapting this code to a multi-output classification problem involves just a few changes.
You’ll need to use a classification algorithm instead of a regression one, and make sure your output variables are categorical (not continuous).
Here’s how you might adjust the code:
# Import necessary libraries
from sklearn.multioutput import MultiOutputClassifier
...
lgb_model = lgb.LGBMClassifier(learning_rate=0.05, n_estimators=100)
# Create the multioutput wrapper for LightGBM
model = MultiOutputClassifier(lgb_model)
...
Here, the LGBMRegressor has been replaced with LGBMClassifier because we’re dealing with a classification problem now.
Also, the MultiOutputRegressor has been replaced with MultiOutputClassifier.