This tutorial is your roadmap to training a LightGBM model for ranking tasks in Python.
You’ll learn how to install LightGBM in your Python environment, prepare your data correctly, and train a model using LightGBM’s Ranker.
I’ll also cover how to evaluate your model’s performance using the industry-standard Normalized Discounted Cumulative Gain (NDCG) metric.
By the end, you’ll have a solid understanding of LTR with LightGBM and be ready to tackle real-world ranking problems.
Let’s get started!
Learning to Rank Quick Recap
Learning to Rank (LTR) is a subfield of machine learning that focuses on building models for ranking items.
In essence, it’s about learning the best order or sequence for a list of items based on specific criteria.
For example, consider a search engine.
When you type in a query, the search engine doesn’t just return a random list of websites.
Instead, it ranks the results based on relevance to your query, the site’s popularity, and other factors.
This is a ranking problem, and it’s where Learning to Rank shines.
LTR can also be used for recommendation systems.
For instance, a music streaming service might use it to rank songs in a playlist based on your listening habits, ensuring that the songs you’re most likely to enjoy are at the top.
You may have heard of LambdaMART, which is basically the implementation of LambdaRank in a gradient boosting framework, also using decision trees.
We can think of LightGBM’s LTR implementation as an evolution of LambdaMART.
Installing LightGBM in Python
Before we dive into the main content of this tutorial, let’s first ensure that you have the LightGBM library installed in your Python environment.
You can install LightGBM either using conda or pip.
If you’re using an Anaconda distribution, you can install LightGBM by using the following command in your terminal:
conda install -c conda-forge lightgbm
If you prefer using pip, run this command:
pip install lightgbm
After running one of these commands, LightGBM should be installed and ready for use in your Python environment.
Learning to Rank Objective Functions in LightGBM
For ranking tasks, LightGBM provides two objective functions: lambdarank
and rank_xendcg
.
We could use regular regression or classification objective functions for ranking tasks, in what’s called pointwise learning to rank, but here I will focus on the two objective functions specifically designed for ranking.
lambdarank
The lambdarank
objective function is based on the LambdaMART algorithm.
LambdaMART is a combination of MART (Multiple Additive Regression Trees) and LambdaRank.
In simple terms, lambdarank
tries to optimize the order of items in a way that maximizes the NDCG (Normalized Discounted Cumulative Gain) metric.
A recent paper by Amazon Science showed that offline NDCG correlates well with online metrics for product ranking tasks.
rank_xendcg
The rank_xendcg
objective function uses a different approach, which makes it less sensitive to outliers and more robust to noise in the data.
In practice, both lambdarank
and rank_xendcg
can be effective for ranking tasks.
The choice between them often depends on the specific characteristics of your data and problem, so treat it as another hyperparameter to tune.
Preparing the Data for LightGBM
Before we can train our LightGBM model, we first need to prepare our data.
This involves loading the data, preparing the features and labels, and setting up the group
parameter data.
Loading the Data With SVMLight
We’re using the MSLR-WEB10K dataset from Microsoft, which is a commonly used benchmark dataset in the Learning to Rank community.
It has pairs of queries and documents, where each pair is labeled with a relevance score from 0 to 4, from least to most relevant.
This is a real-world dataset that was used by Microsoft to train their Bing search engine.
The data uses the SVMLight format, so we will use the load_svmlight_file
function from scikit-learn to load it.
This is a text-based format for storing sparse feature vectors, which is commonly used in machine learning.
The data has already been split into training and test sets.
from sklearn.datasets import load_svmlight_file
import numpy as np
X_train, y_train, qid_train = load_svmlight_file(str(data_path / 'vali.txt'), query_id=True)
X_test, y_test, qid_test = load_svmlight_file(str(data_path / 'test.txt'), query_id=True)
Your data doesn’t need to be in the SVMLight format.
It needs to follow the usual X and y format that you know from other machine learning tasks with an additional query_id
parameter to indicate which rows belong to which query group.
For example, if you have 10 documents that belong to the same query, you would have 10 rows in your data with the same value for the query_id
parameter.
I am using the ‘vali.txt’ and ’test.txt’ instead of the original ’train.txt’ and ’test.txt’ because they are smaller and easier to work with for this tutorial.
Preparing the Features and Labels
In our data, the labels represent the relevance of each item.
They must be integers, with higher values indicating higher relevance.
The features, meanwhile, are the characteristics of each item that we’ll use to predict their relevance.
For this dataset, the features mostly correspond to the similarity between the content of the web pages and the search query.
# Convert labels to integers
y_train = y_train.astype(int)
y_test = y_test.astype(int)
Preparing the group
Parameter
The group
parameter in LightGBM is an array that contains the number of rows for each query group.
A query group is a set of items that are related to the same query.
For example, in a search engine scenario, all the websites returned for a specific search term would be one query group.
This parameter is important because it tells LightGBM how to form the list of items for optimization during training.
It’s calculated as the sum of rows in each query group.
_, group_train = np.unique(qid_train, return_counts=True)
In our case, the qid_train
array contains the query ID for each row in the training set as an integer and it’s sorted by query ID.
So we can just use the np.unique
function to get their counts, and then use the counts as the group
parameter later.
We only need to do this for the training set.
This step can be confusing and very error-prone, so make sure you double-check that you’re calculating the group
parameter correctly by inspecting some of the query groups manually.
Training the LightGBM Model
With our data prepared, we can now train our LightGBM model.
The only difference from training a regular LightGBM model is that we need to include the group
parameter.
The object we’ll use for training is LGBMRanker
, which is a subclass of LGBMModel
that’s specifically designed for ranking tasks.
from lightgbm import LGBMRanker
gbm = LGBMRanker()
gbm.fit(X_train, y_train, group=group_train)
If you want to try a different objective function, you can specify it using the objective
parameter when creating the LGBMRanker
object.
Finally, we fit our model to the data.
Evaluating the LightGBM Ranker
Once we’ve trained our LightGBM model, the next step is to evaluate its performance.
This involves making predictions on the test set and then measuring how accurate these predictions are.
To make predictions with our model, we’ll use the predict
function.
We need to make predictions one query at a time.
If you pass the entire test set to the predict
function, it will return a single list of predictions, thinking all the items belong to the same query group.
predictions = []
for group in np.unique(qid_test):
preds = gbm.predict(X_test[qid_test == group])
predictions.extend(preds)
We iterate over the unique query IDs in the test set, select the rows that belong to each query group, and make predictions for that group.
In a deployment scenario, you can have an API endpoint that receives a request with a list of items and their features, and then returns a list of predictions.
Normalized Discounted Cumulative Gain
For Learning to Rank tasks, we typically use measures that take into account the relevance and the position of the items, like Normalized Discounted Cumulative Gain (NDCG).
As it’s one of the most popular offline metrics in the industry, I will use it to evaluate the performance of our model.
First, we borrow a function to calculate it.
It’s above the scope of this tutorial to explain how it works, but it’s worth spending time on the Wiki page to understand it.
This function takes three arguments: the predicted scores (y_score
), the true relevance scores (y_true
), and the number of top documents to consider (k
).
def ndcg(y_score, y_true, k):
order = np.argsort(y_score)[::-1]
y_true = np.take(y_true, order[:k])
gain = 2 ** y_true - 1
discounts = np.log2(np.arange(len(y_true)) + 2)
return np.sum(gain / discounts)
Next, we get the unique query IDs from our test set.
qids = np.unique(qid_test)
We then loop over each query id, make predictions for that query, and calculate the NDCG score for it. We append each score to a list.
ndcg_ = list()
for i, qid in enumerate(qids):
y = y_test[qid_test == qid]
if np.sum(y) == 0:
continue
p = gbm.predict(X_test[qid_test == qid])
idcg = ndcg(y, y, k=10)
ndcg_.append(ndcg(p, y, k=10) / idcg)
Finally, we calculate the mean NDCG score across all queries.
This gives us a single number that represents the overall performance of our model.
np.mean(ndcg_)