So, you’ve heard about the power of XGBoost for Learning to Rank (LTR) tasks and want to harness it, right?

You couldn’t have landed in a better place!

XGBoost is a go-to tool for many LTR applications, from predicting click-through rates and powering search engines to enhancing recommender systems.

I can vouch for its effectiveness, having used it to build models for ranking freelancers on Upwork.

In this tutorial, we’ll unlock the potential of XGBoost for your LTR tasks.

We’ll delve into various objective functions, walk you through the steps of data preparation, and illustrate how to train your model.

By the end of this guide, you’ll be fully equipped to construct your own LTR models using XGBoost.

So, without further ado, let’s dive right in!

Learning to Rank Objective Functions

Optimizing a ranking score can be as simple as treating each query-document pair as a classification problem or as complex as optimizing an objective function that takes into account all the candidates for a query.

There are several types of objective functions, each with its strengths and weaknesses.

Let’s take a look at some of them:

A note: I am using the term “document” to refer to the items that are being ranked, but it could be anything, from web pages to products.

A “query” could be a search query or the browsing history of a user, for example.


This approach treats each query-document pair independently.

You just need to label each query-document pair with a relevance score and train a model to predict this score.

For example, let’s say we have a dataset of query-document pairs where each pair is labeled with a relevance score from 1 to 5.

We can train a regression model to predict the relevance score for each pair.

Despite its simplicity, the pointwise approach is surprisingly effective and hard to beat, so it’s a good place to start.

In XGBoost, this approach can be implemented using any of the standard regression or classification objective functions.

Remember to make adjustments for any label imbalances in your dataset.


The pairwise approach considers pairs of documents and tries to minimize the number of incorrectly ordered pairs. It’s used in algorithms like RankNet.

For example, considering our previous dataset, this approach would take a query and two documents at a time and adjust the predicted relevance scores so that the most relevant document has a higher score than the least relevant one.

It’s different than the pointwise approach, as in the first we only consider the query and a single document at a time.

Here we are trying to explicitly model the relative order of documents for a query.

XGBoost provides a few objective functions for this approach:


Also known as LambdaMART, because it combines MART (Multiple Additive Regression Trees) and LambdaRank, this is the original version of the pairwise loss function (also known as RankNet).

More details on how it works can be found in the original documentation.


NDCG stands for Normalized Discounted Cumulative Gain.

It’s one of the most popular measures of ranking quality used in the industry, as it takes into account the relative order of documents for a query and the relevance score of each document.

This objective function uses a surrogate gradient derived from the NDCG metric to optimize the model.

It comes in handy if you are using NDCG as the evaluation metric, as it’s a direct optimization of it.


MAP stands for Mean Average Precision.

It’s a simpler measure of ranking quality that is used when the relevance scores are binary (0 or 1).

In general, if you are using MAP as the evaluation metric, you should use this objective function.


Although not implemented in XGBoost, the listwise approach is worth mentioning for completeness.

It considers the entire list of documents for a query and tries to optimize the order of the entire list at once.

It’s an evolution over the pointwise and pairwise approaches, and it can lead to better results because it takes into account the relative order of all documents.

An example of a listwise loss function is ListMLE.

Let’s see an example of how to use XGBoost for Learning to Rank in Python.

Preparing The Data

In this section, we’re preparing our data for the Learning to Rank task.

We’re using the MSLR-WEB10K dataset from Microsoft, which is a commonly used benchmark dataset in the Learning to Rank community.

It has pairs of queries and documents, where each pair is labeled with a relevance score from 0 to 4, from least to most relevant.

This is a real-world dataset that was used by Microsoft to train their Bing search engine.

First, we import the necessary libraries: pandas for data manipulation, NumPy for numerical operations, and load_svmlight_file from sklearn.datasets to load our dataset.

The SVMLight format is a text-based format for storing sparse datasets.

It’s commonly used when you have many features and most of them are zeros, which is the case for click-through rate prediction and other similar tasks.

import pandas as pd
import numpy as np
from sklearn.datasets import load_svmlight_file

Next, we load our training and validation datasets.

The load_svmlight_file function returns a tuple containing the feature matrix, the target vector, and the query IDs. The query IDs are used to group the instances by query.

Let’s say you searched for “xgboost learning to rank” on Google to find this tutorial.

All the pages that Google returns for this query would be grouped together.

train = load_svmlight_file(str(data_path / 'vali.txt'), query_id=True)
valid = load_svmlight_file(str(data_path / 'test.txt'), query_id=True)
0 qid:1 1:3 2:0 3:2 4:2 … 135:0 136:0
2 qid:1 1:3 2:3 3:0 4:0 … 135:0 136:0

Examples extracted from the original MSLR website.

If you pay attention to the loaded files, you’ll notice that I am using the original validation and test sets from a single fold of MSLR.

This is just because they have fewer rows and are easier to work with for a demo. All the files contain the same format, so you can use any of them.

We then unpack the tuples into separate variables for easier manipulation.

X_train, y_train, qid_train = train
X_valid, y_valid, qid_valid = valid

X_train and X_valid are the feature matrices, y_train and y_valid are the target (relevance) vectors, and qid_train and qid_valid are arrays indicating the query for each instance.

If we have 10 results for a query, there will be 10 instances with the same query id, but different feature vectors and relevance scores.

In the context of XGBoost, when you see the group attribute, think of it as the query IDs (qid_train and qid_valid). It’s used to indicate which instances belong to the same query.

Originally we needed to specify how many instances belong to each query, but now XGBoost has a more user-friendly interface where we can specify the queries for which each instance belongs directly as a NumPy array.

Training XGBRanker

Now that our data is prepared, we can move on to training our model.

We’ll be using the XGBRanker class from the XGBoost library, which is specifically designed for Learning to Rank tasks.

First, we import the XGBoost library.

import xgboost as xgb

Next, we create an instance of the XGBRanker class.

We specify the tree_method as "hist" and the objective as "rank:ndcg".

The "hist" method is a fast histogram-based method for constructing the decision trees, and "rank:ndcg" is the objective function we discussed earlier.

You can use tree_method="gpu_hist" if you have a GPU and want to train your model faster.

model = xgb.XGBRanker(tree_method="hist", objective="rank:ndcg")

Finally, we fit the model to our training data.

We pass in our feature matrix (X_train), our target vector (y_train), and our query IDs (qid_train).

The query IDs are used to group the instances by query, which is crucial for Learning to Rank tasks., y_train, qid=qid_train)

The only difference between this and a regular classification or regression task is that we need to pass in the query IDs to the fit method, as it will group instances to optimize the pairwise objective functions.

It’s worth noting that there is an experimental feature in XGBoost for handling position bias, called lambdarank_unbiased.

Position bias is the name given to the phenomenon where the first few results are more likely to be clicked than the rest even if they are not the most relevant ones.

I never used this experimental feature, but I am mentioning it here in case you want to try it out.

Predicting Ranking Scores

Once our model is trained, we can use it to make predictions.

In the context of Learning to Rank, we typically want to rank the documents for a single query at a time.

To do this, we can use the predict method of our model and pass in the feature matrix for a single query.

First, let’s say we want to make predictions for the first query in our validation set.

We can get the feature matrix for this query like this:

X_query = X_valid[qid_valid == qids[0]]

X_query will contain all the features for the rows where the query ID is equal to the first query ID in our validation set.

Then, we can use the predict method to get the predicted relevance scores:

y_pred = model.predict(X_query)

The predict method returns a NumPy array with the predictions.

The documents can then be ranked based on these scores. The higher the score, the higher the relevance.

Be careful! As you can see, there is no query id in the predict method, so if you pass in the entire validation set, it will return the predicted relevance scores for all documents at once thinking they belong to the same query.

Evaluating Ranking Performance

After making predictions, it’s important to evaluate the performance of our model.

For Learning to Rank tasks, we typically use measures that take into account the relevance and the position of the items, like Normalized Discounted Cumulative Gain (NDCG).

As it’s one of the most popular offline metrics in the industry, I will use NDCG to evaluate the performance of our model.

First, we borrow a function to calculate it.

It’s above the scope of this tutorial to explain how it works, but it’s worth spending time on the Wiki page to understand it.

This function takes three arguments: the predicted scores (y_score), the true relevance scores (y_true), and the number of top documents to consider (k).

def ndcg(y_score, y_true, k):
    order = np.argsort(y_score)[::-1]
    y_true = np.take(y_true, order[:k])

    gain = 2 ** y_true - 1

    discounts = np.log2(np.arange(len(y_true)) + 2)
    return np.sum(gain / discounts)

Next, we get the unique query IDs from our validation set.

qids = np.unique(qid_valid)

We then loop over each query id, make predictions for that query, and calculate the NDCG score for it. We append each score to a list.

ndcg_ = list()

for i, qid in enumerate(qids):
    y = y_valid[qid_valid == qid]
    if np.sum(y) == 0:
    p = model.predict(X_valid[qid_valid == qid])

    idcg = ndcg(y, y, k=10)
    ndcg_.append(ndcg(p, y, k=10) / idcg)

Finally, we calculate the mean NDCG score across all queries.

This gives us a single number that represents the overall performance of our model.