You’ve chosen XGBoost as your algorithm, and now you’re wondering: “How do I figure out which features are the most important in my model?”

That’s what ‘feature importance’ is all about. It’s a way of finding out which features in your data are doing the heavy lifting when it comes to your model’s predictions.

Understanding which features are important can help you interpret your model better.

Maybe you’ll find a feature you didn’t expect to be important. Or maybe you’ll see that a feature you thought was important isn’t really doing much at all.

In this tutorial, we’ll dive into how to implement this in your code and interpret the results.

How Is Feature Importance Calculated in XGBoost?

Now that you understand what feature importance is, let’s talk about the different types of feature importance you can get with XGBoost.

There are five main types: ‘weight’, ‘gain’, ‘cover’, ’total_gain’, and ’total_cover’.

Let’s break each one down:

  1. Weight: This is the simplest type of feature importance. It’s just the number of times a feature is used to split the data across all trees. If a feature is used a lot, it has a high weight.

  2. Gain: This is a bit more complex. Gain measures the average improvement in (training set) loss brought by a feature. In other words, it tells you how much a feature helps to make accurate predictions in your training data.

  3. Cover: Cover measures the number of data points a given feature affects. This is done via taking the average of the hessian values of all splits the feature is used in.

  4. Total_gain: This is similar to gain, but instead of the average, it gives you the total improvement brought by a feature.

  5. Total_cover: This is like cover, but it gives you the total number, rather than the average.

So, when should you use each type?

Unfortunately, there’s no one-size-fits-all answer.

I like to compute all of them and then use my best judgement to decide which one is most useful for my particular problem.

Implementing XGBoost Feature Importance in Python

We’ll be using the Red Wine dataset from UCI for this example.

This dataset contains information about various attributes of red wines, such as ‘fixed acidity’, ‘volatile acidity’, ‘citric acid’, etc.

The task is to predict the quality of the wine based on these attributes.

First, you’ll need to import the necessary libraries.

We’ll need pandas for data manipulation, XGBoost for our model, and matplotlib for plotting our feature importances later.

import pandas as pd
from xgboost import XGBRegressor
import matplotlib.pyplot as plt

Next, let’s load our dataset. We’ll use pandas’ read_csv function for this.

df = pd.read_csv('winequality-red.csv', delimiter=';')
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
7.4 0.7 0 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.8 0.88 0 2.6 0.098 25 67 0.9968 3.2 0.68 9.8 5
7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 5
11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 6
7.4 0.7 0 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5

Now, we need to separate our features (X) from our target variable (y).

In this case, we’re trying to predict the ‘quality’ column.

So, we’ll drop that column from our dataframe and assign it to y.

Then, we’ll assign the rest of the dataframe to X.

X = df.drop('quality', axis=1)
y = df['quality']

Here I am not splitting the data into training and test sets to keep things brief, so assume that we have already done that.

The native feature importances are calculated using the training set only, as it’s the data used to create the decision trees.

Next, we’ll create our XGBoost model. We’ll use the XGBRegressor for this example.

Everything is valid for XGBClassifier as well.

Let’s use the default hyperparameters.

model = XGBRegressor()

Then, we fit our model to the data., y)

Now comes the fun part: getting our feature importances.

XGBoost sklearn API makes this easy with the feature_importances_ attribute.

To make it easier to understand, let’s put everything into a pandas Series where the index is the feature name and the values are the feature importances.

importances = model.feature_importances_
pd.Series(importances, index=X.columns).sort_values()
fixed acidity 0.0356799
density 0.051082
free sulfur dioxide 0.0526365
chlorides 0.0537856
residual sugar 0.055074
citric acid 0.0559636
pH 0.0683643
total sulfur dioxide 0.0783635
sulphates 0.0954391
volatile acidity 0.0955693
alcohol 0.358042

As you can see, the ‘alcohol’ feature is by far the most important, with the highest feature importance score of 0.358042.

By default, XGBoost will use the ‘gain’ type of feature importance.

Let’s try other types to see if we get different results.

You need to change it in the model initialization using the importance_type parameter.

model = XGBRegressor(importance_type='weight'), y)

importances = model.feature_importances_
pd.Series(importances, index=X.columns).sort_values()
free sulfur dioxide 0.0671536
alcohol 0.0680961
pH 0.0713949
citric acid 0.0798775
residual sugar 0.0803487
sulphates 0.0838831
density 0.0933082
total sulfur dioxide 0.0937795
chlorides 0.0951932
volatile acidity 0.111216
fixed acidity 0.155749

This time, the ‘fixed acidity’ feature is the most important, with a feature importance score of 0.155749.

This means it’s the one that’s used the most to split the data across all trees.

Be careful, as this can be misleading.

In the case of high cardinality categorical features, the ‘weight’ type of feature importance will be biased towards those features, as they naturally have more splits.

Random Forests suffer from the same problem.

Here is the same example with the ‘cover’ type of feature importance for completion.

model = XGBRegressor(importance_type='cover')
fixed acidity 0.0465268
free sulfur dioxide 0.0672191
citric acid 0.0774046
volatile acidity 0.0809456
residual sugar 0.0822461
sulphates 0.0887277
total sulfur dioxide 0.0980716
chlorides 0.0999913
pH 0.104945
density 0.125591
alcohol 0.128332

Finally, here are the results for all types (averaged, not doing the total ones to keep it brief):

weight gain cover
fixed acidity 0.155749 0.0356799 0.0465268
volatile acidity 0.111216 0.0955693 0.0809456
citric acid 0.0798775 0.0559636 0.0774046
residual sugar 0.0803487 0.055074 0.0822461
chlorides 0.0951932 0.0537856 0.0999913
free sulfur dioxide 0.0671536 0.0526365 0.0672191
total sulfur dioxide 0.0937795 0.0783635 0.0980716
density 0.0933082 0.051082 0.125591
pH 0.0713949 0.0683643 0.104945
sulphates 0.0838831 0.0954391 0.0887277
alcohol 0.0680961 0.358042 0.128332

Visualizing XGBoost Feature Importance

Finally, let’s plot our feature importances to make them easier to understand.

We’ll create a bar plot with our features on the x-axis and their importance on the y-axis.

As we have our feature importances organized in a pandas Dataframe, we can easily plot them using the integrated matplotlib functions.

res.plot.barh(figsize=(1280/96, 720/96), title='Feature importance')

XGBOOST Feature Importance Plot

Here I plotted all the types of feature importance for comparison, but you can plot only one if you want.

The biggest downside of these feature importances is that they don’t tell the direction of the relationship, only the strength of it.

So we don’t know if a feature is positively or negatively correlated with the target variable.

In any case, it’s still useful to get a general idea of which features are important.