Working with categorical data in machine learning can be a bit of a headache, especially when using algorithms like XGBoost.
XGBoost, despite being a powerful and efficient gradient boosting library, is made to work with numeric data.
This means that you need to find a way to transform categorical data into a format that XGBoost can understand.
This can be a time-consuming and complex process, especially if you’re dealing with a large number of categorical variables or categories.
The problem becomes even more challenging when you consider the potential pitfalls of encoding categorical variables.
If not done correctly, encoding can introduce noise into your data, leading to poor model performance.
Furthermore, some encoding methods can significantly increase the dimensionality of your dataset, making the training process slower and more memory-intensive.
So, how do you handle categorical data in XGBoost without falling into these traps?
Find out in this tutorial!
This tutorial is valid for both regression and classification problems.
Native Encoding Using The XGBoost Scikit-learn Interface
The Scikit-learn interface of XGBoost provides a parameter called enable_categorical
that allows XGBoost to handle categorical variables.
First, you need to ensure that your categorical columns are of type ‘category’ in your Pandas DataFrame.
This is important because XGBoost’s enable_categorical
parameter can only recognize categorical columns that are of type ‘category’.
Let’s use the Adult dataset in our demo, as it has a mix of categorical and numeric columns.
The task here is to predict whether a person earns more than $50,000 per year based on their demographic information.
Here’s how you can convert your categorical columns to ‘category’ type:
import pandas as pd
# Load the dataset
df = pd.read_csv('adult.csv')
# List of categorical columns
cat_cols = ['marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country', 'workclass', 'education']
# Convert categorical columns to category type
for col in cat_cols:
df[col] = df[col].astype('category')
Sometimes Pandas will automatically convert your categorical columns to ‘category’ type when you load your dataset, but it’s always a good idea to check.
Once your categorical columns are of type ‘category’, you can pass your DataFrame to XGBoost and set enable_categorical=True
.
This tells XGBoost to handle the categorical variables natively.
Here’s how you can do it:
from xgboost import XGBClassifier
# Split into X and y, drop the target variable from X and convert y to binary
y = df['income'].map({'<=50K': 0, '>50K': 1})
X = df.drop('income', axis=1)
# Initialize XGBoost classifier
model = XGBClassifier(enable_categorical=True, tree_method='hist')
# Fit the model
model.fit(X, y)
I didn’t split into train and test sets to keep the code brief, but you should always do this in practice BEFORE doing any transformation, including categorical encoding.
The tree_method='hist'
parameter tells XGBoost to use the histogram-based algorithm when building trees.
I had to use this parameter because I got the following error when I tried to fit the model without it:
ValueError: Experimental support for categorical data is not implemented for current tree method yet.
This method is straightforward and doesn’t require any manual encoding of the categorical variables.
However, it’s important to note that this feature is experimental and may not always provide the best results.
Native Encoding Using XGBoost’s Native Interface
The native interface of XGBoost also provides a way to handle categorical variables.
However, it’s a bit different from the Scikit-learn interface.
In the native interface, you still need to convert your categorical variables to ‘category’ type, as described in the previous section.
However, here you need to specify the categorical columns using the enable_categorical
parameter when you convert your DataFrame to DMatrix.
import xgboost as xgb
y = df['income'].map({'<=50K': 0, '>50K': 1})
X = df.drop('income', axis=1)
# Convert DataFrame to DMatrix
data = xgb.DMatrix(X, label=y, enable_categorical=True)
# Specify parameters
params = {'max_depth': 3, 'eta':1, 'objective':'binary:logistic'}
# Train the model
model = xgb.train(params, data)
Why would you use the native interface instead of the Scikit-learn interface?
The native interface is more flexible and usually has more capabilities than the Scikit-learn interface.
So if you want to use any advanced or new hyperparameters, you’ll need to use the native interface, but for most use cases, the Scikit-learn interface is sufficient.
I didn’t get the error I got in the previous section even though I didn’t specify the tree_method='hist'
parameter.
If you do, try adding the parameter to the params
dictionary and see if it works.
Categorical Encoding Using One-Hot Encoding
One-Hot Encoding is a popular method for handling categorical variables.
It creates a binary column for each category level and returns a matrix with these binary representations.
This method is simple and effective, but it can lead to a high-dimensional dataset if your categorical variables have many unique categories.
Imagine doing it for all the ZIP codes in the US, for example.
Anyway, it can be a good option if your categorical variables have a small number of unique levels (low cardinality).
To perform One-Hot Encoding, the easiest way is to use the OneHotEncoder
class from the category_encoders
library.
First, let’s install the library:
pip install category_encoders
Now, you can use the OneHotEncoder
class to encode your categorical variables:
from category_encoders import OneHotEncoder
# Initialize OneHotEncoder
encoder = OneHotEncoder(cols=cat_cols)
# Fit and transform the DataFrame
df_encoded = encoder.fit_transform(df)
Again, I am not splitting into train and test sets to keep the code brief. Do it BEFORE doing any transformation.
In this case, you would use the fit_transform
with the training set and the transform
method with the test set.
Just like with any other Scikit-learn transformer.
The best thing about this library is that it returns a DataFrame with both the original numerical columns and the new encoded columns, avoiding an extra step of concatenating the two DataFrames.
age | workclass_1 | workclass_2 | workclass_3 | workclass_4 |
---|---|---|---|---|
90 | 1 | 0 | 0 | 0 |
82 | 0 | 1 | 0 | 0 |
66 | 1 | 0 | 0 | 0 |
54 | 0 | 1 | 0 | 0 |
41 | 0 | 1 | 0 | 0 |
Now, your DataFrame is encoded and ready to be passed to XGBoost:
y = df_encoded['income'].map({'<=50K': 0, '>50K': 1})
X = df_encoded.drop('income', axis=1)
# Initialize XGBoost classifier
model = XGBClassifier(enable_categorical=True, tree_method='hist')
# Fit the model
model.fit(X, y)