Want to use LightGBM for a binary classification task but feel stuck?

In this tutorial, you are going to see an example of how to do it in Python step-by-step.

I’ll also explain how to handle class imbalance, a common issue in binary classification tasks.

What Is LightGBM?

Leaf-wise tree growth. Source: LightGBM official docs

LightGBM, which stands for “Light Gradient Boosting Machine,” is an open-source, distributed, high-performance gradient boosting framework developed by Microsoft.

It is designed for efficient and scalable training of large datasets and is particularly well-suited for problems involving large numbers of features or high-dimensional data.

Gradient boosting is an ensemble machine learning algorithm that builds a predictive model in the form of a set of weak learners, typically decision trees (GBDT).

LightGBM, like other gradient boosting frameworks (e.g., XGBoost), builds trees sequentially, with each tree trying to correct the errors of the previous ones.

This is different from Random Forests, which builds trees independently by randomly sampling the data with replacement and features without replacement.

Installing LightGBM in Python

Before we dive into the main content of this tutorial, let’s first ensure that you have the LightGBM library installed in your Python environment.

You can install LightGBM either using conda or pip.

If you’re using an Anaconda distribution, you can install LightGBM by using the following command in your terminal:

conda install -c conda-forge lightgbm

If you prefer using pip, run this command:

pip install lightgbm

After running one of these commands, LightGBM should be installed and ready for use in your Python environment.

Preparing the Data

The first step in any machine learning project is to load the data.

We’ll be using the Adult Dataset for this tutorial.

You can load this dataset into a pandas DataFrame using the read_csv function.

Here’s how you can do it:

import pandas as pd

data = pd.read_csv('adult.csv')

Splitting the Data into Training and Testing Sets

Once you’ve loaded the data, the next step is to split it into a training set and a testing set.

This allows us to evaluate the performance of our model on unseen data.

We’ll use the train_test_split function from the sklearn.model_selection module to do this.

Here’s how:

from sklearn.model_selection import train_test_split

X = data.drop('income', axis=1)
y = data['income']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In this code, we first separate the features (X) from the target variable (y).

We then split both into training and testing sets, with 80% of the data going to the training set and 20% going to the test set.

Preprocessing the Data

After splitting the data, the next step is to preprocess it.

This involves transforming categorical variables into integers and handling missing values.

To transform categorical variables into integers, we’ll use the OrdinalEncoder from the category_encoders package.

I prefer using it instead of the LabelEncoder from sklearn.preprocessing because it’s a more robust and complete implementation.

Here’s how:

import category_encoders as ce

encoder = ce.OrdinalEncoder(cols=['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country'])

X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

In the above code, we first initialize the OrdinalEncoder specifying the columns we want to transform.

We then fit this encoder to the training data and transform it. We also transform the test data using the same encoder.

It will not only transform the categorical variables into integers but also concatenate the transformed columns to the other (numerical) columns.

If you don’t have the category_encoders package installed, you can install it by running the following command in your terminal:

pip install category_encoders

or

conda install -c conda-forge category_encoders

To handle missing values, we’ll replace them with NaNs.

import numpy as np

X_train = X_train.replace('?', np.nan)
X_test = X_test.replace('?', np.nan)

In this code, we replace any ‘?’ in the data with ‘NaN’ using the replace function.

This is because LightGBM can handle missing values represented as np.nan by default.

Building the Model

When building a LightGBM model, there are several hyperparameters that you can tune.

However, for the purpose of this tutorial, we will focus on three of the most important ones:

  1. learning_rate: This is the learning rate or shrinkage rate. It controls how much each tree contributes to the overall prediction. Generally, a lower learning rate is better, but it also means that the model will take longer to train.
  2. n_estimators: This is the number of boosting rounds or trees to build. It controls the complexity of the model. I like to set as high a value as possible for this parameter, but it also means that the model will take longer to train.
  3. num_leaves: This is the main parameter to control the complexity of the tree model. It controls the number of leaves in each tree, which is a proxy for the depth. A higher value means a more complex, less regularized model, which can lead to overfitting.

To set the hyperparameters, we create an LGBMClassifier object and pass in the hyperparameters as arguments.

from lightgbm import LGBMClassifier

lgb = LGBMClassifier(learning_rate=0.05, n_estimators=100, num_leaves=31)

Binary vs Cross-Entropy Objective

LightGBM offers two loss functions for binary classification: binary and cross_entropy.

Most of the time you will use binary, which optimizes the log loss when your classes are labeled with only 0s and 1s.

If you want to optimize probabilities directly (as labels), you can use cross_entropy.

This loss function will consider multiple labels with values between 0 and 1 as probabilities of the positive class.

It’s rare, but one use case is when you have probabilities from another model and want to train a LightGBM model to match those probabilities instead of the binary labels.

Using is_unbalance To Handle Class Imbalance

If your dataset is imbalanced (has many more instances of one class than the other), you can use the is_unbalance parameter to handle it.

Setting it to True will multiply the weight of the positive class errors by the ratio of negative to positive instances in the dataset, making the model pay more attention to the under-represented class.

To use this parameter, you simply set it to True when initializing the LGBMClassifier object.

lgb = LGBMClassifier(learning_rate=0.05, n_estimators=100, num_leaves=31, is_unbalance=True)

Be warned that, although it improves metrics like ROC AUC, it will hurt metrics like log loss, due to it breaking the probability calibration of the predictions.

So, a rule of thumb is, if you only care about positive samples having a higher predicted score than negative samples, then you can use is_unbalance=True.

If you want your predictions to be closer to the real probabilities of the classes, then you should set is_unbalance=False.

Training the Model

Once you’ve set your hyperparameters, you can train your LightGBM model using the fit method.

LightGBM can handle raw integer-encoded categorical features well by default.

However, you can set the categorical_feature parameter to the column names of the categorical features.

In practice, try both ways and see which one performs better.

Here’s how to do it:

lgb.fit(X_train, y_train, categorical_feature=['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country'])

In this code, we just pass the training data and the list of categorical features to the fit method.

Some people like to do early stopping when training LightGBM models, but in my experience, it tends to overfit badly with gradient boosting models.

Making Predictions

Class Predictions

Once the model is trained, you can use it to make predictions.

If you want to directly predict the class for each instance in the test set, you can use the predict function of the model.

y_pred = lgb.predict(X_test)

The predict function applies a threshold of 0.5 to the predicted probabilities.

So each instance will be assigned a class of 1 if the predicted probability of the positive class is greater than 0.5, and a class of 0 otherwise.

Probability Predictions

If you’re interested in the predicted probabilities rather than the class predictions, you can use the predict_proba function of the model.

probabilities = lgb.predict_proba(X_test)[:, 1]

By default, this function returns an array with two columns, one for the probability of the negative class and one for the probability of the positive class.

This is why we take the second column using the [:, 1] syntax.

Evaluating the Model

After making predictions with your model, the next step is to evaluate its performance.

There are several metrics you can use for this, including accuracy, log loss, ROC AUC, and the classification report.

Accuracy

Accuracy is the most straightforward metric.

It’s simply the proportion of predictions that the model got right.

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: ', accuracy)

However, there are many other metrics that are more informative than accuracy, especially when dealing with imbalanced datasets.

Log Loss

Log loss, also known as logistic loss or cross-entropy loss, is a performance metric for evaluating the predicted probabilities of membership to a given class.

from sklearn.metrics import log_loss

loss = log_loss(y_test, probabilities)
print('Log loss: ', loss)

ROC AUC

The ROC AUC score provides an aggregate measure of performance across all possible classification thresholds.

from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_test, probabilities)
print('ROC AUC: ', roc_auc)

Classification Report

The classification report displays the precision, recall, F1, and support scores for the model.

from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred)
print('Classification report: \n', report)

Classification Report

If you want more fine-grained information of which classes are being predicted correctly and which ones are not, you can use a confusion matrix.

Confusion Matrix

A confusion matrix is a table that summarizes the performance of a classification model on a set of test data for which the true values are known.

from sklearn.metrics import confusion_matrix

matrix = confusion_matrix(y_test, y_pred)
print('Confusion matrix: \n', matrix)

By evaluating your model using these metrics, you can get a good understanding of its performance across different aspects.

It’s worth noting that LightGBM can be used for regression and multi-class classification as well.