Why XGBoost Still Beats Deep Learning At The Tabular Data Game

Gradient boosted decision trees (GBDTs) are the current state of the art on tabular data. They are used in many Kaggle competitions and are the go-to model for many data scientists, as they tend to get better performance than neural networks while being easier and faster to train. Neural networks, on the other hand, are the state of the art in many other tasks, such as image classification, natural language processing, and speech recognition....

November 20, 2023 · 6 min · Mario Filho

PyTorch Implementation of Google's TW-BERT for Information Retrieval

Information retrieval (IR) systems are crucial for a wide range of applications, from web search engines to personal digital assistants. However, traditional IR systems can struggle with accurately understanding and ranking the relevance of documents based on user queries. They typically rely on simple term-matching methods, known as sparse retrieval methods, that may not fully capture the semantic meanings of the terms in the queries. While these methods are computationally efficient and easy to scale, they treat each term independently and fail to capture the contextual relationships between them....

October 9, 2023 · 10 min · Mario Filho

Unified Embeddings in PyTorch for Efficient Recommendation Systems

In machine learning, particularly in the field of recommendation systems and natural language processing, we often deal with categorical features. These features can be anything from user IDs, product IDs, to words in a text. One common practice to handle these categorical features is to represent them as embeddings, which are dense vector representations learned during the training process. However, when dealing with web-scale machine learning systems, the number of unique categorical features can be extremely large, leading to a massive number of embeddings....

October 9, 2023 · 9 min · Mario Filho

How To Train A Random Forest With XGBoost

Are you looking to train a Random Forest using XGBoost for classification or regression tasks but aren’t sure where to start? In this tutorial, I will first briefly explain the mechanisms behind XGBoost and Random Forest and highlight their differences. Then, I’ll guide you through a step-by-step process of training an XGBoost Random Forest for both classification and regression tasks using a real-world dataset. By the end of this tutorial, you’ll be well-equipped to tackle your own projects with confidence and expertise....

September 29, 2023 · 6 min · Mario Filho

How To Use LightGBM For Learning To Rank In Python

This tutorial is your roadmap to training a LightGBM model for ranking tasks in Python. You’ll learn how to install LightGBM in your Python environment, prepare your data correctly, and train a model using LightGBM’s Ranker. I’ll also cover how to evaluate your model’s performance using the industry-standard Normalized Discounted Cumulative Gain (NDCG) metric. By the end, you’ll have a solid understanding of LTR with LightGBM and be ready to tackle real-world ranking problems....

September 26, 2023 · 8 min · Mario Filho

How To Use LightGBM For Multiclass Classification in Python

Looking to use LightGBM for multiclass classification in Python but unsure of how to proceed? This tutorial is designed to get you up to speed. I’ll guide you through each step, from data preparation to model building, training, and evaluation. By the end of this tutorial, you will be ready to apply these steps to your own projects. So, let’s dive right in! Installing LightGBM in Python Before we dive into the main content of this tutorial, let’s first ensure that you have the LightGBM library installed in your Python environment....

September 22, 2023 · 7 min · Mario Filho

How To Use LightGBM For Regression in Python

Are you trying to create a regression model using the LightGBM library in Python but finding it challenging? Perhaps you’re unsure about installing the library, setting up the model, preparing the data, or evaluating your model’s performance. You’re in the right place. This tutorial will guide you through each of these steps. We’ll install LightGBM, prepare a dataset, train a model, make predictions, and evaluate the results. By the end, you’ll have a functional LightGBM regression model and a solid understanding of the process....

September 22, 2023 · 8 min · Mario Filho

LightGBM For Binary Classification In Python

Want to use LightGBM for a binary classification task but feel stuck? In this tutorial, you are going to see an example of how to do it in Python step-by-step. I’ll also explain how to handle class imbalance, a common issue in binary classification tasks. What Is LightGBM? LightGBM, which stands for “Light Gradient Boosting Machine,” is an open-source, distributed, high-performance gradient boosting framework developed by Microsoft. It is designed for efficient and scalable training of large datasets and is particularly well-suited for problems involving large numbers of features or high-dimensional data....

November 14, 2023 · 8 min · Mario Filho

How To Use CatBoost For Regression In Python

As a Python user aiming to predict a continuous target variable from a dataset with both numerical and categorical features, you’ve made a great choice in considering CatBoost. This high-performance machine learning algorithm is particularly known for its ability to handle categorical variables effectively. In this tutorial, I’ll guide you step-by-step on how to use CatBoost for regression tasks. We’ll start from preparing your data, training the CatBoost model, and finally evaluating its performance....

September 18, 2023 · 7 min · Mario Filho

How To Use CatBoost For Multiclass Classification In Python

Are you looking to tackle a multiclass classification problem using Python and stumbled upon CatBoost? Or perhaps you’ve heard about CatBoost’s impressive handling of categorical data and now you’re curious to see it in action with multiclass classification. Either way, you’ve come to the right place! In this tutorial, we’re going to explore how to use CatBoost, a powerful machine learning library, to conquer multiclass classification problems. I’ll start by giving you a quick primer on CatBoost and why it’s an excellent choice for multiclass classification....

September 15, 2023 · 7 min · Mario Filho