How To Deal With Categorical Variables in XGBoost

Working with categorical data in machine learning can be a bit of a headache, especially when using algorithms like XGBoost. XGBoost, despite being a powerful and efficient gradient boosting library, is made to work with numeric data. This means that you need to find a way to transform categorical data into a format that XGBoost can understand. This can be a time-consuming and complex process, especially if you’re dealing with a large number of categorical variables or categories....

August 1, 2023 · 5 min · Mario Filho

How to Get Feature Importance in XGBoost in Python

You’ve chosen XGBoost as your algorithm, and now you’re wondering: “How do I figure out which features are the most important in my model?” That’s what ‘feature importance’ is all about. It’s a way of finding out which features in your data are doing the heavy lifting when it comes to your model’s predictions. Understanding which features are important can help you interpret your model better. Maybe you’ll find a feature you didn’t expect to be important....

July 18, 2023 · 6 min · Mario Filho

How To Use LightGBM For Multi-Output Regression And Classification In Python

Today, we’re going to dive into the world of LightGBM and multi-output tasks. LightGBM is a powerful gradient boosting framework (like XGBoost) that’s widely used for various tasks. But what if you want to predict multiple outputs at once? That’s where multi-output regression and classification comes in. Unfortunately, LightGBM doesn’t support multi-output tasks directly, but we can use scikit-learn’s MultiOutputRegressor to get around this limitation. What Is Multi-Output Regression and Classification First, let’s break down what these terms mean....

July 6, 2023 · 5 min · Mario Filho

How To Solve Logistic Regression Not Converging in Scikit-Learn

When using the Scikit-Learn library, you might encounter a situation where your logistic regression model does not converge. You may get a warning message similar to this: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( This article aims to help you understand why this happens and how to resolve it....

July 1, 2023 · 6 min · Mario Filho

Does Random Forest Need Feature Scaling or Normalization?

If you are using Random Forest as your machine learning model, you don’t need to worry about scaling or normalizing your features. Random Forest is a tree-based model and hence does not require feature scaling. Tree-based models are invariant to the scale of the features, which makes them very user-friendly as this step can be skipped during preprocessing. Still, in practice you can see different results when you scale your features because of the way numerical values are represented in computers....

June 29, 2023 · 5 min · Mario Filho

StandardScaler vs MinMaxScaler: What's the Difference?

The main differences between StandardScaler and MinMaxScaler lie in the way they scale the data, the range of values they produce, and the specific applications they’re suited for. StandardScaler subtracts the mean from each data point and then divides the result by the standard deviation. This results in a dataset with a mean of 0 and a standard deviation of 1. MinMaxScaler, on the other hand, subtracts the minimum value from each data point and then divides the result by the difference between the maximum and minimum values....

June 23, 2023 · 7 min · Mario Filho

How To Train A Logistic Regression Using Scikit-Learn (Python)

Logistic regression is a type of predictive model used in machine learning and statistics. Its purpose is to determine the likelihood of an outcome based on one or more input variables, also known as features. For example, logistic regression can be used to predict the probability of a customer churning, given their past interactions and demographic information. Difference Between Linear And Logistic Regression? Before diving into logistic regression, it’s important to understand its sibling model, linear regression....

June 21, 2023 · 17 min · Mario Filho

How To Get Feature Importance In LightGBM (Python Example)

LightGBM is a popular gradient boosting framework that uses tree-based learning algorithms. These algorithms are excellent for handling tabular data and are widely used in various machine learning applications. One of the key aspects of understanding your model’s behavior is knowing which features contribute the most to its predictions, and that’s where feature importance comes into play. By the end of this guide, you’ll have a better grasp on the importance of your features and how to visualize them, which will help you improve your model’s performance and interpretability....

September 19, 2023 · 11 min · Mario Filho

A Guide To Normalized Cross Entropy

What Is Normalized Cross Entropy? Normalized Cross Entropy is a modified version of Cross Entropy that takes into account a baseline solution or “base level.” It essentially measures the relative improvement of your model over the selected baseline solution. The very popular Practical Lessons from Predicting Clicks on Ads at Facebook paper popularized the concept of Normalized Cross Entropy (NCE) as a metric for binary classification problems. Although the reasons for including this metric in the paper were likely because they didn’t want people to know the actual cross entropy of their model, it can be a useful metric....

June 9, 2023 · 6 min · Mario Filho

Multiple Time Series Forecasting With N-BEATS In Python

Imagine having a robust forecasting solution capable of handling multiple time series data without relying on complex feature engineering. That’s where N-BEATS comes in! In this tutorial, I’ll break down its inner workings, walk you through the process of installing and configuring NeuralForecast to train an N-BEATS model in Python, and show you how to effectively prepare and split your time series data. Furthermore, we’ll explore hyperparameter tuning with Optuna....

June 2, 2023 · 14 min · Mario Filho

Multiple Time Series Forecasting With GRU In Python

So, you’ve already explored the world of LSTMs and now you’re curious about their sibling GRUs (Gated Recurrent Units) and how they can enhance your time series forecasting projects… Great! As machine learning practitioners, we’re always looking for ways to expand our knowledge and improve our model choices. In this tutorial, we’ll take a deep dive into GRUs, covering their inner workings, and comparing them to LSTMs. By the end of this tutorial, you’ll have a solid understanding of GRUs and be well-equipped to use them effectively in Python....

May 25, 2023 · 14 min · Mario Filho

Sales Forecasting For Multiple Products Using Python (Complete Guide)

As a data scientist, tackling sales forecasting for multiple products is a tough job. You know it’s essential for businesses, but dealing with different models, metrics, and complexities can be overwhelming. You might be feeling the pressure to deliver accurate forecasts to drive better decision-making and wondering how to tackle this challenge effectively. Don’t worry! I’m here to make this process easier and guide you through it. In this tutorial, I’ll simplify sales forecasting by walking you through these key steps:...

May 18, 2023 · 25 min · Mario Filho

Bagging vs Boosting vs Stacking In Machine Learning

Bagging, boosting, and stacking are three ensemble learning techniques used to improve model performance. Bagging involves training multiple models independently on random subsets of data and then combining their predictions through a majority vote. Boosting focuses on correcting the errors made by previous weak models in a sequence to create a stronger model. Stacking combines multiple models by training a meta-model, which takes model predictions as input and outputs the final prediction....

May 2, 2023 · 22 min · Mario Filho

Ensemble Time Series Forecasting in Python Made Easy with AutoGluon

A fast, easy, and hands-off approach to creating ensemble models for time series forecasting is using AutoGluon. AutoGluon is an open-source AutoML library for deep learning. It’s a great tool for time series forecasting because it can automatically select the best models for your data and ensemble them together to create a more accurate model. It also has a built-in feature to handle missing values and can handle large datasets....

April 24, 2023 · 12 min · Mario Filho

CatBoost Hyperparameter Tuning Guide with Optuna

You’ve built a CatBoost model; now what? Hyperparameter tuning is the key to unlocking your model’s full potential. But if the thought of tackling this task feels daunting, you’re not alone. Once you’ve mastered the tips and tricks presented in this tutorial, you’ll be equipped with the skills to fine-tune any CatBoost model effectively. Let’s get started! Installing CatBoost and Optuna First, let’s install both libraries simply by running: pip install catboost optuna Or, if you’re using Anaconda, run:...

April 19, 2023 · 7 min · Mario Filho

5 Dynamic Time Warping (DTW) Libraries in Python With Examples

The world of time series analysis can be complex, and finding the right Python library for Dynamic Time Warping can be even more so. That’s where this tutorial comes in! My goal is to provide you with an easy-to-follow guide that will help you understand the various options available and make the right choice for your project. Whether you are a beginner or an expert, you will find valuable insights here....

April 13, 2023 · 6 min · Mario Filho

XGBoost Hyperparameter Tuning With Optuna (Kaggle Grandmaster Guide)

Trying to find the right hyperparameters for XGBoost can feel like searching for a needle in a haystack. Trust me, I’ve been there. XGBoost was a crucial model to win at least two of the Kaggle competitions I participated in. By the end of this tutorial, you’ll be equipped with the exact same techniques I used to optimize my models and achieve those top rankings. Let’s get started! Installing XGBoost And Optuna Installing XGBoost is easy, just run:...

April 10, 2023 · 8 min · Mario Filho

How To Use Optuna to Tune LightGBM Hyperparameters

As a Kaggle Grandmaster, I absolutely love working with LightGBM, a fantastic machine learning library that’s become one of my go-to tools. I always focus on tuning the model’s hyperparameters before diving into feature engineering. Think of it like cooking up the perfect dish. You want to make sure you’ve got the right ingredients and their quantities before you start experimenting with new flavors. By fine-tuning your hyperparameters first, you’ll squeeze every last drop of performance from your model in the data you already have....

April 7, 2023 · 9 min · Mario Filho

Time Series Anomaly Detection in Python

Discovering outliers, unusual patterns or events in your time series data has never been easier! In this tutorial, I’ll walk you through a step-by-step guide on how to detect anomalies in time series data using Python. You won’t have to worry about missing sudden changes in your data or trying to keep up with patterns that change over time. I’ll use website impressions data from Google Search Console as an example, but the techniques I cover will work for any time series data....

September 28, 2023 · 10 min · Mario Filho

Do Neural Networks Need Feature Scaling Or Normalization?

In short, feature scaling or normalization is not strictly required for neural networks, but it is highly recommended. Scaling or normalizing the input features can be the difference between a neural network that converges in a few iterations and one that takes hundreds of iterations to converge or even fails to converge at all. The optimization process may become slower because the gradients in the direction of the larger-scale features will be significantly larger than the gradients in the direction of the smaller-scale features....

April 4, 2023 · 8 min · Mario Filho