Overview Of My 3rd Place Solution To The Criteo ECML-PKDD 22 Challenge

Some posts ago, I shared about a competition sponsored by Criteo that I decided to participate in to learn more about the out-of-distribution robustness of machine learning models. It turns out I got 3rd place and a prize! In any competition, you need to write a report on your solution to claim the prize, so I decided to post it here too. Enjoy! Table of Contents Background Preprocessing Validation XGBoost LightGBM CatBoost Logistic Regression Results Table With Best Model Scores In Bold Final Ensemble Background The competition had a dataset with about 40 categorical features “aggregated from traces in computational advertising” and a binary target....

June 8, 2022 · 5 min · Mario Filho

Can We Solve Distribution Shift With Clever Training In Machine Learning?

One of the biggest problems we have when using machine learning in practice is distribution shift. A distribution shift occurs when the distribution of the data the model sees in production starts to look different than the data used to train it. A simple example that broke a lot of models was COVID. The quarantine simply changed how people behaved and the historical data became less representative. Another good example is credit card fraud....

May 27, 2022 · 12 min · Mario Filho

Are Kaggle Competitions Worth It? Ponderings of a Kaggle Grandmaster

I would not have a data science career without Kaggle. So if you are looking for a blog post bashing Kaggle, this is not the place. That said, I am not a radical that thinks Kaggle is the ultimate thing that everyone must do in order to become a data scientist. I want to give an honest opinion coming from the perspective of someone that heavily competed but decided to “retire” a few years ago....

May 25, 2022 · 15 min · Mario Filho