As a data scientist, tackling sales forecasting for multiple products is a tough job.
You know it’s essential for businesses, but dealing with different models, metrics, and complexities can be overwhelming.
You might be feeling the pressure to deliver accurate forecasts to drive better decisionmaking and wondering how to tackle this challenge effectively.
Don’t worry! I’m here to make this process easier and guide you through it.
In this tutorial, I’ll simplify sales forecasting by walking you through these key steps:
 Making a validation set for sales forecasting
 Creating and checking an initial baseline model
 Learning evaluation metrics like SMAPE, Pinball Loss, and Coverage Probability
 Trying techniques like ARIMA, HoltWinters, and DeepAR (LSTM) models
 Finetuning DeepAR hyperparameters for the best results
By the end of this tutorial, you’ll be skilled in handling various sales forecasting models and techniques.
This will help you develop and refine forecasts for different products, leading to smarter decisions for your business.
Eager to level up your sales forecasting abilities? Let’s dive in!
This method is bestsuited for products that have some sales history and sell almost every period.
Check our other tutorial for forecasting products with intermittent demand.
How To Transform Transactional Data For Sales Forecasting
To illustrate the importance of transforming transactional data into a sales forecastingfriendly format, let’s use an example from the retail industry.
Imagine a store selling a variety of products, like electronics, clothes, and toys.
Their raw transactional data would include details like product IDs, transaction timestamps, and quantities sold, but this information may be scattered across multiple tables or files.
By transforming this raw data into a more structured form, you can easily track the sales trends of different products over time.
Here we will work as if we were hired by a store that wants to forecast the sales performance of its top 100 most frequently sold products on a weekly basis.
You can expand this method to forecast the sales of more products or on a daily basis.
The final format we need the data to be in is the long format with each row containing: unique IDs for the products, a reference date and the quantity or dollar amount sold.
To show you how to do it, I found the Olist Brazilian ECommerce Public Dataset on Kaggle, which contains real transactional data for a Brazilian ecommerce company from 2016 to 2018.
First, the code reads in two separate data files, one that contains order items and the other with order details.
Then, it merges the two data sources together using the 'order_id'
as the common key to match the relevant information.
import os
import pandas as pd
order_items = pd.read_csv(os.path.join(path, 'olist_order_items_dataset.csv'))
orders = pd.read_csv(os.path.join(path, 'olist_orders_dataset.csv'))
orders = orders[['order_id', 'order_purchase_timestamp']]
data = pd.merge(order_items, orders, on='order_id')
This is what 'olist_order_items_dataset.csv'
looks like:
order_id  order_item_id  product_id  seller_id  shipping_limit_date  price  freight_value 

00010242fe8c5a6d1ba2dd792cb16214  1  4244733e06e7ecb4970a6e2683c13e61  48436dade18ac8b2bce089ec2a041202  20170919 09:45:35  58.9  13.29 
00018f77f2f0320c557190d7a144bdd3  1  e5f2d52b802189ee658865ca93d83a8f  dd7ddc04e1b6c2c614352b383efe2d36  20170503 11:05:13  239.9  19.93 
000229ec398224ef6ca0657da4fc703e  1  c777355d18b72b67abbeef9df44fd0fd  5b51032eddd242adc84c38acab88f23d  20180118 14:48:30  199  17.87 
00024acbcdf0a6daa1e931b038114c75  1  7634da152a4610f1595efa32f14722fc  9d7a1d34a5052409006425275ba1c2b4  20180815 10:10:18  12.99  12.79 
00042b26cf59d7ce69dfabb4e55b4fd9  1  ac6c3623068f30de03045865e4e10089  df560393f3a51e74553ab94004ba5c87  20170213 13:57:51  199.9  18.14 
And this is what selected columns of 'olist_orders_dataset.csv'
looks like:
order_id  order_purchase_timestamp 

e481f51cbdc54678b7cc49136f2d6af7  20171002 10:56:33 
53cdb2fc8bc7dce0b6741e2150273451  20180724 20:41:37 
47770eb9100c2d0c44946d9cf07ec65d  20180808 08:38:49 
949d5b44dbf5de918fe9c16f97b45f8a  20171118 19:28:06 
ad21c59c0840e6cb83a9ceb5573f8159  20180213 21:18:39 
After merging, this is the dataset we get:
order_id  order_item_id  product_id  seller_id  shipping_limit_date  price  freight_value  order_purchase_timestamp 

00010242fe8c5a6d1ba2dd792cb16214  1  4244733e06e7ecb4970a6e2683c13e61  48436dade18ac8b2bce089ec2a041202  20170919 09:45:35  58.9  13.29  20170913 08:59:02 
00018f77f2f0320c557190d7a144bdd3  1  e5f2d52b802189ee658865ca93d83a8f  dd7ddc04e1b6c2c614352b383efe2d36  20170503 11:05:13  239.9  19.93  20170426 10:53:06 
000229ec398224ef6ca0657da4fc703e  1  c777355d18b72b67abbeef9df44fd0fd  5b51032eddd242adc84c38acab88f23d  20180118 14:48:30  199  17.87  20180114 14:33:31 
00024acbcdf0a6daa1e931b038114c75  1  7634da152a4610f1595efa32f14722fc  9d7a1d34a5052409006425275ba1c2b4  20180815 10:10:18  12.99  12.79  20180808 10:00:35 
00042b26cf59d7ce69dfabb4e55b4fd9  1  ac6c3623068f30de03045865e4e10089  df560393f3a51e74553ab94004ba5c87  20170213 13:57:51  199.9  18.14  20170204 13:57:51 
Now we have information about when the order was made and we can count how much of each product was sold each period.
The data is already in the long format, where each row represents a single observation, and each column represents a different attribute of that observation.
data_long = data[['product_id', 'order_purchase_timestamp', 'order_item_id']].copy()
data_long['order_purchase_timestamp'] = pd.to_datetime(data_long['order_purchase_timestamp']).dt.normalize()
data_long = data_long[data_long['order_purchase_timestamp'] >= '20170101']
data_long['week_start_date'] = (data_long['order_purchase_timestamp'] + pd.Timedelta(days=1)).apply(lambda x: x  pd.offsets.Week(weekday=6))
In this case, each observation is a product sold, and the attributes are the product’s ID, the date it was sold, and the number of units sold.
The data is filtered to only include orders made on or after '20170101'
and we create a new column 'week_start_date'
that will be our reference period to aggregate the data.
The purpose of this particular code segment is to assign a consistent week_start_date
to every transaction in the dataset.
Be very careful when dealing with dates, because it’s very easy to get confused about the exact date a period starts or ends.
In this case, I had to experiment with multiple transformations until I found one that worked.
Here’s a breakdown of the line that creates the 'week_start_date'
column:
data_long['order_purchase_timestamp']
contains the purchase dates for each transaction.pd.Timedelta(days=1)
is used to add one day to the purchase date, which helps account for transactions occurring on the starting weekday (Sunday).(data_long['order_purchase_timestamp'] + pd.Timedelta(days=1)).apply(lambda x: x  pd.offsets.Week(weekday=6))
calculates theweek_start_date
for each transaction by rolling back the date to the most recent Sunday.
Next, the data is grouped by 'product_id'
and 'week_start_date'
, counting the number of units sold each week for each product.
This new aggregated table is then renamed from 'order_item_id'
to 'quantity_sold'
to more accurately reflect its contents.
To focus on the top 100 bestselling products, the data is further filtered for these specific items.
data_grouped = data_long.groupby(['product_id', 'week_start_date'])['order_item_id'].sum().reset_index()
data_grouped = data_grouped.rename(columns={'order_item_id': 'quantity_sold'})
top100 = data_grouped['product_id'].value_counts().head(100).index
data_grouped = data_grouped[data_grouped['product_id'].isin(top100)]
product_id  week_start_date  quantity_sold 

0152f69b6cf919bcdaf117aa8c43e5a2  20170219 00:00:00  1 
0152f69b6cf919bcdaf117aa8c43e5a2  20170305 00:00:00  4 
0152f69b6cf919bcdaf117aa8c43e5a2  20170312 00:00:00  6 
0152f69b6cf919bcdaf117aa8c43e5a2  20170319 00:00:00  6 
0152f69b6cf919bcdaf117aa8c43e5a2  20170326 00:00:00  4 
All the products are “stacked” on top of each other in the same DataFrame.
One problem we still have is that not every product has been sold every week, so there are missing values in the data.
We want each product to have a value for every week, even if it’s zero.
So we use the pivot
method to transform the data into wide format, where each row represents a single product, and each column represents a different week.
Then, after filling in the missing values with zeros, we use the stack
method to transform the data back into long format.
data_pivoted = data_grouped.pivot(index='product_id', columns='week_start_date', values='quantity_sold').fillna(0)
data_long = data_pivoted.stack().reset_index()
data_long = data_long.rename(columns={'level_1': 'week_start_date', 0: 'quantity_sold'})
The final format is the same as the last table above, but now we have a value for every product for every week.
If your data contains missing data because of the way it was collected instead of simple sparsity (like in this case), you should be careful about filling in the missing values.
We can verify that all products have the same number of observations by ensuring that the standard deviation of the row count per product is zero.
assert data_long.groupby('product_id').size().describe()['std'] == 0
Finally, we rename the columns to match the requirements of the libraries we’ll be testing for forecasting.
data_long = data_long.rename(columns={'week_start_date': 'ds', 'quantity_sold': 'y', 'product_id': 'unique_id'})
How To Create a Validation Set For Sales Forecasting
Now that we have transformed our transactional data into a forecastingfriendly format, the next crucial step is to create a validation set.
A validation set is used to evaluate the performance of the model before we use it for future sales predictions.
To create a validation set, we’ll split our data into a training dataset and a validation dataset.
We could have a test dataset as well but, in my experience, you can make a lot of mistakes that render your test set unreliable, so I take production data as the real test set.
Here’s the code to split the data:
train = data_long[data_long['ds'] < '20180101']
valid = data_long[(data_long['ds'] >= '20180101') & (data_long['ds'] < '20180301')]
h = valid['ds'].nunique()
print('h =', h)
print('train:', train)
print('valid:', valid)
We split the data into three parts:
 Training: This dataset includes all records before ‘20180101’. We will use it to train our forecasting models.
 Validation: This dataset includes records between ‘20180101’ and ‘20180301’. We will use this data to validate our model’s performance and finetune its parameters.
The variable h
is used to represent the forecasting horizon.
The forecasting horizon is the number of time periods into the future that we want to forecast.
In this case, h
represents the number of weeks in our validation set, which we will use to forecast sales.
Now, with the training and validation sets, we can proceed to fit our sales forecasting models, evaluate their performances, and make predictions.
How to Build an Initial Sales Forecasting Baseline Model
Establishing a baseline is an essential step in the sales forecasting process.
Creating a simple baseline model helps you understand the sales forecasting problem better and provides a base performance level that more complex models should aim to surpass.
In other words, deploying complex models isn’t worth it if they can’t outperform a simple baseline.
In this section, we will use a selection of basic models to create a foundation for our sales forecasting.
Importing Libraries and Creating Baseline Models
First, we’ll import the necessary libraries and establish a variety of basic models.
These include:
Naive
: A naive model that assumes sales will be the same as the most recent observation.SeasonalNaive
: A model that takes seasonality into account and repeats the sales pattern from the previous season.WindowAverage
: A moving average model that calculates the average sales over a specified window of previous observations.SeasonalWindowAverage
: A model that captures both seasonality and a moving average by combining the seasonal naive and window average approaches.
In our case, a season is (arbitrarily) defined as 4 weeks. We expect that the sales pattern for a given product will be similar every 4 weeks.
It makes sense if you think about salary payments and credit card bills due dates.
The right seasonal length will depend on the data and the problem you’re trying to solve.
Talk with domain experts to determine the right season length for your problem.
We will use the StatsForecast library to create these models.
You can easily install it using pip
:
pip install statsforecast
Or conda
:
conda install c nixtla statsforecast
Here’s the code to import the libraries and set up these baseline models:
from statsforecast import StatsForecast
from statsforecast.models import Naive, SeasonalNaive, WindowAverage, SeasonalWindowAverage
model = StatsForecast(models=[Naive(),
SeasonalNaive(season_length=4),
WindowAverage(window_size=4),
SeasonalWindowAverage(window_size=2, season_length=4)],
freq='W', n_jobs=1)
Fitting the Baseline Models
With the baseline models set up, we’ll now fit them using our training dataset:
model.fit(train)
Generating Forecasts and Evaluating Performance
After fitting the models, we can generate forecasts for the validation dataset.
Remember that, in this example, we use a forecasting horizon (h
) equal to the number of unique dates in the validation dataset.
I used 9 weeks, but you can change this value to any number of weeks you want.
p = model.forecast(h=h, level=[90])
level
is a parameter that controls the prediction interval.
Point forecasts provide a single estimate for the future, while prediction intervals show a range where the actual value might be.
Working with nontechnical stakeholders, these intervals help create realistic expectations about the model’s performance, which leads to improved planning and decisionmaking.
Again, you can change this value to any number that is appropriate for your problem.
This is what the p
dataframe looks like:
ds  Naive  Naivelo90  Naivehi90  SeasonalNaive  SeasonalNaivelo90  SeasonalNaivehi90  WindowAverage  SeasWA 

20180107 00:00:00  1  0.644854  2.64485  0  4.45428  4.45428  0.5  0 
20180114 00:00:00  1  1.32617  3.32617  1  3.45428  5.45428  0.5  0.5 
20180121 00:00:00  1  1.84897  3.84897  0  4.45428  4.45428  0.5  0 
20180128 00:00:00  1  2.28971  4.28971  1  3.45428  5.45428  0.5  0.5 
20180204 00:00:00  1  2.678  4.678  0  4.45428  4.45428  0.5  0 
It has a column for each model, including the point forecast and the prediction interval bounds we specified for the models that support it.
In our case, negative values make no sense, so we’ll replace them with zeros:
cols = ['Naive', 'Naivelo90', 'Naivehi90', 'SeasonalNaive',
'SeasonalNaivelo90', 'SeasonalNaivehi90', 'WindowAverage',
'SeasWA']
p.loc[:, cols] = p.loc[:, cols].clip(0)
Finally, we merge the forecasts with the actual sales from the validation dataset to evaluate the models’ performance.
p = p.reset_index().merge(valid, on=['ds', 'unique_id'], how='left')
unique_id  ds  Naive  Naivelo90  Naivehi90  SeasonalNaive  SeasonalNaivelo90  SeasonalNaivehi90  WindowAverage  SeasWA  y 

0152f69b6cf919bcdaf117aa8c43e5a2  20180107 00:00:00  1  0  2.64485  0  0  4.45428  0.5  0  2 
0152f69b6cf919bcdaf117aa8c43e5a2  20180114 00:00:00  1  0  3.32617  1  0  5.45428  0.5  0.5  1 
0152f69b6cf919bcdaf117aa8c43e5a2  20180121 00:00:00  1  0  3.84897  0  0  4.45428  0.5  0  1 
0152f69b6cf919bcdaf117aa8c43e5a2  20180128 00:00:00  1  0  4.28971  1  0  5.45428  0.5  0.5  0 
0152f69b6cf919bcdaf117aa8c43e5a2  20180204 00:00:00  1  0  4.678  0  0  4.45428  0.5  0  0 
Now that we have our baseline models in place, we can compare their performance with more complex models.
A wellperforming complex model should outperform these baseline models, justifying any additional computational complexity and efforts needed in model development and deployment.
How To Evaluate a Sales Forecasting Model
To assess the performance of a sales forecasting model, we need to use multiple evaluation metrics.
In this section, we will focus on three key metrics: the Symmetric Mean Absolute Percentage Error (SMAPE) for point forecasts, the pinball loss for quantile metrics, and the coverage probability.
Symmetric Mean Absolute Percentage Error (SMAPE)
SMAPE is a scaleindependent measure of accuracy that considers both underestimation and overestimation errors.
It’s particularly useful for sales forecasting because it allows for a fair comparison of forecasts even if sales values vary widely across products or time and can handle zero values in the true sales data.
Still, when evaluating sales forecasting models using SMAPE, it’s essential to be cautious about interpreting the percentage errors for products with low sales volumes.
For example, when comparing a 50% error in a product that sells 2 units versus a 50% error in a product that sells 100 units, the error in absolute terms is quite different.
In the first case, the 50% error translates to an error of only 1 unit, while in the second case, it corresponds to an error of 50 units.
In practical terms, the impact of the two errors on inventory management or revenue is very different.
Make sure to consider the practical implications of the errors in terms of units or revenue.
Here’s how we calculate SMAPE for the Naive model:
import numpy as np
def smape_loss(y_true, y_pred):
return 100 * np.mean(np.abs(y_true  y_pred) / (np.abs(y_true) + np.abs(y_pred)))
print(smape_loss(p['y'], p['Naive']))
Pinball Loss
Pinball loss is a specialized metric for evaluating the accuracy of quantile forecasts, which estimate specific percentile values of the future sales distribution rather than just a single point.
For example, when predicting the 95th percentile sales volume of a product, pinball loss assesses how well the model estimates that specific quantile.
An accurate 95th percentile forecast means you can better estimate the upper range of potential sales outcomes, which allows you to avoid overstocking and reduce the risk of stockouts.
A lower pinball loss indicates a more accurate prediction, while a higher value suggests less precision.
To calculate the pinball loss, we use the mean_pinball_loss
function from the sklearn.metrics
module:
from sklearn.metrics import mean_pinball_loss
print(mean_pinball_loss(p['y'], p['Naivelo90'], alpha=0.05), mean_pinball_loss(p['y'], p['Naivehi90'], alpha=0.95))
As level
is set to cover 90% of the distribution, the alpha
value for the lower quantile is 0.05, and the alpha
value for the upper quantile is 0.95.
It’s very easy to confuse the alpha
value with the level
.
Coverage Probability
Coverage probability measures how often the actual sales values lie within the prediction intervals.
A high coverage probability indicates that the true sales values often lie within the predicted range, giving you confidence in your model’s ability to provide reliable estimates of future sales.
When measuring 90% prediction intervals, you are aiming for a model that provides a range within which the true sales values should fall 90% of the time.
If your calculated coverage probability significantly deviates from the target 90%, it could indicate that your model’s prediction intervals may not be reliable.
For example, if your coverage probability is only 75%, it means that the true sales values fall within the prediction intervals only 75% of the time, which is lower than the desired 90%.
Then you need to go back and investigate why your model is not providing accurate prediction intervals.
Here’s how to calculate the coverage probability:
def coverage_prob(y_true, y_lo, y_hi):
return np.mean((y_lo <= y_true) & (y_true <= y_hi))
print(coverage_prob(p['y'], p['Naivelo90'], p['Naivehi90']))
Now that you have the three metrics and understand their definitions, you can use them to compare the performance of the baseline models with the more complex models considering multiple aspects.
Some models might improve all three metrics, while others might improve only one or two.
For example, the model with lowest SMAPE have accurate point forecasts, but low Coverage Probability may suggest you should put effort into improving the prediction intervals.
Now let’s move on to more advanced forecasting techniques using ARIMA and HoltWinters models.
How To Forecast Sales With ARIMA and HoltWinters
First, let’s import the necessary libraries and set up our models.
We’re going to use the StatsForecast
library again.
In this case, we’ll use two objects: AutoARIMA
and HoltWinters
.
ARIMA (Autoregressive Integrated Moving Average) is a linear model that combines three key components: autoregression (AR), differencing (I), and moving average (MA).
The AR component is based on the relationship between a current observation and a certain number of lagged observations (past values).
The I component helps make the time series stationary by subtracting the previous observation from the current one.
The MA component considers the relationship between a current observation and a residual error from a moving average model applied to lagged observations.
ARIMA models can capture complex patterns, like trends and seasonality, in the data.
They’re particularly useful for univariate time series data without external factors affecting the observations.
HoltWinters is a model specifically designed for time series data that exhibits both trend and seasonality.
It’s an extension of the Simple Exponential Smoothing (SES) model, which weighs past observations with a decreasing exponential scale.
HoltWinters improves upon SES by accounting for trends (using Holt’s linear trend method) and seasonal components (using a seasonal smoothing factor).
I chose to use ARIMA and HoltWinters in this tutorial because they’re wellestablished methods with a solid track record in the field of time series forecasting.
They can handle various data patterns and often produce reliable forecasts for multiple products, making them a good fit for our ecommerce sales prediction problem.
Additionally, both models can be easily implemented and finetuned using Python libraries, which makes them practical for realworld applications.
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, HoltWinters
model = StatsForecast(models=[AutoARIMA(season_length=4),
HoltWinters(season_length=4, error_type='A')],
freq='W', n_jobs=1)
StatsForecast
will apply an internal optimization process to find the best parameters for each model and product.
These models are local, not global, meaning that each product will have its own separate model, which is different than the global models we will see later.
Nonglobal models offer flexibility and easier interpretation at the cost of scalability and leveraging shared information across products.
There is no way to know if a global or local model will perform better before testing, so it’s best to try both and see which one works best for your data.
That said, these simpler, local models, tend to work better when we are doing longterm forecasts, as we usually have a smaller number of data points to train the model.
Now that we have our models set up, we need to fit them to our training data and get the forecasts.
As we are using the same library as for the naive baselines, the process is very similar:
model.fit(train)
p = model.forecast(h=h, level=[90])
cols = p.columns[1:]
p.loc[:, cols] = p.loc[:, cols].clip(0)
p = p.reset_index().merge(valid, on=['ds', 'unique_id'], how='left')
To evaluate these models, we’ll use the same metrics as before: SMAPE, Pinball Loss, and Coverage Probability.
print(smape_loss(p['y'], p['AutoARIMA']))
print(mean_pinball_loss(p['y'], p['AutoARIMAlo90'], alpha=0.1), mean_pinball_loss(p['y'], p['AutoARIMAhi90'], alpha=0.9))
print(coverage_prob(p['y'], p['AutoARIMAlo90'], p['AutoARIMAhi90']))
Now, how do you compare these metrics to the ones from the naive baselines?
In this dataset, the Naive model has about 63% SMAPE, while the ARIMA model has about 61% SMAPE.
Point for ARIMA.
But when it comes to Coverage Probability, the Naive model has 86%, while the ARIMA model has only 77%.
Point for Naive.
So if you had to pick between only these two, you would need to decide which metric is more important for your business or keep looking for a better model.
How To Forecast Sales With DeepAR (LSTM) And External Features
Now, let’s explore a more advanced technique for sales forecasting: DeepAR.
This is a global model that uses Long ShortTerm Memory (LSTM), a type of recurrent neural network, to predict future values based on the past values of multiple time series.
DeepAR can capture complex patterns and dependencies among different products, making it suitable for our ecommerce sales prediction problem.
Originally it was developed for univariate time series, but GluonTS, the library we’ll use to implement DeepAR, has extended it to multivariate time series.
First I will add a static feature to the dataset.
products = pd.read_csv(os.path.join(path, 'olist_products_dataset.csv')).rename(columns={'product_id': 'unique_id'})
data_long_w_external = data_long.merge(products[['unique_id', 'product_category_name']], on='unique_id', how='left')
data_long_w_external['product_category_name'] = data_long_w_external['product_category_name'].astype('category')
cat_cardinality = data_long_w_external['product_category_name'].nunique()
This is what 'olist_products_dataset.csv'
looks like:
unique_id  product_category_name  product_name_lenght  product_description_lenght  product_photos_qty  product_weight_g  product_length_cm  product_height_cm  product_width_cm 

1e9e8ef04dbcff4541ed26657ea517e5  perfumaria  40  287  1  225  16  10  14 
3aa071139cb16b67ca9e5dea641aaa2f  artes  44  276  1  1000  30  18  20 
96bd76ec8810374ed1b65e291975717f  esporte_lazer  46  250  1  154  18  9  15 
cef67bcfe19066a932b7673e239eb23d  bebes  27  261  1  371  26  4  26 
9dc1a7de274444849c219cff195d0b71  utilidades_domesticas  37  402  4  625  20  17  13 
In this code snippet, we’re preparing the dataset to include an external feature, product_category_name
.
We could use all others, but for simplicity, we’ll use only this one.
Adding external features to the dataset is optional, but it can improve the model’s performance by providing additional information about the time series.
By including the category of each product, we hope that the model will be able to learn that some categories have different sales patterns than others, like seasonality.
The products dataset is read from a CSV file, and the column containing product IDs is renamed to unique_id
for consistency.
The original dataset (data_long
) is merged with the products dataset, essentially adding a new column  product_category_name
 to the existing dataset.
This operation is performed using the unique_id
as the common key between both datasets.
The product_category_name
column is converted to a categorical data type.
This is an important step because categorical features are handled differently from numerical features in machine learning models.
The cardinality of the product_category_name
column is calculated, which is the number of unique product categories present in the dataset.
This information will be used later when setting up the DeepAR model.
As we are passing it as a static categorical feature, DeepAR will automatically add this to the series when we make forecasts.
It’s possible to add dynamic features as well, like the periods in the historical data and forecast horizon when the product was on sale, or the price.
The downside is that you must have this information for the entire forecast horizon at the time of making the forecast, which is not always possible.
If you don’t have GluonTS installed, you can install it by following the instructions here.
We will use the PyTorch backend.
Let’s import the necessary libraries and prepare our dataset.
from gluonts.dataset.pandas import PandasDataset
train_ds = PandasDataset.from_long_dataframe(train, target='y', item_id='unique_id',
timestamp='ds', freq='W', static_feature_columns=['product_category_name'])
Here, we’re creating a PandasDataset
object from our training data using the from_long_dataframe
method.
This method transforms the long format train
DataFrame into a format suitable for use with the gluonts
library.
The method takes the following attributes:
train
: The input training DataFrame.target
: The name of the target variable in the DataFrame (‘y’ in this case, which represents sales).item_id
: The column that identifies different time series within the DataFrame (‘unique_id’ in this case, which represents individual products).timestamp
: The column containing the time information (‘ds’ in this case, which represents the date).freq
: The frequency of the time series data (‘W’ in this case, denoting weekly data).static_feature_columns
: A list of column names that contain static (nontimevarying) features. In this case, we’re including the ‘product_category_name’ column as a static feature to improve our model’s performance.
Now, let’s set up the DeepAR model using the gluonts.torch.model.deepar
module.
There are many parameters we can tune, but for now we’ll specify the number of layers, the static features, and the training parameters for the optimizer.
from gluonts.torch.model.deepar import DeepAREstimator
estimator = DeepAREstimator(freq='W', prediction_length=h, num_layers=1, num_feat_static_cat=1,
cardinality=[cat_cardinality],
trainer_kwargs={'accelerator': 'gpu','max_epochs':1})
In this code snippet, we’re setting up the DeepAR model using the DeepAREstimator
class.
The model’s arguments are defined as follows:
freq
: The frequency of the time series data (‘W’ for weekly data).prediction_length
: The number of time steps to predict into the future (denoted ash
here).num_layers
: The number of layers in the LSTM neural network (set to 1 in this case).num_feat_static_cat
: The number of static categorical features we’re including in the model (1 in this case, representing the ‘product_category_name’ feature).cardinality
: A list containing the cardinality (number of distinct categories) of each static categorical feature. In our case, we provide[cat_cardinality]
, a list with a single element, which is the number of unique product categories in our dataset.trainer_kwargs
: A dictionary containing additional arguments for the model’s optimizer. In this case, we specify to use a GPU accelerator for faster training and set the maximum number of training epochs to 1. Training epochs refer to the number of times the model will see the entire training dataset during training.
Next, we’ll train the model using our prepared dataset with the train
method.
predictor = estimator.train(train_ds)
Once the model is trained, we can use the predict
method to generate forecasts for each product in our dataset.
The predict
method returns a generator object, which we can convert to a list and then to a DataFrame.
DeepAR estimates the parameters of a distribution for each time step in the forecast horizon, so each element in the list contains samples from the fitted distribution as forecasts.
def make_df_preds(predictor, dataset):
pred = list(predictor.predict(dataset))
# gluonts.model.forecast.SampleForecast(info=None, item_id='0152f69b6cf919bcdaf117aa8c43e5a2', samples=array([[ 1.53543979e01, 4.16583046e02, 2.44526476e01...]], dtype=float32), start_date=Period('20180101/20180107', 'WSUN'))
all_preds = list()
for item in pred:
unique_id = item.item_id
p = item.samples.mean(axis=0)
plo = np.percentile(item.samples, 5, axis=0)
phi = np.percentile(item.samples, 95, axis=0)
dates = pd.date_range(start=item.start_date.to_timestamp(), periods=len(p), freq='W')
category_pred = pd.DataFrame({'ds': dates, 'unique_id': unique_id, 'pred': p, 'plo': plo, 'phi': phi})
all_preds += [category_pred]
all_preds = pd.concat(all_preds, ignore_index=True)
return all_preds
all_preds = make_df_preds(predictor, train_ds)
all_preds = all_preds.merge(valid[['unique_id', 'ds', 'y']], on=['unique_id', 'ds'], how='left')
In this code snippet, we define a function called make_df_preds
that generates forecasts using the trained DeepAR model and formats the results into a DataFrame.
The function takes two arguments:
predictor
: The trained DeepAR model.dataset
: The dataset for which we want to generate forecasts.
Inside the function, we perform the following steps:

Call the
predict
method of thepredictor
to generate forecasts for each time series in the dataset. The results are stored in a list calledpred
. 
Initialize an empty list named
all_preds
to store the formatted forecasts for each time series. 
Iterate through the items in
pred
. For each item, we: Extract the unique ID (product ID) and store it in
unique_id
.  Compute the mean forecast across all generated samples and store it in
p
.  Calculate the 5th and 95th percentiles of the forecast samples, and store them in
plo
andphi
, respectively. These values represent the lower and upper bounds of the 90% prediction interval.  Create a date range starting from the first date in the forecast and continuing for the desired prediction length (
len(p)
), using a weekly frequency (‘W’).  Create a DataFrame called
category_pred
that contains the date, unique ID, mean forecast, and the 10th and 90th percentiles for each time step.  Append
category_pred
to theall_preds
list.
 Extract the unique ID (product ID) and store it in

Concatenate all the DataFrames stored in
all_preds
, and reset the index to create a unified DataFrame containing all the forecasts.
After defining the make_df_preds
function, we use it to generate forecasts for our training dataset (train_ds
) and store the results in the all_preds
DataFrame.
Why train_ds
instead of valid_ds
?
DeepAR will use this data as context and generate forecasts for the immediate future (the next h
weeks) from the last date in the training data.
Finally, we merge the all_preds
DataFrame with the validation data using common columns (unique_id
and ds
) to facilitate the comparison and evaluation of the forecasts.
Now, we can evaluate the DeepAR model using the same metrics as before: SMAPE, Pinball Loss, and Coverage Probability.
How To Tune DeepAR Hyperparameters
It’s unlikely that we’ll get the best possible results with the default hyperparameters, so let’s try to improve our model’s performance by tuning the hyperparameters with Optuna.
Optuna is a powerful hyperparameter optimization library that helps us find the best combination of hyperparameters for our model.
Installing it is very simple with pip:
pip install optuna
First, we need to define the objective function that Optuna will use to evaluate different sets of hyperparameters.
import optuna
def objective(trial):
num_layers = trial.suggest_int('num_layers', 1, 4)
context_length = trial.suggest_int('context_length', 4, 52)
hidden_size = trial.suggest_categorical('hidden_size', [8, 16, 32, 64, 128, 256])
dropout_rate = trial.suggest_float('dropout_rate', 0.0, 0.95)
lr = trial.suggest_float('lr', 1e5, 1e1, log=True)
batch_size = 256
estimator = DeepAREstimator(freq='W', prediction_length=h, num_feat_static_cat=1,
cardinality=[cat_cardinality], num_layers=num_layers,
context_length=context_length, hidden_size=hidden_size,
dropout_rate=dropout_rate, batch_size=batch_size, lr=lr,
trainer_kwargs={'accelerator': 'gpu','max_epochs':10})
predictor = estimator.train(train_ds)
df_preds = make_df_preds(predictor, train_ds)
df_preds = df_preds.merge(valid, on=['ds', 'unique_id'], how='left')
return coverage_prob(df_preds['y'], df_preds['plo'], df_preds['phi'])
Inside the objective function, we create a DeepAR model with the given hyperparameters, train it using our training dataset, and generate forecasts like before.
Then, the forecasts are evaluated using the Coverage Probability metric, which is returned as the optimization target.
Here is a brief explanation of each hyperparameter and the ranges we’re exploring with Optuna:

num_layers
: The number of layers in the LSTM neural network. The range we’re exploring is between 1 and 4 layers. More layers can capture complex patterns but may also add complexity and risk overfitting. 
context_length
: The number of past time steps the model should consider when making a prediction. We’re trying values between 4 and 52, which represent 1 month to 1 year of historical data. A larger context length enables the model to learn from longterm trends but may increase computational complexity. 
hidden_size
: The number of hidden units (neurons) in each LSTM layer. We’re testing a range of values: [8, 16, 32, 64, 128, 256]. Larger hidden sizes can make the model more expressive but may also increase computational cost and risk overfitting. GPUs tend to be faster with multiples of 2. 
dropout_rate
: The dropout rate applied to the LSTM layers to prevent overfitting. We’re looking for values between 0.0 and 0.95. Higher dropout rates increase the regularization strength, but if it’s too high, the model may underfit the data. 
lr
: The learning rate for the model’s optimizer. We’re trying values between 1e5 and 1e1 on a logarithmic scale. The learning rate controls how quickly the model updates its weights during training, and choosing an appropriate value is crucial for convergence and model performance. 
batch_size
: The number of samples used for each update of the model’s weights during training. In this case, we set it to 256. Larger batch sizes can lead to faster training but may also require more memory and may result in less accurate estimation of the gradients.
After defining the objective function, we create an Optuna study with the direction set to 'maximize'
since we want to maximize the Coverage Probability.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)
We then call the optimize
method and pass the objective function and the number of trials (20 in this case) as arguments.
Optuna will perform 20 optimization trials, searching for the best hyperparameter values to maximize our selected metric.
In practice, 20 to 30 trials are usually enough to find a good set of hyperparameters.
We can save all models during the optimization process or get the best set of hyperparameters at the end of the trials.
Then we retrain it and make the final forecasts or save the model for later use.