What Is Naive Forecasting?

Whenever you start a time series forecasting project, you should start with a naive model.

A naive model is a very simple rule that you use to generate predictions for the future.

It’s easy to implement and it gives you a baseline to compare your more complex models against.

Here you will learn how to use the StatsForecast library, which provides the most popular naive models for time series forecasting in Python.

How To Install StatsForecast

StatsForecast is available on PyPI, so you can install it with pip:

pip install statsforecast

Or with conda:

conda install -c conda-forge statsforecast

What Are The Types Of Naive Forecasting Models

We will try the following naive models:

Simple Naive Forecast

The simple naive model predicts the next values as the last observed value.

It’s that simple, just take the last value you have in your data and use it as the prediction for any future time steps.

This works because many time series have a recency bias, which means that the most recent values are more predictive than the older ones.

Seasonal Naive Forecast

The seasonal naive model takes the last observed value from a similar period in the past.

For example, if we want to know the sales for next Friday, we can use the sales from the previous week Friday.

This way we have a recent value, but slightly more sophisticated than the simple naive model.

Window Average Forecast

The window average model takes the average of the last window_size values of the series.

Then it uses this average as the prediction for any future time steps.

It works by smoothing out the noise in the series.

Seasonal Window Average Forecast

The seasonal window average model takes the average of the last window_size values from a similar period in the past.

In our Friday example, we would take the average of the sales from the previous window_size Fridays.

If we have a noisy but seasonal series, this model will be able to smooth out the noise and still capture the seasonality.

How To Prepare The Data For StatsForecast

We will use real sales data from the Favorita store chain, from Ecuador.

We have sales data from 2013 to 2017 for multiple stores and product categories.

To measure the model’s performance, we will use WMAPE (Weighted Mean Absolute Percentage Error) with the absolutes of the actual values as weights.

This is an adapted version of MAPE (Mean Absolute Percentage Error) that solves the problem of division by zero when there are no sales for a specific day.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

def wmape(y_true, y_pred):
    return np.abs(y_true - y_pred).sum() / np.abs(y_true).sum()

For this tutorial I will use only the data from one store and two product categories.

You can use as many categories, SKUs, stores, etc as you want.

path = 'train.csv'
data = pd.read_csv(path, index_col='id', parse_dates=['date'])

data2 = data.loc[(data['store_nbr'] == 1) & (data['family'].isin(['MEATS', 'PERSONAL CARE'])), ['date', 'family', 'sales']]

The columns are:

  • date: date of the record
  • family: product category
  • sales: sales amount

StatsForecast expects the columns to be named in the following format:

  • ds: date of the record
  • y: target variable (sales amount)
  • unique_id: unique identifier of the time series (product category)

So let’s rename them:

data2 = data2.rename(columns={'date': 'ds', 'sales': 'y', 'family': 'unique_id'})

unique_id should identify each time series you have.

If we had more than one store, we would have to add the store number along with the categories to unique_id.

An example would be unique_id = store_nbr + '_' + family.

This is the final version of our dataframe data2:

ds unique_id y
2013-01-01 00:00:00 MEATS 0
2013-01-01 00:00:00 PERSONAL CARE 0
2013-01-02 00:00:00 MEATS 369.101
2013-01-02 00:00:00 PERSONAL CARE 194
2013-01-03 00:00:00 MEATS 272.319

A row for each record containing the date, the time series ID (family in our example) and the target value.

Notice the time series records are stacked on top of each other.

Let’s split the data into train and validation sets.

How To Split Time Series Data For Validation

You should never use random or k-fold validation for time series.

That would cause data leakage, as you would be using future data to train your model.

In practice, you can’t take random samples from the future to train your model, so you can’t use them here.

To avoid this issue, we will use a simple time series split between past and future.

A career tip: knowing how to do time series validation correctly is a skill that will set you apart from many data scientists (even experienced ones!).

Our training set will be all the data between 2013 and 2016 and our validation set will be the first 3 months of 2017.

train = data2.loc[data2['ds'] < '2017-01-01']
valid = data2.loc[(data2['ds'] >= '2017-01-01') & (data2['ds'] < '2017-04-01')]
h = valid['ds'].nunique()

h is the horizon, the number of periods we want to forecast.

Note About This Data

This data doesn’t contain a record for December 25, so I just copied the sales from December 18 to December 25.

Without this step, the model would have a hard time capturing seasonality as it looks for a pattern that repeats every season_length records in the series.

dec25 = list()
for year in range(2013,2017):
    dec25 += [{'ds': pd.Timestamp(f'{year}-12-25'), 'unique_id': 'MEATS', 'y': train.loc[(train['ds'] == f'{year}-12-18') & (train['unique_id'] == 'MEATS'), 'y'].values[0]},
            {'ds':  pd.Timestamp(f'{year}-12-25'), 'unique_id': 'PERSONAL CARE', 'y': train.loc[(train['ds'] == f'{year}-12-18') & (train['unique_id'] == 'PERSONAL CARE'), 'y'].values[0]}]
train = pd.concat([train, pd.DataFrame(dec25)], ignore_index=True).sort_values('ds')

How To Build Naive Forecasting Models In Python

It’s very easy to build naive forecasting models using StatsForecast.

from statsforecast import StatsForecast
from statsforecast.models import Naive, SeasonalNaive, WindowAverage, SeasonalWindowAverage

model = StatsForecast(models=[Naive(), 
                              SeasonalNaive(season_length=7), 
                              WindowAverage(window_size=7), 
                              SeasonalWindowAverage(window_size=2, season_length=7)],
                      freq='D', n_jobs=-1)
model.fit(train)

We pass a list of models to the models argument of the StatsForecast class.

Here we are using the models described above: Naive, SeasonalNaive, WindowAverage and SeasonalWindowAverage.

For the SeasonalNaive and SeasonalWindowAverage models, we need to specify the season length, which in our case is weekly, so 7 periods.

For the WindowAverage and SeasonalWindowAverage models, we need to specify the window size, which I arbitrarily chose to be 7 and 2 periods.

This means that the WindowAverage model will use the average of the last 7 days of the training data to make the prediction and the SeasonalWindowAverage model will use the average of specific days in the last 2 weeks.

In practice you must try different window sizes and season lengths to find the ones that minimize the evaluation metric in your validation set.

The freq argument is the frequency of the data, in our case daily

The n_jobs argument is the number of cores to use for parallelization.

Now we can make predictions for the steps after the last date in the training set.

p = model.predict(h=h, level=[90])
p = p.reset_index().merge(valid, on=['ds', 'unique_id'], how='left')

In our case, the last date in the training set is 2016-12-31, so the first prediction will be for 2017-01-01.

I merged the target values y with the predictions to make it easier to calculate the WMAPE and plot.

The predictions are stored in a dataframe with the following format:

unique_id ds Naive Naive-lo-90 Naive-hi-90 SeasonalNaive SeasonalNaive-lo-90 SeasonalNaive-hi-90 WindowAverage SeasWA y
MEATS 2017-01-01 00:00:00 187.434 -201.434 576.302 176.26 -435.263 787.783 251.709 176.26 0
PERSONAL CARE 2017-01-01 00:00:00 150 57.1832 242.817 101 -175.316 377.316 150.143 101 0
PERSONAL CARE 2017-01-02 00:00:00 150 18.7373 281.263 74 -202.316 350.316 150.143 119.5 81
MEATS 2017-01-02 00:00:00 187.434 -362.508 737.376 80.884 -530.639 692.407 251.709 161.769 116.724
MEATS 2017-01-03 00:00:00 187.434 -486.104 860.972 229.281 -382.242 840.804 251.709 249.757 344.583

It contains the stacked predictions for all the time series with columns for all the models.

Naive and SeasonalNaive have confidence intervals, the other models don’t.

The WMAPE for each model is:

  • Naive WMAPE: 44.00%
  • SeasonalNaive WMAPE: 29.78%
  • WindowAverage WMAPE: 36.48%
  • SeasWA WMAPE: 24.22%

Let’s inspect the predictions visually to see if they make sense.

for model_ in ['Naive', 'SeasonalNaive', 'WindowAverage', 'SeasWA']:
    fig,ax = plt.subplots(2,1, figsize=(1280/96, 720/96))
    for ax_, family in enumerate(['MEATS', 'PERSONAL CARE']):
        p.loc[p['unique_id'] == family].plot(x='ds', y='y', ax=ax[ax_], label='y', title=family, linewidth=2)
        p.loc[p['unique_id'] == family].plot(x='ds', y=model_, ax=ax[ax_], label=model_)
        ax[ax_].set_xlabel('Date')
        ax[ax_].set_ylabel('Sales')
        if model_ in ['Naive', 'SeasonalNaive']:
            ax[ax_].fill_between(p.loc[p['unique_id'] == family, 'ds'].values,
                                p.loc[p['unique_id'] == family, f'{model_}-lo-90'], 
                                p.loc[p['unique_id'] == family, f'{model_}-hi-90'], 
                                alpha=0.2,
                                color='orange')
            ax[ax_].set_title(f'{family} - Orange band: 90% confidence interval')
        ax[ax_].legend()
    fig.tight_layout()
    wmape_ = wmape(p['y'].values, p[model_].values)
    print(f'{model_} WMAPE: {wmape_:.2%}')

plot with forecasts by the Naive model

plot with forecasts by the SeasonalNaive model

plot with forecasts by the WindowAverage model

plot with forecasts by the Seasonal window average model

Now you can try increasingly complex models like ARIMA, Kalman Filters, neural networks and even ensemble models and see if they can beat the WMAPE.