In this article you will learn an easy, fast, step-by-step way to use Convolutional Neural Networks for multiple time series forecasting in Python.

We will use the NeuralForecast library which implements the Temporal Convolutional Network (TCN) architecture.

Temporal Convolutional Network (TCN)

This architecture is a variant of the Convolutional Neural Network (CNN) architecture that is specially designed for time series forecasting.

It was first presented as WaveNet.

temporal convolutional network architecture diagram Source: WaveNet: A Generative Model for Raw Audio

The main ingredient that makes it different from other convolutional networks is that it uses “causal and dilated convolutions.”

Causal convolutions force the model to learn the dependence between the steps without violating the natural order of time.

This is different from other convolutional networks that consider all available data in a sequence for modeling (before and after the current step).

The dilation technique helps it process an increasingly larger portion of the time series steps as it advances to the deeper layers.

In general, as you can see in the image, the dilation factor is doubled at each layer.

Unit 2 in the first hidden layer processes the information from steps 1 and 2.

Unit 4 in the second hidden layer processes steps 1, 2, 3, and 4 through the processing of hidden units 2 and 4 from the first hidden layer.

And so on.

These networks are generally faster to train than recurrent networks.

Let’s learn how you can use it in your time series forecasting projects.

How to Install NeuralForecast With and Without GPU Support

As NeuralForecast uses deep learning methods, if you have a GPU, it is important to have CUDA installed so that the models run faster.

To check if you have a GPU installed and correctly configured with PyTorch (backend library), run the code below:

import torch

This function returns True if you have a GPU installed and correctly configured, and False otherwise.

If you have a GPU but do not have PyTorch installed with it enabled, check the PyTorch official website for instructions on how to install the correct version.

I recommend that you install PyTorch first!!

The command I used to install PyTorch with GPU enabled was:

conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia

If you don’t have a GPU, don’t worry, the library still works fine, it just won’t be as fast.

Installing it is very simple, just run the command below:

pip install neuralforecast

or if you use Anaconda:

conda install -c conda-forge neuralforecast

How To Prepare Time Series Data For The Temporal Convolutional Network

We will use real sales data from the Favorita store chain, from Ecuador.

We have sales data from 2013 to 2017 for multiple stores and product categories.

To measure the model’s performance, we will use WMAPE (Weighted Mean Absolute Percentage Error) with the absolutes of the actual values as weights.

import pandas as pd
import numpy as np

def wmape(y_true, y_pred):
    return np.abs(y_true - y_pred).sum() / np.abs(y_true).sum()

This is an adapted version of MAPE (Mean Absolute Percentage Error) that solves the problem of division by zero when there are no sales for a specific day.

For this tutorial I will use only the data from one store and two product categories.

You can use as many categories, SKUs, stores, etc as you want.

path = 'train.csv'
data = pd.read_csv(path, index_col='id', parse_dates=['date'])

data2 = data.loc[(data['store_nbr'] == 1) & (data['family'].isin(['MEATS', 'PERSONAL CARE'])), ['date', 'family', 'sales', 'onpromotion']]

This data doesn’t contain a record for December 25, so I just copied the sales from December 18 to December 25 to keep the weekly pattern.

dec25 = list()
for year in range(2013,2017):
    for family in ['MEATS', 'PERSONAL CARE']:
        dec18 = data2.loc[(data2['date'] == f'{year}-12-18') & (data2['family'] == family)]
        dec25 += [{'date': pd.Timestamp(f'{year}-12-25'), 'family': family, 'sales': dec18['sales'].values[0], 'onpromotion': dec18['onpromotion'].values[0]}]
data2 = pd.concat([data2, pd.DataFrame(dec25)], ignore_index=True).sort_values('date')

The columns are:

  • date: date of the record
  • family: product category
  • sales: sales amount
  • onpromotion: how many products of that category were on promotion on that day
weekday = pd.get_dummies(data2['date'].dt.weekday)
weekday.columns = ['weekday_' + str(i) for i in range(weekday.shape[1])]

data2 = pd.concat([data2, weekday], axis=1)

Let’s use the weekday as an additional feature.

It can be transformed as an ordinal or categorical variable, but here I will use the categorical approach which is more common.

In general, using additional information that is relevant to the problem can improve the model’s performance.

Date components like weekday, month, day of the month are important to capture seasonal patterns.

There are a ton of additional information that we could add, like temperature, rain, holidays, etc.

data2 = data2.rename(columns={'date': 'ds', 'sales': 'y', 'family': 'unique_id'})

This library expects the columns to be named in the following format:

  • ds: date of the record
  • y: target variable (sales amount)
  • unique_id: unique identifier of the time series (product category)

unique_id should identify each time series you have.

If we had more than one store, we would have to add the store number along with the categories to unique_id.

An example would be unique_id = store_nbr + '_' + family.

This is the final version of our dataframe data2:

ds unique_id y onpromotion weekday_0 weekday_1 weekday_2 weekday_3 weekday_4 weekday_5 weekday_6
2013-01-01 00:00:00 MEATS 0 0 0 1 0 0 0 0 0
2013-01-01 00:00:00 PERSONAL CARE 0 0 0 1 0 0 0 0 0
2013-01-02 00:00:00 MEATS 369.101 0 0 0 1 0 0 0 0
2013-01-02 00:00:00 PERSONAL CARE 194 0 0 0 1 0 0 0 0
2013-01-03 00:00:00 MEATS 272.319 0 0 0 0 1 0 0 0

A row for each record containing the date, the time series ID (family in our example), the target value and columns for external variables (onpromotion).

Notice the time series records are stacked on top of each other.

Let’s split the data into train and validation sets.

Time Series Validation Split

You should never use random or k-fold validation for time series.

That would cause data leakage, as you would be using future data to train your model.

In practice, you can’t take random samples from the future to train your model, so you can’t use them here.

To avoid this issue, we will use a simple time series split between past and future.

A career tip: knowing how to do time series validation correctly is a skill that will set you apart from many data scientists (even experienced ones!).

Our training set will be all the data between 2013 and 2016 and our validation set will be the first 3 months of 2017.

train = data2.loc[data2['ds'] < '2017-01-01']
valid = data2.loc[(data2['ds'] >= '2017-01-01') & (data2['ds'] < '2017-04-01')]
h = valid['ds'].nunique()

Temporal Convolutional Network Hyperparameters

The implementation of the TCN in this library uses an encoder-decoder architecture.

The goal is to use a TCN to learn an optimized numerical representation of past observations with the encoder and then send this representation to a simple feedforward neural network (decoder) to generate predictions.

There are several hyperparameters to tune and NeuralForecast gives us an object that will automatically search for the best combination based on an internal validation error.

Still, it is important to understand what they are and the default value ranges that the library uses to optimize them.

I recommend that you run the search using the default ranges, especially if you don’t have much experience with neural networks.

The numbers shown here as default options are not necessarily the only possible ones, but just intervals chosen by the library’s creator as sensible to optimize.


This is the size of each filter used in the TCN convolution layers.

A filter is simply a weight vector that slides over the time series to generate a new sequence of values.

We can think of this new sequence as a simple transformation of the time series.

If we have a time series with 10 observations and a filter with a size of 2, first we multiply its two weights by the first two values of the time series and add the result.

This gives us the value of the first element of the transformed sequence.

In the second step, we multiply the same two weights by values 2 and 3 of the time series and add the result to get the second element of the transformed sequence.

And so it goes until the last.

We apply several different filters to generate several transformed sequences.

The idea is that these transformations better represent the patterns that we need to predict the next observations.

It differs from a regular feedforward neural network, as the latter uses only one set of weights to transform the inputs in each layer.

This hyperparameter is not optimized during the automatic search and receives the default value of 2, like in the original TCN diagram.


This is the value of the dilation interval of the filters: how many time units they should skip when applying the transformation.

They are also fixed values, multiples of 2, as per the diagram above.


The first value optimized during the automatic search.

It defines the number of steps of the time series that will be used as input for the TCN (input features).

The possible values are -1, 4, 16, and 64.

But be careful, this value is multiplied by the horizon!

That is, a value equal to 4 means the input will be 4 times the horizon (4 * 90 = 360 days in our example).

If the value is -1, the network will use all past steps as input.


This is the size of the encoded representation outputted by the TCN, which is also the number of filters applied.

The number of units in this representation directly affects the network’s ability to learn complex patterns.

The larger, the more complex the patterns it can learn, but it has a higher risk of overfitting.

This hyperparameter is optimized during the automatic search. Its possible values are 10, 20, 40 and 80.


After the TCN emits its outputs, they are transformed again to represent the overall context of the time series information.

We can think of this as a summary of the most important information that we need to make predictions for the next steps.

It is a vector of size defined by context_size.

Possible values for this hyperparameter are 5, 10, and 50.


This is the number of units in the hidden layers of the feedforward neural network that acts as the decoder.

It has two hidden layers by default, and this is the number of units in each of them, not the total between them.

Two values are tested during the search: 64 and 128.


One of the most impactful hyperparameters, it defines how much each optimization step will modify the network weights.

The lower the learning_rate, the slower the optimization, but also makes it more stable.

During tuning, a value is sampled from a log uniform distribution and can range from 0.0001 to 0.1.


The maximum number of times the neural network will update its weights during training.

It is closely tied and inversely proportional to the learning_rate.

Two values are tested: 500 and 1000.

Training a Temporal Convolutional Network In Python

It’s time to start the hyperparameter tuning search.

from neuralforecast import NeuralForecast
from import AutoTCN

models = [AutoTCN(h=h, 

model = NeuralForecast(models=models, freq='D')

First we create a list with a single AutoTCN object and pass it to NeuralForecast.

The library allows us to train several models in the same object, but because this is a tutorial, we will only use TCN.

AutoTCN takes the following arguments:

  • h: the forecast horizon (how many steps into the future we want to predict)
  • num_samples: the number of hyperparameter combinations that will be tested during the hyperparameter tuning. By default, the search is random.
  • loss: the loss function to optimize during training. I am using a custom PyTorch WMAPE loss.

In practice, testing 30 combinations finds a good solution in a reasonable amount of time.

In the NeuralForecast object, we pass the argument freq that informs the frequency of the time series. In our case, it’s daily.

Then we just call the fit method and pass the training data to start training the model.

p = model.predict().reset_index()
p = p.merge(valid[['ds','unique_id', 'y']], on=['ds', 'unique_id'], how='left')

Now that we have a trained model, we can use it to make predictions by calling the predict method.

I merged the predictions with the validation data to make it easier to calculate the error metrics.

This is what the predictions dataframe looks like:

unique_id ds AutoTCN y
MEATS 2017-01-01 00:00:00 122.66 0
PERSONAL CARE 2017-01-01 00:00:00 101.177 0
PERSONAL CARE 2017-01-02 00:00:00 150.795 81
MEATS 2017-01-02 00:00:00 242.997 116.724
MEATS 2017-01-03 00:00:00 247.094 344.583

Now let’s plot the predictions and the actual values to do a visual inspection.

fig, ax = plt.subplots(2, 1, figsize = (1280/96, 720/96))
for ax_i, unique_id in enumerate(['MEATS', 'PERSONAL CARE']):
    plot_df = pd.concat([train.loc[train['unique_id'] == unique_id].tail(30), 
                         p.loc[p['unique_id'] == unique_id]]).set_index('ds') # Concatenate the train and forecast dataframes
    plot_df[['y', 'AutoTCN']].plot(ax=ax[ax_i], linewidth=2, title=unique_id)


print(wmape(p['y'], p['AutoTCN']))

TCN forecast without external variables

This model has a WMAPE of 20.49%.

We can see the combinations of hyperparameters that were tested during the search by calling the get_dataframe method.

results_df = models[0].results.get_dataframe().sort_values('loss')
loss config/encoder_hidden_size config/decoder_hidden_size config/max_steps
0.598952 50 128 500
0.600814 50 64 1000
0.600958 100 128 1000
0.601555 100 128 500
0.602398 100 128 1000

It returns a DataFrame with the loss value and the hyperparameters tested for each combination.

This is very useful to understand which hyperparameters are more important for the model and guide your next steps.

To get the best hyperparameters, we call the get_best_result method.

best_config = models[0].results.get_best_result().metrics['config']

{'h': 90,
 'encoder_hidden_size': 50,
 'context_size': 50,
 'decoder_hidden_size': 128,
 'learning_rate': 0.00026729929225440886,
 'max_steps': 500,
 'batch_size': 16,
 'loss': WMAPE(),
 'check_val_every_n_epoch': 100,
 'random_seed': 18,
 'input_size': 5760}

Let’s use these hyperparameters to train a new model using external variables and see if we can improve the results.

Training a Temporal Convolutional Network with External Variables in Python

Here is the full code to train a TCN model with external variables.

from neuralforecast import NeuralForecast
from neuralforecast.models import TCN

models = [TCN(scaler_type='standard', 
        futr_exog_list=['onpromotion', 'weekday_0',
       'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5',

model = NeuralForecast(models=models, freq='D')

p = model.predict(futr_df=valid).reset_index()
p = p.merge(valid[['ds','unique_id', 'y']], on=['ds', 'unique_id'], how='left')

fig, ax = plt.subplots(2, 1, figsize = (1280/96, 720/96))
for ax_i, unique_id in enumerate(['MEATS', 'PERSONAL CARE']):
    plot_df = pd.concat([train.loc[train['unique_id'] == unique_id].tail(30), 
                         p.loc[p['unique_id'] == unique_id]]).set_index('ds')
    plot_df[['y', 'TCN']].plot(ax=ax[ax_i], linewidth=2, title=unique_id)

print(wmape(p['y'], p['TCN']))

We need to make a few changes:

  • Instead of AutoTCN, we use TCN to instantiate the model.
  • We pass the futr_exog_list argument to the TCN object. This argument is a list of the names of the columns with external variables that we want to use in the model.
  • We pass a scaler_type. Scaling the data usually improves model convergence, but this is not optimized by AutoTCN, so I tried it manually here.
  • We pass the best_config dictionary as named arguments to the TCN object.
  • In the predict method, we pass the futr_df argument with the values for the external variables in the future time steps.

You have to think if the external variables will be available at the time of the forecast when you use this model in production.

If it’s a variable like temperature, you need to replace the true historical values with an estimate in the same way you will do when deployed.

Using data for the external variables that is available only in the historical data but not in production is a subtle mistake that will lead to overoptimistic results.

TCN forecast with external variables

I tried the two scalers available and using none, and these were the results:

  • No scaler: 19.72%
  • Standard scaler: 19.46%
  • Robust scaler: 19.63%

The results were very similar, most of it comes from using the external variables and not the scaler.

As it costs us almost nothing to use the scaler, I would recommend using it.

Now that you have a CNN, try to improve the solution by building an ensemble of models.