In this article, we will explore the use of Gaussian Processes for time series forecasting in Python, specifically using the GluonTS library.

GluonTS is an open-source toolkit for building and evaluating state-of-the-art time series models.

One of the key benefits of using Gaussian Processes for time series forecasting is that they can provide probabilistic predictions.

Instead of just predicting a point estimate for the next value in the time series, GPs can provide a distribution over possible values, allowing us to quantify our uncertainty.

This is not exclusive to GPs, so we have to take into account that it takes more time to train a GP model than a traditional model like ARIMA.

Let’s dive in!

## Installing GluonTS#

Before installing GluonTS, we need to install MXNet. Check out the official installation guide for more details.

Then you can install GluonTS using pip:

``````pip install gluonts
``````

## Preparing The Data For Gaussian Processes In GluonTS#

We will use data from the City of Los Angeles website traffic dataset on Kaggle.

This dataset contains the number of user sessions for the City of Los Angeles website for each day from 2014 to 2019.

Sessions are periods of time when a user is active in a website.

Usually they are measured by looking at user activity and have a timeout after 30 minutes since the last user action.

For example, right now you are in a session on my website. If you leave the page and come back after 2 hours, you will be in a new session.

Let’s load the data and take a look at it:

``````import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import os

parse_dates=['Date']).loc[:, ['Date', 'Device Category', 'Sessions']]
data = data.groupby(['Date', 'Device Category'], as_index=False)['Sessions'].sum()

fixes = pd.DataFrame({'Date': pd.to_datetime(['2016-12-01', '2018-10-13']),
'Device Category': ['tablet', 'tablet'] ,
'Sessions': [0, 0]})
data = pd.concat([data, fixes], axis=0, ignore_index=True)
data = data.sort_values(['Device Category', 'Date']).groupby('Device Category').apply(lambda x: x.fillna(method='ffill'))
``````
Date Device Category Sessions
2014-01-01 00:00:00 desktop 1032805
2014-01-01 00:00:00 mobile 537573
2014-01-01 00:00:00 tablet 92474
2014-01-02 00:00:00 desktop 2359710
2014-01-02 00:00:00 mobile 607544

We load the data with Pandas and take only 3 columns: the date, the device category, and the number of sessions.

The original data has more detailed information about the sessions, like the browser and number of visitors, but we don’t need it for this tutorial.

We then group the data by date and device category and sum the number of sessions for each group.

Here we have 3 time series: one for desktop, one for mobile, and one for tablet. They are stacked in the long format, which means that each row represents a single observation.

This data has two observations missing for the `tablet` device category, so we add them manually as `np.nan`, then fill with the previous day value.

Now we split the data into training and validation sets with a simple time-based split:

``````train = data.loc[data['Date'] < '2019-01-01']
valid = data.loc[(data['Date'] >= '2019-01-01') & (data['Date'] < '2019-02-01')]
h = valid['Date'].nunique()
``````

`h` is the horizon, the number of time steps we want to forecast into the future. In this case, the number of days in the validation set.

Splitting the data is important because we want to make sure that the model doesn’t overfit to the data it has seen and can’t generalize to the future.

GluonTS requires us to use a specific object for time series data.

We can easily convert our Pandas DataFrame to a GluonTS dataset using the `from_long_dataframe` method:

``````from gluonts.dataset.pandas import PandasDataset
from gluonts.dataset.multivariate_grouper import MultivariateGrouper

unique_series = train['Device Category'].nunique()
train_ds = PandasDataset.from_long_dataframe(train, target='Sessions', item_id='Device Category', timestamp='Date', freq='D')
grouper_train = MultivariateGrouper(max_target_dim=unique_series)
train_ds = grouper_train(train_ds)
``````

This method takes in four arguments:

• `train` is a Pandas DataFrame containing the training data.
• `target` is the name of the target column (the value we want to forecast)
• `item_id` is the name of the column containing unique identifiers for each time series, in case we have multiple like here.
• `timestamp` is the name of the column containing the time stamps for each observation
• `freq` is the frequency of the time series data, in this case daily.

Next, we have the `grouper_train` variable, which is an instance of the `MultivariateGrouper` class.

This class is used to group together time series data for models that implement multivariate forecasting in GluonTS.

We set the `max_target_dim` parameter to 3, which is the number of time series we have.

Finally, we’re passing the `train_ds` variable to the `grouper_train()` method to transform the data based on the criteria we specified earlier.

## Training A Gaussian Process Model For Time Series Forecasting#

We will use a flavor of Gaussian Process called GP-Copula or GPVar.

This model is a mix of a Gaussian Process and a recurrent neural network (usually LSTM).

You can find all the details in GP-Copula’s original paper.

``````from gluonts.mx import Trainer, GPVAREstimator

estimator = GPVAREstimator(freq='D', prediction_length=h, target_dim=unique_series, context_length=30, trainer = Trainer(ctx='gpu', epochs=5, learning_rate=1e-3))
predictor = estimator.train(train_ds)
``````

We instantiate the `GPVAREstimator` class and pass in the following parameters:

• `freq`: Frequency of the time series data. In this example, it’s set to `D` to indicate that the data is daily.
• `prediction_length`: Number of time steps to be predicted by the model.
• `target_dim`: Dimension of the target variable. This is typically the number of series we have. In this case, it’s set to `unique_series`, which is the number of unique device categories in the training data.
• `context_length`: Length of the training context. This is the number of time steps the model uses as inputs.
• `trainer`: An instance of the `Trainer` class which defines how the model is trained. It specifies parameters such as the number of epochs and learning rate. In this example, it’s set to use 5 epochs, a learning rate of 1e-3, and to run on a GPU.

The `learning_rate` is a very important parameters that we will optimize later. It defines the step size of the optimzation algorithm used to train the model.

I found, through experience, that fixing a number of epochs and optimizing the learning rate is a good strategy to find the best model while avoiding overfitting.

If you don’t have a GPU, you can set `ctx='cpu'` to run the model on the CPU.

The predictor object is created by calling the `train` method of the `estimator` object, passing in the training dataset (`train_ds`) as an argument.

Notice that, different than the usual scikit-learn API, the `train` (or `fit`) method returns an object that can be used to generate forecasts.

## Generating Forecasts#

Let’s get the forecasts for the validation set:

``````pred = list(predictor.predict(train_ds))
``````

The `predict` method takes in a dataset and returns a generator object with the forecasts.

In this case it’s a 100 by 31 by 3 array, where 100 is the number of samples it generated from the posterior distribution, 31 is the number of time steps in the validation set (the prediction horizon), and 3 is the number of unique time series.

Let’s process this data into a Pandas DataFrame to evaluate the model:

``````all_preds = list()
item = pred
p = np.median(item.samples, axis=0).clip(0)
p10 = np.percentile(item.samples, 10, axis=0).clip(0)
p90 = np.percentile(item.samples, 90, axis=0).clip(0)
dates = pd.date_range(start=item.start_date.to_timestamp(), periods=len(p), freq='D')

for name, q in [('p50', p), ('p10', p10), ('p90', p90)]:
wide_p = pd.DataFrame({'Date': dates})
for i, device in enumerate(train_['Device Category'].unique()):
wide_p[device] = q[:, i]
wide_p = wide_p.melt(id_vars='Date', var_name='Device Category', value_name=name)
all_preds.append(wide_p.set_index(['Date', 'Device Category']))

all_preds = pd.concat(all_preds, axis=1).reset_index()
all_preds = all_preds.merge(valid, on=['Date', 'Device Category'], how='left')
`````` The code above computes the median, 10th percentile, and 90th percentile of the predicted values.

It uses `numpy` to compute these percentiles along the samples axis of the predicted values array.

The for loop then creates a Pandas dataframe with the date range and the predicted values for each device category as the columns.

The `melt` function is used to unpivot the dataframe, converting the device categories to a single column, and the values to a separate column. The resulting dataframe for each percentile is then appended to the `all_preds` list.

After computing the predicted values for each percentile, the code concatenates the dataframes in `all_preds` along the columns axis, resulting in a single dataframe with dates, columns for each percentile (`p50`, `p10`, `p90`) and device category.

Date Device Category p50 p10 p90 Sessions
2019-01-01 00:00:00 desktop 1.06014e+06 625134 1.43136e+06 315674
2019-01-02 00:00:00 desktop 1.06372e+06 643280 1.47342e+06 1209157
2019-01-03 00:00:00 desktop 986666 660598 1.32867e+06 1199978
2019-01-04 00:00:00 desktop 892633 734292 1.14378e+06 1339970
2019-01-05 00:00:00 desktop 445984 282950 591550 516879

We merge the dataframe with the validation data based on the date and device category columns so it’s straightforward to compute the pinball loss.

## Evaluating The Model#

The Pinball loss is a commonly used metric in time series forecasting that measures the deviation of the predicted quantiles from the actual values at a specific quantile level.

``````from sklearn.metrics import mean_pinball_loss
for q in [0.1, 0.5, 0.9]:
print(f'pinball loss at {q}: {mean_pinball_loss(all_preds["Sessions"], all_preds[f"p{int(q*100)}"], alpha=q)}')
``````

The code imports the `mean_pinball_loss` function from scikit-learn.

It then calculates the pinball loss for quantiles of 0.1, 0.5, and 0.9 using a for loop.

Inside the loop, the function is called with the following arguments: the actual values of sessions (`all_preds["Sessions"]`), the predicted values for the corresponding quantile (`all_preds[f"p{int(q*100)}"]`), and the value of `alpha` (which is the quantile).

The final numbers are:

• pinball loss at 0.1: 33952.08
• pinball loss at 0.5: 87465.95
• pinball loss at 0.9: 68380.55

Let’s see if we can improve the model by optimizing the hyperparameters.

## Optimizing The Hyperparameters Of The Gaussian Process With Optuna#

We will use Optuna to optimize the hyperparameters of the Gaussian Process model.

This is a very simple library that uses Tree-structured Parzen Estimator (TPE) to optimize the hyperparameters of the model.

We just need to encapsulate the code that trains the model in a function that takes a `trial` object as an argument.

The `trial` object is a single step of the optimization process.

The function returns a list of values, as we want to optimize the pinball loss at quantiles of 0.1, 0.5, and 0.9.

``````def objective(trial):
context_length = trial.suggest_int('context_length', 7, 400)
learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-1)
num_layers = trial.suggest_int('num_layers', 1, 4)
num_cells = trial.suggest_int('num_cells', 8, 128)
cell_type = trial.suggest_categorical('cell_type', ['lstm', 'gru'])
dropout_rate = trial.suggest_uniform('dropout_rate', 0.1, 0.9)
rank = trial.suggest_int('rank', 1, 3)

estimator = GPVAREstimator(freq='D', prediction_length=h, target_dim=unique_series,
context_length=context_length, num_layers=num_layers,
num_cells=num_cells, cell_type=cell_type, dropout_rate=dropout_rate,
rank=rank,  trainer = Trainer(ctx='gpu', epochs=5, learning_rate=learning_rate))
predictor = estimator.train(train_ds)
pred = list(predictor.predict(train_ds))

all_preds = list()
item = pred
p = np.median(item.samples, axis=0).clip(0)
p10 = np.percentile(item.samples, 10, axis=0).clip(0)
p90 = np.percentile(item.samples, 90, axis=0).clip(0)
dates = pd.date_range(start=item.start_date.to_timestamp(), periods=len(p), freq='D')
for name, q in [('p50', p), ('p10', p10), ('p90', p90)]:
wide_p = pd.DataFrame({'Date': dates})
for i, device in enumerate(train_['Device Category'].unique()):
wide_p[device] = q[:, i]
wide_p = wide_p.melt(id_vars='Date', var_name='Device Category', value_name=name)
all_preds.append(wide_p.set_index(['Date', 'Device Category']))

all_preds = pd.concat(all_preds, axis=1).reset_index()
all_preds = all_preds.merge(valid, on=['Date', 'Device Category'], how='left')

return [mean_pinball_loss(all_preds["Sessions"], all_preds[f"p{int(q*100)}"], alpha=q) for q in [0.1, 0.5, 0.9]]
``````

We define the hyperparameters search space using the `suggest_*` methods of the `trial` object.

These methods will sample a value from the specified range or distribution.

Here is a breakdown of each parameter:

• `context_length`: the number of time steps in the past that the model should use to make a prediction about the future. I set a range from 7 to 400 so we can test very short and very long windows.
• `learning_rate`: the step size at which the model adjusts its weights during training. A small learning rate can lead to slower convergence, while a large learning rate can lead to overshooting the optimal weights. A good rule of thumb is to use a log-uniform distribution between 1e-5 and 1e-1.
• `num_layers`: the number of layers in the RNN model. A deeper model can capture more complex patterns in the data, but can also be prone to overfitting. Let’s try values between 1 and 4. More layers than that are usually not necessary for time series forecasting.
• `num_cells`: the number of units or neurons in each layer of the model. A number between 8 and 128 should be enough.
• `cell_type`: the type of recurrent neural network (RNN) cell to use in the model. This can be either Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU). These are both popular choices for time series forecasting.
• `dropout_rate`: the rate at which to randomly drop out units in the model during training to prevent overfitting. A good range is between 0.1 and 0.9.
• `rank`: the rank of the distribution that will be used to sample the predictions. Default is 2, but let’s try values between 1 and 3.

The rest of the code is similar to the previous example.

To run the optimization, we just need to call the `optimize` method of the `study` object.

``````import optuna
study = optuna.create_study(directions=['minimize', 'minimize', 'minimize'])
study.optimize(objective, n_trials=20, timeout=900)
``````

We set the `directions` of each returned value to `minimize` because we want to minimize the pinball losses at all quantiles.

`n_trials` is the number of steps to run. 20 is usually a good number to start with. 30 is enough for most cases.

The `timeout` argument is the maximum time in seconds to run the optimization. The attribute `study.best_trials` returns a list of the best trials, meaning the trials that returned the lowest values for any of the objectives.

After inspecting it, I found the second element of the returned list to be the minimum for 2 of the 3 objectives, so I picked it.

``````best_trial = study.best_trials
best_trial.params
{'context_length': 270,
'learning_rate': 0.013101512029143874,
'num_layers': 1,
'num_cells': 84,
'cell_type': 'lstm',
'dropout_rate': 0.16228795203947746,
'rank': 2}
``````

To get the corresponding values for the objectives, we can call the `values` attribute of the `best_trial` object.

``````best_trial.values
[31605.068581359206, 64329.63781081989, 69049.05124747986]
``````

The model improved, but I found it takes too long to reach a good solution compared to alternatives like CNNs, “pure” LSTMs or Kalman filters