In this blog post, we will explore how we can use Python to forecast volatility using three methods: Naive, the popular GARCH and machine learning with scikit-learn.

Volatility here is the standard deviation of the returns of a financial instrument.

I will teach you starting points to kickstart your own research.

Installing ARCH and mlforecast

First we need to install the required packages. You can do it with pip:

pip install arch
pip install mlforecast

or with conda:

conda install -c conda-forge arch-py
conda install -c conda-forge mlforecast

With Arch we will use the GARCH model and with mlforecast we will use a Scikit-learn model.

Preparing The Data For Volatility Forecasting

We will use real crypto data available at Kaggle.

This tutorial is applicable to any financial instrument, I just got crypto because it was easily available.

I took only 20 coins to run faster, but you can use all of them.

import pandas as pd
import glob

data = list()
for f in glob.glob(path)[:20]:
    fname_noext = f.split(os.sep)[-1].split(".")[0]
    data_ = pd.read_csv(f, parse_dates=['Date'])
    data_['coin'] = fname_noext
    data_['return'] = data_['Close'].pct_change()
    data_['target_realized_vol'] = data_['return'].rolling(7).std().shift(-7)
    data_ = data_[data_['Date'] >= '2020-01-01']
data = pd.concat(data, axis=0, ignore_index=True)

Each file contains the historical data of a coin.

Our target is the target_realized_vol column, which is the standard deviation of the returns of the next 7 days for each coin.

Some coins have a long history, others don’t, the cutoff date at 2020-01-01 is arbitrary.

We need to split the data into train and validation sets. I picked 2022-03-01 as the split date.

split = "2022-03-01"
train  = data[data['Date'] < split]
valid = data[data['Date'] >= split]
coins = valid['coin'].unique()

We need to be very careful with the training data here, as we created the target column before doing the split.

I will explain why in the next section.

Some people like to split the data by coin instead of by time, but I really don’t like it in a scenario with lots of highly correlated data like the one here.

Naive Volatility Forecasting

First up, let’s talk about the Naive method.

This method is simple and intuitive, making it a great starting point for understanding volatility forecasting.

The Naive method assumes that the volatility of a financial instrument over a given period is equal to the volatility observed during the previous period.

It is called “Naive” because it’s a simple rule that does not take into account any other factors that might influence volatility.

Here we will take the last target value and use it as the prediction for the next period.

from sklearn.metrics import mean_squared_error

p = train[train['Date'] == '2022-02-21'][['Date', 'coin','target_realized_vol']]
p = p.rename(columns={'target_realized_vol':'pred'})
p['Date'] = p['Date'] + pd.DateOffset(8)
p = p.merge(valid[['Date', 'coin','target_realized_vol']], on=['Date', 'coin'], how='left')

mean_squared_error(p['target_realized_vol'], p['pred'], squared=False)
Date coin pred target_realized_vol
2022-03-01 00:00:00 1inch 0.0546111 0.028911
2022-03-01 00:00:00 Aave 0.0692836 0.0379977
2022-03-01 00:00:00 Algorand 0.0566661 0.0249555
2022-03-01 00:00:00 ApeCoin 0.0539567 0.0368307
2022-03-01 00:00:00 Arweave 0.0569066 0.0657768

Pay attention to the Date I selected.

We are pretending to make a prediction on February 28, 2022.

Our training data goes until February 28, 2022, but in practice we don’t know the target for the last 7 days of our data, as these include days that are in the validation set.

Think about it, the target for February 26, 2022 is the standard deviation of the returns of the next 7 days, which ends on March 5, 2022.

Taking the target value from any date after February 21, 2022 would introduce data leakage.

With that in mind, I got the values for each coin and shifted the date in the rows to 8 days after (March 1, 2022), which is the day we want to predict.

Then I merged the predictions with the validation set and calculated the RMSE, which was 0.0244.

Naive baselines tend to be strong and hard to beat in financial time series forecasting, so let’s see if the more complex models can do better.

GARCH For Volatility Forecasting

GARCH, which stands for Generalized Autoregressive Conditional Heteroskedasticity, is a more advanced approach to volatility forecasting, and it takes into account the fact that volatility tends to cluster together over time.

This means that when we observe high volatility today, it is more likely that we will also observe high volatility tomorrow.

GARCH models use traditional statistical modeling to estimate how volatility changes over time, and they are widely used in finance.

Disclaimer: I am approaching this topic with the eyes of a machine learning engineer, not a statistician. Someone with more experience with GARCH may get different results.

We need a loop to run the model for each coin, as the library doesn’t support multiple time series at once.

from arch import arch_model

results = list()
for coin in coins:
    returns = train[train['coin'] == coin]['return'].dropna().clip(-1,1)

    am = arch_model(returns, vol="Garch", p=1, q=1)
    res =
    p = np.sqrt(res.forecast(reindex=False, horizon=7).variance).mean(axis=1)
    y = valid[valid['coin'] == coin]['target_realized_vol'].iloc[0]
    results.append({'coin': coin, 'predicted': p.values[0], 'realized': y})

df = pd.DataFrame(results).sort_values('realized', ascending=False)
print(mean_squared_error(df['realized'], df['predicted'], squared=False))
coin predicted realized
Celo 0.136376 0.0684473
Arweave 0.0901401 0.0657768
Basic Attention Token 0.0688667 0.041834
chainlink 0.0723431 0.0403548
Axie Infinity 0.0820315 0.0395899

For each coin we take the returns from the training set, clipped to be between -1 and 1.

The function arch_model creates the model. The first argument is the series of returns.

We can pass the full series because we are assuming a prediction on February 28, 2022 at the end of the day. So we have all the returns up to and including this date.

The second argument is the type of model we want to use to model the volatility.

The p and q arguments are the order of the autoregressive and moving average terms, respectively.

It basically means how many previous values we want to use to predict the next one.

I consider these as hyperparameters that you can tune for each coin or for all of them at once.

After fitting the model we get the volatility forecast by calling the forecast method and getting the variance attribute.

Here I put horizon=7 because we want to predict the volatility for the next 7 days.

As I am not completely sure about the time duration of GARCH forecasts, I just averaged the values.

Again, I am looking here for something that works, not for the rigorous statistical approach.

The variance is the square of the standard deviation, so I took the square root to get the standard deviation which is what our target is.

The RMSE was 0.0382, which is worse than the naive baseline.

Volatility Forecasting With Scikit-Learn

We can use any scikit-learn model or even XGBoost to predict volatility easily by using the mlforecast library.

First we need to prepare the data with the same care we did for the Naive model.

train_ml = train.groupby('coin', as_index=False).apply(lambda x: x.iloc[:-7]).reset_index(drop=True)
train_ml = train_ml.rename(columns={'Date': 'ds', 'target_realized_vol': 'y', 'coin': 'unique_id'})[['ds', 'unique_id', 'y']]
valid_ml = valid.rename(columns={'Date': 'ds', 'target_realized_vol': 'y', 'coin': 'unique_id'})[['ds', 'unique_id', 'y']]

Here we remove the last 7 days from the training set, as we don’t know the target for these days.

Then we rename the columns to match the mlforecast library requirements.

  • ds is the date column
  • unique_id is the column that identifies the time series (each coin)
  • y is the target column

Let’s create a Random Forest model and fit it to the data.

from mlforecast import MLForecast
from window_ops.rolling import rolling_mean, rolling_std
from sklearn.ensemble import RandomForestRegressor

model = MLForecast(models=[RandomForestRegressor(random_state=0, n_jobs=-1)],
                   lag_transforms={1: [(rolling_mean, 3), (rolling_mean, 7)]}), id_col='unique_id', time_col='ds', target_col='y', static_features=[])

We import the MLForecast class from the mlforecast library.

Then we create a list of models to use. In this case, we only have one model, but you can pass a list of models and the library will train all.

The freq argument is the frequency of the time series. In our case, it is daily.

The lags argument is a list of lags to use. In this case, we use 1, 3, 7 and 14 days. You can try different values and see how they affect the results.

The lag_transforms argument is a dictionary that maps the lag to a list of more complex transformations to apply to the lag.

For example, we will apply a rolling mean with a window of 3 and 7 days to the lag of 1 day.

This means the library will take the original series, shift it by 1 day, and then apply the rolling mean.

To do it for other lags you just have to add the number as a key and the list of transformations as the value.

Note that we pass a list of tuples, where the first element is the function and the second is the window.

If we had more arguments, we could just pass them as additional items in the tuple.

Here we use functions from the window_ops library, which is optimized. The developers of MLForecast recommend using it and numba if you want to use custom functions.

We need to specify the columns related to the date, the unique id and the target in the fit method.

Besides, we can pass a list of static features (e.g: the exchange where the data was taken from). I like to pass an empty list here because if we decided to add external features, they will be treated as static unless we specify which columns are really static.

The prediction is straightforward.

p = model.predict(horizon=8)
p = p.merge(valid_ml, on=['ds', 'unique_id'], how='left').dropna()
mean_squared_error(p['y'], p['RandomForestRegressor'], squared=False)

The predict method returns a dataframe with the predictions for each coin and each day in the horizon.

unique_id ds RandomForestRegressor y
1inch 2022-03-01 00:00:00 0.0406654 0.028911
Aave 2022-03-01 00:00:00 0.0415011 0.0379977
Algorand 2022-03-01 00:00:00 0.0425734 0.0249555
ApeCoin 2022-03-01 00:00:00 0.0455959 0.0368307
Arweave 2022-03-01 00:00:00 0.0401673 0.0657768

The horizon is the number of days to predict after the last date in the training set.

As we have a gap of 7 days between the training and validation sets, we need to predict 8 days.

Then we merge the predictions with the validation set and calculate the RMSE.

The RMSE was 0.0224, which is slightly better than the naive baseline.

This model can be improved by adding more features, tuning the hyperparameters.

It’s a good idea to extend the validation and calculate the RMSE for more days than just a single one like I did here to demonstrate.