When working with time series data, differencing is a common technique used to make the data stationary.

Stationary data is important because it allows us to apply statistical models that assume constant parameters (like the mean and standard deviation) over time, and this can improve the accuracy of our predictions.

Let’s see how we can easily perform differencing in Python using Pandas, Numpy, and Polars.

First-order Differencing

First-order differencing involves subtracting each value in the time series from its previous value.

Pandas

In Pandas, we can perform first-order differencing using the diff() method. Here’s an example:

import pandas as pd

ts = pd.Series([10, 20, 30, 40, 50])

ts_diff = ts.diff()

print(ts_diff)

original and first-order differenced series example

diff
0 nan
1 10
2 10
3 10
4 10

Note that the first value is NaN because there is no previous value to subtract from.

Notice the y-axis values in the plot to see the difference.

Numpy

In Numpy, we can perform first-order differencing using the np.diff() function.

import numpy as np

ts = np.array([10, 20, 30, 40, 50])

ts_diff = np.diff(ts)

print(ts_diff)

#output
[10 10 10 10]

This time the first value is removed.

Polars

In Polars, we can perform first-order differencing using the shift() method.

import polars as pl

ts = pl.Series([10, 20, 30, 40, 50])

ts_diff = ts.diff()

print(ts_diff)

# output
shape: (5,)
Series: '' [i64]
[
	null
	10
	10
	10
	10
]

Again, the first value is null because there is no previous value to subtract from.

Second-order Differencing

Second-order differencing involves taking the simple difference of values on a time series twice.

This can be useful in some cases where a first-order difference is not enough to make the time series stationary.

Pandas

In Pandas, we can perform second-order differencing by calling the diff() method twice.

import pandas as pd

ts = pd.Series([10, 20, 30, 40, 50])

ts_diff = ts.diff().diff()

print(ts_diff)

original, first and second-order differenced series example

diff_diff
0 nan
1 nan
2 0
3 0
4 0

Now our first and second values are NaNs.

Notice the y-axis values in the plot to see the difference.

Numpy

In Numpy, we can perform second-order differencing by calling the np.diff() function twice.

import numpy as np

ts = np.array([10, 20, 30, 40, 50])

ts_diff = np.diff(np.diff(ts))

print(ts_diff)

# output
[0 0 0]

Polars

In Polars, we can perform second-order differencing by calling the shift() method twice.

import polars as pl

ts = pl.Series([10, 20, 30, 40, 50])

ts_diff = ts.diff().diff()

print(ts_diff)

## output
shape: (5,)
Series: '' [i64]
[
	null
	null
	0
	0
	0
]

Basically the same as Pandas.

Seasonal Differencing

Seasonal differencing involves subtracting the value of a time series from the value at the same time in the previous season.

This can be useful in cases where the time series exhibits a seasonal pattern.

For example, if we have a time series of monthly sales data, we would subtract the value of the same month in the previous year from the current month.

If we have a time series of daily sales data, we could subtract the value of the same day in the previous week from the current day.

Pandas

In Pandas, we can perform seasonal differencing by calling the diff() method with the appropriate lag.

import pandas as pd

dates = pd.date_range(start='2021-12', end='2023-12', freq='M')
ts = pd.Series([10, 20, 30, 40, 50, 10, 20, 30, 40, 50, 10, 20] * 2, index=dates)

ts_diff = ts.diff(periods=12)

print(ts_diff)
ts ts_diff
2021-12-31 00:00:00 10 nan
2022-01-31 00:00:00 20 nan
2022-02-28 00:00:00 30 nan
2022-03-31 00:00:00 40 nan
2022-04-30 00:00:00 50 nan
2022-05-31 00:00:00 10 nan
2022-06-30 00:00:00 20 nan
2022-07-31 00:00:00 30 nan
2022-08-31 00:00:00 40 nan
2022-09-30 00:00:00 50 nan
2022-10-31 00:00:00 10 nan
2022-11-30 00:00:00 20 nan
2022-12-31 00:00:00 10 0
2023-01-31 00:00:00 20 0
2023-02-28 00:00:00 30 0
2023-03-31 00:00:00 40 0
2023-04-30 00:00:00 50 0
2023-05-31 00:00:00 10 0
2023-06-30 00:00:00 20 0
2023-07-31 00:00:00 30 0
2023-08-31 00:00:00 40 0
2023-09-30 00:00:00 50 0
2023-10-31 00:00:00 10 0
2023-11-30 00:00:00 20 0

periods=12 tells Pandas to subtract the value 12 rows before from the current row.

Numpy

In Numpy, we can perform seasonal differencing by subtracting the value of the time series at the appropriate lag.

import numpy as np

ts = np.array([10, 20, 30, 40, 50, 10, 20, 30, 40, 50, 10, 20] * 2)

ts_diff = ts[12:] - ts[:-12]

print(ts_diff)

# output
[0 0 0 0 0 0 0 0 0 0 0 0]

Polars

In Polars, we can perform seasonal differencing by also calling the diff() method with the desired lag.

import polars as pl

ts = pl.Series(np.array([10, 20, 30, 40, 50, 10, 20, 30, 40, 50, 10, 20] * 2))

ts_diff = ts.diff(12)

print(ts_diff)

# output
shape: (24,)
Series: '' [i32]
[
	null
	null
	null
	null
	null
	null
	null
	null
	null
	null
	null
	null
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
]

Log Differencing

Log differencing involves taking the logarithm of a time series and then taking the first-order difference of the resulting sequence.

It’s a very common transformation in finance, where the log difference of a time series is often used to model the returns of a stock or other financial instrument.

Pandas

In Pandas, we can perform log differencing by calling the np.log() method and then taking the first-order difference.

import pandas as pd
import numpy as np

ts = pd.Series([1, 2, 3, 4, 5])

ts_diff = np.log(ts).diff()

print(ts_diff)

original and log differenced series example

log_diff
0 nan
1 0.693147
2 0.405465
3 0.287682
4 0.223144

The np.log() function uses the natural logarithm, which is the logarithm to the base e.

In case you have zeros in your time series, you can use the np.log1p() function instead.

This will add 1 to each value before taking the log, which will prevent it from returning infinite values.

import pandas as pd

ts = pd.Series([1, 2, 3, 4, 5])

ts_diff = np.log1p(ts).diff()

print(ts_diff)
log1p_diff
0 nan
1 0.405465
2 0.287682
3 0.223144
4 0.182322

Numpy

In Numpy, we can perform log differencing by calling the np.log() function and then calling the np.diff() function.

import numpy as np

ts = np.array([1, 2, 4, 8, 16])

ts_diff = np.diff(np.log(ts))

print(ts_diff)

# output
[0.69314718 0.40546511 0.28768207 0.22314355]

Polars

In Polars, we can perform log differencing by calling the log() method and then taking the first-order difference using the diff() method.

import polars as pl

ts = pl.Series([1, 2, 3, 4, 5])

ts_diff = ts.log().diff()

print(ts_diff)

# output
shape: (5,)
Series: '' [f64]
[
	null
	0.693147
	0.405465
	0.287682
	0.223144
]

To reproduce the log1p behavior, we can add 1 to the time series before taking the log.

import polars as pl

ts = pl.Series([1, 2, 3, 4, 5])

ts_diff = (ts+1).log().diff()

print(ts_diff)

# output
shape: (5,)
Series: '' [f64]
[
	null
	0.405465
	0.287682
	0.223144
	0.182322
]

original, log and log plus 1 differenced series example

Again, look at the y-axis to see the difference between the original and the log differenced series.

Grouped Time Series Differencing

When you have a DataFrame with multiple time series in the long format, you can take their differences by grouping them first.

This can be useful in cases where you have multiple time series, such as the sales of different products in a store.

Pandas

In Pandas, we can perform grouped time series differencing by calling the groupby() method and then calling diff().

import pandas as pd

ts = pd.DataFrame({'values': [10, 20, 30, 40, 50, 100, 200, 300, 400, 500],
                   'groups': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B']})

ts_diff = ts.groupby('groups').diff()
ts['diff'] = ts_diff['values']

print(ts)
values groups diff
10 A nan
20 A 10
30 A 10
40 A 10
50 A 10
100 B nan
200 B 100
300 B 100
400 B 100
500 B 100

Polars

In Polars, we can perform grouped time series differencing by calling the diff() method over the pl.col(ts) expression specifying the group column with the over() method.

import polars as pl

ts = pl.Series([10, 20, 30, 40, 50, 100, 200, 300, 400, 500])
groups = pl.Series(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])

ts = pl.DataFrame({'ts': ts, 'groups': groups})

ts_diff = ts.with_columns([pl.col('ts').diff().over(pl.col('groups')).alias('ts_diff')])

print(ts_diff)
ts groups ts_diff
i64 str i64
—– ——– ———
10 A null
20 A 10
30 A 10
40 A 10
50 A 10
100 B null
200 B 100
300 B 100
400 B 100
500 B 100

Fractional Differencing

Traditional n-order differencing makes the data stationary but, in the process, it tends to erase the dependence of the time series on its past values.

Fractional differencing was suggested by Marcos López de Prado, in the context of financial time series, as an alternative to differentiate them without erasing its memory structure.

We can apply it using the library fracdiff.

import numpy as np
from fracdiff import fdiff

ts = np.array([10, 20, 30, 40, 50, 100, 200, 300, 400, 500])

ts_diff = fdiff(ts, n=0.5)

#output
array([ 10.        ,  15.        ,  18.75      ,  21.875     ,
        24.609375  ,  67.0703125 , 139.32617188, 181.42089844,
       214.63470459, 243.0519104 ])

original, first-oder and fractional differenced series example

Tune the n parameter to find the coefficient that makes the data stationary while preserving as most as possible the memory structure of the time series.

You can do it by using statistical tests such as the Augmented Dickey-Fuller test or, if you plan to use the time series in a machine learning model, just tune this value as another hyperparameter in the validation set.