When working with time series data, differencing is a common technique used to make the data stationary.

Stationary data is important because it allows us to apply statistical models that assume constant parameters (like the mean and standard deviation) over time, and this can improve the accuracy of our predictions.

Let’s see how we can easily perform differencing in Python using Pandas, Numpy, and Polars.

## First-order Differencing#

First-order differencing involves subtracting each value in the time series from its previous value.

### Pandas#

In Pandas, we can perform first-order differencing using the `diff()` method. Here’s an example:

``````import pandas as pd

ts = pd.Series([10, 20, 30, 40, 50])

ts_diff = ts.diff()

print(ts_diff)
``````

diff
0 nan
1 10
2 10
3 10
4 10

Note that the first value is NaN because there is no previous value to subtract from.

Notice the y-axis values in the plot to see the difference.

### Numpy#

In Numpy, we can perform first-order differencing using the `np.diff()` function.

``````import numpy as np

ts = np.array([10, 20, 30, 40, 50])

ts_diff = np.diff(ts)

print(ts_diff)

#output
[10 10 10 10]
``````

This time the first value is removed.

### Polars#

In Polars, we can perform first-order differencing using the `shift()` method.

``````import polars as pl

ts = pl.Series([10, 20, 30, 40, 50])

ts_diff = ts.diff()

print(ts_diff)

# output
shape: (5,)
Series: '' [i64]
[
null
10
10
10
10
]
``````

Again, the first value is null because there is no previous value to subtract from.

## Second-order Differencing#

Second-order differencing involves taking the simple difference of values on a time series twice.

This can be useful in some cases where a first-order difference is not enough to make the time series stationary.

### Pandas#

In Pandas, we can perform second-order differencing by calling the `diff()` method twice.

``````import pandas as pd

ts = pd.Series([10, 20, 30, 40, 50])

ts_diff = ts.diff().diff()

print(ts_diff)
``````

diff_diff
0 nan
1 nan
2 0
3 0
4 0

Now our first and second values are NaNs.

Notice the y-axis values in the plot to see the difference.

### Numpy#

In Numpy, we can perform second-order differencing by calling the `np.diff()` function twice.

``````import numpy as np

ts = np.array([10, 20, 30, 40, 50])

ts_diff = np.diff(np.diff(ts))

print(ts_diff)

# output
[0 0 0]
``````

### Polars#

In Polars, we can perform second-order differencing by calling the `shift()` method twice.

``````import polars as pl

ts = pl.Series([10, 20, 30, 40, 50])

ts_diff = ts.diff().diff()

print(ts_diff)

## output
shape: (5,)
Series: '' [i64]
[
null
null
0
0
0
]
``````

Basically the same as Pandas.

## Seasonal Differencing#

Seasonal differencing involves subtracting the value of a time series from the value at the same time in the previous season.

This can be useful in cases where the time series exhibits a seasonal pattern.

For example, if we have a time series of monthly sales data, we would subtract the value of the same month in the previous year from the current month.

If we have a time series of daily sales data, we could subtract the value of the same day in the previous week from the current day.

### Pandas#

In Pandas, we can perform seasonal differencing by calling the `diff()` method with the appropriate lag.

``````import pandas as pd

dates = pd.date_range(start='2021-12', end='2023-12', freq='M')
ts = pd.Series([10, 20, 30, 40, 50, 10, 20, 30, 40, 50, 10, 20] * 2, index=dates)

ts_diff = ts.diff(periods=12)

print(ts_diff)
``````
ts ts_diff
2021-12-31 00:00:00 10 nan
2022-01-31 00:00:00 20 nan
2022-02-28 00:00:00 30 nan
2022-03-31 00:00:00 40 nan
2022-04-30 00:00:00 50 nan
2022-05-31 00:00:00 10 nan
2022-06-30 00:00:00 20 nan
2022-07-31 00:00:00 30 nan
2022-08-31 00:00:00 40 nan
2022-09-30 00:00:00 50 nan
2022-10-31 00:00:00 10 nan
2022-11-30 00:00:00 20 nan
2022-12-31 00:00:00 10 0
2023-01-31 00:00:00 20 0
2023-02-28 00:00:00 30 0
2023-03-31 00:00:00 40 0
2023-04-30 00:00:00 50 0
2023-05-31 00:00:00 10 0
2023-06-30 00:00:00 20 0
2023-07-31 00:00:00 30 0
2023-08-31 00:00:00 40 0
2023-09-30 00:00:00 50 0
2023-10-31 00:00:00 10 0
2023-11-30 00:00:00 20 0

`periods=12` tells Pandas to subtract the value 12 rows before from the current row.

### Numpy#

In Numpy, we can perform seasonal differencing by subtracting the value of the time series at the appropriate lag.

``````import numpy as np

ts = np.array([10, 20, 30, 40, 50, 10, 20, 30, 40, 50, 10, 20] * 2)

ts_diff = ts[12:] - ts[:-12]

print(ts_diff)

# output
[0 0 0 0 0 0 0 0 0 0 0 0]
``````

### Polars#

In Polars, we can perform seasonal differencing by also calling the `diff()` method with the desired lag.

``````import polars as pl

ts = pl.Series(np.array([10, 20, 30, 40, 50, 10, 20, 30, 40, 50, 10, 20] * 2))

ts_diff = ts.diff(12)

print(ts_diff)

# output
shape: (24,)
Series: '' [i32]
[
null
null
null
null
null
null
null
null
null
null
null
null
0
0
0
0
0
0
0
0
0
0
0
0
]
``````

## Log Differencing#

Log differencing involves taking the logarithm of a time series and then taking the first-order difference of the resulting sequence.

It’s a very common transformation in finance, where the log difference of a time series is often used to model the returns of a stock or other financial instrument.

### Pandas#

In Pandas, we can perform log differencing by calling the `np.log()` method and then taking the first-order difference.

``````import pandas as pd
import numpy as np

ts = pd.Series([1, 2, 3, 4, 5])

ts_diff = np.log(ts).diff()

print(ts_diff)
``````

log_diff
0 nan
1 0.693147
2 0.405465
3 0.287682
4 0.223144

The `np.log()` function uses the natural logarithm, which is the logarithm to the base e.

In case you have zeros in your time series, you can use the `np.log1p()` function instead.

This will add 1 to each value before taking the log, which will prevent it from returning infinite values.

``````import pandas as pd

ts = pd.Series([1, 2, 3, 4, 5])

ts_diff = np.log1p(ts).diff()

print(ts_diff)
``````
log1p_diff
0 nan
1 0.405465
2 0.287682
3 0.223144
4 0.182322

### Numpy#

In Numpy, we can perform log differencing by calling the `np.log()` function and then calling the `np.diff()` function.

``````import numpy as np

ts = np.array([1, 2, 4, 8, 16])

ts_diff = np.diff(np.log(ts))

print(ts_diff)

# output
[0.69314718 0.40546511 0.28768207 0.22314355]
``````

### Polars#

In Polars, we can perform log differencing by calling the `log()` method and then taking the first-order difference using the `diff()` method.

``````import polars as pl

ts = pl.Series([1, 2, 3, 4, 5])

ts_diff = ts.log().diff()

print(ts_diff)

# output
shape: (5,)
Series: '' [f64]
[
null
0.693147
0.405465
0.287682
0.223144
]
``````

To reproduce the `log1p` behavior, we can add 1 to the time series before taking the log.

``````import polars as pl

ts = pl.Series([1, 2, 3, 4, 5])

ts_diff = (ts+1).log().diff()

print(ts_diff)

# output
shape: (5,)
Series: '' [f64]
[
null
0.405465
0.287682
0.223144
0.182322
]
``````

Again, look at the y-axis to see the difference between the original and the log differenced series.

## Grouped Time Series Differencing#

When you have a DataFrame with multiple time series in the long format, you can take their differences by grouping them first.

This can be useful in cases where you have multiple time series, such as the sales of different products in a store.

### Pandas#

In Pandas, we can perform grouped time series differencing by calling the `groupby()` method and then calling `diff()`.

``````import pandas as pd

ts = pd.DataFrame({'values': [10, 20, 30, 40, 50, 100, 200, 300, 400, 500],
'groups': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B']})

ts_diff = ts.groupby('groups').diff()
ts['diff'] = ts_diff['values']

print(ts)
``````
values groups diff
10 A nan
20 A 10
30 A 10
40 A 10
50 A 10
100 B nan
200 B 100
300 B 100
400 B 100
500 B 100

### Polars#

In Polars, we can perform grouped time series differencing by calling the `diff()` method over the `pl.col(ts)` expression specifying the group column with the `over()` method.

``````import polars as pl

ts = pl.Series([10, 20, 30, 40, 50, 100, 200, 300, 400, 500])
groups = pl.Series(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])

ts = pl.DataFrame({'ts': ts, 'groups': groups})

ts_diff = ts.with_columns([pl.col('ts').diff().over(pl.col('groups')).alias('ts_diff')])

print(ts_diff)
``````
ts groups ts_diff
i64 str i64
—– ——– ———
10 A null
20 A 10
30 A 10
40 A 10
50 A 10
100 B null
200 B 100
300 B 100
400 B 100
500 B 100

## Fractional Differencing#

Traditional n-order differencing makes the data stationary but, in the process, it tends to erase the dependence of the time series on its past values.

Fractional differencing was suggested by Marcos López de Prado, in the context of financial time series, as an alternative to differentiate them without erasing its memory structure.

We can apply it using the library fracdiff.

``````import numpy as np
from fracdiff import fdiff

ts = np.array([10, 20, 30, 40, 50, 100, 200, 300, 400, 500])

ts_diff = fdiff(ts, n=0.5)

#output
array([ 10.        ,  15.        ,  18.75      ,  21.875     ,
24.609375  ,  67.0703125 , 139.32617188, 181.42089844,
214.63470459, 243.0519104 ])
``````

Tune the `n` parameter to find the coefficient that makes the data stationary while preserving as most as possible the memory structure of the time series.

You can do it by using statistical tests such as the Augmented Dickey-Fuller test or, if you plan to use the time series in a machine learning model, just tune this value as another hyperparameter in the validation set.