When working with time series data, differencing is a common technique used to make the data stationary.
Stationary data is important because it allows us to apply statistical models that assume constant parameters (like the mean and standard deviation) over time, and this can improve the accuracy of our predictions.
Let’s see how we can easily perform differencing in Python using Pandas, Numpy, and Polars.
First-order Differencing
First-order differencing involves subtracting each value in the time series from its previous value.
Pandas
In Pandas, we can perform first-order differencing using the diff()
method. Here’s an example:
import pandas as pd
ts = pd.Series([10, 20, 30, 40, 50])
ts_diff = ts.diff()
print(ts_diff)
diff | |
---|---|
0 | nan |
1 | 10 |
2 | 10 |
3 | 10 |
4 | 10 |
Note that the first value is NaN because there is no previous value to subtract from.
Notice the y-axis values in the plot to see the difference.
Numpy
In Numpy, we can perform first-order differencing using the np.diff()
function.
import numpy as np
ts = np.array([10, 20, 30, 40, 50])
ts_diff = np.diff(ts)
print(ts_diff)
#output
[10 10 10 10]
This time the first value is removed.
Polars
In Polars, we can perform first-order differencing using the shift()
method.
import polars as pl
ts = pl.Series([10, 20, 30, 40, 50])
ts_diff = ts.diff()
print(ts_diff)
# output
shape: (5,)
Series: '' [i64]
[
null
10
10
10
10
]
Again, the first value is null because there is no previous value to subtract from.
Second-order Differencing
Second-order differencing involves taking the simple difference of values on a time series twice.
This can be useful in some cases where a first-order difference is not enough to make the time series stationary.
Pandas
In Pandas, we can perform second-order differencing by calling the diff()
method twice.
import pandas as pd
ts = pd.Series([10, 20, 30, 40, 50])
ts_diff = ts.diff().diff()
print(ts_diff)
diff_diff | |
---|---|
0 | nan |
1 | nan |
2 | 0 |
3 | 0 |
4 | 0 |
Now our first and second values are NaNs.
Notice the y-axis values in the plot to see the difference.
Numpy
In Numpy, we can perform second-order differencing by calling the np.diff()
function twice.
import numpy as np
ts = np.array([10, 20, 30, 40, 50])
ts_diff = np.diff(np.diff(ts))
print(ts_diff)
# output
[0 0 0]
Polars
In Polars, we can perform second-order differencing by calling the shift()
method twice.
import polars as pl
ts = pl.Series([10, 20, 30, 40, 50])
ts_diff = ts.diff().diff()
print(ts_diff)
## output
shape: (5,)
Series: '' [i64]
[
null
null
0
0
0
]
Basically the same as Pandas.
Seasonal Differencing
Seasonal differencing involves subtracting the value of a time series from the value at the same time in the previous season.
This can be useful in cases where the time series exhibits a seasonal pattern.
For example, if we have a time series of monthly sales data, we would subtract the value of the same month in the previous year from the current month.
If we have a time series of daily sales data, we could subtract the value of the same day in the previous week from the current day.
Pandas
In Pandas, we can perform seasonal differencing by calling the diff()
method with the appropriate lag.
import pandas as pd
dates = pd.date_range(start='2021-12', end='2023-12', freq='M')
ts = pd.Series([10, 20, 30, 40, 50, 10, 20, 30, 40, 50, 10, 20] * 2, index=dates)
ts_diff = ts.diff(periods=12)
print(ts_diff)
ts | ts_diff | |
---|---|---|
2021-12-31 00:00:00 | 10 | nan |
2022-01-31 00:00:00 | 20 | nan |
2022-02-28 00:00:00 | 30 | nan |
2022-03-31 00:00:00 | 40 | nan |
2022-04-30 00:00:00 | 50 | nan |
2022-05-31 00:00:00 | 10 | nan |
2022-06-30 00:00:00 | 20 | nan |
2022-07-31 00:00:00 | 30 | nan |
2022-08-31 00:00:00 | 40 | nan |
2022-09-30 00:00:00 | 50 | nan |
2022-10-31 00:00:00 | 10 | nan |
2022-11-30 00:00:00 | 20 | nan |
2022-12-31 00:00:00 | 10 | 0 |
2023-01-31 00:00:00 | 20 | 0 |
2023-02-28 00:00:00 | 30 | 0 |
2023-03-31 00:00:00 | 40 | 0 |
2023-04-30 00:00:00 | 50 | 0 |
2023-05-31 00:00:00 | 10 | 0 |
2023-06-30 00:00:00 | 20 | 0 |
2023-07-31 00:00:00 | 30 | 0 |
2023-08-31 00:00:00 | 40 | 0 |
2023-09-30 00:00:00 | 50 | 0 |
2023-10-31 00:00:00 | 10 | 0 |
2023-11-30 00:00:00 | 20 | 0 |
periods=12
tells Pandas to subtract the value 12 rows before from the current row.
Numpy
In Numpy, we can perform seasonal differencing by subtracting the value of the time series at the appropriate lag.
import numpy as np
ts = np.array([10, 20, 30, 40, 50, 10, 20, 30, 40, 50, 10, 20] * 2)
ts_diff = ts[12:] - ts[:-12]
print(ts_diff)
# output
[0 0 0 0 0 0 0 0 0 0 0 0]
Polars
In Polars, we can perform seasonal differencing by also calling the diff()
method with the desired lag.
import polars as pl
ts = pl.Series(np.array([10, 20, 30, 40, 50, 10, 20, 30, 40, 50, 10, 20] * 2))
ts_diff = ts.diff(12)
print(ts_diff)
# output
shape: (24,)
Series: '' [i32]
[
null
null
null
null
null
null
null
null
null
null
null
null
0
0
0
0
0
0
0
0
0
0
0
0
]
Log Differencing
Log differencing involves taking the logarithm of a time series and then taking the first-order difference of the resulting sequence.
It’s a very common transformation in finance, where the log difference of a time series is often used to model the returns of a stock or other financial instrument.
Pandas
In Pandas, we can perform log differencing by calling the np.log()
method and then taking the first-order difference.
import pandas as pd
import numpy as np
ts = pd.Series([1, 2, 3, 4, 5])
ts_diff = np.log(ts).diff()
print(ts_diff)
log_diff | |
---|---|
0 | nan |
1 | 0.693147 |
2 | 0.405465 |
3 | 0.287682 |
4 | 0.223144 |
The np.log()
function uses the natural logarithm, which is the logarithm to the base e.
In case you have zeros in your time series, you can use the np.log1p()
function instead.
This will add 1 to each value before taking the log, which will prevent it from returning infinite values.
import pandas as pd
ts = pd.Series([1, 2, 3, 4, 5])
ts_diff = np.log1p(ts).diff()
print(ts_diff)
log1p_diff | |
---|---|
0 | nan |
1 | 0.405465 |
2 | 0.287682 |
3 | 0.223144 |
4 | 0.182322 |
Numpy
In Numpy, we can perform log differencing by calling the np.log()
function and then calling the np.diff()
function.
import numpy as np
ts = np.array([1, 2, 4, 8, 16])
ts_diff = np.diff(np.log(ts))
print(ts_diff)
# output
[0.69314718 0.40546511 0.28768207 0.22314355]
Polars
In Polars, we can perform log differencing by calling the log()
method and then taking the first-order difference using the diff()
method.
import polars as pl
ts = pl.Series([1, 2, 3, 4, 5])
ts_diff = ts.log().diff()
print(ts_diff)
# output
shape: (5,)
Series: '' [f64]
[
null
0.693147
0.405465
0.287682
0.223144
]
To reproduce the log1p
behavior, we can add 1 to the time series before taking the log.
import polars as pl
ts = pl.Series([1, 2, 3, 4, 5])
ts_diff = (ts+1).log().diff()
print(ts_diff)
# output
shape: (5,)
Series: '' [f64]
[
null
0.405465
0.287682
0.223144
0.182322
]
Again, look at the y-axis to see the difference between the original and the log differenced series.
Grouped Time Series Differencing
When you have a DataFrame with multiple time series in the long format, you can take their differences by grouping them first.
This can be useful in cases where you have multiple time series, such as the sales of different products in a store.
Pandas
In Pandas, we can perform grouped time series differencing by calling the groupby()
method and then calling diff()
.
import pandas as pd
ts = pd.DataFrame({'values': [10, 20, 30, 40, 50, 100, 200, 300, 400, 500],
'groups': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B']})
ts_diff = ts.groupby('groups').diff()
ts['diff'] = ts_diff['values']
print(ts)
values | groups | diff |
---|---|---|
10 | A | nan |
20 | A | 10 |
30 | A | 10 |
40 | A | 10 |
50 | A | 10 |
100 | B | nan |
200 | B | 100 |
300 | B | 100 |
400 | B | 100 |
500 | B | 100 |
Polars
In Polars, we can perform grouped time series differencing by calling the diff()
method over the pl.col(ts)
expression specifying the group column with the over()
method.
import polars as pl
ts = pl.Series([10, 20, 30, 40, 50, 100, 200, 300, 400, 500])
groups = pl.Series(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])
ts = pl.DataFrame({'ts': ts, 'groups': groups})
ts_diff = ts.with_columns([pl.col('ts').diff().over(pl.col('groups')).alias('ts_diff')])
print(ts_diff)
ts | groups | ts_diff |
---|---|---|
i64 | str | i64 |
—– | ——– | ——— |
10 | A | null |
20 | A | 10 |
30 | A | 10 |
40 | A | 10 |
50 | A | 10 |
100 | B | null |
200 | B | 100 |
300 | B | 100 |
400 | B | 100 |
500 | B | 100 |
Fractional Differencing
Traditional n-order differencing makes the data stationary but, in the process, it tends to erase the dependence of the time series on its past values.
Fractional differencing was suggested by Marcos López de Prado, in the context of financial time series, as an alternative to differentiate them without erasing its memory structure.
We can apply it using the library fracdiff.
import numpy as np
from fracdiff import fdiff
ts = np.array([10, 20, 30, 40, 50, 100, 200, 300, 400, 500])
ts_diff = fdiff(ts, n=0.5)
#output
array([ 10. , 15. , 18.75 , 21.875 ,
24.609375 , 67.0703125 , 139.32617188, 181.42089844,
214.63470459, 243.0519104 ])
Tune the n
parameter to find the coefficient that makes the data stationary while preserving as most as possible the memory structure of the time series.
You can do it by using statistical tests such as the Augmented Dickey-Fuller test or, if you plan to use the time series in a machine learning model, just tune this value as another hyperparameter in the validation set.