Analyzing correlations is a critical step in understanding complex data relationships.
It’s a fast way to find how similar two time series are.
Python offers a wide range of libraries that make calculating correlations between two time series a breeze.
In this tutorial, we’ll explore some of the most popular libraries for correlation analysis, including NumPy, Pandas, Scipy, Polars, CuPy, CuDF, PyTorch, and Dask.
Let’s get started!
Correlation Between Two Time Series Using NumPy
NumPy is the most popular Python library for numerical computing.
To compute the correlation between two time series, we can use the np.corrcoef function.
import numpy as np
x = np.random.randn(100)
y = np.random.randn(100)
corr_coef = np.corrcoef(x, y)
print("Correlation coefficient:", corr_coef)
This function calculates the Pearson correlation coefficient.
Just pass the two time series to the np.corrcoef function, and it will return the correlation matrix.
Correlation Between Two Time Series Using Pandas
Pandas is a popular Python library for data analysis built on top of NumPy.
To compute the correlation between two time series that are columns in a Pandas DataFrame, we can use the DataFrame.corr method.
import pandas as pd
df = pd.DataFrame({'x': np.random.randn(100), 'y': np.random.randn(100)})
corr_matrix = df.corr()
print("Correlation matrix:")
print(corr_matrix)
| x | y | |
|---|---|---|
| x | 1 | -0.0592785 |
| y | -0.0592785 | 1 |
This will calculate the correlation between all pairs of columns in the DataFrame.
If you have two Pandas Series, you can use the Series.corr method to calculate the correlation between them.
The series must have the same index, because Pandas will align the values based on it.
import pandas as pd
import numpy as np
series1 = pd.Series(np.random.randn(100))
series2 = pd.Series(np.random.randn(100))
series1.corr(series2)
If you want to calculate the correlation between a DataFrame and a Series, you can use the DataFrame.corrwith method.
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': np.random.randn(100), 'y': np.random.randn(100)})
series = pd.Series(np.random.randn(100))
df.corrwith(series)
| pearson | |
|---|---|
| x | -0.0256518 |
| y | 0.20236 |
By default, Pandas uses the Pearson correlation. To calculate the Spearman or Kendall correlation between two time series, you can use the method argument in any of the functions above.
df.corrwith(series, method='spearman')
| spearman | |
|---|---|
| x | -0.0158176 |
| y | 0.188407 |
df.corrwith(series, method='kendall')
| kendall | |
|---|---|
| x | -0.0109091 |
| y | 0.129697 |
Correlation Between Two Time Series Using Scipy
Another way to calculate the correlation between two time series is to use the scipy.stats module.
We can use the pearsonr function to calculate the Pearson correlation, the spearmanr function for the Spearman, and the kendalltau function to calculate the Kendall correlation coefficient.
from scipy.stats import pearsonr, spearmanr, kendalltau
x = np.random.randn(100)
y = np.random.randn(100)
pearson_coef, _ = pearsonr(x, y)
print("Pearson correlation coefficient:", pearson_coef)
spearman_coef, _ = spearmanr(x, y)
print("Spearman correlation coefficient:", spearman_coef)
kendall_coef, _ = kendalltau(x, y)
print("Kendall correlation coefficient:", kendall_coef)
Correlation Between Two Time Series Using Polars
Polars is a new Python library built on top of Rust that is gaining popularity for data analysis for its speed and ease of use.
You have basically the same functionality as Pandas, but with a much faster performance.
import polars as pl
df = pl.DataFrame({'x': pl.Series(np.random.randn(100)), 'y': pl.Series(np.random.randn(100))})
corr = df.select(pl.corr('x', 'y'))
print(corr)
| x |
|---|
| f64 |
| ———- |
| 0.171804 |
To get the Spearman correlation, you can use the argument method in the pl.corr function.
df.select(pl.corr('x', 'y', method='spearman'))
| x |
|---|
| f64 |
| ———- |
| 0.141122 |
Correlation Between Two Time Series Using CuPy
If you have a GPU, you can use CuPy to calculate the correlation between two time series.
It’s a library inspired by NumPy that uses the GPU to accelerate the calculations, so you can expect very similar function names.
Always try the same Numpy function name with CuPy to see if it works.
Here we can use the cp.corrcoef function.
import cupy as cp
x = cp.random.randn(100)
y = cp.random.randn(100)
corr_coef = cp.corrcoef(x, y)[0, 1]
print("Correlation coefficient:", corr_coef)
Correlation Between Two Time Series Using CuDF
Just like you can think of CuPy as a GPU version of NumPy, you can think of CuDF as a GPU version of Pandas.
We can easily compute the correlation between two time series that are columns in a CuDF DataFrame with the DataFrame.corr method.
import cudf
import cupy as cp
df = cudf.DataFrame({'x': cp.random.randn(100), 'y': cp.random.randn(100)})
corr_matrix = df.corr()
print("Correlation matrix:")
print(corr_matrix)
Like in Pandas, this will calculate the correlation between all pairs of columns in the DataFrame.
If you have two CuDF Series, you can use the Series.corr method to calculate the correlation between them.
series1 = cudf.Series(cp.random.randn(100))
series2 = cudf.Series(cp.random.randn(100))
series1.corr(series2)
By default, CuDF uses the Pearson correlation, but it has the same method argument as Pandas to calculate the Spearman correlation.
df.corr(method='spearman')
Correlation Between Two Time Series Using Dask
Another library inspired by Pandas is Dask.
It’s a library that allows you to scale your Pandas code to work with datasets that don’t fit in memory.
To calculate the correlation between two time series, you can use the dask.dataframe.corr function.
import dask.dataframe as dd
import pandas as pd
import numpy as np
pandas_df = pd.DataFrame({'x': np.random.randn(100), 'y': np.random.randn(100)})
df = dd.from_pandas(pandas_df, npartitions=2)
corr_matrix = df.corr()
print("Correlation matrix:")
print(corr_matrix.compute())
| x | y | |
|---|---|---|
| x | 1 | 0.101782 |
| y | 0.101782 | 1 |
Correlation Between Two Time Series Using PyTorch
PyTorch has a simple torch.corrcoef function that you can use to calculate the correlation between two time series.
import torch
x = torch.randn((100,2))
corr_coef = torch.corrcoef(x.T)
Different than the other libraries, this function calculates the correlation between rows, not columns.
So if your series are in columns, you need to transpose the matrix before passing it to the function.