Analyzing correlations is a critical step in understanding complex data relationships.
It’s a fast way to find how similar two time series are.
Python offers a wide range of libraries that make calculating correlations between two time series a breeze.
In this tutorial, we’ll explore some of the most popular libraries for correlation analysis, including NumPy, Pandas, Scipy, Polars, CuPy, CuDF, PyTorch, and Dask.
Let’s get started!
Correlation Between Two Time Series Using NumPy
NumPy is the most popular Python library for numerical computing.
To compute the correlation between two time series, we can use the np.corrcoef
function.
import numpy as np
x = np.random.randn(100)
y = np.random.randn(100)
corr_coef = np.corrcoef(x, y)
print("Correlation coefficient:", corr_coef)
This function calculates the Pearson correlation coefficient.
Just pass the two time series to the np.corrcoef
function, and it will return the correlation matrix.
Correlation Between Two Time Series Using Pandas
Pandas is a popular Python library for data analysis built on top of NumPy.
To compute the correlation between two time series that are columns in a Pandas DataFrame, we can use the DataFrame.corr
method.
import pandas as pd
df = pd.DataFrame({'x': np.random.randn(100), 'y': np.random.randn(100)})
corr_matrix = df.corr()
print("Correlation matrix:")
print(corr_matrix)
x | y | |
---|---|---|
x | 1 | -0.0592785 |
y | -0.0592785 | 1 |
This will calculate the correlation between all pairs of columns in the DataFrame.
If you have two Pandas Series, you can use the Series.corr
method to calculate the correlation between them.
The series must have the same index, because Pandas will align the values based on it.
import pandas as pd
import numpy as np
series1 = pd.Series(np.random.randn(100))
series2 = pd.Series(np.random.randn(100))
series1.corr(series2)
If you want to calculate the correlation between a DataFrame and a Series, you can use the DataFrame.corrwith
method.
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': np.random.randn(100), 'y': np.random.randn(100)})
series = pd.Series(np.random.randn(100))
df.corrwith(series)
pearson | |
---|---|
x | -0.0256518 |
y | 0.20236 |
By default, Pandas uses the Pearson correlation. To calculate the Spearman or Kendall correlation between two time series, you can use the method
argument in any of the functions above.
df.corrwith(series, method='spearman')
spearman | |
---|---|
x | -0.0158176 |
y | 0.188407 |
df.corrwith(series, method='kendall')
kendall | |
---|---|
x | -0.0109091 |
y | 0.129697 |
Correlation Between Two Time Series Using Scipy
Another way to calculate the correlation between two time series is to use the scipy.stats
module.
We can use the pearsonr
function to calculate the Pearson correlation, the spearmanr
function for the Spearman, and the kendalltau
function to calculate the Kendall correlation coefficient.
from scipy.stats import pearsonr, spearmanr, kendalltau
x = np.random.randn(100)
y = np.random.randn(100)
pearson_coef, _ = pearsonr(x, y)
print("Pearson correlation coefficient:", pearson_coef)
spearman_coef, _ = spearmanr(x, y)
print("Spearman correlation coefficient:", spearman_coef)
kendall_coef, _ = kendalltau(x, y)
print("Kendall correlation coefficient:", kendall_coef)
Correlation Between Two Time Series Using Polars
Polars is a new Python library built on top of Rust that is gaining popularity for data analysis for its speed and ease of use.
You have basically the same functionality as Pandas, but with a much faster performance.
import polars as pl
df = pl.DataFrame({'x': pl.Series(np.random.randn(100)), 'y': pl.Series(np.random.randn(100))})
corr = df.select(pl.corr('x', 'y'))
print(corr)
x |
---|
f64 |
———- |
0.171804 |
To get the Spearman correlation, you can use the argument method
in the pl.corr
function.
df.select(pl.corr('x', 'y', method='spearman'))
x |
---|
f64 |
———- |
0.141122 |
Correlation Between Two Time Series Using CuPy
If you have a GPU, you can use CuPy to calculate the correlation between two time series.
It’s a library inspired by NumPy that uses the GPU to accelerate the calculations, so you can expect very similar function names.
Always try the same Numpy function name with CuPy to see if it works.
Here we can use the cp.corrcoef
function.
import cupy as cp
x = cp.random.randn(100)
y = cp.random.randn(100)
corr_coef = cp.corrcoef(x, y)[0, 1]
print("Correlation coefficient:", corr_coef)
Correlation Between Two Time Series Using CuDF
Just like you can think of CuPy as a GPU version of NumPy, you can think of CuDF as a GPU version of Pandas.
We can easily compute the correlation between two time series that are columns in a CuDF DataFrame with the DataFrame.corr
method.
import cudf
import cupy as cp
df = cudf.DataFrame({'x': cp.random.randn(100), 'y': cp.random.randn(100)})
corr_matrix = df.corr()
print("Correlation matrix:")
print(corr_matrix)
Like in Pandas, this will calculate the correlation between all pairs of columns in the DataFrame.
If you have two CuDF Series, you can use the Series.corr method to calculate the correlation between them.
series1 = cudf.Series(cp.random.randn(100))
series2 = cudf.Series(cp.random.randn(100))
series1.corr(series2)
By default, CuDF uses the Pearson correlation, but it has the same method
argument as Pandas to calculate the Spearman correlation.
df.corr(method='spearman')
Correlation Between Two Time Series Using Dask
Another library inspired by Pandas is Dask.
It’s a library that allows you to scale your Pandas code to work with datasets that don’t fit in memory.
To calculate the correlation between two time series, you can use the dask.dataframe.corr
function.
import dask.dataframe as dd
import pandas as pd
import numpy as np
pandas_df = pd.DataFrame({'x': np.random.randn(100), 'y': np.random.randn(100)})
df = dd.from_pandas(pandas_df, npartitions=2)
corr_matrix = df.corr()
print("Correlation matrix:")
print(corr_matrix.compute())
x | y | |
---|---|---|
x | 1 | 0.101782 |
y | 0.101782 | 1 |
Correlation Between Two Time Series Using PyTorch
PyTorch has a simple torch.corrcoef
function that you can use to calculate the correlation between two time series.
import torch
x = torch.randn((100,2))
corr_coef = torch.corrcoef(x.T)
Different than the other libraries, this function calculates the correlation between rows, not columns.
So if your series are in columns, you need to transpose the matrix before passing it to the function.