Clustering is an unsupervised learning technique that can help you uncover hidden patterns in your time series data.

Scikit-learn has a wide range of clustering algorithms, including K-means, DBSCAN, and Agglomerative Clustering.

In this tutorial, we’ll explore how to use K-means with different transformations to cluster time series data.

Using the right data transformations can help you get your desired results faster than just trying different clustering algorithms over the same data.

To use other clustering algorithms, you can follow the same steps, just change the algorithm object.

Let’s get started!

Loading The Data With Pandas

First things first, we need to load our data.

In this tutorial, we’ll be using a dataset containing website traffic data from lacity.org.

The dataset contains information like date, device category, browser, and the number of sessions.

Sessions are periods of time when a user is actively using the website.

Usually they are measured by looking at user activity and have a timeout after 30 minutes since the last user action.

For example, right now you are in a session on my website. If you leave the page and come back after 2 hours, you will be in a new session.

Let’s start by loading the data using pandas:

import pandas as pd
import os

path = 'path/to/your/dataset'
data = pd.read_csv(os.path.join(path, 'lacity.org-website-traffic.csv'), 
                   parse_dates=['Date']).loc[:, ['Date', 'Device Category', 'Browser', 'Sessions']]
Date Device Category Browser Sessions unique_id
0 2014-01-01 00:00:00 desktop Chrome 934 desktop_Chrome
1 2014-01-01 00:00:00 desktop Firefox 761 desktop_Firefox
2 2014-01-01 00:00:00 desktop Internet Explorer 1107 desktop_Internet Explorer
3 2014-01-01 00:00:00 desktop Opera 35 desktop_Opera
4 2014-01-01 00:00:00 desktop Safari 554 desktop_Safari

Now, we need to reshape the data into a wide format, where each row represents a unique combination of device category and browser.

In other words, each of our rows needs to represent a single time series.

We want to cluster time series, so each of them becomes a data point.

We’ll create a new column unique_id for this purpose:

data['unique_id'] = data['Device Category'] + "_" + data['Browser']
wide_df = data.groupby(['Date', 'unique_id']).sum().unstack().T.fillna(0).reset_index(level=0, drop=True)
wide_df.head()
unique_id 2014-01-01 00:00:00 2014-01-02 00:00:00 2014-01-03 00:00:00 2014-01-04 00:00:00 2014-01-05 00:00:00
desktop_“mozilla 0 0 0 0 0
desktop_‘Mozilla 0 0 0 0 0
desktop_(not set) 0 0 0 0 0
desktop_AdobeAIR 0 0 0 0 0
desktop_Bluebeam Revu Browser - cef version: 57.0.0.0 0 0 0 0 0

Great! Now that we’ve got our data in the desired format, let’s move on to the clustering part.

Raw Time Series Clustering

We’ll use scikit-learn’s K-means clustering algorithm to group our time series data.

K-means clustering is a distance-based algorithm, which means it calculates the distance between data points to group them into clusters.

When working with time series data containing different scales or units, it’s important to preprocess the data to ensure that the algorithm captures patterns that are independent of the scale.

For example, consider data for people using Chrome on a desktop and people using Opera on a mobile device.

The number of sessions for the former will be much higher than the latter, but our objective is to cluster the patterns together independent of the scale.

To achieve this, we’ll preprocess the data using MaxAbsScaler, which scales the data by its maximum absolute value.

The reason for choosing MaxAbsScaler over StandardScaler is that some series in our dataset have a lot of zeros, making the data sparse.

Using StandardScaler would break the sparse structure of the data, which is usually not desirable.

However, you can use whichever scaler you prefer or the one that works best for your data.

from sklearn.cluster import KMeans
from sklearn.preprocessing import MaxAbsScaler

prep = MaxAbsScaler()
kmeans = KMeans(n_clusters=4, random_state=0)

scaled_data = prep.fit_transform(wide_df)
kmeans.fit(scaled_data)

We import the objects we need to scale and cluster the data from sklearn.

Then we create an instance of the MaxAbsScaler class, which will be used to preprocess our data before fitting the clustering model.

Next, we create an instance of the KMeans class with n_clusters=4, specifying that we want to group the data into four clusters.

We also set the random_state parameter to 0, which ensures that the algorithm’s random initialization is consistent every time the code is run.

This helps in obtaining reproducible results.

I set the n_clusters parameter to 4, but you can choose any number of clusters you want.

I will talk more about choosing the right number of clusters in the next section.

The fit_transform method of the MaxAbsScaler first fits the scaler to the data (i.e., learns the maximum absolute values of each feature) and then transforms the data by scaling it according to the learned values.

The transformed data is returned as a new Numpy array.

The fit method of the KMeans class fits the K-means clustering model to the data, learning the optimal cluster centroids.

Selecting The Best Number Of Clusters

We can see which time series belong to which cluster by using the labels_ attribute of the kmeans object:

This attribute contains the cluster to which each time series (row) belongs.

for cluster in range(kmeans.n_clusters):
    print(f"Cluster {cluster}")
    cluster_data = wide_df[kmeans.labels_ == cluster]
    for i in range(min(5, cluster_data.shape[0])):
        print(cluster_data.index[i])
    print()

Cluster 0
desktop_"mozilla
desktop_'Mozilla
desktop_(not set)
desktop_AdobeAIR
desktop_Bluebeam Revu Browser - cef version: 57.0.0.0

Cluster 1
mobile_Chrome
mobile_Safari

Cluster 2
desktop_Chrome

Cluster 3
desktop_Firefox
desktop_Internet Explorer
desktop_Safari

This code prints the first 5 elements of each cluster.

I love to look at the data and see if the clusters make sense.

I don’t think a metric comes even close to using your experience and intuition to evaluate the quality of the clusters.

I recommend you try different values for n_clusters and see which one makes the most sense to you.

Based on the given clusters, we can interpret them as follows:

Cluster 0 contains less common or less frequently used desktop browsers and some special cases like desktop_"mozilla and desktop_'Mozilla.

These are likely malformed or misreported browser names, possibly due to user agent string manipulation or parsing issues.

Cluster 1 contains the most popular mobile browsers.

Cluster 2 has the most popular desktop browser, Chrome.

Cluster 3 contains the other popular desktop browsers: Firefox, Safari and Internet Explorer (at the time).

We can develop hypotheses about the clusters based on the data and then test them to see if they make sense.

For example, maybe Chrome is separated from the other browsers because the site is better optimized for it, leading to distinct usage patterns when compared to other desktop browsers.

If you want to evaluate your clusters numerically, you can use the silhouette_score metric from scikit-learn.

It measures how well each data point fits within its cluster and how well it is separated from other clusters.

A silhouette score ranges from -1 to 1, with values closer to 1 indicating better clustering quality.

from sklearn.metrics import silhouette_score
for n_cluster in range(2, 10):
    kmeans = KMeans(n_clusters=n_cluster, random_state=0)
    kmeans.fit(scaled_data)
    print(n_cluster, silhouette_score(scaled_data, kmeans.labels_))

2 0.9141625540842973
3 0.9067517905568305
4 0.8921278550540331
5 0.8846711594784025
6 0.8647099535858193
7 0.8612737950043723
8 0.7796911515851629
9 0.792996833125775

silhouette score for time series clustering

We pass the scaled data and the cluster labels to the silhouette_score function to get the silhouette score for each number of clusters.

Although it says the best number of clusters is 2, I don’t think that’s the case by looking at the data.

So you can use it as a sanity check, but I wouldn’t rely on it too much.

Feature Extraction For Time Series Clustering In Python

Feature extraction can be a powerful technique to transform the raw data into a more manageable format, while capturing relevant information from the time series.

TSFresh is a Python library that automatically extracts up to 1,200 features from time series data.

Some of the features include basic statistics, Fourier and wavelet transforms, and autocorrelation.

To use TSFresh, first install it with pip:

pip install tsfresh

TSFresh can take our data in the original (long) format, where each row represents a single time point and the series are stacked.

from tsfresh import extract_features
from tsfresh.feature_extraction import MinimalFCParameters

df = data[['Date', 'unique_id', 'Sessions']]
extracted_features = extract_features(df, column_id='unique_id', column_sort='Date',
                                      default_fc_parameters=MinimalFCParameters())

Here, we import the extract_features function, which will be used to compute the features for each time series, and the MinimalFCParameters class, which provides a minimal set of feature extraction settings to reduce computation time.

We create a new DataFrame called df containing only the columns needed for feature extraction: 'Date', 'unique_id', and 'Sessions'.

The 'Date' column is the time index, 'unique_id' indicates which rows belong to each individual time series, and 'Sessions' contains the actual time series values.

The extract_features function takes the following arguments:

  • df: The input DataFrame containing the time series data.
  • column_id: The name of the column that uniquely identifies each time series. In our case, it’s the 'unique_id' column, which combines the device category and browser information.
  • column_sort: The name of the column that represents the time index. In our dataset, it’s the 'Date' column.
  • default_fc_parameters: The feature set we want to extract. We use MinimalFCParameters to compute a minimal set of features, which reduces the computational complexity and time required for feature extraction. This is a good starting point, but you can experiment with other feature extraction settings provided by tsfresh or create your own custom settings to extract more or different features, depending on your needs and the complexity of your data.

The extract_features function returns a new DataFrame containing the computed features for each time series.

Each row in the DataFrame corresponds to a unique time series, and each column represents a different feature.

Sessions__sum_values Sessions__median Sessions__mean Sessions__length Sessions__standard_deviation Sessions__variance Sessions__root_mean_square Sessions__maximum Sessions__absolute_maximum Sessions__minimum
desktop_“mozilla 8456 38 37.9193 223 1.12562 1.26703 37.936 40 40 36
desktop_‘Mozilla 138889 39 57.6303 2410 37.9793 1442.43 69.0194 159 159 35
desktop_(not set) 485471 37 41.1172 11807 11.5508 133.42 42.7088 80 80 35
desktop_AdobeAIR 11546 37 37.2452 310 1.46504 2.14635 37.274 40 40 35
desktop_Bluebeam Revu Browser - cef version: 57.0.0.0 15580 38 38.0929 409 1.00909 1.01826 38.1063 40 40 36

With the extracted features, you can now proceed with the scaling and clustering steps as we discussed earlier in the tutorial.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

prep = StandardScaler()
kmeans = KMeans(n_clusters=4, random_state=0)

scaled_data = prep.fit_transform(extracted_features)
kmeans.fit(scaled_data)

Now, as we have dense features, we can use StandardScaler to scale the data.

The code is basically the same, but instead of using wide_df, which had the raw time series observations as features, we use extracted_features, which contains the extracted features.

I like this approach because it’s very easy to try different feature extraction methods and compare the results.

silhouette score for time series clustering with tsfresh features

The silhouette score still suggests using only two clusters, but now our clusters changed a bit.

Cluster 0
desktop_"mozilla
desktop_'Mozilla
desktop_(not set)
desktop_AdobeAIR
desktop_Bluebeam Revu Browser - cef version: 57.0.0.0

Cluster 1
desktop_Chrome
mobile_Safari

Cluster 2
desktop_Firefox
desktop_Internet Explorer
desktop_Safari
mobile_Chrome
tablet_Safari

Cluster 3
desktop_Konqueror

Time Series Clustering With PCA Using Scikit-learn

Principal Component Analysis (PCA) is a dimensionality reduction technique that can help you transform your time series data into a lower-dimensional space while preserving the most important information.

It’s another way to do feature extraction.

While feature extraction methods can be domain-specific and interpretable, PCA transforms the data into a new coordinate system, in which the first axis (the first principal component) explains the largest amount of variance in the data, the second axis (the second principal component) explains the second-largest amount of variance, and so on.

It automatically finds the optimal linear projection of the data that preserves the maximum amount of variance.

In practice, I try both approaches and compare the results.

To use PCA for time series clustering, first import the required classes from scikit-learn:

from sklearn.preprocessing import MaxAbsScaler
from sklearn.decomposition import PCA

prep = MaxAbsScaler()
scaled_data = prep.fit_transform(wide_df)

pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)

kmeans_pca = KMeans(n_clusters=4, random_state=0)
kmeans_pca.fit(pca_data)

We import the MaxAbsScaler class for preprocessing and the PCA class for dimensionality reduction.

As before, we are dealing with the raw time series data first, so we scale using MaxAbsScaler to keep the sparsity.

We create an instance of the PCA class with n_components=2, specifying that we want to reduce the dimensionality of the data to 2 principal components (2 features).

The fit_transform method fits the PCA model to the scaled data (i.e., learns the optimal linear projection) and then transforms the data by projecting it onto the first two principal components.

The transformed data is returned as a new Numpy array called pca_data.

The rest of the code is the same, but applied to the transformed data.

The biggest disadvantage I see is that now we have another hyperparameter to tune: the number of components to keep.

The silhouette score plot now suggests we use between 3 and 4 clusters, which makes more sense for me, looking at the data.

silhouette score for time series clustering with PCA

Because of this, mixing my experience and intuition of looking at the examples that got assigned for each cluster, I would consider this approach as the best for clustering this dataset.

Cluster 0
desktop_"mozilla
desktop_'Mozilla
desktop_(not set)
desktop_AdobeAIR
desktop_Bluebeam Revu Browser - cef version: 57.0.0.0

Cluster 1
mobile_Chrome
mobile_Safari

Cluster 2
desktop_Chrome

Cluster 3
desktop_Firefox
desktop_Internet Explorer
desktop_Safari

I love the fact that mobile_Chrome and mobile_Safari are now in the same cluster, which makes sense given that the US tends to have a 50/50 split between Android and iOS users.

The biggest advantage of using PCA with 2 components is that we can visualize the clusters in a 2D space.

import matplotlib.pyplot as plt

for cluster in range(kmeans_pca.n_clusters):
    cluster_points = pca_data[kmeans_pca.labels_ == cluster]
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], label=f"Cluster {cluster}")

# Plot the cluster centroids as black 'X' markers
plt.scatter(kmeans_pca.cluster_centers_[:, 0], kmeans_pca.cluster_centers_[:, 1],
            color='black', marker='x', label='Centroids')

plt.title("Clusters in PCA Space")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend()

# Display the plot
plt.show()

time series clustering with PCA

We have a clear cluster of low activity users, but things get a bit more complicated for the other clusters.