To put it simply, yes, feature scaling is crucial for the KNN algorithm, as it helps in preventing features with larger magnitudes from dominating the distance calculations.
Feature scaling is an essential step in the data preprocessing pipeline, especially for distance-based algorithms like the KNN.
In this tutorial, we will explore the impact of feature scaling on the algorithm’s performance using the Red Wine dataset as an example.
Why Feature Scaling Is Important For KNN
Distance-based algorithms, such as the KNN, calculate the distance between data points to determine their similarity.
Features with larger magnitudes can disproportionately influence the distance calculation, leading to biased results.
Feature scaling addresses this issue by transforming the features to a comparable range or scale, ensuring that each feature contributes fairly to the final result.
Imagine you’re measuring the similarity between two houses based on their size (in square feet) and the number of rooms.
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from scipy.spatial.distance import euclidean
data = pd.DataFrame({'size': [2500, 4000],
'rooms': [3, 5]})
size | rooms | |
---|---|---|
0 | 2500 | 3 |
1 | 4000 | 5 |
Here, we create a DataFrame with two houses, one with a size of 2,500 square feet and 3 rooms, and the other with a size of 4,000 square feet and 5 rooms.
If you don’t scale the features, the difference in size would dominate the distance calculation, while the difference in the number of rooms would barely contribute.
Feature scaling helps to balance these two features, allowing for a more accurate comparison.
Two common feature scaling methods are Min-Max scaling and Standardization.
Min-Max scaling transforms the features by scaling their values to a specific range, typically [0, 1].
It is calculated using the formula:
$$x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}$$
We subtract the minimum value from each feature and divide the result by the difference between the maximum and minimum values.
This is our data after applying Min-Max scaling:
size | rooms | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 1 |
Standardization does it by centering their values around the mean (0) and scaling them based on the standard deviation.
It is calculated using the formula:
$$x_{scaled} = \frac{x - \mu}{\sigma}$$
This is our data after applying Standardization:
size | rooms | |
---|---|---|
0 | -1 | -1 |
1 | 1 | 1 |
In practice, you should try various methods and see which one works best for your dataset, but these two are a great place to start.
In my experience, if I don’t have time to experiment with different methods, I use Standardization, as it works well for most datasets.
from sklearn.preprocessing import MinMaxScaler, StandardScaler
min_max_scaler = MinMaxScaler()
data_min_max = min_max_scaler.fit_transform(data)
standard_scaler = StandardScaler()
data_standard = standard_scaler.fit_transform(data)
We can import the MinMaxScaler
and StandardScaler
classes from the sklearn.preprocessing
module.
Let’s calculate the Euclidean distance between the two houses using the unscaled and scaled features.
from scipy.spatial.distance import euclidean
distance_original = euclidean(data.loc[0], data.loc[1])
distance_min_max = euclidean(data_min_max[0], data_min_max[1])
distance_standard = euclidean(data_standard[0], data_standard[1])
The Euclidean distance without feature scaling is 1,500, while the distance with Min-Max scaling is 1.41 and the distance with Standardization is 2.83.
Now let’s see scaling applied to a real dataset and how it affects the KNN algorithm’s performance.
Comparing KNN Performance With and Without Feature Scaling
Let’s use the Red Wine Quality dataset from the UCI Machine Learning Repository to compare the KNN algorithm’s performance with and without feature scaling.
The task in this dataset is to predict the quality of red wine based on its chemical properties like pH, alcohol content, and acidity.
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine_data = pd.read_csv(url, sep=";")
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality |
---|---|---|---|---|---|---|---|---|---|---|---|
7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
7.8 | 0.88 | 0 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.2 | 0.68 | 9.8 | 5 |
7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.997 | 3.26 | 0.65 | 9.8 | 5 |
11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.998 | 3.16 | 0.58 | 9.8 | 6 |
7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
We split the dataset into features and labels, and then split the data into training and test sets.
from sklearn.model_selection import train_test_split
X = wine_data.drop('quality', axis=1)
y = wine_data['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
The train_test_split
function from the sklearn.model_selection
module with the test_size
parameter set to 0.3 splits the data into 70% training and 30% test sets.
First, let’s train a KNN model without feature scaling and calculate its accuracy.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
The accuracy of the KNN model without feature scaling is 48.54%.
Let’s see if scaling the features improves the model’s performance. First, with Standardization.
from sklearn.preprocessing import StandardScaler
X_train_standard = standard_scaler.fit_transform(X_train)
X_test_standard = standard_scaler.transform(X_test)
knn_standard = KNeighborsClassifier()
knn_standard.fit(X_train_standard, y_train)
y_pred_standard = knn_standard.predict(X_test_standard)
accuracy_standard = accuracy_score(y_test, y_pred_standard)
Remember to always fit the scaler on the training set and then use it to transform both the training and test sets.
The accuracy of the KNN model with Standardization is 57.08%.
Let’s see if Min-Max scaling can do better.
min_max_scaler = MinMaxScaler()
X_train_min_max = min_max_scaler.fit_transform(X_train)
X_test_min_max = min_max_scaler.transform(X_test)
knn_min_max = KNeighborsClassifier()
knn_min_max.fit(X_train_min_max, y_train)
y_pred_min_max = knn_min_max.predict(X_test_min_max)
accuracy_min_max = accuracy_score(y_test, y_pred_min_max)
The accuracy of the KNN model with Min-Max scaling is 56.25%, which is better than KNN without feature scaling but worse than KNN with Standardization.
In this case, it’s an easy choice to go with Standardization.
The most important thing I wanted to show you in this article is that feature scaling makes a huge difference in the performance of the KNN algorithm, even more than in the case of logistic regression and SVMs, for example.