In Support Vector Machines (SVM), feature scaling or normalization are not strictly required, but are highly recommended, as it can significantly improve model performance and convergence speed.
SVM tries to find the optimal hyperplane that separates the data points of different classes with the maximum margin.
If the features are on different scales, the hyperplane will be heavily influenced by the features with larger values, potentially leading to suboptimal results.
In this tutorial, we will explore the impact of feature scaling and normalization on SVM’s performance using the Red Wine dataset as an example.
Loading the Dataset
The Red Wine dataset is a popular dataset used to study classification and regression tasks in machine learning.
It contains data about chemical properties of red wine, such as acidity, pH, alcohol content, and the quality of the wine.
The goal is to use these characteristics to predict the quality of each wine as a classification problem, where the quality is discretized into classes.
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine_data = pd.read_csv(url, sep=";")
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality |
---|---|---|---|---|---|---|---|---|---|---|---|
7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
7.8 | 0.88 | 0 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.2 | 0.68 | 9.8 | 5 |
7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.997 | 3.26 | 0.65 | 9.8 | 5 |
11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.998 | 3.16 | 0.58 | 9.8 | 6 |
7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
We split the dataset into features and labels, and then split the data into training and test sets.
from sklearn.model_selection import train_test_split
X = wine_data.drop('quality', axis=1)
y = wine_data['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
SVM Without Feature Scaling
First, we train an SVM model without feature scaling to use as a baseline.
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
svm_clf = SVC(random_state=42)
svm_clf.fit(X_train, y_train)
y_pred_test_svm = svm_clf.predict(X_test)
accuracy_test_svm = accuracy_score(y_test, y_pred_test_svm)
print("SVM - without feature scaling")
print("Test accuracy:", accuracy_test_svm)
The code above first imports the necessary classes and functions, then creates an instance of the SVC class with a random state for reproducibility.
It fits the model to the training data and makes predictions for the test set.
Finally, it calculates and prints the accuracy score to evaluate the model’s performance.
The accuracy is 50.42%, which is better than random but not great.
A logistic regression model gets about 56.25% accuracy on this dataset, so I would expect an SVM model to perform better.
So, let’s try to improve the performance by scaling the features.
SVM With Feature Scaling
Let’s use the StandardScaler
class from scikit-learn to scale the features.
This class standardizes the features by subtracting the mean and dividing by the standard deviation, which makes the features have a mean of 0 and a standard deviation of 1.
It’s the most popular method for feature scaling.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Notice that we fit only on the training data to avoid data leakage.
Now let’s train the model again.
svm_clf_scaled = SVC(random_state=42)
svm_clf_scaled.fit(X_train_scaled, y_train)
y_pred_test_svm_scaled = svm_clf_scaled.predict(X_test_scaled)
accuracy_test_svm_scaled = accuracy_score(y_test, y_pred_test_svm_scaled)
print("SVM - with feature scaling")
print("Test accuracy:", accuracy_test_svm_scaled)
The code above creates another instance of the SVC class but now it fits the model to the scaled training data and makes predictions for the scaled test set.
Finally, it calculates and prints the accuracy score for the test set.
The accuracy is 60.63%, which is better than the previous model’s performance, without feature scaling, and the logistic regression model.
Scikit-learn has other classes for feature scaling, such as MinMaxScaler
and RobustScaler
, so we may do even better by using one of those.
Let’s try MinMaxScaler
first.
It works by subtracting the minimum value and dividing by the range, which makes the features have a minimum of 0 and a maximum of 1.
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)
svm_clf_minmax = SVC(random_state=42)
svm_clf_minmax.fit(X_train_minmax, y_train)
y_pred_test_svm_minmax = svm_clf_minmax.predict(X_test_minmax)
accuracy_test_svm_minmax = accuracy_score(y_test, y_pred_test_svm_minmax)
print("SVM - with MinMaxScaler")
print("Test accuracy:", accuracy_test_svm_minmax)
The accuracy of this model is 58.33% which is better than no scaling but worse than standardization.
Let’s see if RobustScaler
does better. It scales the features based on the median and interquartile range (IQR), making it robust to outliers.
from sklearn.preprocessing import RobustScaler
robust_scaler = RobustScaler()
X_train_robust = robust_scaler.fit_transform(X_train)
X_test_robust = robust_scaler.transform(X_test)
svm_clf_robust = SVC(random_state=42)
svm_clf_robust.fit(X_train_robust, y_train)
y_pred_test_svm_robust = svm_clf_robust.predict(X_test_robust)
accuracy_test_svm_robust = accuracy_score(y_test, y_pred_test_svm_robust)
print("SVM - with RobustScaler")
print("Test accuracy:", accuracy_test_svm_robust)
This method gets 59.79% accuracy, which is good, but still not as good as standardization.
In conclusion, SVM can benefit from feature scaling, and different scalers have different effects on the model’s performance.
It is essential to test various scaling techniques and choose the one that works best for your specific dataset and problem.
SVM With Normalization
Normalization is a type of feature scaling where the goal is to adjust the values of a feature vector to have a unit norm, i.e., the sum of the squares of the feature values equals 1.
It is often used when working with distance-based algorithms, such as k-Nearest Neighbors, to ensure that all features contribute equally to the distance calculation.
It’s done for each row instead of each column. So you don’t need all the column values to normalize a row, which can help avoid data leakage.
To apply normalization to the dataset, you can use the Normalizer
class from scikit-learn:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
X_train_normalized = normalizer.fit_transform(X_train)
X_test_normalized = normalizer.transform(X_test)
svm_clf_normalized = SVC(random_state=42)
svm_clf_normalized.fit(X_train_normalized, y_train)
y_pred_test_svm_normalized = svm_clf_normalized.predict(X_test_normalized)
accuracy_test_svm_normalized = accuracy_score(y_test, y_pred_test_svm_normalized)
print("SVM - with normalization")
print("Test accuracy:", accuracy_test_svm_normalized)
The accuracy of this model is 49.38%, which is worse than our SVM without feature scaling.
Keep in mind that, despite it not being the best method for this dataset, normalization is still useful in some cases.
So it’s important to add it to your toolbox and test it the next time you work with SVM on a new dataset.
Making a machine learning model perform better in practice sometimes resembles more alchemy than science.
Moving forward, what about XGBoost? Does it require feature scaling?