In linear regression, feature scaling is not strictly required but can be beneficial in certain situations.

When using gradient descent-based optimization algorithms, feature scaling can help speed up convergence and improve model performance.

However, when employing a closed-form solution like the normal equation, feature scaling is not necessary, as the algorithm naturally handles features with different scales

In this tutorial, we will explore the impact of feature scaling on linear regressions’s performance using the Red Wine dataset as an example.

Loading the Dataset

The Red Wine dataset is a popular dataset used to study regression tasks in machine learning.

It contains data about chemical properties of red wine, such as acidity, pH, and alcohol content, and the quality of the wine.

The goal is to use these characteristics to predict the quality of each wine.

import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine_data = pd.read_csv(url, sep=";")
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
7.4 0.7 0 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.8 0.88 0 2.6 0.098 25 67 0.9968 3.2 0.68 9.8 5
7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 5
11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 6
7.4 0.7 0 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5

We split the dataset into features and labels, and then split the data into training and test sets.

from sklearn.model_selection import train_test_split

X = wine_data.drop('quality', axis=1)
y = wine_data['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Gradient-Based Linear Regression With And Without Feature Scaling

We will first implement a gradient-based method for linear regression using the Stochastic Gradient Descent (SGD) Regressor from scikit-learn.

It is an optimization algorithm that iteratively adjusts the model weights to minimize the cost function.

When dealing with large datasets, the stochastic gradient descent method can converge faster than the closed-form solution, as it updates the model’s parameters using a random subset of the training data at each iteration.

This makes it computationally more efficient, as it doesn’t require loading the full dataset in memory, and allows for online learning, where the model can be updated with new data without retraining from scratch.

from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error

sgd_regressor = SGDRegressor(random_state=42)
sgd_regressor.fit(X_train, y_train)

y_pred_test_sgd = sgd_regressor.predict(X_test)

rmse_test_sgd = mean_squared_error(y_test, y_pred_test_sgd, squared=False)

print("SGDRegressor - without feature scaling")
print("Test RMSE:", rmse_test_sgd)

The code above first imports the necessary classes and functions, then creates an instance of the SGDRegressor class with a random state for reproducibility.

It fits the model to the training data and makes predictions for the test set.

Finally, it calculates and prints the root mean squared error (RMSE) to evaluate the model’s performance.

The RMSE is 780765114783.63, which indicates the model wasn’t able to converge to a good solution.

How do I know it?

Our target has a range of 3 to 8, so predicting the mean quality based on the training data would result in an RMSE of 0.7976, which is much better than the model’s performance.

So this weird large number is a clear indication that the model is going crazy around its optimization process.

Let’s try to fix this by scaling the features.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

The code above creates an instance of the StandardScaler class, then fits the scaler to the training data and transforms both the training and test sets.

Notice that we fit only on the training data to avoid data leakage.

Standardization is a common method for feature scaling, where we take each feature, subtract the mean and divide by the standard deviation.

This makes each feature have a mean of 0 and a standard deviation of 1. And put them in similar ranges.

Now let’s train the model again.

sgd_regressor_scaled = SGDRegressor(random_state=42)
sgd_regressor_scaled.fit(X_train_scaled, y_train)

y_pred_test_sgd_scaled = sgd_regressor_scaled.predict(X_test_scaled)

rmse_test_sgd_scaled = mean_squared_error(y_test, y_pred_test_sgd_scaled, squared=False)

print("SGDRegressor - with feature scaling")
print("Test RMSE:", rmse_test_sgd_scaled)

The code above creates another instance of the SGDRegressor class but now it fits the model to the scaled training data and makes predictions for the scaled test set.

Finally, it calculates and prints the root mean squared error (RMSE) for the test set.

The RMSE is 0.6414, which is much better than the previous model’s performance.

Now the model is able to converge to a good solution and beat the baseline.

Closed-Form Linear Regression With And Without Feature Scaling

The closed-form solution is another method for training a linear regression that finds the optimal model parameters by solving a system of linear equations.

It offers a more straightforward approach for finding the optimal parameters without the need for iterative updates like gradient-based methods.

For small to moderately-sized datasets, this solution can be computationally efficient and provide accurate results.

The LinearRegression class in scikit-learn uses a closed-form solution based on SVD (Singular Value Decomposition) to find the optimal model weights.

Let’s try it first without feature scaling.

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

y_pred_test_lin_reg = lin_reg.predict(X_test)

rmse_test_lin_reg = mean_squared_error(y_test, y_pred_test_lin_reg, squared=False)

print("LinearRegression - without feature scaling")
print("Test RMSE:", rmse_test_lin_reg)

This time the RMSE is 0.6413, even without feature scaling. So you see that the closed-form solution is not hurt by the lack of it.

Let’s see if the feature scaling makes any difference.

lin_reg_scaled = LinearRegression()
lin_reg_scaled.fit(X_train_scaled, y_train)

y_pred_test_lin_reg_scaled = lin_reg_scaled.predict(X_test_scaled)

rmse_test_lin_reg_scaled = mean_squared_error(y_test, y_pred_test_lin_reg_scaled, squared=False)

print("LinearRegression - with feature scaling")
print("Test MSE:", rmse_test_lin_reg_scaled)

The RMSE is the same as before, 0.6413, so there was no difference in performance.