In short, feature scaling or normalization is not strictly required for neural networks, but it is highly recommended.

Scaling or normalizing the input features can be the difference between a neural network that converges in a few iterations and one that takes hundreds of iterations to converge or even fails to converge at all.

The optimization process may become slower because the gradients in the direction of the larger-scale features will be significantly larger than the gradients in the direction of the smaller-scale features.

This can result in oscillations in the training process, and the algorithm might take longer to converge to the optimal solution.

The use of unscaled features can lead to numerical instability in the training process.

For example, when using activation functions like the sigmoid or hyperbolic tangent (tanh), large input values can cause the function outputs to saturate at their extreme values.

This saturation can lead to vanishing gradients, which in turn can make the learning process slow or stall altogether.

To make it clear, let’s explore the impact of feature scaling and normalization in on neural network performance in practice using the Red Wine dataset.

Loading the Dataset

The Red Wine dataset is a popular dataset used to study classification and regression tasks in machine learning.

It contains data about chemical properties of red wine, such as acidity, pH, alcohol content, and the quality of the wine.

The goal is to use these characteristics to predict the quality of each wine as a classification problem, where the quality is discretized into classes.

import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine_data = pd.read_csv(url, sep=";")

We split the dataset into features and labels, and then split the data into training and test sets.

from sklearn.model_selection import train_test_split

X = wine_data.drop('quality', axis=1)
y = wine_data['quality'] - wine_data['quality'].min()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Originally the quality label goes from 3 to 8, but we subtract the minimum value to make it start from 0.

Neural Networks Without Feature Scaling

First, let’s create a simple neural network model without feature scaling to use as a baseline.

We will use the popular deep learning library, PyTorch, to create the model.

First we need to import the necessary libraries and convert the data into PyTorch tensors.

import torch
import torch.nn as nn
import torch.optim as optim

X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

torch has the core functionality, torch.nn has the classes to build the neural network layers, and torch.optim has the optimization algorithms.

Let’s create a simple neural network with one hidden layer.

model = nn.Sequential(
    nn.Linear(11, 128),
    nn.ReLU(),
    nn.Linear(128, 6),
    nn.Softmax(dim=1)
)
        
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

nn.Sequential is a container that holds the layers of the neural network.

A sequential model is a stack of layers, where the output of one layer is the input to the next.

The layers are defined inside the parentheses as a list of layer objects.

The neural network has 11 input features, 128 neurons in the hidden layer, and 6 output neurons, with the usual ReLU activation function in the hidden layer.

We will treat the problem as a multi-class classification, so we will use the nn.Softmax activation function in the output layer.

I chose Adam as the optimizer because it tends to work well off the shelf, but you can use any optimizer you like.

The optimizer is responsible for updating the model’s parameters (weights and biases) during training to minimize the loss function.

nn.CrossEntropyLoss is the most popular loss function for classification problems.

Now let’s train the model.

num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    
    optimizer.zero_grad()
    outputs = model(X_train_tensor)
    
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()
    
    if (epoch+1) % 10 == 0:
    
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}")

        model.eval()
        with torch.no_grad():
            outputs = model(X_test_tensor)
            _, predicted = torch.max(outputs.data, 1)
            correct = (predicted == y_test_tensor).sum().item()
            print(f"Accuracy: {correct / len(y_test_tensor):.4f}")

This code snippet trains the neural network model for a specified number of epochs and evaluates its performance on the test dataset.

It’s the usual training loop for PyTorch models.

num_epochs is the number of times the model will see the entire training dataset during training.

We will evaluate the model’s performance on the test dataset every 10 epochs, although I will only consider the final performance for comparison.

The output of the model is:

Epoch 10/100, Loss: 1.5230
Accuracy: 0.4938
Epoch 20/100, Loss: 1.5180
Accuracy: 0.4979
Epoch 30/100, Loss: 1.5168
Accuracy: 0.4854
Epoch 40/100, Loss: 1.5152
Accuracy: 0.4958
Epoch 50/100, Loss: 1.5135
Accuracy: 0.4875
Epoch 60/100, Loss: 1.5115
Accuracy: 0.4896
Epoch 70/100, Loss: 1.5094
Accuracy: 0.4896
Epoch 80/100, Loss: 1.5072
Accuracy: 0.4917
Epoch 90/100, Loss: 1.5051
Accuracy: 0.5000
Epoch 100/100, Loss: 1.5028
Accuracy: 0.5021

The model gets about 50.21% accuracy on the test dataset, which is not good, as a simple logistic regression gets about 56% accuracy.

Let’s see if scaling the features can improve the model’s performance.

Neural Networks With Feature Scaling

Let’s use the StandardScaler from scikit-learn to scale the features.

This scaler standardizes the features by subtracting the mean and dividing by the standard deviation, which makes the features have a mean of 0 and a standard deviation of 1.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_tensor_scaled = torch.FloatTensor(X_train_scaled)
X_test_tensor_scaled = torch.FloatTensor(X_test_scaled)

Notice that we fit only on the training data to avoid data leakage.

Now, let’s train the neural network again.

model_scaled = nn.Sequential(
    nn.Linear(11, 128),
    nn.ReLU(),
    nn.Linear(128, 6),
    nn.Softmax(dim=1)
)
        
optimizer = optim.Adam(model_scaled.parameters())
criterion = nn.CrossEntropyLoss()

num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    
    optimizer.zero_grad()
    outputs = model_scaled(X_train_tensor_scaled)
    
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()
    
    if (epoch+1) % 10 == 0:
    
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}")

        model.eval()
        with torch.no_grad():
            outputs = model_scaled(X_test_tensor_scaled)
            _, predicted = torch.max(outputs.data, 1)
            correct = (predicted == y_test_tensor).sum().item()
            print(f"Accuracy: {correct / len(y_test_tensor):.4f}")

The code outputs:

Epoch 10/100, Loss: 1.7280
Accuracy: 0.4792
Epoch 20/100, Loss: 1.6664
Accuracy: 0.5062
Epoch 30/100, Loss: 1.6117
Accuracy: 0.5208
Epoch 40/100, Loss: 1.5656
Accuracy: 0.5271
Epoch 50/100, Loss: 1.5296
Accuracy: 0.5437
Epoch 60/100, Loss: 1.5036
Accuracy: 0.5563
Epoch 70/100, Loss: 1.4858
Accuracy: 0.5521
Epoch 80/100, Loss: 1.4736
Accuracy: 0.5458
Epoch 90/100, Loss: 1.4650
Accuracy: 0.5417
Epoch 100/100, Loss: 1.4587
Accuracy: 0.5500

The model gets about 55% accuracy on the test dataset, which is an improvement over the unscaled model.

It can very likely be improved by tuning the hyperparameters, but my goal here is just to show you how scaling the features can improve the performance of neural networks.

For some datasets, the impact of feature scaling might be more significant, while for others, it might be less pronounced.

Regardless, it’s always worth trying!

And you can try other scalers from scikit-learn, such as the MinMaxScaler or the RobustScaler.

Neural Networks With Feature Normalization

Normalization is a type of feature scaling where the goal is to adjust each row of the feature matrix to have a unit norm.

Instead of doing a transformation on all the values of a column, we do a transformation on each row.

To apply normalization to the dataset, you can use the Normalizer class from scikit-learn:

from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
X_train_normalized = normalizer.fit_transform(X_train)
X_test_normalized = normalizer.transform(X_test)

X_train_tensor_normalized = torch.FloatTensor(X_train_normalized)
X_test_tensor_normalized = torch.FloatTensor(X_test_normalized)

Running the same neural network model on the normalized dataset gives an accuracy of 40.42% on the test dataset which is much worse than the unscaled model.

This doesn’t mean that normalization is bad, but it means that it doesn’t work well with this dataset.

Neural Networks With Log-Transformed Features

Log transformation is a technique used to stabilize the variance, normalize the distribution, and reduce the impact of outliers in the data.

It can be particularly useful when working with data that has a skewed distribution.

Let’s apply log transformation to our dataset and see how it affects the performance of the neural network.

Before applying the log transformation, we need to ensure that all the data is positive since the log function is undefined for negative values.

In our dataset, all features are positive, so we can proceed with the log transformation.

import numpy as np

X_train_log = np.log1p(X_train)
X_test_log = np.log1p(X_test)

X_train_tensor_log = torch.FloatTensor(X_train_log)
X_test_tensor_log = torch.FloatTensor(X_test_log)

I like to use the log1p function from numpy because it’s more numerically stable than the log function.

It solves the problem of taking the log of zero values too.

It computes the natural logarithm of 1+x instead of x. To get the original value, you can use the expm1 function.

Unfortunately, the log transformation doesn’t improve the performance of the neural network, which gets about 43% accuracy.

Anyway, it’s another tool that you will have in your toolbox.

Neural Networks With Batch Normalization In The Input Layer

Instead of scaling the features with an external scaler, we can use a technique called batch normalization to scale the features directly in each batch.

The biggest advantage of batch normalization is that it doesn’t require any additional preprocessing of the data.

It’s a technique that is used in many deep learning models, so it’s worth learning about it.

Batch normalization will first standardize the features, like the StandardScaler, but only for the current batch.

Then, it will apply a two learnable parameters, called scale and shift, to each data point.

These parameters allow the network to adjust the normalized data in a way that best suits its learning process.

We just need to modify the neural network model code:

model = nn.Sequential(
    nn.BatchNorm1d(11),
    nn.Linear(11, 128),
    nn.ReLU(),
    nn.Linear(128, 6),
    nn.Softmax(dim=1)
)

The nn.BatchNorm1d layer will normalize the features in each batch. We pass the number of features as an argument.

Running this model over the unscaled dataset gives a final accuracy of 54.79% on the test dataset.

It’s not as good as the model with StandardScaler, but the difference is so small that is probably due to noise.

I would take this over the standardized model just because it’s simpler to deploy and gets about the same performance.

What about decision tree-based models like Random Forests or linear models?