If you are using Random Forest as your machine learning model, you don’t need to worry about scaling or normalizing your features.
Random Forest is a tree-based model and hence does not require feature scaling.
Tree-based models are invariant to the scale of the features, which makes them very user-friendly as this step can be skipped during preprocessing.
Still, in practice you can see different results when you scale your features because of the way numerical values are represented in computers.
Let’s try a few scaling methods and see how they affect the performance of a Random Forest model.
Comparing Random Forest Results With and Without Scaling
In this part, we’re going to experiment with a Random Forest model under three different scenarios using the same dataset.
In the first scenario, we won’t perform any preprocessing and will feed the raw dataset to the model.
In the second scenario, we’ll standardize the features of our dataset before using it.
Finally, in the third case, we’ll use MinMaxScaler to transform our features so they fall within a specific range.
Random Forest Without Scaling
Let’s start by running a Random Forest on the red wine dataset without scaling the features.
The Red Wine dataset is a popular choice in machine learning, ideal for regression tasks. It’s a collection of data from Portuguese red wine.
The dataset has various chemical properties of the wine like acidity, sugar, pH, and alcohol content, and a quality rating, which is our target variable.
Here’s the code for our baseline:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, sep=";")
# Split inputs and target
X = data.drop('quality', axis=1)
y = data['quality']
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize RandomForestRegressor and fit the training data
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
# Predict and measure RMSE
predictions = rfr.predict(X_test)
mse = mean_squared_error(y_test, predictions)
rmse = mse ** 0.5
print("RMSE without scaling: ", rmse)
Running this code gives us an RMSE of 0.5579, which is our baseline score.
Random Forest with StandardScaler
Now, let’s scale our features using StandardScaler and see if there’s any difference in the model’s performance.
This scaler transforms the data so that it has a mean of 0 and a standard deviation of 1.
from sklearn.preprocessing import StandardScaler
# Initialize the StandardScaler
scaler = StandardScaler()
# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)
# Transform test data
X_test_scaled = scaler.transform(X_test)
# Initialize RandomForestRegressor and fit the scaled training data
rfr_scaled = RandomForestRegressor()
rfr_scaled.fit(X_train_scaled, y_train)
# Predict and measure RMSE
predictions_scaled = rfr_scaled.predict(X_test_scaled)
mse_scaled = mean_squared_error(y_test, predictions_scaled)
rmse_scaled = mse_scaled ** 0.5
print("RMSE with StandardScaler: ", rmse_scaled)
We got 0.5605 as our RMSE score, which is slightly worse than the baseline score.
Random Forest with MinMaxScaler
Lastly, we’ll use MinMaxScaler, another scaling method that transforms the data so it fits within a specific range, typically between 0 and 1.
from sklearn.preprocessing import MinMaxScaler
# Initialize MinMaxScaler
minmax = MinMaxScaler()
# Fit and transform the training data
X_train_minmax = minmax.fit_transform(X_train)
# Transform test data
X_test_minmax = minmax.transform(X_test)
# Initialize RandomForestRegressor and fit the MinMax scaled training data
rfr_minmax = RandomForestRegressor()
rfr_minmax.fit(X_train_minmax, y_train)
# Predict and measure RMSE
predictions_minmax = rfr_minmax.predict(X_test_minmax)
mse_minmax = mean_squared_error(y_test, predictions_minmax)
rmse_minmax = mse_minmax ** 0.5
print("RMSE with MinMaxScaler: ", rmse_minmax)
This code gives us an RMSE of 0.5643, which is also slightly worse than the baseline score.
Random Forest with Normalizer
Normalization scales individual samples to have unit norm.
This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
Let’s see how a Random Forest model performs with data normalized using sklearn’s Normalizer
.
from sklearn.preprocessing import Normalizer
# Initialize Normalizer
normalizer = Normalizer()
# Fit and transform the training data
X_train_normalized = normalizer.fit_transform(X_train)
# Transform the test data
X_test_normalized = normalizer.transform(X_test)
# Initialize RandomForestRegressor and fit normalized training data
rfr_normalized = RandomForestRegressor()
rfr_normalized.fit(X_train_normalized, y_train)
# Predict and measure RMSE
predictions_normalized = rfr_normalized.predict(X_test_normalized)
mse_normalized = mean_squared_error(y_test, predictions_normalized)
rmse_normalized = mse_normalized ** 0.5
print("RMSE with Normalizer: ", rmse_normalized)
The RMSE with Normalizer is 0.5548 which is slightly better than the baseline score.
Should You Worry About Scaling Features for Random Forest?
The experiment shows that scaling the features doesn’t necessarily improve the performance of a Random Forest model.
In fact, both StandardScaler and MinMaxScaler resulted in a slightly higher RMSE than the baseline score.
This reinforces the concept that Random Forest, being a tree-based model, isn’t sensitive to the scale of the input features.
However, it’s important to remember that even if scaling is not required, the nuances of how numerical values are represented in a computer can still affect your model’s output.
When Does Feature Scaling Matter?
Scaling is important for many other types of machine learning models.
Models that are distance-based such as K-Nearest Neighbors, or models that use gradient descent optimization like linear regression or neural networks, require feature scaling.
In these models, features that are on larger scales can dominate the model’s fitting process, leading to suboptimal performance.