Scaling and regularisation#
It is recommended to scale the data before performing regularisation. In this page I want to show why.
The reason is quite simple - popular components of the target function responsible for regularisation of the model look like this.
\(\sum{\beta_i^2}\);
\(\sum{\left|\beta_i\right|}\).
All coefficients have the same contribution to the target function, regardless of their magnitude, so the optimisation algorithm naturally benefits from reducing first those coefficients that make a greater contribution to the target function, regardless of the economic/physical sense of the variables.
Thus regularisation may compress the coefficients too much at large scales without any intelligible reason for doing so. It is to counteract these phenomena that it is recommended to bring the data to a uniform scale by any available means.
Below is a small experiment that confirms this idea.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
Let’s say you have a data frame with two features and their scaling is dramatically different.
So in the following cell I reproduce such a case. There are two functions and the numbers in the first function are usually 100 times smaller than in the second. But on the target variable, they out of order have about the same effect - each unit in the first variable contributes 100 times more to the value of the explained variable.
np.random.seed(10)
sample_size = 200
X = np.concatenate(
[
np.random.normal(1, 0.5, (sample_size, 1)),
np.random.normal(100, 50, (sample_size, 1))
],
axis = 1
)
y = np.dot(X, np.array([500, 5])) + np.random.normal(0, 10, sample_size)
We will now gradually increase the scaling. Note that the higher scaling coefficient decreases much faster than the second one. Even in relative units:
\(\beta_1(\alpha=0) \approx 2*\beta_1(\alpha=45)\) coefficient with higher scaling became twice as low;
\(\beta_1(\alpha=0) \approx \beta_1(\alpha=45)\) coefficient with lower scaling will decrease a little, but not twice as much as the first coefficient.
display_frame = pd.DataFrame(
{alpha:Ridge(alpha = alpha).fit(X,y).coef_ for alpha in np.arange(0, 50, 5)}
).T
display_frame.index.name = "$\\alpha$"
display_frame.columns = ["$\\beta_1$", "$\\beta_2$"]
display(display_frame)
$\beta_1$ | $\beta_2$ | |
---|---|---|
$\alpha$ | ||
0 | 499.305569 | 5.029773 |
5 | 451.840445 | 5.012651 |
10 | 412.616182 | 4.998492 |
15 | 379.658049 | 4.986585 |
20 | 351.575574 | 4.976432 |
25 | 327.361354 | 4.967669 |
30 | 306.267632 | 4.960028 |
35 | 287.727714 | 4.953306 |
40 | 271.304290 | 4.947344 |
45 | 256.654507 | 4.942020 |
Now we want to do the same operation, but with a standardised feature matrix. Display the coefficients you need to multiply with the standardised and original data in different columns.
As the result:
The coefficients on the standardized data decrease uniformly even in absolute terms;
If you transform coefficients to be used directly with the original data, because of the difference in scaling, the absolute decrease is greater for the features with higher scaling, but relatively both coefficients decreases ~ 20%.
means = X.mean(axis=0)
std = X.std(axis = 0)
X_stand = (X-means)/std
display_frame = pd.DataFrame(
{alpha:Ridge(alpha = alpha).fit(X_stand,y).coef_ for alpha in np.arange(0, 50, 5)}
).T
display_frame.index.name = "$\\alpha$"
display_frame.columns = ["$\\beta_1$", "$\\beta_2$"]
pd.concat(
[
display_frame,
(display_frame+means)/std
],
keys=["Standardised data", "Original data"],
axis = 1
)
Standardised data | Original data | |||
---|---|---|---|---|
$\beta_1$ | $\beta_2$ | $\beta_1$ | $\beta_2$ | |
$\alpha$ | ||||
0 | 243.721886 | 232.585223 | 501.430354 | 7.222430 |
5 | 237.582185 | 226.708414 | 488.852137 | 7.095341 |
10 | 231.744246 | 221.121240 | 476.892129 | 6.974515 |
15 | 226.186353 | 215.802807 | 465.505843 | 6.859501 |
20 | 220.888825 | 210.734182 | 454.652961 | 6.749890 |
25 | 215.833785 | 205.898170 | 444.296855 | 6.645309 |
30 | 211.004954 | 201.279119 | 434.404177 | 6.545419 |
35 | 206.387481 | 196.862749 | 424.944500 | 6.449913 |
40 | 201.967783 | 192.636007 | 415.890002 | 6.358508 |
45 | 197.733422 | 188.586938 | 407.215194 | 6.270944 |