Scaling and regularisation

Scaling and regularisation#

It is recommended to scale the data before performing regularisation. In this page I want to show why.

The reason is quite simple - popular components of the target function responsible for regularisation of the model look like this.

  • \(\sum{\beta_i^2}\);

  • \(\sum{\left|\beta_i\right|}\).

All coefficients have the same contribution to the target function, regardless of their magnitude, so the optimisation algorithm naturally benefits from reducing first those coefficients that make a greater contribution to the target function, regardless of the economic/physical sense of the variables.

Thus regularisation may compress the coefficients too much at large scales without any intelligible reason for doing so. It is to counteract these phenomena that it is recommended to bring the data to a uniform scale by any available means.

Below is a small experiment that confirms this idea.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge

Let’s say you have a data frame with two features and their scaling is dramatically different.

So in the following cell I reproduce such a case. There are two functions and the numbers in the first function are usually 100 times smaller than in the second. But on the target variable, they out of order have about the same effect - each unit in the first variable contributes 100 times more to the value of the explained variable.

np.random.seed(10)
sample_size = 200

X = np.concatenate(
    [
        np.random.normal(1, 0.5, (sample_size, 1)),
        np.random.normal(100, 50, (sample_size, 1))
    ],
    axis = 1
)

y = np.dot(X, np.array([500, 5])) + np.random.normal(0, 10, sample_size)

We will now gradually increase the scaling. Note that the higher scaling coefficient decreases much faster than the second one. Even in relative units:

  • \(\beta_1(\alpha=0) \approx 2*\beta_1(\alpha=45)\) coefficient with higher scaling became twice as low;

  • \(\beta_1(\alpha=0) \approx \beta_1(\alpha=45)\) coefficient with lower scaling will decrease a little, but not twice as much as the first coefficient.

display_frame = pd.DataFrame(
    {alpha:Ridge(alpha = alpha).fit(X,y).coef_ for alpha in np.arange(0, 50, 5)}
).T

display_frame.index.name = "$\\alpha$"
display_frame.columns = ["$\\beta_1$", "$\\beta_2$"]
display(display_frame)
$\beta_1$ $\beta_2$
$\alpha$
0 499.305569 5.029773
5 451.840445 5.012651
10 412.616182 4.998492
15 379.658049 4.986585
20 351.575574 4.976432
25 327.361354 4.967669
30 306.267632 4.960028
35 287.727714 4.953306
40 271.304290 4.947344
45 256.654507 4.942020

Now we want to do the same operation, but with a standardised feature matrix. Display the coefficients you need to multiply with the standardised and original data in different columns.

As the result:

  • The coefficients on the standardized data decrease uniformly even in absolute terms;

  • If you transform coefficients to be used directly with the original data, because of the difference in scaling, the absolute decrease is greater for the features with higher scaling, but relatively both coefficients decreases ~ 20%.

means = X.mean(axis=0)
std = X.std(axis = 0)

X_stand = (X-means)/std


display_frame = pd.DataFrame(
    {alpha:Ridge(alpha = alpha).fit(X_stand,y).coef_ for alpha in  np.arange(0, 50, 5)}
).T

display_frame.index.name = "$\\alpha$"
display_frame.columns = ["$\\beta_1$", "$\\beta_2$"]
pd.concat(
    [
        display_frame,
        (display_frame+means)/std
    ],
    keys=["Standardised data", "Original data"],
    axis = 1
)
Standardised data Original data
$\beta_1$ $\beta_2$ $\beta_1$ $\beta_2$
$\alpha$
0 243.721886 232.585223 501.430354 7.222430
5 237.582185 226.708414 488.852137 7.095341
10 231.744246 221.121240 476.892129 6.974515
15 226.186353 215.802807 465.505843 6.859501
20 220.888825 210.734182 454.652961 6.749890
25 215.833785 205.898170 444.296855 6.645309
30 211.004954 201.279119 434.404177 6.545419
35 206.387481 196.862749 424.944500 6.449913
40 201.967783 192.636007 415.890002 6.358508
45 197.733422 188.586938 407.215194 6.270944