Scaling and regularisation

Scaling and regularisation#

It is recommended to scale the data before performing regularisation. In this page I want to show why.

The reason is quite simple - popular components of the target function responsible for regularisation of the model look like this.

$\sum{\beta_i^2}$;
$\sum{\left|\beta_i\right|}$.

All coefficients have the same contribution to the target function, regardless of their magnitude, so the optimisation algorithm naturally benefits from reducing first those coefficients that make a greater contribution to the target function, regardless of the economic/physical sense of the variables.

Thus regularisation may compress the coefficients too much at large scales without any intelligible reason for doing so. It is to counteract these phenomena that it is recommended to bring the data to a uniform scale by any available means.

Below is a small experiment that confirms this idea.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge

Let’s say you have a data frame with two features and their scaling is dramatically different.

So in the following cell I reproduce such a case. There are two functions and the numbers in the first function are usually 100 times smaller than in the second. But on the target variable, they out of order have about the same effect - each unit in the first variable contributes 100 times more to the value of the explained variable.

np.random.seed(10)
sample_size = 200

X = np.concatenate(
    [
        np.random.normal(1, 0.5, (sample_size, 1)),
        np.random.normal(100, 50, (sample_size, 1))
    ],
    axis = 1
)

y = np.dot(X, np.array([500, 5])) + np.random.normal(0, 10, sample_size)

We will now gradually increase the scaling. Note that the higher scaling coefficient decreases much faster than the second one. Even in relative units:

$\beta_1(\alpha=0) \approx 2*\beta_1(\alpha=45)$ coefficient with higher scaling became twice as low;
$\beta_1(\alpha=0) \approx \beta_1(\alpha=45)$ coefficient with lower scaling will decrease a little, but not twice as much as the first coefficient.

display_frame = pd.DataFrame(
    {alpha:Ridge(alpha = alpha).fit(X,y).coef_ for alpha in np.arange(0, 50, 5)}
).T

display_frame.index.name = "$\\alpha$"
display_frame.columns = ["$\\beta_1$", "$\\beta_2$"]
display(display_frame)

	$\beta_1$	$\beta_2$
$\alpha$
0	499.305569	5.029773
5	451.840445	5.012651
10	412.616182	4.998492
15	379.658049	4.986585
20	351.575574	4.976432
25	327.361354	4.967669
30	306.267632	4.960028
35	287.727714	4.953306
40	271.304290	4.947344
45	256.654507	4.942020

Now we want to do the same operation, but with a standardised feature matrix. Display the coefficients you need to multiply with the standardised and original data in different columns.

As the result:

The coefficients on the standardized data decrease uniformly even in absolute terms;
If you transform coefficients to be used directly with the original data, because of the difference in scaling, the absolute decrease is greater for the features with higher scaling, but relatively both coefficients decreases ~ 20%.

means = X.mean(axis=0)
std = X.std(axis = 0)

X_stand = (X-means)/std


display_frame = pd.DataFrame(
    {alpha:Ridge(alpha = alpha).fit(X_stand,y).coef_ for alpha in  np.arange(0, 50, 5)}
).T

display_frame.index.name = "$\\alpha$"
display_frame.columns = ["$\\beta_1$", "$\\beta_2$"]
pd.concat(
    [
        display_frame,
        (display_frame+means)/std
    ],
    keys=["Standardised data", "Original data"],
    axis = 1
)

	Standardised data		Original data
	$\beta_1$	$\beta_2$	$\beta_1$	$\beta_2$
$\alpha$
0	243.721886	232.585223	501.430354	7.222430
5	237.582185	226.708414	488.852137	7.095341
10	231.744246	221.121240	476.892129	6.974515
15	226.186353	215.802807	465.505843	6.859501
20	220.888825	210.734182	454.652961	6.749890
25	215.833785	205.898170	444.296855	6.645309
30	211.004954	201.279119	434.404177	6.545419
35	206.387481	196.862749	424.944500	6.449913
40	201.967783	192.636007	415.890002	6.358508
45	197.733422	188.586938	407.215194	6.270944