Model selection

Model selection#

This section considers approaches that allow us to select the best model. Typically, the best model is understood to be the one that achieves the best results according to certain metrics while also having the simplest possible form.

Train/test split#

A train/test split is an approach where we train model on one samples and evaluate it on other. There are a set of ways it can be realized suitable in different cases. All of them involve computing a number that describes the quality of the predictions on the test set.

Name

Description

Use Case

Stratification

Repeats

Group Support

Hold-Out

Split data into training and testing once

Quick evaluation on large datasets

Optional

No

No

K-Fold

Split into k folds, each used once as validation

Balanced use of data, general-purpose

Optional

No

No

Stratified K-Fold

K-Fold but preserves label proportions

Classification tasks with imbalanced classes

Yes

No

No

Repeated K-Fold

Repeats K-Fold CV multiple times with different splits

More robust estimate of performance

Optional

Yes

No

Repeated Stratified K-Fold

Stratified K-Fold with repetitions

Imbalanced classification + variance estimation

Yes

Yes

No

Leave-One-Out (LOO)

Each sample is a test set once

Small datasets, unbiased but high variance

No

No

No

Leave-P-Out

Leave p samples out for testing, train on the rest

Small datasets; computationally expensive

No

No

No

Leave-One-Group-Out

Like LOO, but groups (not samples) are left out

Grouped observations (e.g., subjects in medical data)

N/A

No

Yes

Leave-P-Groups-Out

Leave p groups out for testing

Group-based validation with more generality

N/A

No

Yes

Group K-Fold

Like K-Fold, but entire groups are kept together in folds

Prevent data leakage across groups

N/A

No

Yes

TimeSeriesSplit

K-Fold variant for time-ordered data, no shuffling

Time series forecasting and sequential data

No

No

No

Nested CV

Inner CV for model selection, outer CV for performance estimation

Model tuning + unbiased generalization estimate

Optional

Yes

Possible

Monte Carlo / ShuffleSplit

Random train/test splits repeated several times

Flexible alternative to K-Fold, useful for variance estimation

Optional

Yes

No

Predefined Split

Use a user-defined array to split data

When external split information (e.g., study design) is given

N/A

No

Possible

Regularization#

Regularization is a technique that modifies the loss function of a parametric model by adding a component that increases with the magnitude of the learnable parameters.

Suppose we have a parametric machine learning model with weights \(W = \left(w_1, w_2, \ldots, w_n\right)\) that needs to be fitted. It is supposed to be fitted with an \(L\) loss function. In such case:

  • L1 regularization is supposed to use the loss function \(L'(W) = L(W) + \lambda\sum_{i=1}^n |w_i|\) (lasso).

  • L2 regularization is supposed to use the loss function \(L''(W) = L(W) + \lambda\sum_{i=1}^n w_i^2\) (ridge).

Here, \(\lambda \geq 0\) is a parameter that defines the strength of the regularization.


The following cell generates a two-dimentional linear regression task.

import numpy as np
import sklearn.linear_model

X = np.random.random((200, 2))
y = np.dot(X, np.array([2, 3])) + np.random.normal(0, 1)

The next code uses simple linear regression to estimate the coefficients.

no_regul = sklearn.linear_model.LinearRegression(fit_intercept=False).fit(X, y)
no_regul.coef_
array([1.83020504, 2.82172752])

The same input, but with relatively strong regularization.

strong_regular = sklearn.linear_model.Lasso(alpha=1, fit_intercept=False).fit(X, y)
strong_regular.coef_
array([0.        , 1.14487664])

As a result, the regularized model produces smaller coefficients.