Model selection

Model selection#

This section considers approaches that allow us to select the best model. Typically, the best model is understood to be the one that achieves the best results according to certain metrics while also having the simplest possible form.

Train/test split#

A train/test split is an approach where we train model on one samples and evaluate it on other. There are a set of ways it can be realized suitable in different cases. All of them involve computing a number that describes the quality of the predictions on the test set.

Name	Description	Use Case	Stratification	Repeats	Group Support
Hold-Out	Split data into training and testing once	Quick evaluation on large datasets	Optional	No	No
K-Fold	Split into k folds, each used once as validation	Balanced use of data, general-purpose	Optional	No	No
Stratified K-Fold	K-Fold but preserves label proportions	Classification tasks with imbalanced classes	Yes	No	No
Repeated K-Fold	Repeats K-Fold CV multiple times with different splits	More robust estimate of performance	Optional	Yes	No
Repeated Stratified K-Fold	Stratified K-Fold with repetitions	Imbalanced classification + variance estimation	Yes	Yes	No
Leave-One-Out (LOO)	Each sample is a test set once	Small datasets, unbiased but high variance	No	No	No
Leave-P-Out	Leave p samples out for testing, train on the rest	Small datasets; computationally expensive	No	No	No
Leave-One-Group-Out	Like LOO, but groups (not samples) are left out	Grouped observations (e.g., subjects in medical data)	N/A	No	Yes
Leave-P-Groups-Out	Leave p groups out for testing	Group-based validation with more generality	N/A	No	Yes
Group K-Fold	Like K-Fold, but entire groups are kept together in folds	Prevent data leakage across groups	N/A	No	Yes
TimeSeriesSplit	K-Fold variant for time-ordered data, no shuffling	Time series forecasting and sequential data	No	No	No
Nested CV	Inner CV for model selection, outer CV for performance estimation	Model tuning + unbiased generalization estimate	Optional	Yes	Possible
Monte Carlo / ShuffleSplit	Random train/test splits repeated several times	Flexible alternative to K-Fold, useful for variance estimation	Optional	Yes	No
Predefined Split	Use a user-defined array to split data	When external split information (e.g., study design) is given	N/A	No	Possible

Regularization#

Regularization is a technique that modifies the loss function of a parametric model by adding a component that increases with the magnitude of the learnable parameters.

Suppose we have a parametric machine learning model with weights \(W = \left(w_1, w_2, \ldots, w_n\right)\) that needs to be fitted. It is supposed to be fitted with an \(L\) loss function. In such case:

L1 regularization is supposed to use the loss function \(L'(W) = L(W) + \lambda\sum_{i=1}^n |w_i|\) (lasso).
L2 regularization is supposed to use the loss function \(L''(W) = L(W) + \lambda\sum_{i=1}^n w_i^2\) (ridge).

Here, \(\lambda \geq 0\) is a parameter that defines the strength of the regularization.

The following cell generates a two-dimentional linear regression task.

import numpy as np
import sklearn.linear_model

X = np.random.random((200, 2))
y = np.dot(X, np.array([2, 3])) + np.random.normal(0, 1)

The next code uses simple linear regression to estimate the coefficients.

no_regul = sklearn.linear_model.LinearRegression(fit_intercept=False).fit(X, y)
no_regul.coef_

array([1.83020504, 2.82172752])

The same input, but with relatively strong regularization.

strong_regular = sklearn.linear_model.Lasso(alpha=1, fit_intercept=False).fit(X, y)
strong_regular.coef_

array([0.        , 1.14487664])

As a result, the regularized model produces smaller coefficients.

Model selection

Contents

Model selection#

Train/test split#

Regularization#