Model selection#
This section considers approaches that allow us to select the best model. Typically, the best model is understood to be the one that achieves the best results according to certain metrics while also having the simplest possible form.
Train/test split#
A train/test split is an approach where we train model on one samples and evaluate it on other. There are a set of ways it can be realized suitable in different cases. All of them involve computing a number that describes the quality of the predictions on the test set.
Name |
Description |
Use Case |
Stratification |
Repeats |
Group Support |
---|---|---|---|---|---|
Hold-Out |
Split data into training and testing once |
Quick evaluation on large datasets |
Optional |
No |
No |
K-Fold |
Split into k folds, each used once as validation |
Balanced use of data, general-purpose |
Optional |
No |
No |
Stratified K-Fold |
K-Fold but preserves label proportions |
Classification tasks with imbalanced classes |
Yes |
No |
No |
Repeated K-Fold |
Repeats K-Fold CV multiple times with different splits |
More robust estimate of performance |
Optional |
Yes |
No |
Repeated Stratified K-Fold |
Stratified K-Fold with repetitions |
Imbalanced classification + variance estimation |
Yes |
Yes |
No |
Leave-One-Out (LOO) |
Each sample is a test set once |
Small datasets, unbiased but high variance |
No |
No |
No |
Leave-P-Out |
Leave p samples out for testing, train on the rest |
Small datasets; computationally expensive |
No |
No |
No |
Leave-One-Group-Out |
Like LOO, but groups (not samples) are left out |
Grouped observations (e.g., subjects in medical data) |
N/A |
No |
Yes |
Leave-P-Groups-Out |
Leave p groups out for testing |
Group-based validation with more generality |
N/A |
No |
Yes |
Group K-Fold |
Like K-Fold, but entire groups are kept together in folds |
Prevent data leakage across groups |
N/A |
No |
Yes |
TimeSeriesSplit |
K-Fold variant for time-ordered data, no shuffling |
Time series forecasting and sequential data |
No |
No |
No |
Nested CV |
Inner CV for model selection, outer CV for performance estimation |
Model tuning + unbiased generalization estimate |
Optional |
Yes |
Possible |
Monte Carlo / ShuffleSplit |
Random train/test splits repeated several times |
Flexible alternative to K-Fold, useful for variance estimation |
Optional |
Yes |
No |
Predefined Split |
Use a user-defined array to split data |
When external split information (e.g., study design) is given |
N/A |
No |
Possible |
Regularization#
Regularization is a technique that modifies the loss function of a parametric model by adding a component that increases with the magnitude of the learnable parameters.
Suppose we have a parametric machine learning model with weights \(W = \left(w_1, w_2, \ldots, w_n\right)\) that needs to be fitted. It is supposed to be fitted with an \(L\) loss function. In such case:
L1 regularization is supposed to use the loss function \(L'(W) = L(W) + \lambda\sum_{i=1}^n |w_i|\) (lasso).
L2 regularization is supposed to use the loss function \(L''(W) = L(W) + \lambda\sum_{i=1}^n w_i^2\) (ridge).
Here, \(\lambda \geq 0\) is a parameter that defines the strength of the regularization.
The following cell generates a two-dimentional linear regression task.
import numpy as np
import sklearn.linear_model
X = np.random.random((200, 2))
y = np.dot(X, np.array([2, 3])) + np.random.normal(0, 1)
The next code uses simple linear regression to estimate the coefficients.
no_regul = sklearn.linear_model.LinearRegression(fit_intercept=False).fit(X, y)
no_regul.coef_
array([1.83020504, 2.82172752])
The same input, but with relatively strong regularization.
strong_regular = sklearn.linear_model.Lasso(alpha=1, fit_intercept=False).fit(X, y)
strong_regular.coef_
array([0. , 1.14487664])
As a result, the regularized model produces smaller coefficients.