Classic models

Classic models#

There is a set of approaches for building algorithms that learn patterns from data, which were spreaded before the deep learning networks. This page considers these approaches.

Gradient boosting#

Gradient boosting is an ensemble approach to fitting machine learning models. Each subsequent weak learner (single model which composes the final ensemble) corrects the errors of the previous one.

Consider general boosting algorithm:

Lets denote:

$K$: nubmer of weak learners in the model.
$f_k(x_i), k = \overline{1, K}$: $k$-th weak laerner.
$\hat{y}_i^{(k)}, k = \overline{1, K}$: predict of the $k$-th weak learner for $i$-th observation.

\[\begin{split} \hat{y}_i^{(0)} = 0 \\ \hat{y}_i^{(1)} = f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i) \\ \hat{y}_i^{(2)} = f_1(x_i) + f_2(x_i) = \hat{y}_i^{(1)} + f_2(x_i)\\ \ldots \\ \hat{y}_i^{(K)} = \hat{y}_i^{(K-1)} + f_{K}(x_i) \end{split}\]

The final learner is:

\[\hat{y}_i^{(K)} = \hat{y}_i^{(K-1)} + f_{K}(x_i) = \sum_{k=0}^{K} f_k(x_i)\]

The the best results are achieved by a special modification of the boosting: Gradient Boosting. The main feature of the gradient boosting is a way it looks for subsequent weak learners $f_k, k=\overline{1, K}$.

Let’s consider the $k$-th step of the fitting algorithm for consideration. The idea dehind gradient boosting is to fit $f_k(x_j)$$ in order to predict the values of partial derivatives of some loss funciton $L[y, \hat{y}^{(k-1)}]$ by $\hat{y}^{(k-1)}_j$. More formally, the values that are supposed to be predicted by weak learner $f_k$ are:

\[s_i = -\frac{\partial L\left[y, \hat{y}^{(k-1)}\right]}{\partial \hat{y}^{(k-1)}_i}\left(\hat{y}^{(k-1)}\right)\]

Where:

$y=\left(y_1, y_2, \ldots, y_n \right)$: the vector of final targets.
$\hat{y}^{(k-1)} = \left(\hat{y}^{(k-1)}_1, \hat{y}^{k-1}_2, \ldots, \hat{y}^{k-1}_n\right)$: the vector of predictions is produced by the enseble of the first $k-1$ weak learners.

This approach assumes that, using gradient boosting, you try to select such an $f_k$ that its predictions are close to the $s_i$ that correspond is corresponds to the direction that minimises the selected loss.

Classic models

Contents

Classic models#

Gradient boosting#