Gradient descent

Gradient descent#

Is a cursial method for machine learning. But in general it’s a way of optimising - finding values where the function depends on those values taking the minimum/maximum value. Here I’ll try to implement a basic variation of gradient descent using only numpy. I’ll explain each step to make everything clear.

import numpy as np
import plotly as pltly
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Task#

Suppose we have classic linear regression task - we have array of point \((x_i, y_i)\) just like generated and visualised in the cell below:

sample_size = 200

real_alpha = 3; real_beta = 2;

X = np.random.uniform(-3 , 3, sample_size)
y = X*real_alpha + real_beta + np.random.normal(0, 3, sample_size)


def pltly_plot_line(alpha, beta, **kwargs):
    line_X = np.linspace(-3,3, 10)
    line_y = alpha*line_X + beta
    
    return go.Scatter(
        x=line_X, y=line_y, 
        mode='lines', **kwargs
    )

def get_my_scatter(X, y):
    return go.Scatter(
        x=X, y=y, mode='markers', 
        marker_size=10, name="Observations",
        marker_color = "blue"
    )

go.Figure(
    data = [
        get_my_scatter(X, y),
        pltly_plot_line(
            real_alpha, real_beta, name="Real line",
            line=dict(width=5)
        )
    ],
    layout={
        "xaxis_title" : "X",
        "yaxis_title" : "y",
        "width" : 1000,
        "height" : 500,
        "dragmode" : False
    }
)

We need to find such a function \(y=x\alpha + \beta\) that produces the closest line to the array of points. More precisely to find \(\alpha\) and \(\beta\) that corresponds closes line.

We will use \(MSE\) as a loss function:

\[L(\alpha, \beta)=\frac{\sum_{i=1}^n \left[(x_i\alpha + \beta) - y_i\right]^2}{n}\]

But the optimisation result will be the same with \(SSE\) (Sum Square Errors):

\[L(\alpha, \beta)=\sum_{i=1}^n \left[(x_i\alpha + \beta) - y_i\right]^2\]

So we will use \(SSE\) because it requires fewer operations.

The following code displays \(SSE\) values for different values of \(\alpha\) and \(\beta\). You can choose any combination of \((\alpha, \beta)\). The points on the \(L\) plot represent the selected SSE combinations. On the scatter plot there are lines that correspond by colour to those combinations of \((\alpha,\beta)\). As you can see best line have green color and it’s corresponds to minimal value of \(L\).

def sse_fun(y_true, y_pred):
    return ((y_true - y_pred)**2).sum()

def pltly_get_sse_surface(alphas, betas):
    
    grid = np.array([
        [sse_fun(y, alpha*X + beta) for beta in betas] 
        for alpha in alphas
    ]).T

    return go.Surface(
        z=grid, x=alphas, y=betas,
        hovertemplate="alpha:%{x}<br>beta:%{y}<br>Loss:%{z}<extra></extra>",
        showscale=False
    )

loss_scene_layout = dict(
    title='Loss', autosize=False,
    width=600, height=800,
    margin=dict(l=65, r=50, b=65, t=90),
    scene=dict(
        xaxis_title='alpha',
        yaxis_title='beta',
        zaxis_title='L'
    )
)
def pltly_get_sse_fig(data):
    fig = go.Figure(data=data)
    fig.update_layout(**loss_scene_layout)
    return fig


fig = make_subplots(
    rows=1, cols=2,
    specs=[[{"type": "scene"}, {"type": "xy"}]]
)

alphas_grid = np.linspace(real_alpha-3, real_alpha+3, 100)
betas_grid = np.linspace(real_beta-3, real_beta+3, 100)

fig.add_trace(
    pltly_get_sse_surface(alphas_grid, betas_grid), 
    row=1, col=1
)
fig.add_trace(get_my_scatter(X, y), row=1, col=2)
loss_scatter_plot_layout = dict(
    width=1000, height=500,
    scene=dict(
        xaxis_title='alpha',
        yaxis_title='beta',
        zaxis_title='L'
    ),
    xaxis1_title = "x",
    yaxis1_title = "y",
    dragmode = False
)
fig.update_layout(**loss_scatter_plot_layout
)

# visualisation of points and lines corresponding to them
points_to_visualise = [
    (real_alpha, real_beta, "green"),
    (real_alpha-2, real_beta+1, "red"),
    (real_alpha+2, real_beta-2, "orange")
]

for i, (alpha, beta, col) in enumerate(points_to_visualise):
    loss_val = sse_fun(alpha*X + beta, y)
    fig.add_trace(
        go.Scatter3d(
            x=[alpha], y=[beta], 
            z=[loss_val + loss_val/10], 
            marker_color=col,
            showlegend=False
        ),
        row=1, col=1
    )
    fig.add_trace(
        pltly_plot_line(
            alpha, beta, 
            line=dict(color=col, width=5),
            name=f"line{i+1}",
        ),
        row=1, col=2
    )
fig.show()

Gradient#

The idea is really simple - we need to pick a random point where we want to optimise and move in the direction of the fastest decreasing of function. Ok, let’s say we start by picking point \((\alpha_0, \beta_0)\), then move to point \((\alpha_1, \beta_1)\) and so on. All we need for the optional point \((\alpha_i, \beta_i)\) is to determine the direction of the fastest decreasing of the \(L\) function - antigradient of the function:

\[-\nabla L(\alpha, \beta)= -(\frac{\partial L}{\partial \alpha}, \frac{\partial L}{\partial \beta})\]

So we need some math with partial derivatives:

\[\frac{\partial L}{\partial \alpha}(\alpha, \beta)=2\sum_{i=1}^n (\alpha x_i+\beta - y_i)x_i;\]

\[\frac{\partial L}{\partial \beta}(\alpha, \beta)=2\sum_{i=1}^n \alpha x_i +\beta - y_i.\]

So from each \((\alpha_i,\beta_i)\) we can substitute the values into \(\nabla L(\alpha, \beta)\) and get the direction we need to move. A good way to check if you have done everything right is to visualise the gradients at different points. The following cell defines a function to calculate gradint at a point and uses it to visualise the directions of the gradient. It would be cool to point cones tangential to the error graph, but plotly has some issues with cones pointing in case of large axis dimensionality, so I didn’t get it right. But you can still see that the gradient realisation works well and all the cones point in the direction of minimisation.

def sse_gradient(alpha, beta, X, y):
    n = len(X)
    
    expand_X = X[np.newaxis,:]
    expand_y = y[np.newaxis,:]
    expand_alpha = alpha[:, np.newaxis]
    expand_beta = beta[:, np.newaxis]

    alpha_grad = 2*(
        (np.dot(expand_alpha, expand_X) + expand_beta - expand_y)*expand_X
    ).sum(axis=1)/n
    beta_grad = 2*(expand_alpha*expand_X + expand_beta - expand_y).sum(axis=1)/n
    
    return (alpha_grad, beta_grad)

cones_a = np.linspace(real_alpha - 2, real_alpha + 2, 10)
cones_b = np.linspace(real_beta - 2, real_beta + 2, 10)
cones_a, cones_b = np.meshgrid(cones_a, cones_b)
cones_a = cones_a.ravel(); cones_b = cones_b.ravel()
cons_preds = np.dot(
    cones_a[:, np.newaxis], X[np.newaxis, :]
) + cones_b[:, np.newaxis]
cones_sse = ((cons_preds - y[np.newaxis, :])**2).sum(axis=1)

grad_a, grad_b = sse_gradient(cones_a, cones_b, X, y)
grad_a = -grad_a;grad_b = -grad_b
grad_L = np.zeros(grad_a.size)

pltly_get_sse_fig(
    data = [
        pltly_get_sse_surface(alphas_grid, betas_grid),
        go.Cone(
            x=cones_a, y=cones_b, z=cones_sse, 
            u=grad_a, v=grad_b, w=grad_L,
            showscale = False,
            colorscale=[
                'rgba(255, 0, 0, 0.2)', 
                'rgba(255, 0, 0, 0.2)'
            ]
        )
    ]
)

Algorithm#

Now, finally, to implement the optimisation algorithm. So we need

Select the point \((\alpha_0, \beta_0)\) (sets \(i=0\));
Make an optimisation step like \((\alpha_{i+1}, \beta_{i+1}) = (\alpha_{i}, \beta_{i}) - lr*\nabla L(\alpha, \beta)\);
Check the stop criteria:
- If it completes, stop the optimisation and select \((\alpha_i, \beta_i)\) as the result;
- If it does not finish, \(i:=i+1\) and return to step 2.

Where \(lr\) - learning rate, parameter that determines the speed of the optimisation.

Just look how easy it is to get rid of unnecessary things - the following cell represents a really simple reasilation. Of course, solutions in libraries like sklearn are universal, much more optimised and have many features - that requires a lot of code. But the idea can be shown in just 10 lines of code.

As the result showen sequence of coefficient changes during optimisation.

lr = 10e-2

# here is defining of initial point
coefs = [(real_alpha+2.5, real_beta-2.5)]

# we don't use any complex stop criteria
# just 20 steps
for i in range(10):
    grad_a, grad_b = sse_gradient(
        np.array([coefs[-1][0]]), 
        np.array([coefs[-1][1]]), X, y
    )

    coefs.append((
        coefs[-1][0] - lr*grad_a[0], 
        coefs[-1][1] - lr*grad_b[0]
    ))

coefs

[(5.5, -0.5),
 (3.9349832493396333, -0.057139502353940996),
 (3.4195523053914667, 0.2959285356499043),
 (3.2499591159509156, 0.5779810461599597),
 (3.194286908040883, 0.8034908101279732),
 (3.1761150891780643, 0.9838552095357639),
 (3.1702669194859174, 1.1281325591415075),
 (3.1684520655189967, 1.2435498785726589),
 (3.1679439508768503, 1.3358823189407831),
 (3.1678484920582104, 1.4097478750204548),
 (3.167874630554446, 1.4688402454478422)]

Let’s visualise the results of the previous cell. So on the left plot you can see how the optimisation looks in terms of the loss function, but on the right you can see how the optimisation works in terms of the regression line obtained.

fig = make_subplots(
    rows=1, cols=2,
    specs=[[{"type": "scene"}, {"type": "xy"}]]
)

colorscale = pltly.colors.get_colorscale('RdYlGn')
sample_colorscale = pltly.colors.sample_colorscale(
    colorscale, np.linspace(0, 1, len(coefs))
)

path_data = {
    "x" : [], "y" : [], "z" : [], 
    "marker_color" : []
}
lines = []
for (alpha, beta), col in zip(coefs, sample_colorscale):
    
    sse_val = sse_fun(X*alpha + beta, y)
    path_data["x"].append(alpha)
    path_data["y"].append(beta)
    path_data["z"].append(sse_val + sse_val/30)
    path_data["marker_color"].append(col)

    lines.append(pltly_plot_line(
        alpha, beta,
        line=dict(color=col),
        showlegend=False
    ))

fig.add_trace(
    pltly_get_sse_surface(alphas_grid, betas_grid),
    row=1, col=1
)
fig.add_trace(
    go.Scatter3d(
        **path_data,
        line=dict(width=4, color="yellow"),
        marker=dict(size=4),
        showlegend=False
    ),
    row=1, col=1
)

fig.add_trace(
    get_my_scatter(X,y)
)
for l in lines:
    fig.add_trace(l, row=1, col=2)

fig.update_layout(**loss_scatter_plot_layout)

Gradient descent

Contents

Gradient descent#

Task#

Gradient#

Algorithm#