Recurrent

Recurrent#

This page explains the concept of a recurrent layer.

The key idea is to create a mechanism where each input affects the processing and outcome of subsequent inputs.

An RNN is essentially a single layer. At each step, it uses \(h_{t-1}\), a special state vector from the previous step.

Classic#

The computation that lays behind the classical RNN can be written using the following formula:

\[h_t = f(x_t W^T_1 + b_1 + h_{t-1} W^T_2 + b_2)\]

Where:

\(x_t\): input at the \(t\)-th step.
\(h_t\): vector that describes hidden state at the \(t\)-th step.
\(W_1\): weights associated with the input.
\(W_2\): weights associated with the state.
\(b_1\): bias associated with the input.
\(b_2\): bias associated with the state.
\(f\): activation function, typically a hyperbolic tangent.

Really popular wa to visualise RNN transformation is following:

It is assumed that to the hidden state \(h_t\) can be applied linear transformation, \(O\), that brings it to the expected dimensionality if this element is to be used in the subsequent steps.

It’s important to understand this approach to visualization because as it is widely used to describe the RNN’s modifications.

Realization on python#

In this section, we will step by step implement the computations performed by a recurrent layer and compare them with torch.nn.RNN as the reference.

The following cell defines the parameters of the recurrent procedure that we will use as an example.

We have:

A sequence of \(15\) elements: \(\left\{ x_1, x_2, \dots, x_{10} \right\}\).
Each element is a vector of \(5\) elements: \(x_t \in \mathbb{R}^5\).
We are working with a sample containing \(10\) sequences.
The state vector is a \(3\)-element vector: \(h_t \in \mathbb{R}^3\).
The activation function \(f(x)\) is the hyperbolic tangent.

import torch

samples_size = 15
element_size = 5
sequence_size = 10
state_size = 3
activation = torch.nn.Tanh()

input_data = torch.rand(samples_size, sequence_size, element_size)

For given inputs:

Input weights \(W_1\) should be a \(3 \times 5\) matrix.
Input bias \(b_1\) should be a vector with \(3\) elements.
State weights \(W_2\) should be a \(3 \times 3\) matrix.
State bias \(b_2\) should be a vector with \(3\) elements.

W_1 = torch.rand(state_size, element_size)
b_1 = torch.rand(state_size)
W_2 = torch.rand(state_size, state_size)
b_2 = torch.rand(state_size)

Realisation of \(x_1 W^T_1 + b_1 + h_{0} W^T_2 + b_2\), where initial state initialised as \(h_{0}=(0)_{10}\), will take form:

state = torch.zeros(samples_size, state_size)
(input_data[:, 0, :] @ W_1.T) + b_1 + (state @ W_2.T) + b_2

tensor([[1.0187, 0.5179, 2.1283],
        [2.3602, 1.7332, 3.0126],
        [1.7238, 1.3923, 2.8326],
        [1.6351, 1.1317, 2.6059],
        [1.5822, 1.4294, 3.1752],
        [1.5521, 0.8518, 2.3209],
        [2.0734, 1.5617, 2.4171],
        [2.1795, 1.8786, 3.2781],
        [1.9290, 1.4330, 3.1887],
        [1.8527, 1.4238, 3.0310],
        [1.9417, 1.5560, 2.9881],
        [1.5617, 0.9667, 2.4146],
        [2.4599, 2.0704, 3.4121],
        [2.0325, 1.6429, 3.2050],
        [1.4736, 0.9799, 2.1185]])

As the result we got \(h_1\) for each sample out of \(15\).

The implementation of the full recurrent procedure for all \(15\) elements of the sequence is provided in the following cell:

states = [state]

for i in range(input_data.shape[1]):
    res = activation( 
        (input_data[:, i, :] @ W_1.T) + b_1
        + (states[-1] @ W_2.T) + b_2
    )
    states.append(res)

As result we have \(11\) (including initial) states. Which is matrix of \(15\) rows - each row for corresponding observation.

len(states)

states[3]

tensor([[0.9638, 0.9522, 0.9989],
        [0.9718, 0.9704, 0.9994],
        [0.9982, 0.9973, 0.9999],
        [0.9974, 0.9940, 0.9999],
        [0.9912, 0.9944, 0.9998],
        [0.9909, 0.9858, 0.9996],
        [0.9976, 0.9981, 0.9999],
        [0.9956, 0.9934, 0.9998],
        [0.9956, 0.9870, 0.9998],
        [0.9964, 0.9967, 0.9999],
        [0.9839, 0.9770, 0.9994],
        [0.9890, 0.9855, 0.9996],
        [0.9906, 0.9791, 0.9996],
        [0.9985, 0.9980, 0.9999],
        [0.9965, 0.9937, 0.9997]])

Then we take all first rows as states of the first sample, all second rows as states of the second sample and so on - which can be realized as stacking on the columns (-2) dimension. Torch also separately represents the last states for all samples, so it is a second output.

my_ans = (
    torch.stack(states[1:], dim=-2), 
    states[-1][None]
)

Now, we will do the same using the ready-made torch.nn.RNN class. Before computing, we need to set the weights of the instance to match those used in the custom procedure:

rnn = torch.nn.RNN(element_size, state_size, batch_first=True)

with torch.no_grad():
    # Copying the parameters that were used earlier:
    rnn.weight_ih_l0.copy_(W_1)
    rnn.bias_ih_l0.copy_(b_1)
    rnn.weight_hh_l0.copy_(W_2)
    rnn.bias_hh_l0.copy_(b_2)

    torch_ans = rnn(input_data)

The following cell verifies that both outputs are identical.

torch.testing.assert_close(torch_ans[0], my_ans[0])
torch.testing.assert_close(torch_ans[1], my_ans[1])

LSTM#

Long Short-Term Memory (LSTM) is the RNN modification that solves the classic RNN problem of “forgetting” information associated with early sequence elements. LSTM introduces a long-term memory vector \(C_t\) which selectevely update with each new element of the input sequence.

To perform the update the information stored in the long-term memory, the mechanisms known as “gates” are introduced:

Forget gate: clears useless information from the \(C_t\).
Input gate: defines which of the new information must be added to the \(C_t\).

There are also output gate that prepare information for subsequent steps just like classical RNN transformation.

Forget gate#

Forgate gate is calculated as:

\[f_t = \sigma\left(W^f_1 x_{t-1} + W^f_2 h_{t-1} + b^f\right)\]

Here: The matrices of parameters, \(W^f_1\), \(W^f_2\) and \(b^f\), are learnable.

It’s a really similar idea like to the classic RNN. However, it is crusial to use the sigmoid function as the activation here: this can cause some elements to become very close to zero as \(f_t \in [0,1]\). The idea is that it should be multiplied by the long-term memory vector from the previous step, \(C_{t-1}\), and elements that gate “wants to delete” will be removed from the \(C_t\).

Input gate#

Consist of two elements:

\[\begin{split} i_t = \sigma\left(W^i_1 x_{t-1} + W^i_2 h_{t-1} + b^i\right), \\ i_t \in [0, 1] \end{split}\]

\[\begin{split} \tilde{c}_t = \tanh\left(W^c_1 x_{t-1} + W^c_2 h_{t-1} + b^c\right), \\ \tilde{c}_t \in [-1, 1] \end{split}\]

Here \(W^i_1\), \(W^i_2\), \(b^i\), \(W^c_1\), \(W^c_2\) and \(b^c\) are learnable matrices.

The output of the gate is:

\[I_t = i_t \odot \tilde{c}_t\]

The intucion behind that is following:

The \(i_t\) vector is produced by a sigmoid transformation. This property enables vector to remove “unsignificant” information from the vector with which it element-wise multipliedit elemnt-wise producted with.
The vector, \(\tilde{c}_t\), is produced by hiperbolic tangent transformation, taking values from -1 to 1. It incodes all the information extracted from the element of the processed sequence under consideration.

Memory update#

According to the previous steps, the long-term memory vector will be updated using the following formula:

\[C_t = C_{t-1} \odot f_t + I_t\]

The component \(C_{t-1} \odot f_t\) removes the useless information from the \(C_t\). \(I_t\) adds important information loaded from this step.

Output gate#

It produces the output of the model, whether it will be used in the following LSTM block or by something else. Mathematically, this can be written as following:

\[ h_t = \sigma\left(W_1^o x_{t-1} + W_2^o h_{t-1} + b^o\right) \odot \tanh(C_t), h_t \in [0, 1] \]

Here: The matrices of parameters, \(W^o_1\), \(W^o_2\) and \(b^o\), are learnable.

Using the same logic as was used in the other gates, this stage determines which exactly information from the \(C_{t}\) must be included in the outputs of the \(t\)-th step.