Recurrent layers#

This page considers relasilation of the recurent layers in torch. Find out more at the specific page of the torch documentation.

import torch
from torch.nn import RNN

Equivalent realisation#

On the RNN page of the PyTorch documentation, you can find a function that implements a transformation equivalent to torch.nn.RNN. To explore different parameters of the recurrent layer, it’s convenient to have a function that we can modify to gain a better understanding. This section consider how basic version of this function can be used.


The following cell shows a modification of this function that takes only parts of the transformation as parameters, making it more convenient.

def forward(
    x: torch.Tensor,
    hidden_size: int,
    weight_ih: list[torch.Tensor],
    bias_ih: list[torch.Tensor],
    weight_hh: list[torch.Tensor],
    bias_hh: list[torch.Tensor],
    h_0 : torch.Tensor | None = None,
    num_layers: int = 1,
    batch_first: bool = False
):
    if batch_first:
        x = x.transpose(0, 1)
    seq_len, batch_size, _ = x.size()
    if h_0 is None:
        h_0 = torch.zeros(num_layers, batch_size, hidden_size)
    h_t_minus_1 = h_0
    h_t = h_0
    output = []
    for t in range(seq_len):
        for layer in range(num_layers):
            h_t[layer] = torch.tanh(
                x[t] @ weight_ih[layer].T
                + bias_ih[layer]
                + h_t_minus_1[layer] @ weight_hh[layer].T
                + bias_hh[layer]
            )
        output.append(h_t[-1].clone())
        h_t_minus_1 = h_t
    output = torch.stack(output)
    if batch_first:
        output = output.transpose(0, 1)
    return output, h_t

Following code creates RNN layer and typical input for it:

sequence_len = 4
batch_size = 5

rnn = RNN(input_size=2, hidden_size=3)
x = torch.randn(sequence_len, batch_size, rnn.input_size)

To use custom forward, you must pass weights as lists (there may be more than one layer under torch.nn.RNN), and all other parameters of the layer.

function_out = forward(
    x=x,
    hidden_size=rnn.hidden_size,
    weight_ih=[rnn.weight_ih_l0],
    bias_ih=[rnn.bias_ih_l0],
    weight_hh=[rnn.weight_hh_l0],
    bias_hh=[rnn.bias_hh_l0]
)

layer_out = rnn(x)

The following cell shows that the results of the custom realization and torch.nn.RNN are the same.

torch.testing.assert_close(
    actual=function_out[0],
    expected=layer_out[0]
)
torch.testing.assert_close(
    actual=function_out[1],
    expected=layer_out[1]
)

Forward input#

Forward of the torch.nn.RNN takes two input arrays, one is a sequences that have to be processed, other is the initial hidden states. Other feature of the forward is that it can procedure both batched and unbatched input. Find out more in the corresponding page.


Consider examples with parameters defined in the following cell:

input_size = 2
hidden_size = 3
sequence_len = 10

Unbached input assumes a sequence of vectors as input and hidden_size dimmed vector for each recurrent of the layer (in particular in this case - 1).

input = torch.randn(sequence_len, input_size)
hidden = torch.randn(1, hidden_size)
output, hidden = rnn(input, hidden)
output.shape, hidden.shape
(torch.Size([10, 3]), torch.Size([1, 3]))

Batched input actually the same but also added dimentinality of the samples - which is second by defalut:

samples_number = 5
input = torch.randn(sequence_len, samples_number, input_size)
hidden = torch.randn(1, samples_number, hidden_size)
output, hidden = rnn(input, hidden)
output.shape, hidden.shape
(torch.Size([10, 5, 3]), torch.Size([1, 5, 3]))

Batch dimention#

By default torch.nn.RNN is supposed to work on tensors with dimensionality \((L, N, H_{in})\), which can be considered as sequence of batches. But there is batch_first parameter which makes torch.nn.RNN layer to work with dimensionality \((N, L, H_{in})\), so it can be considered as batch of sequences - which actually is convenied in most of the cases.

Here:

  • \(L\): lenght of the seuqnce.

  • \(N\): batch size.

  • \(H_{in}\): dimentionality of the element of the sequence.


Consider the difference using the tensor generated in the next cell:

X = torch.empty(5, 7, 3)

By default we got last state for 7 items in batch.

rnn = RNN(input_size=3, hidden_size=10)
rnn(X)[1].shape
torch.Size([1, 7, 10])

But the same code, with only difference in the batch_first=True argument, resulted in last state for 5 items in batch.

rnn = RNN(input_size=3, hidden_size=10, batch_first=True)
rnn(X)[1].shape
torch.Size([1, 5, 10])

Layers number#

By specifying the num_layers parameter in torch.nn.RNN, you can define how many times the recurrent layer will be applied to the input data. This means the output of one recurrent layer will serve as the input to the next, sequentially. Find more details on the dedicated page.


Each recurrent has it’s own weights, the following cell shows the set of parameters of torch.nn.RNN that have num_layers=3:

rnn = RNN(input_size=2, hidden_size=3, num_layers=3)

for name, parameters in rnn.named_parameters():
    print("="*80)
    print(name, parameters)
================================================================================
weight_ih_l0 Parameter containing:
tensor([[-0.1217, -0.2320],
        [-0.4093,  0.0111],
        [ 0.0011, -0.1270]], requires_grad=True)
================================================================================
weight_hh_l0 Parameter containing:
tensor([[-0.4822,  0.2226,  0.2734],
        [-0.2779, -0.3565,  0.0812],
        [ 0.2085,  0.4712,  0.3837]], requires_grad=True)
================================================================================
bias_ih_l0 Parameter containing:
tensor([ 0.3848, -0.3489, -0.3642], requires_grad=True)
================================================================================
bias_hh_l0 Parameter containing:
tensor([-0.1991,  0.4458,  0.1032], requires_grad=True)
================================================================================
weight_ih_l1 Parameter containing:
tensor([[-0.2957, -0.3953, -0.1050],
        [ 0.5394,  0.3613, -0.2077],
        [ 0.0979, -0.1050, -0.3406]], requires_grad=True)
================================================================================
weight_hh_l1 Parameter containing:
tensor([[-0.2834, -0.3222,  0.3811],
        [ 0.4013,  0.3117,  0.3779],
        [-0.1798, -0.5078,  0.2733]], requires_grad=True)
================================================================================
bias_ih_l1 Parameter containing:
tensor([ 0.1544, -0.5526,  0.1558], requires_grad=True)
================================================================================
bias_hh_l1 Parameter containing:
tensor([0.4194, 0.0069, 0.3011], requires_grad=True)
================================================================================
weight_ih_l2 Parameter containing:
tensor([[ 0.0569, -0.1900, -0.0903],
        [-0.2319,  0.1846, -0.0532],
        [-0.5023, -0.0299,  0.1895]], requires_grad=True)
================================================================================
weight_hh_l2 Parameter containing:
tensor([[ 0.0914, -0.4255, -0.3047],
        [ 0.5072,  0.4777,  0.3464],
        [-0.5395,  0.4865, -0.0562]], requires_grad=True)
================================================================================
bias_ih_l2 Parameter containing:
tensor([-0.0107,  0.2063, -0.0353], requires_grad=True)
================================================================================
bias_hh_l2 Parameter containing:
tensor([ 0.4801,  0.2667, -0.4451], requires_grad=True)

There is a postfix like l1 after each set of parameters, indicating the layer to which those weights belong.

Note: weight_ih_l0 has dimensionality input_size \(\times\) hidden_size, whereas all subsequent weights, such as weight_ih_l1 and weight_ih_l2, have dimensionalities of hidden_size \(\times\) hidden_size. This is because all layers except the first consider the hidden states of the previous layer as the input sequence, and those sequences have elements of size hidden_size.