Layers number

Layers number#

By specifying the num_layers parameter in torch.nn.RNN, you can define how many times the recurrent layer will be applied to the input data.

import torch
from torch.nn import RNN

This page uses the following common designations:

$H_{out}$: hidden layer size.
$L$: number of recurrnet layers.
$sl$: processed sequence length.

Idea description#

When you specify data like this, the computation procedure changes:

\[h^{l}_t = f(h_t^{l-1} [W_1^l]^T + b_1 + h_{t-1}^{l} [W^l_2]^T + b_2)\]

Consider differences from classical definition of the recurrent layer.

$h_t^{l}$ is a hidden state of the $t$-th element of sequence of the $l$-th layer. Obviously for the first layer it takes input values - $l=1, h_t^{l-1}=x_t$.

Also you there are sets of parameters for each layer $W_1^l, W_2^l, l \in \overline{1,L}$ where $L$ is a number of recurrent layers.

Output description#

The output of a stacked RNN layer in PyTorch, as with a standard single-layer RNN, is a tuple consisting of two elements:

The first element, output, contains the hidden states from the last layer of the RNN at each time step $t$. This tensor encapsulates the sequence of outputs after propagation through all RNN layers.
The second element, h_n, represents the final hidden state for each layer and for each direction (in the case of a bidirectional RNN), taken from the last time step of the input sequence. This serves as a summary of the processed input for each layer.

The following cell sends data through the stacked RNN layer.

sequence_len = 5
input_size = 3
hidden_size = 4
input = torch.randn(sequence_len, input_size)

rnn = RNN(
    input_size=input_size, 
    hidden_size=hidden_size, 
    num_layers=2
)
out = rnn(input)

Consider the output elements of the layer defined below.

The first element of output corresponds to the size $(sl, H_{out})$.

out[0]

tensor([[ 0.5623,  0.1467,  0.1087,  0.6155],
        [-0.1213,  0.6940, -0.0027,  0.5050],
        [ 0.1429,  0.4335,  0.0797,  0.4975],
        [-0.0405,  0.5954,  0.0013,  0.4789],
        [ 0.2233,  0.4955, -0.2479,  0.7426]], grad_fn=<SqueezeBackward1>)

This hidden state comes through all layers at each step.

The second element of output corresponds to the size $(L, H_{out})$.

out[1]

tensor([[ 0.7996, -0.4562,  0.9697,  0.0581],
        [ 0.2233,  0.4955, -0.2479,  0.7426]], grad_fn=<SqueezeBackward1>)

This hidden state comes through all steps at each layer.

Note: The last vector of out[0] is the same as the last vector of out[1], as they are actually the same hidden state — the hidden state produced by the last layer at the last step.

Compare to set of RNN#

rnn1 = RNN(input_size=input_size, hidden_size=hidden_size)
rnn1.weight_ih_l0 = rnn.weight_ih_l0
rnn1.bias_ih_l0 = rnn.bias_ih_l0
rnn1.weight_hh_l0 = rnn.weight_hh_l0
rnn1.bias_hh_l0 = rnn.bias_hh_l0

rnn2 = RNN(input_size=hidden_size, hidden_size=hidden_size)
rnn2.weight_ih_l0 = rnn.weight_ih_l1
rnn2.bias_ih_l0 = rnn.bias_ih_l1
rnn2.weight_hh_l0 = rnn.weight_hh_l1
rnn2.bias_hh_l0 = rnn.bias_hh_l1

print(rnn2(rnn1(input)[0])[0])

Self implementation#

def forward(
    x: torch.Tensor, 
    hidden_size: int, 
    weight_ih: list[torch.Tensor], 
    bias_ih: list[torch.Tensor], 
    weight_hh: list[torch.Tensor], 
    bias_hh: list[torch.Tensor], 
    num_layers=1
):
    seq_len, batch_size, _ = x.size()
    h_0 = torch.zeros(num_layers, batch_size, hidden_size)
    h_t_minus_1 = h_0
    h_t = h_0
    output = []
    for t in range(seq_len):
        print("="*80)
        display(Latex(f"Processing $x_{{{t}}}$ elemnt of sequence."))
        for layer in range(num_layers):
            print(layer)
            print(t)
            h_t[layer] = torch.tanh(
                x[t] @ weight_ih[layer].T
                + bias_ih[layer]
                + h_t_minus_1[layer] @ weight_hh[layer].T
                + bias_hh[layer]
            )
            display(Latex("$x_i$"))
        output.append(h_t[-1].clone())
        h_t_minus_1 = h_t
    output = torch.stack(output)
    return output, h_t

weight_hh = [rnn.weight_hh_l0, rnn.weight_hh_l1, rnn.weight_hh_l2]
weight_ih = [rnn.weight_ih_l0, rnn.weight_ih_l1, rnn.weight_ih_l2]
bias_hh = [rnn.bias_hh_l0, rnn.bias_hh_l1, rnn.bias_hh_l2]
bias_ih = [rnn.bias_ih_l0, rnn.bias_ih_l1, rnn.bias_ih_l2]

forward(
    x=input,
    hidden_size=rnn.hidden_size,
    weight_ih=weight_ih,
    bias_ih=bias_ih,
    weight_hh=weight_hh,
    bias_hh=bias_hh,
    num_layers=rnn.num_layers
)

================================================================================

\[Processing $x_{0}$ elemnt of sequence.\]

0
0

\[x_i\]

1
0

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

Cell In[48], line 6

bias_hh = [rnn.bias_hh_l0, rnn.bias_hh_l1, rnn.bias_hh_l2]

bias_ih = [rnn.bias_ih_l0, rnn.bias_ih_l1, rnn.bias_ih_l2]

----> 6 forward(

   x=input,

   hidden_size=rnn.hidden_size,

   weight_ih=weight_ih,

   bias_ih=bias_ih,

   weight_hh=weight_hh,

   bias_hh=bias_hh,

   num_layers=rnn.num_layers

)

Cell In[46], line 22, in forward(x, hidden_size, weight_ih, bias_ih, weight_hh, bias_hh, num_layers)

   print(layer)

   print(t)

   h_t[layer] = torch.tanh(

---> 22         x[t] @ weight_ih[layer].T

       + bias_ih[layer]

       + h_t_minus_1[layer] @ weight_hh[layer].T

       + bias_hh[layer]

   )

   display(Latex("$x_i$"))

output.append(h_t[-1].clone())

RuntimeError: mat1 and mat2 shapes cannot be multiplied (10x2 and 10x10)