Linear#

The torch.nn.Linear layer performs the following operation:

\[X_{n \times l} \cdot \left(\omega_{k \times l}\right)^T + b_k\]

Where:

  • \(l\): number of inputs

  • \(k\): number of outputs

  • \(n\): number of input samples

  • \(X_{n \times l}\): input tensor

  • \(\omega_{k \times l}\): weight matrix of the layer

  • \(b_k\): bias vector of the layer

import torch
from torch.nn import Linear

Access to parameters#

This layer allows to get parameters using attributes weights and bias.


Here’s an example of how to do it:

linear_layer = Linear(in_features=3, out_features=4)

default_weights = torch.ones_like(linear_layer.weight)
default_biases = torch.zeros_like(linear_layer.bias)

with torch.no_grad():
    linear_layer.weight.copy_(default_weights)
    linear_layer.bias.copy_(default_biases)

After completing the process, we have the weight tensor initialized with ones and the bias tensor initialized with zeros:

print(linear_layer.weight)
print(linear_layer.bias)
Parameter containing:
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], requires_grad=True)
Parameter containing:
tensor([0., 0., 0., 0.], requires_grad=True)

More dimentions#

Unlike classical matrix multiplication, a Linear layer can operate on tensors with higher dimensionality.

Suppose you have a tensor with dimensions \(\left(d_1, d_2, \dots, d_{m-2}, d_{m-1}, d_m\right)\).

A Linear layer designed for such input will have \(d_m\) input features and \(k\) output features, with a weight matrix \(\omega \in \mathbb{R}^{k \times d_m}\).

The output will retain the shape \(\left(d_1, d_2, \dots, d_{m-2}, d_{m-1}, k\right)\), where each of the \(\prod_{i=1}^{m-2} d_i\) subtensors of size \(d_{m-1} \times d_m\) is independently multiplied with \(\omega\).


For example, consider an input that takes shape.

X = torch.randn(3, 5, 4)
X
tensor([[[ 0.6648,  0.0795, -0.3961,  0.0717],
         [ 2.2550,  1.3696, -1.4603,  0.9347],
         [ 0.2754, -0.6647, -0.0767,  0.2089],
         [ 0.7514,  0.3045, -1.1518, -0.4475],
         [-0.8777,  0.4888, -0.1978, -0.9798]],

        [[-1.7346,  0.5344, -1.8987,  0.5710],
         [ 0.5810, -0.0143,  0.7732, -0.3079],
         [-0.6366,  0.5068, -1.8391,  1.4452],
         [-1.1583,  0.9299,  0.6273, -1.8185],
         [ 0.7702, -1.7367, -0.8410, -0.3621]],

        [[ 0.2885, -0.1347,  0.8165, -0.4481],
         [-0.1231,  0.8926, -0.1328,  0.8820],
         [-0.9528,  1.1596, -0.3776, -0.5287],
         [ 0.2178, -0.4286,  1.1390,  1.9489],
         [-0.7107,  2.1834,  0.6254,  1.3248]]])

Convenient to think of as 3 matrices, with \(5 \times 4\) dimensionality.

A layer that can handle hundreds of such entries is created in the next cell.

linear = Linear(
    in_features=4, 
    out_features=2, 
    bias=False
)

Directly applying the layer to the data yields \(3\) matrices with \(5 \times 2\) dimensionality.

linear(X)
tensor([[[-0.1042, -0.1680],
         [-0.8402, -0.3794],
         [ 0.2441, -0.3321],
         [-0.1327, -0.1373],
         [-0.0276,  0.4840]],

        [[ 0.1779, -0.0869],
         [-0.1385,  0.1409],
         [-0.0116, -0.4490],
         [-0.2179,  1.0413],
         [ 0.7147, -0.7942]],

        [[-0.0410,  0.1841],
         [-0.3835,  0.0896],
         [-0.3047,  0.5820],
         [-0.0099, -0.4006],
         [-0.9295,  0.6689]]], grad_fn=<UnsafeViewBackward0>)

The same result could be achieved by taking the input matrices one by one and multiplying them by the weight of the layer.

torch.stack([x @ linear.weight.data.T for x in X])
tensor([[[-0.1042, -0.1680],
         [-0.8402, -0.3794],
         [ 0.2441, -0.3321],
         [-0.1327, -0.1373],
         [-0.0276,  0.4840]],

        [[ 0.1779, -0.0869],
         [-0.1385,  0.1409],
         [-0.0116, -0.4490],
         [-0.2179,  1.0413],
         [ 0.7147, -0.7942]],

        [[-0.0410,  0.1841],
         [-0.3835,  0.0896],
         [-0.3047,  0.5820],
         [-0.0099, -0.4006],
         [-0.9295,  0.6689]]])