# Transformer

This page discusses transformer architecture. The following picture shows classical schema to explain the transformer.

![](transformer_files/schema.svg)

## Encoder/decoder

The transformer uses encoder/decoder achitecture. The idea behind this achitecture is following:

- In the encoder layer, positional encoding is applied. Then, a multi-head attention is used to create a sequence representation.
- In the decoder layer, a transformation similar to the one in the encoding is applied to the incomplete output sequence. Then, the multihead attention is used a second time to combine the output of the encoder with the output of the encoded output sequence that was passed as input to the second part. As the result the output can be generated.

## Masked attention

Masked attention is used to prevent the model from using the keys of future elements when computing values for the current sequence. To the matrix that contains all possible $q_i, k_j$ combinations is added matrix:

$$M_{ij} = \begin{cases}
0, & i \leq j \\
- \infty, & i > j
\end{cases}$$

Or in more visual represenation:

$$
M = \left(\begin{array}{ccccc}
0 & -\infty & -\infty & \cdots & -\infty \\
0 & 0 & -\infty & \cdots & -\infty \\
0 & 0 & 0 & \cdots & -\infty \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & 0 & \cdots & 0\
\end{array}\right)
$$

So under softmax expression takes form

$$QK^T + M = 
\left(\begin{array}{ccccc}
q_1k_1 & -\infty & -\infty & \cdots & -\infty \\
q_2k_1 & q_2k_2 & -\infty & \cdots & -\infty \\
q_3k_1 & q_3k_2 & q_3k_3 & \cdots & -\infty \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
q_nk_1 & q_nk_2 & q_nk_3 & \cdots & q_nk_n\
\end{array}\right)
$$

In softmax transformation $-\infty$ elements are transformed to zero. That's why:

$$softmax\left(\frac{QK^T + M}{\sqrt{d}}\right) =
\left(\begin{array}{ccccc}
s_{11} & 0 & 0 & \cdots & 0 \\
s_{21} & s_{22} & 0 & \cdots & 0 \\
s_{31} & s_{32} & s_{33} & \cdots & 0 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
s_{n1} & s_{n2} & s_{n3} & \cdots & s_{nn}\
\end{array}\right)
$$

Where $s_{ij}$ is the result of a softmax transformation for the $j$-th element of the $i$-th row.