Attention#

Attention is a type of transformation for the sequence, that uses information hidden within the entire sequence. The source of information can be the sequence itself or another sequence.

Self-attention#

Define self-attention as \(SA\):

\[ SA: x_i \mapsto y_i, \quad x_i \in \mathbb{R}^s,\; y_i \in \mathbb{R}^{d_0},\; i \in \{1, \dots, n\}. \]

Where:

  • \(s\): the dimensionality of the vector describing a single element of the sequence.

  • \(n\): the length of the sequence to be transformed.

  • \(d_o\): the size of the embedding that represents each element of the sequence after the transformation.

The procedure involves calculating several vectors that somehow represent each element. An important parameter of the entire procedure is \(d\), which is the dimentionality of the information calculated for each element of the processed sequence.

At the highest level, the idea is simple: \(y_i\) is the weighed sum of all the elements in the sequence:

\[y_i = w_i W^\nu x_i\]

Where:

  • \(W^\nu \in \mathbb{R}^{(n \times s)}\): learnable matrix.

  • \(w_i \in \mathbb{R}^n\): is a crucial element of the self-attention approach. It is a vector that shows how the other elements of the sequence influence the the sence of the element under consideration.

Note: In some sources, the product \(W^\nu x_i\) is interpreted as the vector \(\nu_i \in \mathbb{R}^d\), which contains the general representation of a word regardless of context.

For each element introduce two vectors:

\[q_i = x_i W^q , k_i = x_i W^k.\]

Here \(W^q \in \mathbb{R}^{(s \times d)}\) and \(W^k \in \mathbb{R}^{(s \times d)}\) are learnable parameters. \(d\) is a selectable parameter that determines the dimentionality of the \(q_i \in \mathbb{R}^d\) and \(k_i \in \mathbb{R}^d\): which influences the complexity of the contexts \(q_i\) and \(k_i\) can understand.

The idea behind the method is that these vectors are queries (\(q\)) and keys (\(k\)). The matrices that produce them (\(W^q, W^k\)) are learned in such a way that keys of one elements have to match the queries of the other elements.

In this context, “match” refers to the high result of the scalar product of the \(q_i\) and \(k_j\).

For each \(j\)-s element of the array that is processed, the vector \((q_i k_1, q_i k_2, \ldots, q_i k_n)\) is computed. The elements which \(k_j\) better matches to the \(q_i\) will be higher.

For a given element, \(x_i\), the vector represents the weights of the matches with all other elements. However, to give the vector the properties of the real weights, it is normalized by dividing into \(\sqrt{d}\) and processed by a softmax function. Note: \(d\) is the dimention of \(q_i\) and \(k_i\).

Finally for weights we got such approach:

\[w_i = softmax \left( \frac{W^qx_i W^kx_1}{\sqrt{d}}, \frac{W^qx_i W^kx_2}{\sqrt{d}}, \ldots, \frac{W^qx_i W^kx_n}{\sqrt{d}} \right)\]

The entire transformation will take the following form:

\[y_i = \left( \frac{W^qx_i W^kx_1}{\sqrt{d}}, \frac{W^qx_i W^kx_2}{\sqrt{d}}, \ldots, \frac{W^qx_i W^kx_n}{\sqrt{d}} \right) W^\nu x_i\]

Matrix representation#

The previous description provided more detail, explaining the role of each element in the processed sequence. In practice, however, the entire sequence is handled simultaneously. To reflect this, we will now move on to matrix transformations, which enable processing of the whole sequence with just a few matrix multiplications.

Vectors \(k_i\) and \(q_i\) must be computed for each element of the processed sequence. To do that stack vecticaly sequence element \(x_i \in \mathbb{R^s}, i \in \overline{1,n}\) into matrix \(X = \mathbb{R}^{n \times s}\):

\[\begin{split} X = \left(\begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array}\right) \end{split}\]

Matrix multiplications \(K=XW^k\) and \(Q=XW^q\) will produce exactly \(k_i\) and \(q_i\), \(i=\overline{1,n}\) stacked vertically:

\[\begin{split} Q = XW^q = \left(\begin{array}{c} x_1 W^q\\ x_2 W^q\\ \vdots \\ x_n W^q \end{array}\right) = \left(\begin{array}{c} q_1\\ q_2\\ \vdots \\ q_n \end{array}\right); \end{split}\]
\[\begin{split} K = XW^k = \left(\begin{array}{c} x_1 W^k \\ x_2 W^k \\ \vdots \\ x_n W^k \end{array}\right) = \left(\begin{array}{c} k_1 \\ k_2 \\ \vdots \\ k_n \end{array}\right). \end{split}\]

Just clarifying the dimenions of the obtained matrices \(Q=\mathbb{R}^{n \times d}\), \(K=\mathbb{R}^{n \times d}\).

Now, we need to find all the product combinations of \(q_i\) and \(k_j\) in matrix form:

\[\left(q_ik_j\right)_{n \times n}\]

We make multiplication of form:

\[ \begin{align}\begin{aligned}\begin{split}QK^T = \left(\begin{array}{c} q_1 \\ q_2 \\ \vdots \\ q_n \end{array}\right)\end{split}\\\left(\begin{array}{cccc}k_1 & k_2 & \ldots & k_n\end{array}\right) = \\\begin{split}\left(\begin{array}{cccc} q_1k_1 & q_1k_2 & \dots & q_1k_n \\ q_2k_1 & q_2k_2 & \dots & q_2k_n \\ \vdots & \vdots & \ddots & \vdots \\ q_nk_1 & q_nk_2 & \dots & q_nk_n \end{array}\right) \end{split}\end{aligned}\end{align} \]

Normalization with a base of \(\sqrt{d}\) and a softmax function is applied to bring the values within the range of \([0,1]\). It reveals the degree to which the meaning of the \(j\)-th element of the sequence is revealed by the \(i\)-th element:

\[\begin{split}softmax\left(\frac{QK^T}{d}\right) = \left(\begin{array}{c} softmax(\alpha_{11}, \alpha_{12}, \dots, \alpha_{1n}) \\ softmax(\alpha_{21}, \alpha_{22}, \dots, \alpha_{2n}) \\ \vdots \\ softmax(\alpha_{n1}, \alpha_{n2}, \dots, \alpha_{nn}) \end{array}\right) \end{split}\]

Where:

\[\alpha_{ij} = \frac{q_ik_j}{\sqrt{d}}\]

At the same time with similar to the \(Q\) and \(K\) matrices procedure matrix \(V\) as a stacked \(v_i\) vectors is computed:

\[\begin{split}V = XW^{\nu} = \left(\begin{array}{c} x_1 W^{\nu}\\ x_2 W^{\nu}\\ \vdots \\ x_n W^{\nu} \end{array}\right) = \left(\begin{array}{c} \nu_1\\ \nu_2\\ \vdots \\ \nu_n \end{array}\right) \end{split}\]

Here, \(W^{\nu} \in \mathbb{R}^{s \times d_o}\) and \(X \in \mathbb{R}^{n \times s}\), which gives \(V = XW^{\nu} \in \mathbb{R}^{n \times d_o}\). \(d_o\) there is selectable parameter - as will become obvious in the following steps it regulates the dimensionality of the output embeddings.

Finally:

\[softmax(\frac{QK^T}{\sqrt{d}})V \in \mathbb{R}^{n \times d_0}\]

represents the elements of the sequence that have been transformed by self-attention and stacked vertically.

Multihead attention#

Multihead attention involves applying several attention transformations to the same sequence and then combining the results. This approach helps capture different contextual relationships.

More specifically: we create \(d_h\) attention transformations:

\[Q^{(i)}, K^{(i)}, V^{(i)}, i = \overline{1, d_{h}}.\]

For each a transformed sequence called “head” is produced:

\[H^{(i)}=softmax\left(\frac{Q^{(i)}\left[K^{(i)}\right]^T}{\sqrt{d}}\right) \in \mathbb{R}^{n \times d}\]

These results are concatenated horizontally and transformed by the learnable matrix \(W_0 \in \mathbb{R}^{dd_h \times d_0}\). \(d_o\) here is the dimentionality of the output embedding.

\[H = \left(H_1, H_2, \ldots, H_{d_h}\right) \in \mathbb{R^{n \times dd_h}}\]

The result of the transformation will be as follows:

\[HW_0 \in \mathbb{R}^{n \times d_o}\]