DL mechanisms#
This page considers concepts that are traditionally considered as related to deep learning.
Recurrent#
Recurrent is an approach to processing typically sequential units of data. The main idea is to use information about how previous elements of the sequence were processed to process the following ones.
Mathematically it can be written:
Where:
\(x_t\): input at the \(t\)-th step.
\(h_t\): vector that describes hidden state at the \(t\)-th step.
\(W_1\): weights associated with the input.
\(W_2\): weights associated with the state.
\(b_1\): bias associated with the input.
\(b_2\): bias associated with the state.
\(f\): activation function, typically a hyperbolic tangent.
Each \(h_t\) depends on \(x_t\) and \(h_{t-1}\). But \(h_{t-1}\) depends on \(x_{t-1}\) and \(h_{t-2}\), and so on recursively
All these computations, resulting in \(h_t\) for \(t = \overline{1,n}\), can be used in the subsequent steps to describe the process we are interested in.
For more detailed explanation, checkout on the Recurrent page.
Self-attention#
Self-attention is a mechanism that processes an array of input data. In the general case, each element of the array is a vector, \(x_i \in \mathbb{R}^k\). Unlike the RNN architecture, the order of the \(x_i\) not central here, but self-attention is usually used to process the ordered sequences as well.
The main idea is to build such transformation mechanism \(SA\):
For each \(x_i\), it considers all the other members of the array, \(x_j, i \neq j\), and it allows the most significant member to influence the result.
At the highest level, the idea is simple: \(y_i\) is the weighed sum of all the elements in the sequence:
Where:
\(W^\nu\): learnable matrix.
\(w_j\): is a crucial element of the self-attention approach. It is the weight of the \(j\)-th element of the array in the context of processing \(y_i\). The process of finding this weight is described below.
For each element introduce two vectors:
Here \(W^k \in \mathbb{R}^{(k \times a)}\) and \(W^q \in \mathbb{R}^{(l \times a)}\) are learnabale parameters.
The idea behind the method is that these vectors are queries (\(q\)) and keys (\(k\)). The matrices that produce them (\(W^k, W^q\)) are learned in such a way that keys of one elements have to match the queries of the other elements.
In this context, “match” refers to the high result of the scalar product of the \(q_i\) and \(k_j\).
For each \(j\)-s element of the array that is processed, the vector \((q_i k_1, q_i k_2, \ldots, q_i k_n)\) is computed. The elements which \(k_j\) better matches to the \(q_i\) will be higher.
So for a chosen element, \(x_i\), the vector can be considered as the weights of the matches with all other elements. However, to give them the properties of the real weights, this vector is usually processed by a softmax function.
Finally for weights we got such approach:
The entire transformation will take the following form: