DL mechanisms#
This page considers concepts that are traditionally considered as related to deep learning.
Deep learning is a subfield of machine learning that encompasses models known as neural networks, inspired by the human brain.
Recurrent#
Recurrent is an approach to processing typically sequential units of data. The main idea is to use information about how previous elements of the sequence were processed to process the following ones.
Mathematically it can be written:
Where:
\(x_t\): input at the \(t\)-th step.
\(h_t\): vector that describes hidden state at the \(t\)-th step.
\(W_1\): weights associated with the input.
\(W_2\): weights associated with the state.
\(b_1\): bias associated with the input.
\(b_2\): bias associated with the state.
\(f\): activation function, typically a hyperbolic tangent.
Each \(h_t\) depends on \(x_t\) and \(h_{t-1}\). But \(h_{t-1}\) depends on \(x_{t-1}\) and \(h_{t-2}\), and so on recursively
All these computations, resulting in \(h_t\) for \(t = \overline{1,n}\), can be used in the subsequent steps to describe the process we are interested in.
For more detailed explanation, checkout on the Recurrent page.
Attention#
Attention is a type of transformation for the sequence, that uses information hidden within the entire sequence. The source of information can be the sequence itself or another sequence.
Attention is generally associated with transformer architecture, but idea can be used in other approaches. It contains a wide range of modifications, which makes the attention section potentially very large. That’s why it is considered independently from the transformer here.
Check attention page for more details.
Transformer#
The transformer is a deep learning architecture that was represented in the article Attention Is All You Need.
The schema is represented in the following piture:
The following list outlines the most important concepts used in the transformer architecture:
Attention transformation: A transformation applied to a sequence, where each element is computed by incorporating information from other elements in the sequence, with different levels of attention paid to them. Several extensions of the attention mechanism are used in modern transformers:
Multi-Head Attention: The sequence is transformed by several attention mechanisms in parallel, each referred to as a head.
Masked Attention: A variant of attention where certain positions in the sequence are hiddeng (masked) to prevent an element from attending the future tokens.
Self-Attention: A process in which each element of a sequence attends to other elements within the same sequence.
Cross-Attention: A process in which each element of a sequence attends to elements of another sequence.
Encoder/Decoder: The encoder and decoder are essential parts of transformers. The encoder processes the input sequence using self-attention to produce contextual representations. The docoder generates the output sequence by attending to the encoder’s representationand predicts the next token in a sequence.
Positional encoding: Since self-attention does not inherently capture the order of elements, the transformer adds to the input embeddings values computed from a function that depends on each element’s position in the sequence.
Feed Forward: Two linear transforamtions connected by an activation function. The first typically makes increases dimensionality, and the second reduces it back down.
Add & Norm: Add the scip connection information with the output of the attention layers and normilize.