Embeddings

Embeddings#

Embedding is the representation of something as a vector.

Dence vs Sparce#

There are two types of embeddings: dence and sparce.

Sparce embeddings usually result in a high-dimensional vector close to the vocabulary size (30 000 is a typical size). Each position of in the encoding vector corresponds to a specific token in the vocabulary, which makes the interpreting the results easier.

Dence embeddings typically have fewer dimentions (384, 768, or 1024 elements), and the position of an element is not directly related to a specific token.

Word2Vec#

Word2Vec (W2V) is an approach to building word embeddings based on the context. Words with similar contexts will have similar embeddings.

To each word corresponds two vectors:

  • \(u_i \in \mathbb{R}^n\): center vector.

  • \(\nu_i \in \mathbb{R}^n\): context vector.

Now, let’s consider words, \(i\) and \(j\). The probability of encounting word \(i\) in the context of word \(j\) we’ll define as following:

\[p_{ij} = \sigma(u_i^T \nu_j)\]
  • \(\sigma\): sigmoid function.

The optimization algorithm looks for \(u_i\) and \(\nu_i\) that maximize \(p_{ij}\) when word \(i\) contains word \(j\) in its context and minimize it when it does not.

BERT#

BERT is a popular model for building embeddings. Developed by Google, it has a many modicications for specific tasks.

Training#

BERT training icluded solving two tasks: Masked token prediction (MTP) and Next Sentence Prediction (NSP).

Masked Token Prediction (MTP): some tokens of the sentence were masked or replaced and the goal of the model was to predict the correct token in that places.

Replacements usually follow empirical rules:

  • 15% of the tokens form the original data are selected to participate in loss function calculation.

  • 80% of the selected tokens must be replaced with a mask (the original BERT typically uses the special [MASKED] token).

  • 10% of the selected tokens must be replaced with random ones.

  • 10% of the selected tokens must be left unchanged.

This rule is important for avoiding model overfitting. The specific values were discovered through experementation.

As example consider the following sentence:

  • Finetuning sparse embedding models involves several components: the model, datasets, loss functions, training arguments, evaluators, and the trainer class.

For simplicity, we will consider the case of white space tokenization.

  • Finetuning sparse embedding [MASKED] involves several components: the model, datasets, loss dogs, training arguments, [MASKED], and the trainer class.

We expect the model to predict something like this:

  • [-][-][-][models][-][-][components:][-][-][-][-][functions][-][-][evaluators][-][-][-]

Here [-] marks tokens that are unimportant to us - they do not appear in 15% of the “interesting” tokens and are not used in the loss function calculation. However, the words that appear in 15% of the “interesting” tokens must be determined directly to minimize the loss.

Next Sentence Prediction (NSP): The model is given a sentence that is separated by a special token. It must to classify whether the second part matches the first.

The original approach by google sentence uses:

  • [CLS]: token that idicates where the outcome should come from.

  • [SEP]: token separates the the part of the sentence that is guaranteed to be correct and, possibly, uncorrect.

Suppose the training dataset contains the following sentence:

  • Born in Nine Mile, Jamaica, Marley began his career in 1963, after forming the group Teenagers with Peter Tosh and Bunny Wailer, which became the Wailers.

For NSP it can be transformed as:

  • [CLS] Born in Nine Mile, Jamaica, Marley began his career in 1963, after forming the [SEP] group Teenagers with Peter Tosh and Bunny Wailer, which became the Wailers.

In this case, the output sequence at the positions corresponding to the [CLS] must contain a signal that can be interpreted as “true”.

Or sentence can be transformed as:

  • [CLS] Born in Nine Mile, Jamaica, Marley began his career in 1963, after forming the [SEP] mama of hand typing fast.

This makes no sense, so the model trains to predict a “false” signal in hte possition corresponding to [CLS] token.