Tokenization

Tokenization#

Tokenization is the process of identyfying the segments that would be considred by the text processing algorithms as an atomic units.

Vocabulary: The set of tokens used by the tokenization approach.

Wordpiece#

This approach takes all possible symbols that is awailable in the example texts courpuse - at the first step it is a vocabulary of the model. At the each following step it adds to the vocabulary pair of tokens from the current vocabulary that have the highest score:

\[\frac{N(i, j)}{N(i)N(j)}\]

Where:

  • \(N(i, j)\): frequency of the pair formed from \(i\)-th and \(j\)-th tokens.

  • \(N(i)\): frequenccy of the \(i\)-th token.

The idea behind this formula is as follows: If some tokens appear together very frequently but not often separately, then a pair should be added. Accordingly, if the nominator \(N(i,j)\), is high if tokens often appear together often, but denominator, \(N(i)N(j)\), will take relatively low values, if separately do not appear together often.

Unigram#

Unigram tokenization begins with the largest possible vocabulary: ideally, all substrings that could be found in the data words up to a defined length. At each step, the probability of encountering each token is estimated based on its frequency of the token in the training data, at each step token with the lowest frequency is removed from the vocabulary until the specified size of vocabulary achieved.


Consider sentence “this is his hat”. All possible tokens are count in the following cell:

from collections import Counter

train_data = ["this", "is", "his", "hat"]
counter = Counter()

for s in train_data:
    for i in range(len(s)):
        for j in range(i + 1, len(s) + 1):
            counter[s[i:j]] += 1
counter
Counter({'h': 3,
         'i': 3,
         'is': 3,
         's': 3,
         't': 2,
         'hi': 2,
         'his': 2,
         'th': 1,
         'thi': 1,
         'this': 1,
         'ha': 1,
         'hat': 1,
         'a': 1,
         'at': 1})

According to unigram logic, at this step we can delete any token with frequence of 1 at this step. In real data, tokens consisting of single symbols will relatively really large frequences, so it’s unlikely they will be deleted if the vocabulary size is significantly larger than the number of letters in the language.

Tokenization playground#

Try the tokenization playgroud launched by Hugging Face. It’s a service that allows you to experiment with various tokenizers used in modern machine learning models.