Tokenizers

Tokenizers#

This page discusses the details of working with the tokenizers package.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

Continuing subword prefix#

continuing_subword_prefix is a parameter that defines the prefix added to tokens when a word is split into multiple subword tokens. It marks tokens that follow the initial token of a word.


The following cell builds the tokenizer. Note that the trainer passed continuing_subword_prefix="##".

trainer = BpeTrainer(continuing_subword_prefix="##", vocab_size=30)
tokenizer = Tokenizer(BPE())
tokenizer.train_from_iterator(
    ["doing", "sleeping", "resting", "reading", "running"],
    trainer=trainer
)

The next code shows the vocabulary of the tokenizer that was used.

tokenizer.get_vocab()
{'t': 11,
 '##s': 18,
 'i': 4,
 'ru': 29,
 'do': 28,
 '##l': 21,
 '##n': 14,
 '##d': 24,
 'g': 3,
 '##g': 16,
 '##u': 13,
 '##ing': 26,
 're': 27,
 'p': 8,
 '##a': 23,
 'u': 12,
 'o': 7,
 '##o': 20,
 '##p': 22,
 '##t': 19,
 'r': 9,
 'd': 1,
 'l': 5,
 '##ng': 25,
 's': 10,
 '##i': 15,
 'e': 2,
 'a': 0,
 'n': 6,
 '##e': 17}

There are a versions of the tokens with a ## prefix, which means that these tokens are supposed to follow another token. For example, the ##ing token is a common sequence of the symbols that marks end of the unit. It never appears at the begining of any unit in the training courpus, so it must be added as a “following” token.

The following cell applies a tokenizer to some specific words.

tokenizer.encode("doing", "reading").tokens
['do', '##ing', 're', '##a', '##d', '##ing']

As result:

  • doing: do, #ing.

  • reading: re, ##a, ##d, ##ing.