Tokenizers#
This page discusses the details of working with the tokenizers
package.
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
Continuing subword prefix#
continuing_subword_prefix
is a parameter that defines the prefix added to tokens when a word is split into multiple subword tokens. It marks tokens that follow the initial token of a word.
The following cell builds the tokenizer
. Note that the trainer passed continuing_subword_prefix="##"
.
trainer = BpeTrainer(continuing_subword_prefix="##", vocab_size=30)
tokenizer = Tokenizer(BPE())
tokenizer.train_from_iterator(
["doing", "sleeping", "resting", "reading", "running"],
trainer=trainer
)
The next code shows the vocabulary of the tokenizer that was used.
tokenizer.get_vocab()
{'t': 11,
'##s': 18,
'i': 4,
'ru': 29,
'do': 28,
'##l': 21,
'##n': 14,
'##d': 24,
'g': 3,
'##g': 16,
'##u': 13,
'##ing': 26,
're': 27,
'p': 8,
'##a': 23,
'u': 12,
'o': 7,
'##o': 20,
'##p': 22,
'##t': 19,
'r': 9,
'd': 1,
'l': 5,
'##ng': 25,
's': 10,
'##i': 15,
'e': 2,
'a': 0,
'n': 6,
'##e': 17}
There are a versions of the tokens with a ##
prefix, which means that these tokens are supposed to follow another token. For example, the ##ing
token is a common sequence of the symbols that marks end of the unit. It never appears at the begining of any unit in the training courpus, so it must be added as a “following” token.
The following cell applies a tokenizer to some specific words.
tokenizer.encode("doing", "reading").tokens
['do', '##ing', 're', '##a', '##d', '##ing']
As result:
doing
:do
,#ing
.reading
:re
,##a
,##d
,##ing
.