Tokenizer#
There is a group of objects in the transformers
library that provide tokinzation. This page considers their properties.
import transformers
from random import randint
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased")
Vocabulary#
To get the vocabulary of the tonizer, use the get_vocab
method: it returns dict where each available token corresponds to it’s id.
The following cell represents a small subset of the vocabulary of the tokeniser under consideration.
vocab: dict[str, int] = tokenizer.get_vocab()
dict(list(vocab.items())[:10])
{'##úl': 26994,
'Michaels': 19108,
'Sculpture': 19477,
'notoriety': 26002,
'##kov': 7498,
'##grating': 21889,
'##¹': 28173,
'Manny': 17381,
'towers': 8873,
'##gles': 15657}
Transformations#
There is a set of tools available in the Transformers Python package for performing common transformations in the NLP domain:
tokenize
: Converts a givenstr
into a list of tokens.convert_tokens_to_ids
: Converts alist[str]
into a list of integers, where each integer represents the index of the corresponding token.convert_ids_to_tokens
: Converts alist[int]
of token indices into a list of strings, where each string represents a token.decode
: Takes alist[int]
of token indices and reconstructs a full sentence from them.
The following cells represents the usage of all the tools mentioned.
Here is an example of decomposing text into tokens.
tokens_list = tokenizer.tokenize("typical tokinezation example")
tokens_list
['typical', 'to', '##kin', '##ez', '##ation', 'example']
The following cell shows how really popular words, which is typicaly have separate tokens corresponding indices of the tokens.
ids_list = tokenizer.convert_tokens_to_ids(["hello", "is", "the"])
ids_list
[19082, 1110, 1103]
Now use convert_ids_to_tokens
to convert the list of numbers into a list of strings, each string being a token.
tokenizer.convert_ids_to_tokens([15657, 1834, 7321])
['##gles', 'needed', 'thereafter']
And the list of tokens can be transformed into finished text using the decode
method.
tokenizer.decode([15657, 1834, 7321])
'##gles needed thereafter'
Special tokens#
Typically, tokenisers have pad_token
, eos_token
, bos_token
and unk_token
fields that allow access to how the tokeniser in question handles service tokens.
The next code represents the special tokens for the tokinesator we loaded earlier:
tokenizer.pad_token, tokenizer.eos_token, tokenizer.bos_token, tokenizer.unk_token
('[PAD]', None, None, '[UNK]')