Tokenizer#

There is a group of objects in the transformers library that provide tokinzation. This page considers their properties.

import transformers
from random import randint
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased")

Vocabulary#

To get the vocabulary of the tonizer, use the get_vocab method: it returns dict where each available token corresponds to it’s id.


The following cell represents a small subset of the vocabulary of the tokeniser under consideration.

vocab: dict[str, int] = tokenizer.get_vocab()
dict(list(vocab.items())[:10])
{'##úl': 26994,
 'Michaels': 19108,
 'Sculpture': 19477,
 'notoriety': 26002,
 '##kov': 7498,
 '##grating': 21889,
 '##¹': 28173,
 'Manny': 17381,
 'towers': 8873,
 '##gles': 15657}

Transformations#

There is a set of tools available in the Transformers Python package for performing common transformations in the NLP domain:

  • tokenize: Converts a given str into a list of tokens.

  • convert_tokens_to_ids: Converts a list[str] into a list of integers, where each integer represents the index of the corresponding token.

  • convert_ids_to_tokens: Converts a list[int] of token indices into a list of strings, where each string represents a token.

  • decode: Takes a list[int] of token indices and reconstructs a full sentence from them.


The following cells represents the usage of all the tools mentioned.

Here is an example of decomposing text into tokens.

tokens_list = tokenizer.tokenize("typical tokinezation example")
tokens_list
['typical', 'to', '##kin', '##ez', '##ation', 'example']

The following cell shows how really popular words, which is typicaly have separate tokens corresponding indices of the tokens.

ids_list = tokenizer.convert_tokens_to_ids(["hello", "is", "the"])
ids_list
[19082, 1110, 1103]

Now use convert_ids_to_tokens to convert the list of numbers into a list of strings, each string being a token.

tokenizer.convert_ids_to_tokens([15657, 1834, 7321])
['##gles', 'needed', 'thereafter']

And the list of tokens can be transformed into finished text using the decode method.

tokenizer.decode([15657, 1834, 7321])
'##gles needed thereafter'

Special tokens#

Typically, tokenisers have pad_token, eos_token, bos_token and unk_token fields that allow access to how the tokeniser in question handles service tokens.


The next code represents the special tokens for the tokinesator we loaded earlier:

tokenizer.pad_token, tokenizer.eos_token, tokenizer.bos_token, tokenizer.unk_token
('[PAD]', None, None, '[UNK]')