Tokenizer

Tokenizer#

There is a group of objects in the transformers library that provide tokinzation. This page considers their properties.

import transformers
from random import randint
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased")

Vocabulary#

To get the vocabulary of the tonizer, use the get_vocab method: it returns dict where each available token corresponds to it’s id.

The following cell represents a small subset of the vocabulary of the tokeniser under consideration.

vocab: dict[str, int] = tokenizer.get_vocab()
dict(list(vocab.items())[:10])

{'##úl': 26994,
 'Michaels': 19108,
 'Sculpture': 19477,
 'notoriety': 26002,
 '##kov': 7498,
 '##grating': 21889,
 '##¹': 28173,
 'Manny': 17381,
 'towers': 8873,
 '##gles': 15657}

Transformations#

There is a set of tools available in the Transformers Python package for performing common transformations in the NLP domain:

tokenize: Converts a given str into a list of tokens.
convert_tokens_to_ids: Converts a list[str] into a list of integers, where each integer represents the index of the corresponding token.
convert_ids_to_tokens: Converts a list[int] of token indices into a list of strings, where each string represents a token.
decode: Takes a list[int] of token indices and reconstructs a full sentence from them.

The following cells represents the usage of all the tools mentioned.

Here is an example of decomposing text into tokens.

tokens_list = tokenizer.tokenize("typical tokinezation example")
tokens_list

['typical', 'to', '##kin', '##ez', '##ation', 'example']

The following cell shows how really popular words, which is typicaly have separate tokens corresponding indices of the tokens.

ids_list = tokenizer.convert_tokens_to_ids(["hello", "is", "the"])
ids_list

[19082, 1110, 1103]

Now use convert_ids_to_tokens to convert the list of numbers into a list of strings, each string being a token.

tokenizer.convert_ids_to_tokens([15657, 1834, 7321])

['##gles', 'needed', 'thereafter']

And the list of tokens can be transformed into finished text using the decode method.

tokenizer.decode([15657, 1834, 7321])

'##gles needed thereafter'

Special tokens#

Typically, tokenisers have pad_token, eos_token, bos_token and unk_token fields that allow access to how the tokeniser in question handles service tokens.

The next code represents the special tokens for the tokinesator we loaded earlier:

tokenizer.pad_token, tokenizer.eos_token, tokenizer.bos_token, tokenizer.unk_token

('[PAD]', None, None, '[UNK]')

Tokenizer

Contents

Tokenizer#

Vocabulary#

Transformations#

Special tokens#