Transformers#

transformers is a library in python that allows you to use pre-trained machine learning models that belong to the transformers architecture. So in the following example, the pre-trained Bert model has been loaded and deiaplayed:

import transformers
import numpy as np

Pipeline#

With pipeline you can easily define any model you like. Find out more:


The following page shows a very simple example of loading pipeline for text classification purposes.

pipeline = transformers.pipeline(
    task="text-classification",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english"
)
print(
    pipeline("Wow this is awesome"),
    pipeline("The fish had really strange taste"),
    sep="\n"
)
Device set to use cpu
[{'label': 'POSITIVE', 'score': 0.9998487234115601}]
[{'label': 'NEGATIVE', 'score': 0.8964248299598694}]

In some cases, you may need to explicitly specify the task, model and tokinizer arguments. The following cell shows using this case for access to the question-answering mode of the distilbert-base-cased-distilled-squad model.

qa_pipeline = transformers.pipeline(
    task="question-answering",
    model="distilbert-base-cased-distilled-squad", 
    tokenizer="distilbert-base-cased-distilled-squad"
)

qa_pipeline(
    question="What is the capital of France?",
    context="France's capital is Paris."
)
Device set to use cpu
{'score': 0.9831558465957642, 'start': 20, 'end': 25, 'answer': 'Paris'}

Tokenizer#

You can opearte separately with tokenizer by using corresponding objects. Check more features in the corresponding page.


The following cell loading of the bert-base-cased tokenizer.

tokenizer = transformers.AutoTokenizer.from_pretrained('bert-base-cased')
tokenizer
BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

So here is example how “Hello world!” phrase can be tokinized.

tokenizer.encode_plus(
    'Hello world!', 
    add_special_tokens=True, 
    return_token_type_ids=False, 
    return_tensors='pt'
)
{'input_ids': tensor([[ 101, 8667, 1362,  106,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

Model#

Model can be loaded separately as well ass tokenizer.


The following cell load model Bert.

model = transformers.BertModel.from_pretrained('bert-base-cased')
model
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

The next cell shows the use of the model. Input must first be tokinised before it can be passed to the model.

tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-cased")
tokenized = tokenizer.encode_plus(
    'Hello!', 
    add_special_tokens=True, 
    return_token_type_ids=False, 
    return_tensors='pt'
)
out = model(**tokenized)
out[0], out[1][:, :50]
(tensor([[[ 0.6283,  0.2166,  0.5605,  ...,  0.0136,  0.6158, -0.1712],
          [ 0.6108, -0.2253,  0.9263,  ..., -0.3028,  0.4500, -0.0714],
          [ 0.8040,  0.1809,  0.7076,  ..., -0.0685,  0.4837, -0.0774],
          [ 1.3290,  0.2360,  0.4567,  ...,  0.1509,  0.9621, -0.4841]]],
        grad_fn=<NativeLayerNormBackward0>),
 tensor([[-0.7105,  0.4876,  0.9999, -0.9947,  0.9599,  0.9521,  0.9767, -0.9946,
          -0.9815, -0.6238,  0.9776,  0.9984, -0.9989, -0.9998,  0.8559, -0.9755,
           0.9895, -0.5281, -1.0000, -0.7414, -0.7056, -0.9999,  0.2901,  0.9786,
           0.9729,  0.0734,  0.9828,  1.0000,  0.8981, -0.1109,  0.2780, -0.9920,
           0.8693, -0.9985,  0.1461,  0.2067,  0.8092, -0.2430,  0.8580, -0.9585,
          -0.8130, -0.6138,  0.7961, -0.5727,  0.9737,  0.2362, -0.1194, -0.0789,
           0.0031,  0.9997]], grad_fn=<SliceBackward0>))

Apply to dataset#

This section shows how the BERT model can be applied to a dataset.

Dataset#

We will be working with the datasets library, which is usually used in conjunction with the transformers library. So in the following cell we have extracted the imdb library, which contains reviews for the movies. We use a small subset of the dataframe to reduce the amount of calculations.

from datasets import load_dataset
dataset = load_dataset("imdb", split="train")

np.random.seed(100)
idx = np.random.randint(len(dataset), size=200)
dataset = dataset.select(idx)

dataset[0]
{'text': "The Film must have been shot in a day,there are scenes where you can see the camera reflections and its red pointer,even the scenery's green light that blends with the actors!!!The plot and the lines are really awful without even the slightest inspiration(At least as a thriller genre movie).Everything that got to do with Poe in the movie,has a shallow and childish approach.The film is full of clise and no thrilling.If you want to watch a funny b-movie for a relaxing evening with friends then go for it you will enjoy it (As I Did) but there's no way to take this film seriously!",
 'label': 0}

Applying tokenization to the dataset. Finally for each element of the dataset:

def tokenization(example):
    return tokenizer.batch_encode_plus(
        example['text'],
        add_special_tokens=True, 
        return_token_type_ids=False, 
        truncation=True
    )
dataset = dataset.map(
    tokenization, batched=True
)
dataset.set_format(
    type="torch", 
    columns=["input_ids", "attention_mask"]
)

After that we have extra keys for each object input_ids and attention_mask - wich is required for bert model.

display(HTML("<b>Keys:</b>"))
print(list(dataset[0].keys()))
display(HTML("<br><b>Tokens:</b>"))
print(dataset[0]['input_ids'])
display(HTML("<br><b>Mask:</b>"))
print(dataset[0]['attention_mask'])
Keys:
['input_ids', 'attention_mask']

Tokens:
tensor([  101,  1109,  2352,  1538,  1138,  1151,  2046,  1107,   170,  1285,
          117,  1175,  1132,  4429,  1187,  1128,  1169,  1267,  1103,  4504,
        26906,  1105,  1157,  1894,  1553,  1200,   117,  1256,  1103, 19335,
          112,   188,  2448,  1609,  1115, 13390,  1116,  1114,  1103,  5681,
          106,   106,   106,  1109,  4928,  1105,  1103,  2442,  1132,  1541,
         9684,  1443,  1256,  1103, 16960,  7670,   113,  1335,  1655,  1112,
          170, 11826,  6453,  2523,   114,   119,  5268,  1115,  1400,  1106,
         1202,  1114, 21377,  1107,  1103,  2523,   117,  1144,   170,  8327,
         1105,  2027,  2944,  3136,   119,  1109,  1273,  1110,  1554,  1104,
          172,  6137,  1162,  1105,  1185, 21401,  1158,   119,  1409,  1128,
         1328,  1106,  2824,   170,  6276,   171,   118,  2523,  1111,   170,
        22187,  3440,  1114,  2053,  1173,  1301,  1111,  1122,  1128,  1209,
         5548,  1122,   113,  1249,   146,  2966,   114,  1133,  1175,   112,
          188,  1185,  1236,  1106,  1321,  1142,  1273,  5536,   106,   102])

Mask:
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Loader#

Preparing the classical torch.loader:

from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
loader = DataLoader(
    dataset, 
    batch_size=32, 
    collate_fn=data_collator, 
    pin_memory=True,
    shuffle=False
)

Now let’s look at what the first element of loader will be - it’s dict, with all the necessary stuff for fitting and forwarding through bert model:

item = next(iter(loader))
item.keys()
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
dict_keys(['input_ids', 'attention_mask'])

Finally you can pass item to the model to get prediction, but you need high-performance hardware to run this:

model.eval()
model(**item)["pooler_output"]

Finally, we need to model all the batches. In the following cell we have just extracted embeddings for the whole imdb dataset.

Note:

  • This cell should be started on powerful hardware;

  • The function nested in torch.inference_mode is crucial, without it the model takes too much memory.

import torch
from tqdm import tqdm

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

@torch.inference_mode()
def get_embeddings(model, loader, device):
    model.to(device)
    model.eval()
    
    total_embeddings = []
    labels = []
    
    for batch in tqdm(loader):

        batch = {
            key: batch[key].to(device) 
            for key in ['attention_mask', 'input_ids']
        }
        
        embeddings = model(**batch)['last_hidden_state'][:, 0, :]
        total_embeddings.append(embeddings)

    return torch.cat(total_embeddings, dim=0)

embeddings = get_embeddings(model, loader, device)
cpu
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:37<00:00, 13.89s/it]

So for each line from the dataset we got an embedding from the model. It’s representation and shape is shown below:

display(HTML("<b>Embeddings:</b>"))
display(embeddings)
display(HTML("<br><b>Shape:</b>"))
display(embeddings.shape)
Embeddings:
tensor([[ 0.6028,  0.1125, -0.2223,  ..., -0.1698,  0.1655,  0.0645],
        [ 0.6183,  0.0239, -0.2428,  ..., -0.1532,  0.1792,  0.1111],
        [ 0.3977,  0.1045, -0.1627,  ..., -0.0737,  0.2369,  0.1243],
        ...,
        [ 0.5241,  0.1857, -0.4136,  ..., -0.2503,  0.0959,  0.2350],
        [ 0.6121,  0.0789, -0.1658,  ..., -0.1467,  0.4310,  0.1626],
        [ 0.6343,  0.0062, -0.1479,  ..., -0.2116,  0.1337,  0.1664]])

Shape:
torch.Size([200, 768])