Text CNN#
Text CNN is a method for applying convolutional architecture to text tasks.
import numpy as np
from gensim import downloader
from gensim.models import KeyedVectors
import torch
from torch import nn
from tqdm import tqdm
from collections.abc import Collection
from sklearn.datasets import fetch_20newsgroups
Data#
As an example, we’ll use the 20newsgroups
dataset from scikit-learn
. To minimize computing complexity only few categories will be used.
categories = ["talk.politics.guns", "rec.motorcycles"]
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
So we are dealing with a set of text like shows at the following cell.
print(test["data"][80][500:1000])
escape
from Cheshire County jail last win-
ter. Santaw, 32 is scheduled to be
sentenced next week. The rape last
fall came six months after Santaw
was released from prision, where
he spent 15 years for a rape he commit-
ted when he was 16. (AP)
[end of article]
Any reactions? Did he do enough time? What should his penalty
be?
BTW, Walpole is a town in Massachusetts. Of course, New
hampshire is close by.
J. Case Kim
kim39@husc.harvard.edu
The following cell implements a simple approach that will be used to transform the data into embeddings.
wv = downloader.load("word2vec-google-news-300")
def seq_to_emb(
sentences: Collection[str],
wv: KeyedVectors,
tokens_num: int,
pad_token: str = "</s>"
) -> np.ndarray:
'''
Convert a set of sentences into embeddings. Each set of symbols separated by
a space will be recognized as a separate token.
Parameters
----------
sentences: Collection[str]
A collection of sentences that require transformation into embeddings.
wv: KeyedVectors
This needs to be used to transform tokens into embeddings.
tokens_num: int
The number of tokens to take from each sample. Extra tokens will be
dropped, and if there are not enough tokens, padding will be added.
pad_token: str = "</s.>"
The token that will be used for padding if there are not enough tokens
in a sample.
Returns
-------
out: np.ndarray
Of size (<samples> ,<embedding size>, <tokens_num>).
'''
rv = []
pad_vector = wv.get_vector(pad_token)
for sentence in sentences:
sentence_embeddings = []
got_emb = 0
for one_token in sentence.split():
if wv.has_index_for(one_token):
sentence_embeddings.append(wv.get_vector(one_token))
got_emb += 1
# Taking embedings only for some of the words
if got_emb >= tokens_num: break
sentence_embeddings = np.stack(sentence_embeddings, axis=1)
pad_array = np.tile(
pad_vector[:, None],
reps=(1, tokens_num - sentence_embeddings.shape[1])
)
sentence_embeddings = np.hstack([sentence_embeddings, pad_array])
rv.append(sentence_embeddings)
return np.stack(rv)
Each sentence is transformed into a set of embeddings, with each token having its own embedding. Note: only the specified number of initial embeddings is taken.
The following cell demonstrates building embeddings for a small subset of data.
ans = seq_to_emb(
sentences=train["data"][:10],
wv=wv,
tokens_num=20
)
ans.shape
(10, 300, 20)
The result can be interpreted as the first 20 tokens from the first 10 samples, each with 300 channels representing the meaning of each token.
The following code transforms all the data using this tool.
tokens_num = 100
X_train = torch.tensor(seq_to_emb(
sentences=train["data"],
wv=wv,
tokens_num=tokens_num
))
X_test = torch.tensor(seq_to_emb(
sentences=test["data"],
wv=wv,
tokens_num=tokens_num
))
y_train = torch.tensor(train["target"], dtype=torch.float)
y_test = torch.tensor(test["target"], dtype=torch.float)
Model#
TextCNN is designed to pass data through convolutional layers, where the kernel aggregates tokens in nearby positions. Then, an activation function should transform the result.
The next cell shows how convolution can be applied to the data.
convolution = nn.Conv1d(in_channels=300, out_channels=10, kernel_size=5)
conv_transformed = convolution(X_train)
conv_transformed.shape
torch.Size([1144, 10, 96])
As the activation function, we’ll use max pooling, which aggregates the maximum value from the output channels of the convolution, as shown in the following code.
pooling = nn.AdaptiveAvgPool1d(output_size=1)
pooling(conv_transformed).shape
torch.Size([1144, 10, 1])
These ideas are implemented in the complete network defined in the next cell:
class TextCNN(nn.Module):
def __init__(
self,
kernel_sizes: list[int],
in_channels: int,
out_channels: int
):
super().__init__()
self.conv_transforms = nn.ModuleList([
nn.Sequential(
nn.Conv1d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=ks
),
nn.AdaptiveAvgPool1d(output_size=1),
nn.Flatten()
)
for ks in kernel_sizes
])
self.head = nn.Sequential(
nn.Linear(
in_features=len(kernel_sizes)*out_channels,
out_features=1
),
nn.Flatten(start_dim=0),
nn.Sigmoid()
)
def forward(self, X: torch.Tensor):
return self.head(torch.cat([ct(X) for ct in self.conv_transforms], axis=1))
Now really basic trainloop to fit such model.
torch.manual_seed(10)
text_cnn = TextCNN([2,3], 300, 10)
optimizer = torch.optim.Adam(text_cnn.parameters(), lr=1e-3)
for i in tqdm(range(20)):
optimizer.zero_grad()
predict = text_cnn(X_train)
loss_value = nn.functional.binary_cross_entropy(
input=predict, target=y_train)
loss_value.backward()
optimizer.step()
100%|██████████| 20/20 [00:04<00:00, 4.86it/s]
Results analise#
Accuracy#
Following cell shows accuracy of the model on the train set.
text_cnn.eval()
with torch.no_grad():
test_pred = text_cnn(X_test)
ans = (
(test_pred > 0.5).to(dtype=torch.float) == y_test
).to(dtype=torch.float).mean()
print(f"accuracy - {ans}")
accuracy - 0.8635170459747314
At least it seems to work.
Visual benchmark#
At the beginning, two categories were randomly chosen from the entire dataset.
train["target_names"]
['rec.motorcycles', 'talk.politics.guns']
So, in the target, we used to train the model: 0 corresponds to rec.motorcycles
, and 1 corresponds to talk.politics.guns
.
ChatGPT has written some text on these topics:
motorcycle_safety_gear = """
When it comes to riding a motorcycle, safety should always be a top priority.
While some riders may prefer the freedom of a light jacket or no gear at all,
it's important to remember that protective equipment can save your life in the
event of an accident. Full-face helmets, armored jackets, gloves, and boots are
essential pieces of gear that should be worn on every ride. They not only protect
you from injury but also increase your visibility to other road users. It's always
better to be safe than sorry, especially when you're on the road with much larger
vehicles.
"""
choosing_right_motorcycle = """
Choosing the right motorcycle can be a daunting task, especially for first-time
riders. Factors like engine size, style, and comfort should be considered before
making a decision. Cruiser motorcycles are popular for long-distance touring,
while sportbikes offer agility and speed. For beginners, a smaller engine size
is recommended to allow for more control and ease of learning. It's also important
to test ride a few bikes to see which one feels most comfortable and fits your
riding style.
"""
gun_control_debate = """
The debate around gun control has become increasingly polarized in recent years.
Advocates for stricter laws argue that the rise in gun violence can be
attributed to easy access to firearms, particularly assault weapons. They
point to statistics showing higher gun-related deaths in countries with more
relaxed gun laws. On the other hand, opponents of gun control emphasize the
importance of the Second Amendment and the right of citizens to protect
themselves. They argue that the solution lies not in restricting access to
firearms, but in addressing underlying issues like mental health, crime rates,
and personal responsibility. The challenge lies in finding common ground on
this deeply contentious issue.
"""
self_defense_gun_ownership = """
In many parts of the country, people own guns primarily for self-defense.
While some argue that carrying a firearm provides peace of mind and a sense
of security, others worry about the risks associated with increased gun
ownership. Statistics suggest that more guns in circulation can lead to more
accidents and gun-related deaths, particularly in homes with children. However,
gun rights advocates maintain that an armed citizenry is a deterrent against
crime and tyranny, and that responsible ownership is key to mitigating risks.
"""
Let’s pass these texts through the model and check the results.
texts = [
motorcycle_safety_gear,
choosing_right_motorcycle,
gun_control_debate,
self_defense_gun_ownership,
]
inp = torch.tensor(seq_to_emb(texts, wv=wv, tokens_num=100))
with torch.no_grad():
ans = text_cnn(inp)
ans
tensor([0.4796, 0.4586, 0.5786, 0.5647])
These texts were classified correctly by the model.