Text CNN#

Text CNN is a method for applying convolutional architecture to text tasks.

import numpy as np

from gensim import downloader
from gensim.models import KeyedVectors

import torch
from torch import nn

from tqdm import tqdm
from collections.abc import Collection
from sklearn.datasets import fetch_20newsgroups

Data#

As an example, we’ll use the 20newsgroups dataset from scikit-learn. To minimize computing complexity only few categories will be used.

categories = ["talk.politics.guns", "rec.motorcycles"]
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

So we are dealing with a set of text like shows at the following cell.

print(test["data"][80][500:1000])
escape
from Cheshire County jail last win-
ter.  Santaw, 32 is scheduled to be
sentenced next week.  The rape last
fall came six months after Santaw
was released from prision, where
he spent 15 years for a rape he commit-
ted when he was 16.  (AP)


 
[end of article]

Any reactions?  Did he do enough time?  What should his penalty
be?  

BTW, Walpole is a town in Massachusetts.  Of course, New
hampshire is close by.
J. Case Kim
kim39@husc.harvard.edu

The following cell implements a simple approach that will be used to transform the data into embeddings.

wv = downloader.load("word2vec-google-news-300")

def seq_to_emb(
    sentences: Collection[str], 
    wv: KeyedVectors, 
    tokens_num: int, 
    pad_token: str = "</s>"
) -> np.ndarray:
    '''
    Convert a set of sentences into embeddings. Each set of symbols separated by
    a space will be recognized as a separate token.

    Parameters
    ----------
    sentences: Collection[str]
        A collection of sentences that require transformation into embeddings.
    wv: KeyedVectors
        This needs to be used to transform tokens into embeddings.
    tokens_num: int
        The number of tokens to take from each sample. Extra tokens will be 
        dropped, and if there are not enough tokens, padding will be added.
    pad_token: str = "</s.>"
        The token that will be used for padding if there are not enough tokens 
        in a sample.

    Returns
    -------
    out: np.ndarray
        Of size (<samples> ,<embedding size>, <tokens_num>).
    '''

    rv = []
    pad_vector = wv.get_vector(pad_token)

    for sentence in sentences:

        sentence_embeddings = []
        got_emb = 0
        for one_token in sentence.split():
            if wv.has_index_for(one_token):
                sentence_embeddings.append(wv.get_vector(one_token))
                got_emb += 1
                # Taking embedings only for some of the words
                if got_emb >= tokens_num: break

        sentence_embeddings = np.stack(sentence_embeddings, axis=1)
        pad_array = np.tile(
            pad_vector[:, None], 
            reps=(1, tokens_num - sentence_embeddings.shape[1])
        )
        sentence_embeddings = np.hstack([sentence_embeddings, pad_array])
        rv.append(sentence_embeddings)

    return np.stack(rv)

Each sentence is transformed into a set of embeddings, with each token having its own embedding. Note: only the specified number of initial embeddings is taken.

The following cell demonstrates building embeddings for a small subset of data.

ans = seq_to_emb(
    sentences=train["data"][:10],
    wv=wv,
    tokens_num=20
)
ans.shape
(10, 300, 20)

The result can be interpreted as the first 20 tokens from the first 10 samples, each with 300 channels representing the meaning of each token.

The following code transforms all the data using this tool.

tokens_num = 100

X_train = torch.tensor(seq_to_emb(
    sentences=train["data"],
    wv=wv,
    tokens_num=tokens_num
))
X_test = torch.tensor(seq_to_emb(
    sentences=test["data"],
    wv=wv,
    tokens_num=tokens_num
))

y_train = torch.tensor(train["target"], dtype=torch.float)
y_test = torch.tensor(test["target"], dtype=torch.float)

Model#

TextCNN is designed to pass data through convolutional layers, where the kernel aggregates tokens in nearby positions. Then, an activation function should transform the result.

The next cell shows how convolution can be applied to the data.

convolution = nn.Conv1d(in_channels=300, out_channels=10, kernel_size=5)
conv_transformed = convolution(X_train)
conv_transformed.shape
torch.Size([1144, 10, 96])

As the activation function, we’ll use max pooling, which aggregates the maximum value from the output channels of the convolution, as shown in the following code.

pooling = nn.AdaptiveAvgPool1d(output_size=1)
pooling(conv_transformed).shape
torch.Size([1144, 10, 1])

These ideas are implemented in the complete network defined in the next cell:

class TextCNN(nn.Module):
    def __init__(
        self, 
        kernel_sizes: list[int], 
        in_channels: int, 
        out_channels: int
    ):
        
        super().__init__()

        self.conv_transforms = nn.ModuleList([
            nn.Sequential(
                nn.Conv1d(
                    in_channels=in_channels,
                    out_channels=out_channels,
                    kernel_size=ks
                ),
                nn.AdaptiveAvgPool1d(output_size=1),
                nn.Flatten()
            )
            for ks in kernel_sizes
        ])

        self.head = nn.Sequential(
            nn.Linear(
                in_features=len(kernel_sizes)*out_channels,
                out_features=1
            ),
            nn.Flatten(start_dim=0),
            nn.Sigmoid()
        )
    
    def forward(self, X: torch.Tensor):
        return self.head(torch.cat([ct(X) for ct in self.conv_transforms], axis=1))

Now really basic trainloop to fit such model.

torch.manual_seed(10)
text_cnn = TextCNN([2,3], 300, 10)
optimizer = torch.optim.Adam(text_cnn.parameters(), lr=1e-3)
for i in tqdm(range(20)):
    optimizer.zero_grad()
    predict = text_cnn(X_train)
    loss_value = nn.functional.binary_cross_entropy(
        input=predict, target=y_train)
    loss_value.backward()
    optimizer.step()
100%|██████████| 20/20 [00:04<00:00,  4.86it/s]

Results analise#

Accuracy#

Following cell shows accuracy of the model on the train set.

text_cnn.eval()

with torch.no_grad():
    test_pred = text_cnn(X_test)

ans = (
    (test_pred > 0.5).to(dtype=torch.float) == y_test
).to(dtype=torch.float).mean()

print(f"accuracy - {ans}")
accuracy - 0.8635170459747314

At least it seems to work.

Visual benchmark#

At the beginning, two categories were randomly chosen from the entire dataset.

train["target_names"]
['rec.motorcycles', 'talk.politics.guns']

So, in the target, we used to train the model: 0 corresponds to rec.motorcycles, and 1 corresponds to talk.politics.guns.

ChatGPT has written some text on these topics:

motorcycle_safety_gear = """
When it comes to riding a motorcycle, safety should always be a top priority. 
While some riders may prefer the freedom of a light jacket or no gear at all, 
it's important to remember that protective equipment can save your life in the 
event of an accident. Full-face helmets, armored jackets, gloves, and boots are 
essential pieces of gear that should be worn on every ride. They not only protect 
you from injury but also increase your visibility to other road users. It's always 
better to be safe than sorry, especially when you're on the road with much larger 
vehicles.
"""

choosing_right_motorcycle = """
Choosing the right motorcycle can be a daunting task, especially for first-time 
riders. Factors like engine size, style, and comfort should be considered before 
making a decision. Cruiser motorcycles are popular for long-distance touring, 
while sportbikes offer agility and speed. For beginners, a smaller engine size 
is recommended to allow for more control and ease of learning. It's also important 
to test ride a few bikes to see which one feels most comfortable and fits your 
riding style.
"""

gun_control_debate = """
The debate around gun control has become increasingly polarized in recent years. 
Advocates for stricter laws argue that the rise in gun violence can be 
attributed to easy access to firearms, particularly assault weapons. They 
point to statistics showing higher gun-related deaths in countries with more 
relaxed gun laws. On the other hand, opponents of gun control emphasize the 
importance of the Second Amendment and the right of citizens to protect 
themselves. They argue that the solution lies not in restricting access to 
firearms, but in addressing underlying issues like mental health, crime rates, 
and personal responsibility. The challenge lies in finding common ground on 
this deeply contentious issue.
"""

self_defense_gun_ownership = """
In many parts of the country, people own guns primarily for self-defense. 
While some argue that carrying a firearm provides peace of mind and a sense 
of security, others worry about the risks associated with increased gun 
ownership. Statistics suggest that more guns in circulation can lead to more 
accidents and gun-related deaths, particularly in homes with children. However, 
gun rights advocates maintain that an armed citizenry is a deterrent against 
crime and tyranny, and that responsible ownership is key to mitigating risks.
"""

Let’s pass these texts through the model and check the results.

texts = [
    motorcycle_safety_gear, 
    choosing_right_motorcycle,
    gun_control_debate, 
    self_defense_gun_ownership, 
]

inp = torch.tensor(seq_to_emb(texts, wv=wv, tokens_num=100))
with torch.no_grad():
    ans = text_cnn(inp)
ans
tensor([0.4796, 0.4586, 0.5786, 0.5647])

These texts were classified correctly by the model.