Pre-Processing#

This notebook provides an overview of common techniques for processing text data.

import nltk
import pandas as pd

Tokenezation#

Tokenization is the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens. There are many ways to implement tokenization.

Approach

Description

Examples / Tools

Pros

Cons

Used in / Applications

Whitespace & Rule-based

Split by spaces/punctuation with rules

NLTK word_tokenize, spaCy

Simple, fast, interpretable

Fails on non-space languages, brittle rules

Traditional NLP pipelines

Statistical (unsupervised)

Learns boundaries from co-occurrence statistics

Punkt (NLTK)

Data-driven, language-adaptive

Still word-level, OOV issues

Sentence segmentation

Subword: BPE

Iteratively merge frequent character/word pairs

SentencePiece (BPE), HuggingFace

Handles rare words, reduces vocab size

Fixed vocab, not morphologically aware

GPT-2, RoBERTa

Subword: WordPiece

Likelihood-based subword merges

HuggingFace, TensorFlow Tokenizer

Statistically grounded, efficient

Fixed vocab, similar issues as BPE

BERT, DistilBERT

Subword: Unigram LM

Probabilistic model over subwords

SentencePiece (Unigram mode)

Flexible, multiple segmentations possible

More complex training

XLNet, T5

Byte-level (BPE variant)

Tokenizes directly at UTF-8 byte level

HuggingFace Tokenizers

Universal, works with emojis & all scripts

Tokens less human-readable

GPT-2, GPT-3, LLaMA

Character-level

Each character is a token

Custom preprocessing, Char-RNNs

No OOV, language-independent

Very long sequences, inefficient

Older RNNs, some Asian langs

Morpheme-based

Split into roots, prefixes, suffixes

Polyglot, Stanza

Captures true linguistic structure

Language-specific analyzers, slower

Finnish, Turkish, Korean NLP


We will explore differences among tokenizers using the sentence provided in the following cell.

inp = "I'm loving NLP—it's amazing!"

Simple whitespace tokenization allows text to be split by spaces.

inp.split()
["I'm", 'loving', "NLP—it's", 'amazing!']

Symbol tokenization, of course, treats each symbol as a separate token:

print(list(inp))
['I', "'", 'm', ' ', 'l', 'o', 'v', 'i', 'n', 'g', ' ', 'N', 'L', 'P', '—', 'i', 't', "'", 's', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g', '!']

The Wold tokenizer separates text into wolds. It differs from the whitespace tokenizer in that it assumes that different tokens can be separated by more than just whitespace. The following cell shows the transformation of the example sentence.

nltk.tokenize.word_tokenize(inp)
['I', "'m", 'loving', 'NLP—it', "'s", 'amazing', '!']

Stop words#

There are many words in text that are not considered to have much meaning - they are commonly called stop words and are not considered when processing text data.


The following cell shows the stop word list according to the nltk library.

print(nltk.corpus.stopwords.words("english"))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Stemming#

Lemmatization is the process of applying words from the text to their base form. It helps reduce the size of the dictionary by removing not-so-important variations.


The following cell shows the transformation process for different forms of the word “invest”.

input = [
    "invest",
    "invests",
    "invested",
    "investing",
    "investment",
    "investments",
    "investor",
    "investors",
    "investiture",
    "investedness"
]

porter_stemmer = nltk.stem.PorterStemmer()
output = [porter_stemmer.stem(w) for w in input]
for inp, out in zip(input, output):
    print(f"{inp} -> {out}")
invest -> invest
invests -> invest
invested -> invest
investing -> invest
investment -> invest
investments -> invest
investor -> investor
investors -> investor
investiture -> investitur
investedness -> invested

Lemming#

There are a lot of misunderstanding between terms stemming and lemming. I haven’t found authorative opinion yet but, there is descirption that corresponds to the behaviour of the tools:

  • Stemming is a rule-based process that removes prefixes or suffixes from a word to reduce it to a base form. However, stemming does not guarantee that the resulting word will be a valid or meaningful word in the language. It may result in non-existent or strange forms.

  • Lemmatization, on the other hand, involves reducing a word to its lemma, or canonical form, based on its dictionary definition. The resulting word is always a valid word in the language, and lemmatization often takes into account the part of speech and other grammatical information.

Long story short: Stemming might produce words that don’t exist, while lemmatization guarantees a real word as the result.

These things are usually considered as separate processes, but from my perspective lematization is the specification of the stem, so in this site they are considered as such.


The next cell shows a set of words that have been transformed into a non-existent word by stemming and outputting the limming for the same example.

porter_stemmer = nltk.stem.PorterStemmer()
lemmatizer = nltk.stem.WordNetLemmatizer()

input = [
    "geese",
    "happily",
    "generously",
    "studied"
]

pd.DataFrame({
    "input": input,
    "porter stemming": [porter_stemmer.stem(w) for w in input],
    "lemming": [lemmatizer.lemmatize(w) for w in input]
})
input porter stemming lemming
0 geese gees goose
1 happily happili happily
2 generously gener generously
3 studied studi studied

Embedings#

To use tokenized text in any type of algorithm, particularly in machine learning algorithms, you need to transform it into numerical data - generally, a vector formed according to some rules. Building embeddings is not an approach exclusive to the NLP domain; here, we consider exactly where embeddings are used in NLP.

The following table contains the typical wide-known approaches:

Approach

Description

Output Format

Pros

Cons

Bag of Words (BoW)

Counts word occurrences in a document.

Sparse vector

Simple, fast, interpretable

Ignores word order and semantics

TF-IDF

Weighted word counts based on term frequency and inverse document frequency.

Sparse vector

Reduces impact of common words, more informative than BoW

Still ignores word order and semantics

Word2Vec

Learns word embeddings using context (CBOW or Skip-gram).

Dense vector (per word)

Captures semantic similarity

Needs large corpus and training time

GloVe

Embeddings based on global word co-occurrence statistics.

Dense vector (per word)

Combines global and local information

Pretrained; less adaptable to specific domains

FastText

Extends Word2Vec using subword (character n-gram) information.

Dense vector (per word)

Handles out-of-vocabulary words well

Larger model size

BERT embeddings

Contextualized embeddings using Transformer encoder.

Dense vector (contextual)

Captures context, syntax, semantics

Computationally expensive

Sentence Transformers

Generates sentence-level embeddings using pretrained transformers.

Dense vector (sentence)

Effective for sentence similarity and retrieval

Heavier model; fine-tuning may be needed

For a more detailed description check out the Embeddings page.

For a more detailed explanation of the tf-idf approach, see tf-idf.