Pre-Processing#
This notebook provides an overview of common techniques for processing text data.
import nltk
import pandas as pd
Tokenezation#
Tokenization is the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens. There are many ways to implement tokenization.
Approach |
Description |
Examples / Tools |
Pros |
Cons |
Used in / Applications |
---|---|---|---|---|---|
Whitespace & Rule-based |
Split by spaces/punctuation with rules |
NLTK |
Simple, fast, interpretable |
Fails on non-space languages, brittle rules |
Traditional NLP pipelines |
Statistical (unsupervised) |
Learns boundaries from co-occurrence statistics |
Punkt (NLTK) |
Data-driven, language-adaptive |
Still word-level, OOV issues |
Sentence segmentation |
Subword: BPE |
Iteratively merge frequent character/word pairs |
SentencePiece (BPE), HuggingFace |
Handles rare words, reduces vocab size |
Fixed vocab, not morphologically aware |
GPT-2, RoBERTa |
Subword: WordPiece |
Likelihood-based subword merges |
HuggingFace, TensorFlow Tokenizer |
Statistically grounded, efficient |
Fixed vocab, similar issues as BPE |
BERT, DistilBERT |
Subword: Unigram LM |
Probabilistic model over subwords |
SentencePiece (Unigram mode) |
Flexible, multiple segmentations possible |
More complex training |
XLNet, T5 |
Byte-level (BPE variant) |
Tokenizes directly at UTF-8 byte level |
HuggingFace Tokenizers |
Universal, works with emojis & all scripts |
Tokens less human-readable |
GPT-2, GPT-3, LLaMA |
Character-level |
Each character is a token |
Custom preprocessing, Char-RNNs |
No OOV, language-independent |
Very long sequences, inefficient |
Older RNNs, some Asian langs |
Morpheme-based |
Split into roots, prefixes, suffixes |
Polyglot, Stanza |
Captures true linguistic structure |
Language-specific analyzers, slower |
Finnish, Turkish, Korean NLP |
We will explore differences among tokenizers using the sentence provided in the following cell.
inp = "I'm loving NLP—it's amazing!"
Simple whitespace tokenization allows text to be split by spaces.
inp.split()
["I'm", 'loving', "NLP—it's", 'amazing!']
Symbol tokenization, of course, treats each symbol as a separate token:
print(list(inp))
['I', "'", 'm', ' ', 'l', 'o', 'v', 'i', 'n', 'g', ' ', 'N', 'L', 'P', '—', 'i', 't', "'", 's', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g', '!']
The Wold tokenizer separates text into wolds. It differs from the whitespace tokenizer in that it assumes that different tokens can be separated by more than just whitespace. The following cell shows the transformation of the example sentence.
nltk.tokenize.word_tokenize(inp)
['I', "'m", 'loving', 'NLP—it', "'s", 'amazing', '!']
Stop words#
There are many words in text that are not considered to have much meaning - they are commonly called stop words and are not considered when processing text data.
The following cell shows the stop word list according to the nltk
library.
print(nltk.corpus.stopwords.words("english"))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Stemming#
Lemmatization is the process of applying words from the text to their base form. It helps reduce the size of the dictionary by removing not-so-important variations.
The following cell shows the transformation process for different forms of the word “invest”.
input = [
"invest",
"invests",
"invested",
"investing",
"investment",
"investments",
"investor",
"investors",
"investiture",
"investedness"
]
porter_stemmer = nltk.stem.PorterStemmer()
output = [porter_stemmer.stem(w) for w in input]
for inp, out in zip(input, output):
print(f"{inp} -> {out}")
invest -> invest
invests -> invest
invested -> invest
investing -> invest
investment -> invest
investments -> invest
investor -> investor
investors -> investor
investiture -> investitur
investedness -> invested
Lemming#
There are a lot of misunderstanding between terms stemming and lemming. I haven’t found authorative opinion yet but, there is descirption that corresponds to the behaviour of the tools:
Stemming is a rule-based process that removes prefixes or suffixes from a word to reduce it to a base form. However, stemming does not guarantee that the resulting word will be a valid or meaningful word in the language. It may result in non-existent or strange forms.
Lemmatization, on the other hand, involves reducing a word to its lemma, or canonical form, based on its dictionary definition. The resulting word is always a valid word in the language, and lemmatization often takes into account the part of speech and other grammatical information.
Long story short: Stemming might produce words that don’t exist, while lemmatization guarantees a real word as the result.
These things are usually considered as separate processes, but from my perspective lematization is the specification of the stem, so in this site they are considered as such.
The next cell shows a set of words that have been transformed into a non-existent word by stemming and outputting the limming for the same example.
porter_stemmer = nltk.stem.PorterStemmer()
lemmatizer = nltk.stem.WordNetLemmatizer()
input = [
"geese",
"happily",
"generously",
"studied"
]
pd.DataFrame({
"input": input,
"porter stemming": [porter_stemmer.stem(w) for w in input],
"lemming": [lemmatizer.lemmatize(w) for w in input]
})
input | porter stemming | lemming | |
---|---|---|---|
0 | geese | gees | goose |
1 | happily | happili | happily |
2 | generously | gener | generously |
3 | studied | studi | studied |
Embedings#
To use tokenized text in any type of algorithm, particularly in machine learning algorithms, you need to transform it into numerical data - generally, a vector formed according to some rules. Building embeddings is not an approach exclusive to the NLP domain; here, we consider exactly where embeddings are used in NLP.
The following table contains the typical wide-known approaches:
Approach |
Description |
Output Format |
Pros |
Cons |
---|---|---|---|---|
Bag of Words (BoW) |
Counts word occurrences in a document. |
Sparse vector |
Simple, fast, interpretable |
Ignores word order and semantics |
TF-IDF |
Weighted word counts based on term frequency and inverse document frequency. |
Sparse vector |
Reduces impact of common words, more informative than BoW |
Still ignores word order and semantics |
Word2Vec |
Learns word embeddings using context (CBOW or Skip-gram). |
Dense vector (per word) |
Captures semantic similarity |
Needs large corpus and training time |
GloVe |
Embeddings based on global word co-occurrence statistics. |
Dense vector (per word) |
Combines global and local information |
Pretrained; less adaptable to specific domains |
FastText |
Extends Word2Vec using subword (character n-gram) information. |
Dense vector (per word) |
Handles out-of-vocabulary words well |
Larger model size |
BERT embeddings |
Contextualized embeddings using Transformer encoder. |
Dense vector (contextual) |
Captures context, syntax, semantics |
Computationally expensive |
Sentence Transformers |
Generates sentence-level embeddings using pretrained transformers. |
Dense vector (sentence) |
Effective for sentence similarity and retrieval |
Heavier model; fine-tuning may be needed |
For a more detailed description check out the Embeddings page.
For a more detailed explanation of the tf-idf approach, see tf-idf.