Preprocessing

Preprocessing#

This notebook provides an overview of common techniques for processing text data.

import nltk
import pandas as pd

Tokenezation#

Tokenezation is the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens.


We will explore differences among tokenizers using the sentence provided in the following cell.

inp = "I'm loving NLP—it's amazing!"

Simple whitespace tokenization allows text to be split by spaces.

inp.split()
["I'm", 'loving', "NLP—it's", 'amazing!']

Symbol tokenization, of course, treats each symbol as a separate token:

print(list(inp))
['I', "'", 'm', ' ', 'l', 'o', 'v', 'i', 'n', 'g', ' ', 'N', 'L', 'P', '—', 'i', 't', "'", 's', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g', '!']

The Wold tokenizer separates text into wolds. It differs from the whitespace tokenizer in that it assumes that different tokens can be separated by more than just whitespace. The following cell shows the transformation of the example sentence.

nltk.tokenize.word_tokenize(inp)
['I', "'m", 'loving', 'NLP—it', "'s", 'amazing', '!']

Stop words#

There are many words in text that are not considered to have much meaning - they are commonly called stop words and are not considered when processing text data.


The following cell shows the stop word list according to the nltk library.

print(nltk.corpus.stopwords.words("english"))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Stemming#

Lemmatization is the process of applying words from the text to their base form. It helps reduce the size of the dictionary by removing not-so-important variations.


The following cell shows the transformation process for different forms of the word “invest”.

input = [
    "invest",
    "invests",
    "invested",
    "investing",
    "investment",
    "investments",
    "investor",
    "investors",
    "investiture",
    "investedness"
]

porter_stemmer = nltk.stem.PorterStemmer()
output = [porter_stemmer.stem(w) for w in input]
for inp, out in zip(input, output):
    print(f"{inp} -> {out}")
invest -> invest
invests -> invest
invested -> invest
investing -> invest
investment -> invest
investments -> invest
investor -> investor
investors -> investor
investiture -> investitur
investedness -> invested

Lemming#

There are a lot of misunderstanding between terms stemming and lemming. I haven’t found authorative opinion yet but, there is descirption that corresponds to the behaviour of the tools:

  • Stemming is a rule-based process that removes prefixes or suffixes from a word to reduce it to a base form. However, stemming does not guarantee that the resulting word will be a valid or meaningful word in the language. It may result in non-existent or strange forms.

  • Lemmatization, on the other hand, involves reducing a word to its lemma, or canonical form, based on its dictionary definition. The resulting word is always a valid word in the language, and lemmatization often takes into account the part of speech and other grammatical information.

Long story short: Stemming might produce words that don’t exist, while lemmatization guarantees a real word as the result.

These things are usually considered as separate processes, but from my perspective lematization is the specification of the stem, so in this site they are considered as such.


The next cell shows a set of words that have been transformed into a non-existent word by stemming and outputting the limming for the same example.

porter_stemmer = nltk.stem.PorterStemmer()
lemmatizer = nltk.stem.WordNetLemmatizer()

input = [
    "geese",
    "happily",
    "generously",
    "studied"
]

pd.DataFrame({
    "input": input,
    "porter stemming": [porter_stemmer.stem(w) for w in input],
    "lemming": [lemmatizer.lemmatize(w) for w in input]
})
input porter stemming lemming
0 geese gees goose
1 happily happili happily
2 generously gener generously
3 studied studi studied