NLP

NLP#

Natural language processing is a field focused on algorithms for processing ordinary human text.

Pre-Processing#

To work with text, you need to transform it into numbers - an approach more aligned with how computers process information. There are various techniques for achieving this.

The following table describes typical terms used in the field of “wrangling and preprocessing”.

Term	Meaning
Conversion	Transform text into a standartized format.
Sanitization	Remove noise, unnecessary characters.
Tokenization	Split text into words or phrases
Stemming	Reduce words to their root form
Lemmatization	Normalize words based on their dictionary meaning.
Building embeddings	Transforming tokens into the vectors.

Check more details on this in the Pre-Processing page.

Language understanding#

Is a group of NLP tasks aimed at understanding the structure and syntax of the language. Following table counts and describes typical tasks related to language understanding.

Task	Description
Parts of the speech	Identify nouns, verbs, adjectives, etc.
Chunking	Group related words together
Dependency parsing	Analyse grammatical relationship in a sentence
Constituency parsing	Break down a sentence into hierarchical sub-units

Processing and Functionality#

NLP has numerous applications in business. The most typical are listed in the following table:

Functionality	Description
Named Entity Recognition (NER)	Identify proper names (e.g., people, places).
N-gram Identification	Analyze word sequences to predict text.
Sentiment Analysis	Detect emotions and opinions in text.
Information Extraction	Identify key information from unstructured data.
Information Retrieval	Find relevant documents or data.
Questions and Answering	Process user queries for precise answers.
Topic Modeling	Identify key themes in text data.

Text generation#

Text generation is an applied task related to training models to predict token sequences based on given input information.

GPT (Generative Pre-Training) is an architecture of the model, that by given sequence of tokens, learns to predict the next token. It uses the transformer architecture, but only the decoder from the original transformer.

Metrics#

It’s a complex task to estimate the quality of text generation. Generally, there are no ground truth subsets with which we can compare generated texts in such a task. The existing approaches allow us to estimate:

Diversity: Ensures that the generated texts are not just the most popular combinations of tokens. A common way to measure of unique n-grams is to calculate the ratio of unique n-grams to the total number of n-grams:

\[\frac{|\mathrm{unique\ n\!-\!grams}|}{|\mathrm{all\ n\!-\!grams}|}\]
Memorization: The proportion of generated n-grams that match n-grams in the training corpus.
Perplexity: It is the probability that the model will generate the text from a given text corpus. The general formula for compuging perplexity is as follows:

\[PPL(x) = p(x_1, x_2, \ldots, x_m)^{-1/m}\]

Text CNN#

Text CNN is a method for applying convolutional architecture concepts to NLP.

Check the Text CNN page for more details.

LLMs#

LLMs are models designed primarily to predict text. The most advanced LLMs can simulate a wide range of linguistic behaviors. With the right configuration, they can be applied to many problems that are difficult to solve with traditional programming. In this section on LLMs, we will cover:

How LLMs work
Using LLMs
- Prompt engineering: techniques for guiding LLMs to achieve specific goals.
- Agent systems: enabling LLMs to use tools and interact with other systems.

Check the LLMs page.

Transfer learning#

Transfer learning is an approach that improves the performance of the machine learning models for the specific tasks. Since most models with open weights are trained on general datasets, so you can update some weights or the entire compoments of the model to create an algorithms that suits your specific task. The task for which we usually update the model’s weights ususally is called a downstream task.

There are several types of transfer learning:

Linear probing: replace the output layer of the model and retrain only that layer.
Fine-tuning: update all weights of the model.
Paramter-Efficient Fine-tuning: the linear-probing approach is modified with updating only some subset of the other layers.

NLP

Contents