NLP#
Natural language processing is a field focused on algorithms for processing ordinary human text.
Pre-Processing#
To work with text, you need to transform it into numbers - an approach more aligned with how computers process information. There are various techniques for achieving this.
The following table describes typical terms used in the field of “wrangling and preprocessing”.
Term |
Meaning |
---|---|
Conversion |
Transform text into a standartized format. |
Sanitization |
Remove noise, unnecessary characters. |
Tokenization |
Split text into words or phrases |
Stemming |
Reduce words to their root form |
Lemmatization |
Normalize words based on their dictionary meaning. |
Building embeddings |
Transforming tokens into the vectors. |
Check more details on this in the Pre-Processing page.
Language understanding#
Is a group of NLP tasks aimed at understanding the structure and syntax of the language. Following table counts and describes typical tasks related to language understanding.
Task |
Description |
---|---|
Parts of the speech |
Identify nouns, verbs, adjectives, etc. |
Chunking |
Group related words together |
Dependency parsing |
Analyse grammatical relationship in a sentence |
Constituency parsing |
Break down a sentence into hierarchical sub-units |
Processing and Functionality#
NLP has numerous applications in business. The most typical are listed in the following table:
Functionality |
Description |
---|---|
Named Entity Recognition (NER) |
Identify proper names (e.g., people, places). |
N-gram Identification |
Analyze word sequences to predict text. |
Sentiment Analysis |
Detect emotions and opinions in text. |
Information Extraction |
Identify key information from unstructured data. |
Information Retrieval |
Find relevant documents or data. |
Questions and Answering |
Process user queries for precise answers. |
Topic Modeling |
Identify key themes in text data. |
Text generation#
Text generation is an applied task related to training models to predict token sequences based on given input information.
GPT (Generative Pre-Training) is an architecture of the model, that by given sequence of tokens, learns to predict the next token. It uses the transformer architecture, but only the decoder from the original transformer.
Metrics#
It’s a complex task to estimate the quality of text generation. Generally, there are no ground truth subsets with which we can compare generated texts in such a task. The existing approaches allow us to estimate:
Diversity: Ensures that the generated texts are not just the most popular combinations of tokens. A common way to measure of unique n-grams is to calculate the ratio of unique n-grams to the total number of n-grams:
\[\frac{|\mathrm{unique\ n\!-\!grams}|}{|\mathrm{all\ n\!-\!grams}|}\]Memorization: The proportion of generated n-grams that match n-grams in the training corpus.
Perplexity: It is the probability that the model will generate the text from a given text corpus. The general formula for compuging perplexity is as follows:
\[PPL(x) = p(x_1, x_2, \ldots, x_m)^{-1/m}\]
Text CNN#
Text CNN is a method for applying convolutional architecture concepts to NLP.
Check the Text CNN page for more details.
LLMs#
LLMs are models designed primarily to predict text. The most advanced LLMs can simulate a wide range of linguistic behaviors. With the right configuration, they can be applied to many problems that are difficult to solve with traditional programming. In this section on LLMs, we will cover:
How LLMs work
Using LLMs
Prompt engineering: techniques for guiding LLMs to achieve specific goals.
Agent systems: enabling LLMs to use tools and interact with other systems.
Check the LLMs page.
Transfer learning#
Transfer learning is an approach that improves the performance of the machine learning models for the specific tasks. Since most models with open weights are trained on general datasets, so you can update some weights or the entire compoments of the model to create an algorithms that suits your specific task. The task for which we usually update the model’s weights ususally is called a downstream task.
There are several types of transfer learning:
Linear probing: replace the output layer of the model and retrain only that layer.
Fine-tuning: update all weights of the model.
Paramter-Efficient Fine-tuning: the linear-probing approach is modified with updating only some subset of the other layers.