RAG#
Retrieval-Augmented Generation (RAG) is an approach that provides LLM with context associated with the specific information. The general idea is to create a knowledge base in the form of vector database, where encoded as embeddings documents corresponding to the information to be added to the model context. When the system needs information, it searches for embeddings with corresponding properties decodes them, and add them as context to the machine learning model.
Check out this guide for building a RAG system prototype. The guide uses movie reviews from the IMDB dataset as a knowledge base, Sentence Transformers for creating embeddings, Qdrant for vector search, and Qwen2-1.5B for generating responses.
Chunking#
The document is typically separated into chanks, each of which is encoded as an embedding. These embeddings compose the records in the vector database.
There are a few reasons to do that:
Reduce computational costs.
Fit the context of the generation model.
If each chunk contains a particular idea the result the model would be confused by the irrelevant information.
Approaches for chunking:
Fixed length: each chunk contains the specified number of elements in it.
Structural: the chunks are formed according to natural language structures like: paragraphs, abstracts, sentences etc.
Recurcive: The pool of levels is defined. Each level represents a natural language structure element (paragraph, abstract, sentence etc.). In the first step, the innermost elements of the structure are taken. If some of them don’t fit the selected lenght, they are separated into the next level of the structure. This procedure repeats until all chunks fit the selected length.
Semantic: Select the principle of separation that contains relatively small parts of text and contains one idea (typical, sentences). For chunks that go in a row, semantic similarity is estimated. Those first-step chunks are then are composed into the bigger chunks if they are close enough.
Note: Chunks typically typically overlap. Tokens at the end of one chunk are usually included at the beggining of the next chunk. This is done to prevent important from being split up.
Retrieval#
Retrieval is the phase that involves searching the preprated vector database for relevant information. A naive approach that includes simply counts the similarity between the request and all the results is not ideal: it can increase the system latency and provide irrelevant results. There are several approaches that improve the quality of the retrieval process:
Locality-Sensitive Hashing (LSH): Separates the documents into groups. First, it selects the most relevant group then selects the most relevant chunks from that group.
Domain specific metainformation: some metainformation is added to each document in the vectordatabase, which can be a marker that is extracted according to the specific fo the problem being solved.
If we determine that the problem due to insufficiently clear formulation of the user’s request, we can ask LLM to rewrite the query in a several forms and peform a search using each one.
If we find that a model doesn’t not receive enough information from a typical chunk, but we can not make chunks bigger for some reasons, the information can be completed using the source text from which the chunk came.
Applying Rerunker: This is the phase in which decoded documents are reranked using a special reranking language model that estimates the similarity of two texts.
Another important a principle is providing the retirieved context to the LLM. It typically works better if you provide the most relevant information at the begining and end of the context.
The retrieval phase is usually implemented using a thid party tools. The following table shows the most popular today elements of the RAG stack today:
Purpose |
Tool name |
---|---|
Vector databases |
qdrant |
Chroma |
|
Pinecone |
|
Milvus |
|
pgvector |
|
Text search |
elastic search |
Closest vectors search |
FAISS |
Annoy |
Quality estimation#
Quality estimation of the RAG system is a complex question, and different sources provide slightly different information sometimes, but here is my point.
Check:
The Evaluators section of the LangSmith tutorial for evaluating RAG systems.
Retrieval relevance#
The estimaiton of the retrieval part generally corresponds to estimation principles that are suitable for ranking systems. However, as we are working with texts for which there are no suitable proxy metric to mark the relevancy of the elements, the data labeling issue is the main.
The typical metrics for runkings are: precision@k, recall@k, mean Averate Precision (mAP) and mean reciprocal rank (MRR). There are a special section that considers exactly the metrics for ranking task. To label your data you can ask an experts to estimate the relevance of the documents for some set of requests to create a ground truth dataset. You can also ask a powerful LLM to estimate the relevance of document to a given set of requests and use those data as labels.
Generation quality#
Several characteristics describe the quality of the generation phase:
Correctness: How closely does the system’s response match that of a human annotator? The typical way to express correctness numerically is faithfullness: the ratio of the correct statements to the total amount of statements.
Relevance (answer relevance): How closely does the system’s response match the input request? The typical way to express relevance numerically is: the ratio of the relevant statements to the total amount of statements.
Groundedneess: How does the response fit the retrieved document? It compares the information retrieved from the knowledge base to the system’s final answer and estimates if LLM used information provided by the retriewal compoment of the system.
As you can see, named ideas cannot not be estimated by the pure comparison with the labels, so a model a judge approach comes in. Typically, we ask another relatively powerful LLM to annotate the statements or provide a score that expresses how well the texts correspond.