# IMDB example

This page illustrates the RAG system building process. In more details:

- It uses reviews from the IMDb dataset reviews as a knowledge base.
- `sentence_transformer` package to prepare embeddings.
- `qadrant` as a vector database.
- `qwen2` as a generation model.

In the end, the final RAG system wasn't very useful, but it can guide you through the major steps of building your own RAG system.

In [1]:
import os
import uuid

import numpy as np

import ollama
from pprint import pprint
from transformers import pipeline
from datasets import load_dataset
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer

## Building embeddings

The following cell loads a dataset and shows one record from it.

In [2]:
data = list(load_dataset("stanfordnlp/imdb", split="train")['text'])
pprint(data[20])

('If the crew behind "Zombie Chronicles" ever read this, here\'s some advice '
 'guys: <br /><br />1. In a "Twist Ending"-type movie, it\'s not a good idea '
 'to insert close-ups of EVERY DEATH IN THE MOVIE in the opening credits. That '
 "tends to spoil the twists, y'know...? <br /><br />2. I know you produced "
 'this on a shoestring and - to be fair - you worked miracles with your budget '
 'but please, hire people who can actually act. Or at least, walk, talk and '
 "gesture at the same time. Joe Haggerty, I'm looking at you...<br /><br />3. "
 "If you're going to set a part of your movie in the past, only do this if you "
 'have the props and costumes of the time.<br /><br />4. Twist endings are '
 "supposed to be a surprise. Sure, we don't want twists that make no sense, "
 'but signposting the "reveal" as soon as you introduce a character? That\'s '
 'not a great idea.<br /><br />Kudos to the guys for trying, but in all '
 "honesty, I'd rather they hadn't...<br /><br />Only for

The texts are generally short and contain more or less consistent ideas. Therefore, we will not apply some extra chunking; we will simply transform the complete texts into embeddings.

In [3]:
embedding_model = SentenceTransformer(
    "paraphrase-MiniLM-L3-v2",
    model_kwargs={'dtype': 'float16'}
)

The following code uses the cache if it exists; otherwise, it re-encodes the text, this can take some time to perform.

In [4]:
if os.path.exists("imdb_example_files/embeddings.npy"):
    embeddings = np.load("imdb_example_files/embeddings.npy")
else:
    embeddings = embedding_model.encode(data, normalize_embeddings=True)
    if not os.path.exists("imdb_example_files"):
        os.mkdir("imdb_example_files")
    np.save("imdb_example_files/embeddings", embeddings)
    with open("imdb_example_files/.gitignore", "w") as f:
        f.write("embeddings.npy\n")

## Vector database

This section is demonstrates the process of setting up the vector database and uploading information to it.

In [5]:
client = QdrantClient(":memory:")
embedding_size = embeddings.shape[1]

client.create_collection(
    collection_name="imdb",
    on_disk_payload=True,
    vectors_config=models.VectorParams(
        size=embedding_size,
        distance=models.Distance.COSINE,
        on_disk=True
    )
)

True

The next cell converts the information from the raw embeddings into a format suitable for the input expected by the quadrant.

In [6]:
points = [
    models.PointStruct(
        id=str(uuid.uuid4()),
        vector=embeddings[i],
        payload={"text": data[i]}
    )
    for i in range(len(embeddings))
]
client.upsert(collection_name="imdb", points=points)

  client.upsert(collection_name="imdb", points=points)


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

The next cell defines the `load_relevant_reviews` function, which loads information from the quadrant and prepares it as a list of strings.

In [7]:
def load_relevant_reviews(query: str) -> list[str]:
    embedding = embedding_model.encode(
        [query], normalize_embeddings=True
    )
    relevant_info = client.query_points(
        collection_name="imdb",
        query=embedding[0],
        limit=5,
        with_payload=True
    )
    return [res.payload['text'] for res in list(relevant_info)[0][1]]

The following code illustrates which texts are loaded from the database for a specific request.

In [8]:
ans = load_relevant_reviews("What is the typcal plot for a horror movie?")

for res in ans:
    pprint(res)
    print('\n')

("How can you tell that a horror movie is terrible? when you can't stop "
 'laughing about it of course! The plot has been well covered by other '
 "reviewers, so I'll just add a few things on the hilarity of it all.<br /><br "
 '/>Some reviews have placed the location in South America, others in Africa, '
 'I thought it was in some random island in the Pacific. Where exactly does '
 'this take place, seems to be a mystery. The cannibal tribe is conformed by a '
 'couple of black women some black men, and a man who looks like a young Frank '
 'Zappa banging the drums... the Devil God is a large black man with a '
 'terrible case of pink eyes.<br /><br />One of the "freakiest" moments in the '
 'film is when, "Pablito" find his partner hanging from a tree covered in what '
 'seems to be an orange substance that I assume is blood, starts screaming for '
 "minutes on and on (that's actually funny), and then the head of his partner "
 'falls in the ground and "Pablito" kicks it a bit for w

## Generation part

The prompt for the model must include infromation loaded from the RAG system. The next cell defines the `generate_system_prompt` function, which incorporetas the retrieved information into the prompt.

In [9]:
system_template = """
You are a movie expert. You are provided with reviews from the IMDb dataset that are relevant to the user's request.

Reviews:

{reviews}
""".strip()

def generate_system_prompt(reviews: list[str]) -> str:
    return system_template.format(reviews="\n\n".join(reviews))

The next cell shows the kind of input associated with the retrieved information available to the model.

In [10]:
print(
    generate_system_prompt(
        load_relevant_reviews("what is the typcal plot for a horror movie?")
    )
)

You are a movie expert. You are provided with reviews from the IMDb dataset that are relevant to the user's request.

Reviews:

How can you tell that a horror movie is terrible? when you can't stop laughing about it of course! The plot has been well covered by other reviewers, so I'll just add a few things on the hilarity of it all.<br /><br />Some reviews have placed the location in South America, others in Africa, I thought it was in some random island in the Pacific. Where exactly does this take place, seems to be a mystery. The cannibal tribe is conformed by a couple of black women some black men, and a man who looks like a young Frank Zappa banging the drums... the Devil God is a large black man with a terrible case of pink eyes.<br /><br />One of the "freakiest" moments in the film is when, "Pablito" find his partner hanging from a tree covered in what seems to be an orange substance that I assume is blood, starts screaming for minutes on and on (that's actually funny), and then 

The following cell builds the `model_interface` function, which allows access the machine learning model.

The `ollama` inference server significantly boosts the performance of the generation model. It will be used if the `OLLAMA_AVAILABLE` flag is set to `True`. Otherwise, the `pipeline` provided by the `transformers` package will be used.

In [11]:
OLLAMA_AVAILABLE = True

if OLLAMA_AVAILABLE:
    def model_interface(messages: list[dict[str, str]]) -> str:

        ans = ollama.chat(
            model='qwen2:1.5b',
            messages=messages
        )
        if ans.message.content is None:
            raise ValueError("No response from the model.")
        return ans.message.content
else:
    generation_pipeline = pipeline(
        "text-generation",
        model="Qwen/Qwen2-1.5B-Instruct"
    )

    def model_interface(messages: list[dict[str, str]]) -> str:
        ans = generation_pipeline(
            messages, max_new_tokens=512, temperature=0.1, top_p=0.7
        )
        return ans[0]["generated_text"][-1]["content"]

The following cell wraps the generation procedure around the function that only returns the system's response.

In [12]:
def generate(request: str) -> str:
    system_prompt = generate_system_prompt(
        load_relevant_reviews(request)
    )
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": request}
    ]
    return model_interface(messages)

Next cells shows the possible responses of the system.

In [13]:
pprint(generate("What is the typical plot for a horror movie?"))

('The typical plot for a horror movie often involves a protagonist who becomes '
 'suspicious about an event, such as a death in their family or witnessing '
 'something disturbing, and then seeks out answers through unusual means. This '
 'leads them into situations that often involve dark secrets, supernatural '
 'occurrences, or unexpected adversaries. The suspense is built with '
 'increasing tension as the audience learns more about the mystery at hand.\n'
 '\n'
 'The plot may include elements of fear, dread, and psychological horror. '
 'Characters facing personal tragedies are often revealed to have been '
 'harboring some kind of secret, which leads them down a path of discovery, '
 'where they must confront their own fears and confront what has been kept '
 'hidden from them. The genre typically uses jump scares, suspenseful moments, '
 'and supernatural elements such as ghosts, vengeful spirits, or other '
 'entities that haunt people for reasons unknown.\n'
 '\n'
 'A horror 

It's difficult to understand what role RAG played here, but the description is not so bad.

Next request is a bit more details specific.

In [14]:
pprint(generate("What are the best roles of Robert De Niro?"))

('Based on the reviews provided, some of the best roles that Robert De Niro '
 'has played include:\n'
 '\n'
 '1. Taxi Driver (1976): This role was groundbreaking and marked a career '
 'shift for De Niro as he transitioned from a comedy star to a serious drama '
 'actor.\n'
 '\n'
 '2. The King of Comedy (1982): Another iconic film, this role showcased De '
 "Niro's ability to embody a range of characters with his acting skills.\n"
 '\n'
 '3. Cape Fear (1991): De Niro delivered one of the greatest performances in '
 'cinematic history, portraying the title character as a man willing to use '
 'any means necessary to achieve what he wants.\n'
 '\n'
 "4. Taxi Driver: The film showcased De Niro's ability to deliver emotional "
 'depth and humor through his acting skills.\n'
 '\n'
 '5. The Untouchables (1987): De Niro won an Oscar for this role, showing off '
 'his range as a character actor in both comedy and serious drama roles.\n'
 '\n'
 "6. New York, New York: Although not the best fil

Robert De Niro played a role in most of the suggested titles.