LLMs

LLMs#

LLMs are models designed primarily to predict text. The most advanced LLMs can simulate a wide range of linguistic behaviors. With the right configuration, they can be applied to many problems that are difficult to solve with traditional programming.

Chat templates#

Since LLMs are designed only to predict the next token based on a range of previous tokens, they are not able to behive in a chat-like pattern by default. To achieve this, models are fine-tuned to follow the chat templates. A model that has passed the so-called supervised fine-tuning, which is intended to train it to follow the “query-response” pattern, usually has the instruct prefix or postfix somewhere in its name/identifier.

Chat template is a set of rules for structuring model inputs, separating messages, and specifying the role of the speaker for each message.

Special tokens ussually define the beginning and ending of different messages. Role is specified using some defined syntax.

The chat-templates for a popular LLMs are:

OpenAI ChatML
- Common delimiters: <|im_start|> precedes each message; <|im_end|> closes them.
- Roles are explicitly labeled (e.g., <|system|>, <|user|>, <|assistant|>).
LLaMA-2 “INST” Format
- Prompts often use [INST] ... [/INST] markers, with optional system-specific wrappers like <<SYS>> ... <</SYS>>. These markers aren’t necessarily single tokens but are recognized by tokenizer logic.
LLaMA-3 / SentencePiece Chat Templates
- Conversations begin with <|begin_of_text|>.
- Role headers are wrapped using <|start_header_id|>role<|end_header_id|>.
- Each message ends with a special end-of-turn token: <|eot_id|>.

Hugging Face provides a playground where you can see how the chat templates will be applied for the different models available on the platform:

The tokenizers that come with the models in the transformers package apply chat templates to the sets of messages using the apply_chat_template method. The following cell shows the result applying the chat-format to the SmolLM model:

messages = [
    {"role": "system", "content": "You are an AI assistant with access to various tools."},
    {"role": "user", "content": "Hi !"},
    {"role": "assistant", "content": "Hi human, what can help you with ?"},
]

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
rendered_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(rendered_prompt)

<|im_start|>system
You are an AI assistant with access to various tools.<|im_end|>
<|im_start|>user
Hi !<|im_end|>
<|im_start|>assistant
Hi human, what can help you with ?<|im_end|>
<|im_start|>assistant

Prompt engineering#

Prompt engineering is a set of approaches used to configure a text generation model to produce the exact results you’re interested in.

Check the Prompt Engineering guide provided by google.

Generally there are two concepts you need to make the model to produce relevant output:

Configure the model by changing parameters. Models implemented by different organizatinos have different configuration options but most have: output length and sampling controls.
Bulding a promt. for this there are following techniques:
- General prompting / zero shot.
- One & few shot prompting.
- System, contextual and role prompting.

Token sampling#

At each step of the generation process, the model generates the next token by classifying it based on previous tokens. Thus, at some point, the model’s predictions resemble the probabilities of the next token:

\(\left(p_1, p_2, \ldots, p_n\right), \sum_{i=1}^n p_i =1\)

Where \(n\) is a vocabulary of the model.

The temperature regulates the randomness of the selected tokens. A value of 0 results in deterministic model outputs. The higher the value,a the more creative the model’s output will be.

Top-k specifies the model that selects the next token from among the \(k\) tokens with the highest probability.

Top-p it considers the smallest possible set of tokens whose cumulative probability exceeds a predefined threshold, \(p\).

Prompting techniques#

There are different approaches associated to providing model information about the structure of required output:

Zero shot: general prompting technique, just query to the model without providing any additional information.
One shot & Zero shot: to explain the model the structure of the output you expect from it.

There are following options when specifying the general patterns of the model behaviour:

System prompting sets the overall context and purpose for the language model.
Contextual prompting provides specific details or background information relevant to the current convecrsation task.
Role prompting assigning the a specific character or identity for the language model

Step-back prompting: Two prompts are provided: a spefic prompt and a more general prompt. The more general prompt ussually asks about typical approaches to the issue described in the specific prompt. The second prompt includes the model’s answer to the general prompt as context and the specific prompt as the taks.

Chain of Thought (CoT): A technique in which the model is asked to solve a task step by step. In the most basic implementation, the prompt literally asks the model to solve the problem “step by step”.

OpenAI compatibility#

The principles behind openAI’s API have already become the industry standard. If an API OpenAI compatabile, it means you can communicate with it using the same format used by the OpenAI API.

For example, the following cell sends a request to the Ollama server launched on port 11434 port with the pulled llama3.2:1b model.

import requests, json

ans = requests.post(
    "http://localhost:11434/v1/chat/completions",
    json={
        "model": "llama3.2:1b",
        "messages": [
            {"role": "system", "content": "You're an assistant"},
            {"role": "user", "content": "What is the capital of GB?"}
        ]
    }
)

json.loads(ans.content)

{'id': 'chatcmpl-616',
 'object': 'chat.completion',
 'created': 1758182631,
 'model': 'llama3.2:1b',
 'system_fingerprint': 'fp_ollama',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': 'The capital of the United Kingdom, which includes England, Scotland, Wales, and Northern Ireland, is London.'},
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 36, 'completion_tokens': 23, 'total_tokens': 59}}

The following cell shows the request sent to the model that was deployed with the llama.cpp.

ans = requests.post(
    "http://localhost:5893/v1/chat/completions",
    json={
        "messages": [
            {"role": "system", "content": "You're an assistant"},
            {"role": "user", "content": "What is the capital of GB?"}
        ]
    }
)

json.loads(ans.content)

{'choices': [{'finish_reason': 'stop',
   'index': 0,
   'message': {'role': 'assistant',
    'content': 'The capital of GB is London. 😊'}}],
 'created': 1758182635,
 'model': 'gpt-3.5-turbo',
 'system_fingerprint': 'b6503-62c3b645',
 'object': 'chat.completion',
 'usage': {'completion_tokens': 9, 'prompt_tokens': 22, 'total_tokens': 31},
 'id': 'chatcmpl-PGsByExDI96hgvHJ9SIGaAOWVmgmjaDe',
 'timings': {'cache_n': 21,
  'prompt_n': 1,
  'prompt_ms': 35.968,
  'prompt_per_token_ms': 35.968,
  'prompt_per_second': 27.802491103202843,
  'predicted_n': 9,
  'predicted_ms': 232.494,
  'predicted_per_token_ms': 25.832666666666668,
  'predicted_per_second': 38.71067640455237}}

Note: The model is not specified for llama.cpp because the endpoint created for the example only serves one model.

RAG#

Retrieval-Augmented Generation (RAG) is an approach that provides LLM with context associated with the specific information. The general idea is to create a knowledge base in the form of vector database, where encoded as embeddings documents corresponding to the information to be added to the model context. When the system needs information, it searches for embeddings with corresponding properties decodes them, and add them as context to the machine learning model.

There are sevaral topics related to RAG systems that need to be discussed:

Chunking: The process of separating documents from the knowledge base into the chunks that can be used to prepare the embeddings.
Retrieval: There are a set of approaches and tools to collect the relevant information contained in the chunks.
Qality estimation: As there are few compoments in the RAG system that’s why the estimation of the system is a complex process.

Check more in the corresponding RAG page.

Agents#

AI agents are programs where AI controls workflow.

There is some typical terminology in the field of agentic frameworks:

Tools: provide the agent with the ability to execute actions a text-generation model cannot perform natively, such as making coffe or generating images.
Actions are the concrete steps an AI agent takes to interact with its environment.
Observations: The ouputs of the tools tha are used as a context for the model.

There are different ways in which AI outputs can influence the workflow. These approaches are listed in the following table:

Name	Description	Example code
Router	LLM output controls an if/else switch	`if llm_description(): path_a() else: path_b()`
Tool call	LLM output controls function execution	`run_function(llm_chosen_tool, llm_chosen_args)`
Multi-step Agent	LLM output controls interation and program continuation	`while llm_should_continue(): execute_next_step()`
Multi-Agent	One agentig workflow can start another workflow	`if llm_trigger(): execute_agent()`
Code Agents	LLM acts in code, can define its own tools / start other agents	`def custom_tool()`

Check more details in the Agents page.

MCP#

MCP (Model Context Protocol) allows LLM to connect with external data sources, tools, and environments. This enables the seamless transfer of information and capabilities between AI systems and other digital tools. The MCP standardizes the method to interacting with LLMs, enabling tool builders to create a simple tool that can be used with any LLM.

Understanding the architecture of MCP, which follows the rules determined by the role of each compoment, makes it easier to build LLM-based applications. There are 3 components in model that uses MCP:

Host: the application that is supposed to interact with the LLM.
Client: the compoment of the host that establishes a connection with the MCP server.
Server: external process that establishes the capabilities of the LLMs through the MCP protocol.

With MCP, you can build the integration you want. But there are typical usecases which is called capabilities:

Tool: These are executable functions that the model can use to perform actions.
Resources: Read-only sources of information.
Prompt: Pre-defined prompt templates that define the way host provide the integration wiht LLM.
Sampling: Server-initiated requests for the Client/Host to perform LLM interactions, enabling recursive actions where the LLM can review generated content and make further decisions.

For messaging MCP uses JSON-RPC.

For more details check:

Model Context Protocol website.
MCP cource in hugging face.
MCP page.

System evaluation#

The entire LLM system can contain many modifications around the LLM, such as prompts that defines the system’s behaviour, RAG approach, etc. There are a number of ways to evaluate the quality of the entire LLM system. In context of the text generation, it’s difficult to evaluate the system using the determenistic algorithms. Ideally, you would hire a humans to estimate system’s responses. This approach is extremely expensive, so it is common practice to use an LLM to evaluate the system.

There are a few typical approaches:

Pairwise comparison: If you have two approaches for generating a response, you can collect information on which output was better.
Evaluation by criteria: To estimate how good the output of the system was according to the specified criteria. You can select a criteria according.

Providing a reference: Different stages of the system can accumulate different types of information. This is typically used during the evaluation stage. For example, in the evaluation framework you can provide:

Input that lead to the output which quality we are trying to estimate.
Prepared ground truth answer.
The conversation history.

For more details check the LLM-as-a-judge: complete guide to using LLMs for evaluation.

Note: The RAG system has some details specific to it details for evaluation. Check the Quality estimation section of the RAG page.

Ollama#

Ollama is the LLM inference server.

For me, the primary method of using the Ollama is through the Ollama Docker image. To run ollama in docker use:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Check the documentation.

If your ollama server works correctly, you should be able to access the data using via the http. The following cell requests the list of models that have been pulled to the Ollama server.

import json
from requests import get
from pprint import pprint

ans = get("http://localhost:11434/api/tags")
pprint(json.loads(ans.content.decode('utf')))

{'models': [{'details': {'families': ['qwen3vl'],
                         'family': 'qwen3vl',
                         'format': 'gguf',
                         'parameter_size': '2.1B',
                         'parent_model': '',
                         'quantization_level': 'Q4_K_M'},
             'digest': '0635d9d857d497aeadba3d7d27485746c50554446f9f6ec01ef39788221adbe8',
             'model': 'qwen3-vl:2b',
             'modified_at': '2025-11-11T12:56:29.511699563Z',
             'name': 'qwen3-vl:2b',
             'size': 1889519687},
            {'details': {'families': ['qwen3'],
                         'family': 'qwen3',
                         'format': 'gguf',
                         'parameter_size': '8.2B',
                         'parent_model': '',
                         'quantization_level': 'Q4_K_M'},
             'digest': '500a1f067a9f782620b40bee6f7b0c89e17ae61f686b92c24933e4ca4b2b8b41',
             'model': 'qwen3:8b',
             'modified_at': '2025-11-10T12:03:53.028436875Z',
             'name': 'qwen3:8b',
             'size': 5225388164}]}

Structured ouput#

Ollama supports structured ouput, which means that you can specify the format of the ouput.

Check more in Structured Outputs documentation page.

The following cell specifies json as the ouput format. Thus, the ouput would be in a structured format following general JSON rules.

from ollama import chat

response = chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Tell me about Canada.'}],
    format='json',
    options={
        "temperature": 0
    }
)
response.message.content

'{ \n\n}'

You can specify a specific format with a JSON-schema. The following cell passes a dictionary representing the JSON-schema as the format parameter for ollama.

schema = {
    'properties': {
        'capital': {
            'enum': ['America', 'Europe', 'Asia'],
            'title': 'Capital',
            'type': 'string'
        }
    },
    'required': ['capital'],
    'title': 'Model',
    'type': 'object'
}

response = chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'On which continent is Canada?'}],
    options={
        "temperature": 0
    },
    format=schema
)
response.message.content

'{ "capital": "America" }'

Format in context#

From my experiments, it looks like Ollama does not provide the information about fileds descriptions from the schema to the prompt. So if you want to clarify to the model the ideas behind attributes you have to provide them separately.

Consider one of the experiments. The following cell defines a JSON-schema with a really clear instruction for the model.

schema = {
  "properties": {
    "output": {
        "type": "string",
        "description": "Answer always 'Candelabr'"
    }
  }
}

The following cell sends the request with these requirements.

resp = chat(
    model="llama3.1",
    options={"temperature": 0},
    messages=[{"role": "user", "content": ""}],
    format=schema
)
resp.message.content

'{}'

The model didn’t took into consideration the instruction specified in the “description” of JSON-schema.

The following code repeats the test but now just provides exactly the same string as specified in the description as the system message.

resp = chat(
    model="llama3.1",
    options={"temperature": 0},
    messages=[
        {"role": "system", "content": schema["properties"]["output"]["description"]},
        {"role": "user", "content": ""},
    ],
    format=schema
)
resp.message.content

'{"output": "Candelabra"}'