LLMs#
LLMs are models designed primarily to predict text. The most advanced LLMs can simulate a wide range of linguistic behaviors. With the right configuration, they can be applied to many problems that are difficult to solve with traditional programming.
Chat templates#
Since LLMs are designed only to predict the next token based on a range of previous tokens, they are not able to behive in a chat-like pattern by default. To achieve this, models are fine-tuned to follow the chat templates. A model that has passed the so-called supervised fine-tuning, which is intended to train it to follow the “query-response” pattern, usually has the instruct
prefix or postfix somewhere in its name/identifier.
Chat template is a set of rules for structuring model inputs, separating messages, and specifying the role of the speaker for each message.
Special tokens ussually define the beginning and ending of different messages. Role is specified using some defined syntax.
The chat-templates for a popular LLMs are:
OpenAI ChatML
Common delimiters:
<|im_start|>
precedes each message;<|im_end|>
closes them.Roles are explicitly labeled (e.g.,
<|system|>
,<|user|>
,<|assistant|>
).
LLaMA-2 “INST” Format
Prompts often use
[INST] ... [/INST]
markers, with optional system-specific wrappers like<<SYS>> ... <</SYS>>
. These markers aren’t necessarily single tokens but are recognized by tokenizer logic.
LLaMA-3 / SentencePiece Chat Templates
Conversations begin with
<|begin_of_text|>
.Role headers are wrapped using
<|start_header_id|>role<|end_header_id|>
.Each message ends with a special end-of-turn token:
<|eot_id|>
.
Hugging Face provides a playground where you can see how the chat templates will be applied for the different models available on the platform:
The tokenizers that come with the models in the transformers
package apply chat templates to the sets of messages using the apply_chat_template
method. The following cell shows the result applying the chat-format to the SmolLM model:
messages = [
{"role": "system", "content": "You are an AI assistant with access to various tools."},
{"role": "user", "content": "Hi !"},
{"role": "assistant", "content": "Hi human, what can help you with ?"},
]
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
rendered_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(rendered_prompt)
<|im_start|>system
You are an AI assistant with access to various tools.<|im_end|>
<|im_start|>user
Hi !<|im_end|>
<|im_start|>assistant
Hi human, what can help you with ?<|im_end|>
<|im_start|>assistant
Prompt engineering#
Prompt engineering is a set of approaches used to configure a text generation model to produce the exact results you’re interested in.
Check the Prompt Engineering guide provided by google.
Generally there are two concepts you need to make the model to produce relevant output:
Configure the model by changing parameters. Models implemented by different organizatinos have different configuration options but most have: output length and sampling controls.
Bulding a promt. for this there are following techniques:
General prompting / zero shot.
One & few shot prompting.
System, contextual and role prompting.
Token sampling#
At each step of the generation process, the model generates the next token by classifying it based on previous tokens. Thus, at some point, the model’s predictions resemble the probabilities of the next token:
\(\left(p_1, p_2, \ldots, p_n\right), \sum_{i=1}^n p_i =1\)
Where \(n\) is a vocabulary of the model.
The temperature regulates the randomness of the selected tokens. A value of 0 results in deterministic model outputs. The higher the value,a the more creative the model’s output will be.
Top-k specifies the model that selects the next token from among the \(k\) tokens with the highest probability.
Top-p it considers the smallest possible set of tokens whose cumulative probability exceeds a predefined threshold, \(p\).
Prompting techniques#
There are different approaches associated to providing model information about the structure of required output:
Zero shot: general prompting technique, just query to the model without providing any additional information.
One shot & Zero shot: to explain the model the structure of the output you expect from it.
There are following options when specifying the general patterns of the model behaviour:
System prompting sets the overall context and purpose for the language model.
Contextual prompting provides specific details or background information relevant to the current convecrsation task.
Role prompting assigning the a specific character or identity for the language model
Step-back prompting: Two prompts are provided: a spefic prompt and a more general prompt. The more general prompt ussually asks about typical approaches to the issue described in the specific prompt. The second prompt includes the model’s answer to the general prompt as context and the specific prompt as the taks.
Chain of Thought (CoT): A technique in which the model is asked to solve a task step by step. In the most basic implementation, the prompt literally asks the model to solve the problem “step by step”.
OpenAI compatibility#
The principles behind openAI’s API have already become the industry standard. If an API OpenAI compatabile, it means you can communicate with it using the same format used by the OpenAI API.
For example, the following cell sends a request to the Ollama server launched on port 11434 port with the pulled llama3.2:1b
model.
import requests, json
ans = requests.post(
"http://localhost:11434/v1/chat/completions",
json={
"model": "llama3.2:1b",
"messages": [
{"role": "system", "content": "You're an assistant"},
{"role": "user", "content": "What is the capital of GB?"}
]
}
)
json.loads(ans.content)
{'id': 'chatcmpl-616',
'object': 'chat.completion',
'created': 1758182631,
'model': 'llama3.2:1b',
'system_fingerprint': 'fp_ollama',
'choices': [{'index': 0,
'message': {'role': 'assistant',
'content': 'The capital of the United Kingdom, which includes England, Scotland, Wales, and Northern Ireland, is London.'},
'finish_reason': 'stop'}],
'usage': {'prompt_tokens': 36, 'completion_tokens': 23, 'total_tokens': 59}}
The following cell shows the request sent to the model that was deployed with the llama.cpp
.
ans = requests.post(
"http://localhost:5893/v1/chat/completions",
json={
"messages": [
{"role": "system", "content": "You're an assistant"},
{"role": "user", "content": "What is the capital of GB?"}
]
}
)
json.loads(ans.content)
{'choices': [{'finish_reason': 'stop',
'index': 0,
'message': {'role': 'assistant',
'content': 'The capital of GB is London. 😊'}}],
'created': 1758182635,
'model': 'gpt-3.5-turbo',
'system_fingerprint': 'b6503-62c3b645',
'object': 'chat.completion',
'usage': {'completion_tokens': 9, 'prompt_tokens': 22, 'total_tokens': 31},
'id': 'chatcmpl-PGsByExDI96hgvHJ9SIGaAOWVmgmjaDe',
'timings': {'cache_n': 21,
'prompt_n': 1,
'prompt_ms': 35.968,
'prompt_per_token_ms': 35.968,
'prompt_per_second': 27.802491103202843,
'predicted_n': 9,
'predicted_ms': 232.494,
'predicted_per_token_ms': 25.832666666666668,
'predicted_per_second': 38.71067640455237}}
Note: The model is not specified for llama.cpp because the endpoint created for the example only serves one model.
RAG#
Retrieval-Augmented Generation (RAG) is an approach that provides LLM with context associated with the specific information. The general idea is to create a knowledge base in the form of vector database, where encoded as embeddings documents corresponding to the information to be added to the model context. When the system needs information, it searches for embeddings with corresponding properties decodes them, and add them as context to the machine learning model.
There are sevaral topics related to RAG systems that need to be discussed:
Chunking: The process of separating documents from the knowledge base into the chunks that can be used to prepare the embeddings.
Retrieval: There are a set of approaches and tools to collect the relevant information contained in the chunks.
Qality estimation: As there are few compoments in the RAG system that’s why the estimation of the system is a complex process.
Check more in the corresponding RAG page.
Agents#
AI agents are programs where AI controls workflow.
There is some typical terminology in the field of agentic frameworks:
Tools: provide the agent with the ability to execute actions a text-generation model cannot perform natively, such as making coffe or generating images.
Actions are the concrete steps an AI agent takes to interact with its environment.
Observations: The ouputs of the tools tha are used as a context for the model.
There are different ways in which AI outputs can influence the workflow. These approaches are listed in the following table:
Name |
Description |
Example code |
---|---|---|
Router |
LLM output controls an if/else switch |
|
Tool call |
LLM output controls function execution |
|
Multi-step Agent |
LLM output controls interation and program continuation |
|
Multi-Agent |
One agentig workflow can start another workflow |
|
Code Agents |
LLM acts in code, can define its own tools / start other agents |
|
Check more details in the Agents page.
MCP#
MCP (Model Context Protocol) allows LLM to connect with external data sources, tools, and environments. This enables the seamless transfer of information and capabilities between AI systems and other digital tools. The MCP standardizes the method to interacting with LLMs, enabling tool builders to create a simple tool that can be used with any LLM.
Understanding the architecture of MCP, which follows the rules determined by the role of each compoment, makes it easier to build LLM-based applications. There are 3 components in model that uses MCP:
Host: the application that is supposed to interact with the LLM.
Client: the compoment of the host that establishes a connection with the MCP server.
Server: external process that establishes the capabilities of the LLMs through the MCP protocol.
With MCP, you can build the integration you want. But there are typical usecases which is called capabilities:
Tool: These are executable functions that the model can use to perform actions.
Resources: Read-only sources of information.
Prompt: Pre-defined prompt templates that define the way host provide the integration wiht LLM.
Sampling: Server-initiated requests for the Client/Host to perform LLM interactions, enabling recursive actions where the LLM can review generated content and make further decisions.
For messaging MCP uses JSON-RPC.
For more details check:
Model Context Protocol website.
MCP cource in hugging face.
MCP page.