Hugging Face#
This page discusses various aspects of the Hugging Face infrastructure.
Hub#
The huggingface_hub
package allows you to interact with the huggingface infrastructure, which offers different tools for versioning and inferencing of the machine learning models.
Functionality |
Usage Reference |
Notes |
---|---|---|
Authentication |
|
Log in with your token. |
|
Check current user. |
|
Repo Management |
|
Create repo for models/datasets/spaces. |
|
Delete repo. |
|
File Upload / Download |
|
Upload a single file. |
|
Download a single file. |
|
Snapshot Download |
|
Download entire repo (cached locally). |
Commits & Revisions |
|
Commit files with git-like semantics. |
|
Add file in a commit. |
|
Search / Listing |
|
Search models. |
|
Search datasets. |
|
Model / Dataset Info |
|
Get model metadata. |
|
Get dataset details. |
|
Inference API |
|
Run inference via HF servers. |
Spaces Management |
|
Restart a Space. |
|
Request hardware upgrade. |
|
Utilities |
|
Get raw file URL. |
|
Inspect local HF cache. |
Check more:
Hub python library page.
Hub page for more detailed description.
Transformers#
transformers
is python package that allows you to use pre-trained machine learning models that belong to the transformers architecture.
Component |
Description |
---|---|
Models |
Pretrained architectures for tasks like classification, generation, or embeddings. |
Tokenizers |
Convert text into numerical input for models; handle batching, padding, truncation. |
Pipelines |
High-level API combining tokenizer + model for a specific task (e.g., |
Configurations |
Define model hyperparameters and architecture settings (e.g., |
Trainer |
High-level training API handling loops, evaluation, logging, and checkpointing. |
Schedulers & Optimizers |
Learning rate schedulers and optimizer integrations for training models. |
Data Utilities |
Helpers for preprocessing and batching (e.g., |
Hub Integration |
Download/upload pretrained models from Hugging Face Hub ( |
Check more details in the transformers page.
Datasets#
The Datasets is package from Hugging Face’s infrastructure that manages data and implements tools for loading and prcessing data, regardless of its modality.
You can define your own datasets, as well as load ready datatsets form the datasets section of hugging face hub.
The core elements of the package are:
Primitives to keep data:
Dataset
is a map of features, each of which is a array of elements.DatasetDict
is a map of datasets.
load_dataset
: method for loading datasets.Modalities: package build to be able to process all kinds of data, so it implements subpackages:
Audio
: for working with audio data.Image
: for working with image data.Text
: for working with text data.Video
: for working with video data.
Processing methods:
map
: method for applying a function to each element of a dataset.filter
: method for filtering a dataset.train_test_split
: method for splitting a dataset into training and testing sets.sort
: method for sorting a dataset.shuffle
: method for shuffling a dataset.
Other utilities:
arrow_writer
: class for writing data to an Arrow file.Csv
: class for working with CSV files.Json
: class for working with JSON files.
Consider the type of object you encounter when you start using the datasets
package.
from datasets import load_dataset
The following cell displays the load of the lhoestq/demo1
dataset.
demo_dataset = load_dataset("lhoestq/demo1")
demo_dataset
DatasetDict({
train: Dataset({
features: ['id', 'package_name', 'review', 'date', 'star', 'version_id'],
num_rows: 5
})
test: Dataset({
features: ['id', 'package_name', 'review', 'date', 'star', 'version_id'],
num_rows: 5
})
})
At the top level, it is separated into test/train using DatasetDict
abstraction of the datasets. Access the specific section of data by using the corresponding key.
demo_dataset['train']
Dataset({
features: ['id', 'package_name', 'review', 'date', 'star', 'version_id'],
num_rows: 5
})
Consider the following example with audio domain dataset:
dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
dataset
Dataset({
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
num_rows: 563
})
It isn’t separated into test/train; load_dataset
returns dataset directly. The following cell shows the features avaialble in the dataset under consideration.
dataset.features
{'path': Value('string'),
'audio': Audio(sampling_rate=8000, decode=True, stream_index=None),
'transcription': Value('string'),
'english_transcription': Value('string'),
'intent_class': ClassLabel(names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill']),
'lang_id': ClassLabel(names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'])}
The audio
feature has a Audio
datatype due to the specifics of the dataset.
Tokenizers#
A package that implements different tokenization approaches and related tools.
Component |
Description |
---|---|
PreTokenizers |
Split text into initial units (words, punctuation, subwords) before encoding. |
Models |
Define the algorithm for tokenization (BPE, WordPiece, SentencePiece, Unigram). |
Normalizers |
Clean and standardize text (lowercasing, accent stripping, punctuation handling). |
Trainers |
Learn tokenization vocabulary from a dataset. |
Decoders |
Convert token IDs back to readable text. |
Processors |
Post-process tokenized output (e.g., adding special tokens like |
Batch Encoding |
Handle batch tokenization with padding, truncation, and attention masks. |
Check:
Documentation package.
Consider the most essential components that are typically used with the tokenizers
package.
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
from tokenizers.models import BPE
from tokenizers import Tokenizer
Pre-tokenization is the initial separation of the text into smaller units, providing an upper bound on the number of tokens you expect to receive. The main algorithm is often statistical, so the output is not limited by any strict rule. A pre-tokenizer, in contrast, uses a deterministic algorithm, so its results are predictable. The following cell shows the application of the simplest whitespace pre-tokenizer to the preceding sentence.
pretokinizer = Whitespace()
pretokinizer.pre_tokenize_str("Some test text")
[('Some', (0, 4)), ('test', (5, 9)), ('text', (10, 14))]
The tokenizers.Tokenizer
class, is tool for interacting with a tokenizer. It takes a model that defines the exact approach to tokenization.
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pretokinizer
The trainer class is another component of the whole system, and it defines some parameters. The following cell shows the training of the tokenizer defined earlier.
trainer = BpeTrainer(vocab_size=20)
tokenizer.train_from_iterator(
[
"some super check",
"super some check"
],
trainer
)
The following cell shows the vocabulary of the final tokenizer. Each token has an ID that will be used after tokenization.
tokenizer.get_vocab()
{'some': 18,
'per': 16,
'ck': 11,
'h': 2,
'u': 9,
'me': 14,
'o': 5,
'er': 12,
'p': 6,
'r': 7,
'c': 0,
'k': 3,
'ch': 10,
's': 8,
'e': 1,
'check': 19,
'ome': 15,
'm': 4,
'su': 17,
'eck': 13}
Here is a result of a transformation for a particular case.
tokenizer.encode("start some check").tokens
['s', 'r', 'some', 'check']
Smolagents#
The smalagents
package is a tooling for building agents. The following table lists core components that you’re supposed to use:
Component |
Description |
---|---|
|
A class that incapsulates the agent’s logic |
|
A base class for tools. There are some built in tools like: |
|
A decorator that wraps functions that are supposed to be used as tools |
|
A base class for the interfaces that provide different ways to access the decision-making model: |
Consider the following example. Suppose you want to make model awailable to perform a really specific transformation, with which the model is unfamiliar. In this case, we’ll consider the “Kobak transformation”.
The following cell asks the raw model to perform the “Kobak transformation”.
from huggingface_hub import InferenceClient
client = InferenceClient(model="Qwen/Qwen2.5-Coder-32B-Instruct")
output = client.chat_completion(
[{
"role": "user",
"content": "Apply a Kobak transformation on the text 'Hello, World!'"
}]
)
print(output.choices[0].message['content'])
The Kobak transformation is not a standard text transformation technique that I'm familiar with. It's possible that you might be referring to a specific context or a custom transformation method. Could you please provide more details or clarify the definition of the Kobak transformation? This will help me give you an accurate and helpful response.
The model indicates that it is not familiar with the type of the required transformation. We can implement the transformation using python code and provide the code to the model as a tool:
from smolagents import CodeAgent, InferenceClientModel, tool
@tool
def kobak_transformation(text: str) -> str:
"""A tool that fetches the current local time in a specified timezone.
Args:
text: The input text to be transformed.
"""
return text[::-1]
model = InferenceClientModel(model_id="Qwen/Qwen2.5-Coder-32B-Instruct")
agent = CodeAgent(tools=[kobak_transformation], model=model)
result = agent.run("Apply a Kobak transformation on the text 'Hello, World!'")
print(result)
╭──────────────────────────────────────────────────── New run ────────────────────────────────────────────────────╮ │ │ │ Perforam a kobak transformation on the text 'Hello, World!' │ │ │ ╰─ InferenceClientModel - Qwen/Qwen2.5-Coder-32B-Instruct ────────────────────────────────────────────────────────╯
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
─ Executing parsed code: ──────────────────────────────────────────────────────────────────────────────────────── transformed_text = kobak_transformation(text='Hello, World!') print(transformed_text) ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Execution logs:
!dlroW ,olleH
Out: None
[Step 1: Duration 0.96 seconds| Input tokens: 1,994 | Output tokens: 49]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
─ Executing parsed code: ──────────────────────────────────────────────────────────────────────────────────────── final_answer("!dlroW ,olleH") ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Final answer: !dlroW ,olleH
[Step 2: Duration 1.19 seconds| Input tokens: 4,125 | Output tokens: 120]
!dlroW ,olleH
The outputs show the “thoughts” of the model, and the final answer corresponds to the idea of the “Kobak transformation”.