Hugging Face

Hugging Face#

This page discusses various aspects of the Hugging Face infrastructure.

Hub#

The huggingface_hub package allows you to interact with the huggingface infrastructure, which offers different tools for versioning and inferencing of the machine learning models.

Functionality	Usage Reference	Notes
Authentication	`huggingface_hub.login`	Log in with your token.
	`huggingface_hub.whoami`	Check current user.
Repo Management	`huggingface_hub.HfApi.create_repo`	Create repo for models/datasets/spaces.
	`huggingface_hub.HfApi.delete_repo`	Delete repo.
File Upload / Download	`huggingface_hub.upload_file`	Upload a single file.
	`huggingface_hub.download_file`	Download a single file.
Snapshot Download	`huggingface_hub.snapshot_download`	Download entire repo (cached locally).
Commits & Revisions	`huggingface_hub.HfApi.create_commit`	Commit files with git-like semantics.
	`huggingface_hub.CommitOperationAdd`	Add file in a commit.
Search / Listing	`huggingface_hub.HfApi.list_models`	Search models.
	`huggingface_hub.HfApi.list_datasets`	Search datasets.
Model / Dataset Info	`huggingface_hub.HfApi.model_info`	Get model metadata.
	`huggingface_hub.HfApi.dataset_info`	Get dataset details.
Inference API	`huggingface_hub.InferenceClient`	Run inference via HF servers.
Spaces Management	`huggingface_hub.HfApi.restart_space`	Restart a Space.
	`huggingface_hub.HfApi.request_space_hardware`	Request hardware upgrade.
Utilities	`huggingface_hub.hf_hub_url`	Get raw file URL.
	`huggingface_hub.scan_cache_dir`	Inspect local HF cache.

Check more:

Hub python library page.
Hub page for more detailed description.

Transformers#

transformers is python package that allows you to use pre-trained machine learning models that belong to the transformers architecture.

Component	Description
Models	Pretrained architectures for tasks like classification, generation, or embeddings.
Tokenizers	Convert text into numerical input for models; handle batching, padding, truncation.
Pipelines	High-level API combining tokenizer + model for a specific task (e.g., `summarization`).
Configurations	Define model hyperparameters and architecture settings (e.g., `BertConfig`).
Trainer	High-level training API handling loops, evaluation, logging, and checkpointing.
Schedulers & Optimizers	Learning rate schedulers and optimizer integrations for training models.
Data Utilities	Helpers for preprocessing and batching (e.g., `DataCollator`, `BatchEncoding`).
Hub Integration	Download/upload pretrained models from Hugging Face Hub (`from_pretrained`, `push_to_hub`).

Check more details in the transformers page.

Datasets#

The Datasets is package from Hugging Face’s infrastructure that manages data and implements tools for loading and prcessing data, regardless of its modality.

You can define your own datasets, as well as load ready datatsets form the datasets section of hugging face hub.

The core elements of the package are:

Primitives to keep data:
- Dataset is a map of features, each of which is a array of elements.
- DatasetDict is a map of datasets.
load_dataset: method for loading datasets.
Modalities: package build to be able to process all kinds of data, so it implements subpackages:
- Audio: for working with audio data.
- Image: for working with image data.
- Text: for working with text data.
- Video: for working with video data.
Processing methods:
- map: method for applying a function to each element of a dataset.
- filter: method for filtering a dataset.
- train_test_split: method for splitting a dataset into training and testing sets.
- sort: method for sorting a dataset.
- shuffle: method for shuffling a dataset.
Other utilities:
- arrow_writer: class for writing data to an Arrow file.
- Csv: class for working with CSV files.
- Json: class for working with JSON files.

Consider the type of object you encounter when you start using the datasets package.

from datasets import load_dataset

The following cell displays the load of the lhoestq/demo1 dataset.

demo_dataset = load_dataset("lhoestq/demo1")
demo_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'package_name', 'review', 'date', 'star', 'version_id'],
        num_rows: 5
    })
    test: Dataset({
        features: ['id', 'package_name', 'review', 'date', 'star', 'version_id'],
        num_rows: 5
    })
})

At the top level, it is separated into test/train using DatasetDict abstraction of the datasets. Access the specific section of data by using the corresponding key.

demo_dataset['train']

Dataset({
    features: ['id', 'package_name', 'review', 'date', 'star', 'version_id'],
    num_rows: 5
})

Consider the following example with audio domain dataset:

dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
dataset

Dataset({
    features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
    num_rows: 563
})

It isn’t separated into test/train; load_dataset returns dataset directly. The following cell shows the features avaialble in the dataset under consideration.

dataset.features

{'path': Value('string'),
 'audio': Audio(sampling_rate=8000, decode=True, stream_index=None),
 'transcription': Value('string'),
 'english_transcription': Value('string'),
 'intent_class': ClassLabel(names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill']),
 'lang_id': ClassLabel(names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'])}

The audio feature has a Audio datatype due to the specifics of the dataset.

Tokenizers#

A package that implements different tokenization approaches and related tools.

Component	Description
PreTokenizers	Split text into initial units (words, punctuation, subwords) before encoding.
Models	Define the algorithm for tokenization (BPE, WordPiece, SentencePiece, Unigram).
Normalizers	Clean and standardize text (lowercasing, accent stripping, punctuation handling).
Trainers	Learn tokenization vocabulary from a dataset.
Decoders	Convert token IDs back to readable text.
Processors	Post-process tokenized output (e.g., adding special tokens like `[CLS]`).
Batch Encoding	Handle batch tokenization with padding, truncation, and attention masks.

Check:

Documentation package.

Consider the most essential components that are typically used with the tokenizers package.

from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
from tokenizers.models import BPE
from tokenizers import Tokenizer

Pre-tokenization is the initial separation of the text into smaller units, providing an upper bound on the number of tokens you expect to receive. The main algorithm is often statistical, so the output is not limited by any strict rule. A pre-tokenizer, in contrast, uses a deterministic algorithm, so its results are predictable. The following cell shows the application of the simplest whitespace pre-tokenizer to the preceding sentence.

pretokinizer = Whitespace()
pretokinizer.pre_tokenize_str("Some test text")

[('Some', (0, 4)), ('test', (5, 9)), ('text', (10, 14))]

The tokenizers.Tokenizer class, is tool for interacting with a tokenizer. It takes a model that defines the exact approach to tokenization.

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pretokinizer

The trainer class is another component of the whole system, and it defines some parameters. The following cell shows the training of the tokenizer defined earlier.

trainer = BpeTrainer(vocab_size=20)

tokenizer.train_from_iterator(
    [
        "some super check",
        "super some check"
    ],
    trainer
)

The following cell shows the vocabulary of the final tokenizer. Each token has an ID that will be used after tokenization.

tokenizer.get_vocab()

{'some': 18,
 'per': 16,
 'ck': 11,
 'h': 2,
 'u': 9,
 'me': 14,
 'o': 5,
 'er': 12,
 'p': 6,
 'r': 7,
 'c': 0,
 'k': 3,
 'ch': 10,
 's': 8,
 'e': 1,
 'check': 19,
 'ome': 15,
 'm': 4,
 'su': 17,
 'eck': 13}

Here is a result of a transformation for a particular case.

tokenizer.encode("start some check").tokens

['s', 'r', 'some', 'check']

Smolagents#

The smalagents package is a tooling for building agents. The following table lists core components that you’re supposed to use:

Component	Description
`CodeAgent`	A class that incapsulates the agent’s logic
`Tool`	A base class for tools. There are some built in tools like: `ApiWebSearch`, `VisitWebpageTool`, `PythonInterpreterTool` etc.
`@tool`	A decorator that wraps functions that are supposed to be used as tools
`Model`	A base class for the interfaces that provide different ways to access the decision-making model: `InferenceClientModel`, `TransformersModel`, etc.

Consider the following example. Suppose you want to make model awailable to perform a really specific transformation, with which the model is unfamiliar. In this case, we’ll consider the “Kobak transformation”.

The following cell asks the raw model to perform the “Kobak transformation”.

from huggingface_hub import InferenceClient
client = InferenceClient(model="Qwen/Qwen2.5-Coder-32B-Instruct")
output = client.chat_completion(
    [{
        "role": "user",
        "content": "Apply a Kobak transformation on the text 'Hello, World!'"
    }]
)

print(output.choices[0].message['content'])

The Kobak transformation is not a standard text transformation technique that I'm familiar with. It's possible that you might be referring to a specific context or a custom transformation method. Could you please provide more details or clarify the definition of the Kobak transformation? This will help me give you an accurate and helpful response.

The model indicates that it is not familiar with the type of the required transformation. We can implement the transformation using python code and provide the code to the model as a tool:

from smolagents import CodeAgent, InferenceClientModel, tool

@tool
def kobak_transformation(text: str) -> str:
    """A tool that fetches the current local time in a specified timezone.
    Args:
        text: The input text to be transformed.
    """
    return text[::-1]

model = InferenceClientModel(model_id="Qwen/Qwen2.5-Coder-32B-Instruct")
agent = CodeAgent(tools=[kobak_transformation], model=model)

result = agent.run("Apply a Kobak transformation on the text 'Hello, World!'")
print(result)

╭──────────────────────────────────────────────────── New run ────────────────────────────────────────────────────╮
│                                                                                                                 │
│ Perforam a kobak transformation on the text 'Hello, World!'                                                     │
│                                                                                                                 │
╰─ InferenceClientModel - Qwen/Qwen2.5-Coder-32B-Instruct ────────────────────────────────────────────────────────╯

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 ─ Executing parsed code: ──────────────────────────────────────────────────────────────────────────────────────── 
  transformed_text = kobak_transformation(text='Hello, World!')                                                    
  print(transformed_text)                                                                                          
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Execution logs:
!dlroW ,olleH

Out: None

[Step 1: Duration 0.96 seconds| Input tokens: 1,994 | Output tokens: 49]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 ─ Executing parsed code: ──────────────────────────────────────────────────────────────────────────────────────── 
  final_answer("!dlroW ,olleH")                                                                                    
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Final answer: !dlroW ,olleH

[Step 2: Duration 1.19 seconds| Input tokens: 4,125 | Output tokens: 120]

!dlroW ,olleH

The outputs show the “thoughts” of the model, and the final answer corresponds to the idea of the “Kobak transformation”.