Hugging Face#

This page discusses various aspects of the Hugging Face infrastructure.

Hub#

The huggingface_hub package allows you to interact with the huggingface infrastructure, which offers different tools for versioning and inferencing of the machine learning models.

Functionality

Usage Reference

Notes

Authentication

huggingface_hub.login

Log in with your token.

huggingface_hub.whoami

Check current user.

Repo Management

huggingface_hub.HfApi.create_repo

Create repo for models/datasets/spaces.

huggingface_hub.HfApi.delete_repo

Delete repo.

File Upload / Download

huggingface_hub.upload_file

Upload a single file.

huggingface_hub.download_file

Download a single file.

Snapshot Download

huggingface_hub.snapshot_download

Download entire repo (cached locally).

Commits & Revisions

huggingface_hub.HfApi.create_commit

Commit files with git-like semantics.

huggingface_hub.CommitOperationAdd

Add file in a commit.

Search / Listing

huggingface_hub.HfApi.list_models

Search models.

huggingface_hub.HfApi.list_datasets

Search datasets.

Model / Dataset Info

huggingface_hub.HfApi.model_info

Get model metadata.

huggingface_hub.HfApi.dataset_info

Get dataset details.

Inference API

huggingface_hub.InferenceClient

Run inference via HF servers.

Spaces Management

huggingface_hub.HfApi.restart_space

Restart a Space.

huggingface_hub.HfApi.request_space_hardware

Request hardware upgrade.

Utilities

huggingface_hub.hf_hub_url

Get raw file URL.

huggingface_hub.scan_cache_dir

Inspect local HF cache.

Check more:

Transformers#

transformers is python package that allows you to use pre-trained machine learning models that belong to the transformers architecture.

Component

Description

Models

Pretrained architectures for tasks like classification, generation, or embeddings.

Tokenizers

Convert text into numerical input for models; handle batching, padding, truncation.

Pipelines

High-level API combining tokenizer + model for a specific task (e.g., summarization).

Configurations

Define model hyperparameters and architecture settings (e.g., BertConfig).

Trainer

High-level training API handling loops, evaluation, logging, and checkpointing.

Schedulers & Optimizers

Learning rate schedulers and optimizer integrations for training models.

Data Utilities

Helpers for preprocessing and batching (e.g., DataCollator, BatchEncoding).

Hub Integration

Download/upload pretrained models from Hugging Face Hub (from_pretrained, push_to_hub).

Check more details in the transformers page.

Datasets#

The Datasets is package from Hugging Face’s infrastructure that manages data and implements tools for loading and prcessing data, regardless of its modality.

You can define your own datasets, as well as load ready datatsets form the datasets section of hugging face hub.

The core elements of the package are:

  • Primitives to keep data:

    • Dataset is a map of features, each of which is a array of elements.

    • DatasetDict is a map of datasets.

  • load_dataset: method for loading datasets.

  • Modalities: package build to be able to process all kinds of data, so it implements subpackages:

    • Audio: for working with audio data.

    • Image: for working with image data.

    • Text: for working with text data.

    • Video: for working with video data.

  • Processing methods:

    • map: method for applying a function to each element of a dataset.

    • filter: method for filtering a dataset.

    • train_test_split: method for splitting a dataset into training and testing sets.

    • sort: method for sorting a dataset.

    • shuffle: method for shuffling a dataset.

  • Other utilities:

    • arrow_writer: class for writing data to an Arrow file.

    • Csv: class for working with CSV files.

    • Json: class for working with JSON files.


Consider the type of object you encounter when you start using the datasets package.

from datasets import load_dataset

The following cell displays the load of the lhoestq/demo1 dataset.

demo_dataset = load_dataset("lhoestq/demo1")
demo_dataset
DatasetDict({
    train: Dataset({
        features: ['id', 'package_name', 'review', 'date', 'star', 'version_id'],
        num_rows: 5
    })
    test: Dataset({
        features: ['id', 'package_name', 'review', 'date', 'star', 'version_id'],
        num_rows: 5
    })
})

At the top level, it is separated into test/train using DatasetDict abstraction of the datasets. Access the specific section of data by using the corresponding key.

demo_dataset['train']
Dataset({
    features: ['id', 'package_name', 'review', 'date', 'star', 'version_id'],
    num_rows: 5
})

Consider the following example with audio domain dataset:

dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
dataset
Dataset({
    features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
    num_rows: 563
})

It isn’t separated into test/train; load_dataset returns dataset directly. The following cell shows the features avaialble in the dataset under consideration.

dataset.features
{'path': Value('string'),
 'audio': Audio(sampling_rate=8000, decode=True, stream_index=None),
 'transcription': Value('string'),
 'english_transcription': Value('string'),
 'intent_class': ClassLabel(names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill']),
 'lang_id': ClassLabel(names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'])}

The audio feature has a Audio datatype due to the specifics of the dataset.

Tokenizers#

A package that implements different tokenization approaches and related tools.

Component

Description

PreTokenizers

Split text into initial units (words, punctuation, subwords) before encoding.

Models

Define the algorithm for tokenization (BPE, WordPiece, SentencePiece, Unigram).

Normalizers

Clean and standardize text (lowercasing, accent stripping, punctuation handling).

Trainers

Learn tokenization vocabulary from a dataset.

Decoders

Convert token IDs back to readable text.

Processors

Post-process tokenized output (e.g., adding special tokens like [CLS]).

Batch Encoding

Handle batch tokenization with padding, truncation, and attention masks.

Check:


Consider the most essential components that are typically used with the tokenizers package.

from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
from tokenizers.models import BPE
from tokenizers import Tokenizer

Pre-tokenization is the initial separation of the text into smaller units, providing an upper bound on the number of tokens you expect to receive. The main algorithm is often statistical, so the output is not limited by any strict rule. A pre-tokenizer, in contrast, uses a deterministic algorithm, so its results are predictable. The following cell shows the application of the simplest whitespace pre-tokenizer to the preceding sentence.

pretokinizer = Whitespace()
pretokinizer.pre_tokenize_str("Some test text")
[('Some', (0, 4)), ('test', (5, 9)), ('text', (10, 14))]

The tokenizers.Tokenizer class, is tool for interacting with a tokenizer. It takes a model that defines the exact approach to tokenization.

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pretokinizer

The trainer class is another component of the whole system, and it defines some parameters. The following cell shows the training of the tokenizer defined earlier.

trainer = BpeTrainer(vocab_size=20)

tokenizer.train_from_iterator(
    [
        "some super check",
        "super some check"
    ],
    trainer
)

The following cell shows the vocabulary of the final tokenizer. Each token has an ID that will be used after tokenization.

tokenizer.get_vocab()
{'some': 18,
 'per': 16,
 'ck': 11,
 'h': 2,
 'u': 9,
 'me': 14,
 'o': 5,
 'er': 12,
 'p': 6,
 'r': 7,
 'c': 0,
 'k': 3,
 'ch': 10,
 's': 8,
 'e': 1,
 'check': 19,
 'ome': 15,
 'm': 4,
 'su': 17,
 'eck': 13}

Here is a result of a transformation for a particular case.

tokenizer.encode("start some check").tokens
['s', 'r', 'some', 'check']

Smolagents#

The smalagents package is a tooling for building agents. The following table lists core components that you’re supposed to use:

Component

Description

CodeAgent

A class that incapsulates the agent’s logic

Tool

A base class for tools. There are some built in tools like: ApiWebSearch, VisitWebpageTool, PythonInterpreterTool etc.

@tool

A decorator that wraps functions that are supposed to be used as tools

Model

A base class for the interfaces that provide different ways to access the decision-making model: InferenceClientModel, TransformersModel, etc.


Consider the following example. Suppose you want to make model awailable to perform a really specific transformation, with which the model is unfamiliar. In this case, we’ll consider the “Kobak transformation”.

The following cell asks the raw model to perform the “Kobak transformation”.

from huggingface_hub import InferenceClient
client = InferenceClient(model="Qwen/Qwen2.5-Coder-32B-Instruct")
output = client.chat_completion(
    [{
        "role": "user",
        "content": "Apply a Kobak transformation on the text 'Hello, World!'"
    }]
)

print(output.choices[0].message['content'])
The Kobak transformation is not a standard text transformation technique that I'm familiar with. It's possible that you might be referring to a specific context or a custom transformation method. Could you please provide more details or clarify the definition of the Kobak transformation? This will help me give you an accurate and helpful response.

The model indicates that it is not familiar with the type of the required transformation. We can implement the transformation using python code and provide the code to the model as a tool:

from smolagents import CodeAgent, InferenceClientModel, tool

@tool
def kobak_transformation(text: str) -> str:
    """A tool that fetches the current local time in a specified timezone.
    Args:
        text: The input text to be transformed.
    """
    return text[::-1]

model = InferenceClientModel(model_id="Qwen/Qwen2.5-Coder-32B-Instruct")
agent = CodeAgent(tools=[kobak_transformation], model=model)

result = agent.run("Apply a Kobak transformation on the text 'Hello, World!'")
print(result)
╭──────────────────────────────────────────────────── New run ────────────────────────────────────────────────────╮
                                                                                                                 
 Perforam a kobak transformation on the text 'Hello, World!'                                                     
                                                                                                                 
╰─ InferenceClientModel - Qwen/Qwen2.5-Coder-32B-Instruct ────────────────────────────────────────────────────────╯
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Executing parsed code: ──────────────────────────────────────────────────────────────────────────────────────── 
  transformed_text = kobak_transformation(text='Hello, World!')                                                    
  print(transformed_text)                                                                                          
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
Execution logs:
!dlroW ,olleH

Out: None
[Step 1: Duration 0.96 seconds| Input tokens: 1,994 | Output tokens: 49]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Executing parsed code: ──────────────────────────────────────────────────────────────────────────────────────── 
  final_answer("!dlroW ,olleH")                                                                                    
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── 

Final answer: !dlroW ,olleH
[Step 2: Duration 1.19 seconds| Input tokens: 4,125 | Output tokens: 120]
!dlroW ,olleH

The outputs show the “thoughts” of the model, and the final answer corresponds to the idea of the “Kobak transformation”.