Hub

Hub#

Hugging Face provides the huggingface_hub Python package, which offers tools for storing and managing models, as well as for using third-party models. It allows you to download models from the Hub and make requests to inference APIs deployed on Hugging Face’s infrastructure.

from huggingface_hub import InferenceClient

Inference client#

The huggingface_hub package provides an InferenceClient interface, which is a unified object to access different models launched in the hub infrastructure. This section discusses how the object can be used.

For details:

Inference Client section of the API reference.
Run inference on servers tutorial.

The following cell uses the huggingface_hub.InferenceClient to request model inference provided by the hub.

Note: You should have your hugging face token in the $HF_TOKEN environment variable. This token should have access to the inferences requests.

Ollama#

To specify the Ollama inference server, the URL path to Ollama must be speicfied when initializing the InferenceClient. The model that Ollama must launch is specified in the model argument of the chat_completion mehtod.

The following cell initializes the client.

Note: Ollama is accessible via port 11434.

client = InferenceClient(
    model="http://localhost:11434"
)

The following cell shows how to request the model from Ollama.

ans = client.chat_completion(
    model="llama3.2:1b",
    messages=[
        {"role": "user", "content": "The capital of china is"}
    ]
)
ans.choices[0].message.content

'Beijing is the capital of China.'

Chat completion#

The InferenceClient.chat_completion method completes a chat - in the messages have to be provided the chat history, it returns the next response.

In the messages, you should specify the message and the role that provided the message, according to the chat that is supposed to be completed. Roles depends on the model you’re using but the common options are:

user: client that asks a model to do something.
assistant: answer of a model.
system: system prompt that regulates the model’s general behaviour.

The following cell shows how it supposed to be used.

client = InferenceClient(model="meta-llama/Llama-4-Scout-17B-16E-Instruct")

output = client.chat_completion(
    messages=[
        {"role": "user", "content": "The capital of Australia is"},
    ],
    stream=False,
    max_tokens=20,
)
print(output.choices[0].message.content)

The best answer is Canberra

Note. In some tutorials client.chat.completions.create is used instead. They appear to have the same output.

output = client.chat.completions.create(
    messages=[
        {"role": "user", "content": "The capital of Australia is"},
    ],
    stream=False,
    max_tokens=20,
)

print(output.choices[0].message.content)

The best answer is Canberra

Hub

Contents

Hub#

Inference client#

Ollama#

Chat completion#