Intro#

The DS/ML section discusses the python packages/frameworks specialised for building database systems and machine learning.

Hugging Face#

Huggingface is an ecosystem of packages that are related to all aspects of working with deep learning objects.

The first thing you need to do is log in:

huggingface-cli login --token <your HF token>

The following table shows the structure of the ecosystem:

Package

Purpose

🤗 Hub (huggingface_hub)

Central repository for models, datasets, and Spaces. Lets you push/pull models and datasets.

Transformers (transformers)

High-level library with pretrained NLP, vision, and multimodal models. Handles training, inference, and tokenization (via wrappers).

Tokenizers (tokenizers)

Fast, low-level text tokenization library (written in Rust). Often used inside transformers.

Datasets (datasets)

Efficient dataset loading, processing, and streaming. Optimized for large ML datasets.

Evaluate (evaluate)

Standardized evaluation metrics library. Works well with datasets and transformers.

Diffusers (diffusers)

Library for diffusion models (e.g., Stable Diffusion) for images, audio, video.

Accelerate (accelerate)

Utility for running training on any hardware setup (CPU, GPU, multi-GPU, TPU) with minimal code changes.

PEFT (peft)

Parameter-Efficient Fine-Tuning library (LoRA, adapters, etc.) for large models.

Optimum (optimum)

Optimizations for transformers (ONNX, quantization, hardware-specific acceleration).

Smollagents (smolagents)

Building agentic systems.

Gradio (gradio) (partnered)

Simple UI framework to demo models in the browser.

Find out more:

Spark#

Spark is a framework for processing large amounts of data. This section covers its Python SDK.

For more details, check the Spark page.


The following cell demonstrates how to create a Spark session, define a data frame, and display it in the framework.

from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName("Temp").getOrCreate()
spark_session.createDataFrame(
    data=[
        ("Fedor", 500),
        ("Alice", 700),
        ("Bob", 1400)
    ],
    schema=("Name", "Salary")
).show()
+-----+------+
| Name|Salary|
+-----+------+
|Fedor|   500|
|Alice|   700|
|  Bob|  1400|
+-----+------+

Sentence transformer#

The sentence transformer package implements models for building embeddings from sets of texts. Check SBERT page for mode details.


Consider a basic example of using the sentence_transformers package.

The following cell loads the model and displays the type. It’s a special object that build to privide specific interfaces associated with building embeddings.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
type(model)
sentence_transformers.SentenceTransformer.SentenceTransformer

The obtained object have an encode method - that takes a range of texts and returns numpy.array of embeddings.

sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

embeddings = model.encode(sentences)
print(embeddings.shape)
embeddings
(3, 384)
array([[ 0.01919573,  0.12008536,  0.15959828, ..., -0.0053629 ,
        -0.08109505,  0.05021338],
       [-0.01869039,  0.04151868,  0.07431544, ...,  0.00486597,
        -0.06190442,  0.03187514],
       [ 0.136502  ,  0.08227322, -0.02526165, ...,  0.08762047,
         0.03045845, -0.01075752]], shape=(3, 384), dtype=float32)

The following cell uses the similarity method to create a matrix of the embeddings’ similarities.

similarities = model.similarity(embeddings, embeddings)
similarities
tensor([[1.0000, 0.6660, 0.1046],
        [0.6660, 1.0000, 0.1411],
        [0.1046, 0.1411, 1.0000]])

LangChain#

The Lang chain is the core library for developing modern, agent-based solutions. The following table lists and describes the central components of the lang chain package.

Component

Analogy

Description

Models

The brains

These are the core language models (LLMs) that handle the actual work, like generating text, holding conversations, or creating embeddings.

Prompts

The instructions

These are the templates used to provide specific instructions and context to the models. They ensure the model responds in a consistent and desired format.

Chains

The workflow

A way to link multiple components together into a single, automated sequence. This allows you to perform multi-step tasks, like combining a prompt with a model call.

Agents

The reasoning engine

A more advanced chain that uses an LLM to decide which external Tools to use to achieve a goal. It can think, act, and observe, repeating the process until the task is complete.

Tools

The external capabilities

These are functionalities an agent can use to interact with the world. Examples include a search engine, a calculator, or a database lookup.

Check more in Lang Chain package.

MCP SDK#

There is an MCP SDK for python. It is provided by the mcp[cli] package.

Define the assign a server object using the mcp.server.fastmcp.FastMCP class. Use decorators: tool, resource, prompt, and sampling to wrap the funcitons that implement the corresponding facilities.


In the following cell we will consider how to run the server.

%%writefile intro_files/mcp_server.py
from mcp.server.fastmcp import FastMCP

mcp = FastMCP("Some service")

@mcp.tool()
def some_tool(inp: str) -> str:
    return f"Output of some tool for {inp}."

mcp.run()
Overwriting intro_files/mcp_server.py

Run your server using the command mcp dev intro_files/mcp_server.py. The following cell runs the server from python using os.system command to demonstrate the expected output.

import os
os.system("mcp dev intro_files/mcp_server.py &")
0
Starting MCP inspector...
⚙️ Proxy server listening on localhost:6277
🔑 Session token: 4c21942ece36a04554ee01562067c5b129c3b03eaa945ead9d0b8964d9334fe8
   Use this token to authenticate requests or set DANGEROUSLY_OMIT_AUTH=true to disable auth

🚀 MCP Inspector is up and running at:
   http://localhost:6274/?MCP_PROXY_AUTH_TOKEN=4c21942ece36a04554ee01562067c5b129c3b03eaa945ead9d0b8964d9334fe8

🌐 Opening browser...

Note: To use an inspector tool, you must install npm on your system.

MLFlow#

It is a tool for organizing the lifecycle of machine learning models. It includes four componenets:

  • Tracking: Record and query experiments: code, data, config, results.

  • Projects: Packaging format for reproducible runs on any platform.

  • Models: General model format that support diverse deployment tools.

  • Model Registry: Centralized and collaborative model lifecycle management.

For more information, check out the MLFlow page.

Databricks SDK#

Databricks is a platform for developing data applications. It provides the python SKD.

As it provides intercation with cloud based platform you have to set up an authentification in the .databricks file. Check more on configuration the authentification in:

There is a special package for interacting with Databricks Feature Store: databricks-feature-engineering. After installing the module databricks.feature_engineering will be awailable from the environment.

Check the Databricks SDK for python documentation.


If the configuration is set up correctly, you should be able to run the following cell without any errors.

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
type(w)
databricks.sdk.WorkspaceClient

OpenAI client#

The method serving_endpoints.get_open_ai_client.get_open_ai_client returns the openai.OpenAI client, which you can use to requiest some served models.


The following cell creates the open_ai_client and shows that it is really open ai client.

open_ai_client = w.serving_endpoints.get_open_ai_client()
type(open_ai_client)
openai.OpenAI

The following cell illustrates the invocation of the embedding model.

embedding = open_ai_client.embeddings.create(
   model="databricks-gte-large-en",
   input="hello"
)
type(embedding)
openai.types.create_embedding_response.CreateEmbeddingResponse

The result is an openai embedding response object.

embedding.data[0].embedding[:20]
[-0.9521484375,
 -0.7998046875,
 -0.79931640625,
 -0.138427734375,
 -0.79150390625,
 -0.31787109375,
 -0.55810546875,
 0.392333984375,
 -0.36767578125,
 0.4013671875,
 -0.0791015625,
 -0.78515625,
 -0.4599609375,
 0.4189453125,
 0.418212890625,
 -0.36767578125,
 -0.587890625,
 -0.466796875,
 0.159423828125,
 -0.359130859375]

Optuna#

Optuna is a package that enables a hyber parameter optimization.

For more information, check out the Optuna page on this website.


Consider the function:

\[f(x) = (x-2)^2\]

The minimum would be the solution of the equasion:

\[\begin{split}\frac{df}{dx} = 0 \\ 2x-4 = 0 \\ x = 2 \end{split}\]

The following cell performs the same task, but uses Optuna to do so numerically.

import optuna

optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial: optuna.trial.Trial):
    x = trial.suggest_float('x', -10, 10)
    return (x - 2) ** 2

study = optuna.create_study()
study.optimize(objective, n_trials=100)
study.best_params
{'x': 2.0017855120396995}