Databricks

Databricks#

Databricks is a platform for manipulating data and data related processes: analitics and ML.

Data#

Consider how databricks keeps data. There are:

Catalogs: top level container, containing schemas.
Schemas: or databases: Contains data objects.
Data objects can be: Volume, Table, View, Function or Model.

Check the:

Database objects in Databricks.

Fondation models#

Fondation models are LLMs that provided by Databricks. There are popular models from major vendors. You can find the foundry models available to you in the “serving” section of your Databricks deployment.

The following cell uses the llmama 3 model, which is available on my Databricks account.

env=$(databricks auth env | jq ".env")
host=$(echo $env | jq -r ".DATABRICKS_HOST")
token=$(echo $env | jq -r ".DATABRICKS_TOKEN")

ans=$(curl $host/serving-endpoints/databricks-meta-llama-3-3-70b-instruct/invocations -s \
  -H "Authorization: Bearer $token" \
  -d '{
      "messages": [
        {"role": "user", "content": "What is the capital of France?"}
      ]
    }'
)

The API response is represented in the following cell.

echo $ans | jq "."

{
  "id": "chatcmpl_5bad7112-c429-4370-9c73-1ef6d3b64fa7",
  "object": "chat.completion",
  "created": 1763045738,
  "model": "meta-llama-3.3-70b-instruct-121024",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "completion_tokens": 8,
    "total_tokens": 25
  }
}

Or, more specifically, the response of the model:

echo $ans | jq ".choices.[0].message"

{
  "role": "assistant",
  "content": "The capital of France is Paris."
}

Feature store#

You can manipulate the feature store using the databricks Python SDK, module: databricks.feature_engineering. This is not provided with the Databricks Python SDK out of the box - install the separatre PyPI published package.

Create the feature store with code:

from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()

fe.create_table(
    name="<catalog>.<schema>.<table name>",
    primary_keys=["<primary key 1>", "<primary key2>"],
    df=data,
    description="This is some sort of description",
    tags={"source": "bronze", "format": "delta"}
)

Jobs&Workflows#

Jobs and workflows allows to orchestrate tasks, wich are pieces of code that perform actions on the platform, and build relationships between them.

The following table lists teh ways you can define the databricks tasks.

Task Type	Description	Primary Use Case
Notebook Task	Runs a Databricks notebook written in Python, Scala, SQL, or R.	Executing interactive code, ETL logic, or ML training pipelines.
Pipeline Task	Runs a specified Delta Live Tables (DLT) pipeline.	Orchestrating end-to-end declarative ETL/streaming data pipelines.
SQL File Task	Executes a SQL script file stored in the workspace or a Git repository.	Running complex SQL transformations, DDL, or DML statements.
Python Script Task	Executes a Python file on the cluster using `spark-submit`.	Running standard Python code, often with Spark (PySpark) libraries.
Python Wheel Task	Runs a Python function packaged within a Python Wheel (`.whl`) file.	Running production-grade, modular, and version-controlled Python code.
JAR Task	Executes a compiled Java or Scala application packaged as a JAR file.	Running compiled, production-ready code, typically for complex logic.
Spark-Submit Task	Allows submission of a generic Spark application via the `spark-submit` command.	Running custom or highly specialized Spark applications.
dbt Task	Runs one or more `dbt` (data build tool) commands.	Orchestrating and running dbt projects for data transformation.
Run Job Task	Executes another Databricks Job as a task.	Creating nested, modular, or reusable workflows (Parent-Child jobs).
If/Else Condition Task	Evaluates a condition and controls the execution flow of subsequent tasks.	Adding conditional logic (branching) to a workflow.
For Each Task	Iterates over a collection of input values and runs a nested task for each value.	Parallel processing or batch operations over a list of items.
Dashboard Task	Updates a Databricks SQL Dashboard.	Automating the refresh of business intelligence dashboards.

Tasks communication#

To communicate between tasks you can set and read “tasks values”.

In python code use for that dbutils.jobs.tasksValue:

dbutils.jobs.taskValues.set(key="<key>", value="<value>") for setting a value.
dbutils.jobs.taskValues.get(taskKey="<name of the previous task>", key='key_from_script') for reading the value.

CLI#

The Databricks CLI allows you to manipulate your Databricks worksspace/account your machine command line. The following table shows corresponding subcommands:

Command group	Description / purpose
fs	Manage files in DBFS / file system (list, copy, delete, read)
git-credentials	Manage personal access tokens for Databricks to perform operations on behalf of user
repos	Manage Git repos within Databricks (import, sync, permissions)
secrets	Manage secrets, scopes, and access control for secrets
workspace	Handle workspace contents (notebooks, folders) and permissions
cluster-policies	Control rules and policies for cluster configurations
clusters	Manage cluster lifecycle and settings
api	Call any Databricks REST API directly (for advanced or unsupported endpoints)
completion	Generate shell autocompletion scripts
configure	Set up and configure the Databricks CLI (e.g. host, profile)
help	Display summary and help information for commands
bundle	Manage Databricks Asset Bundles (CI/CD-style deployments)
labs	Work with experimental Labs applications and features in Databricks
auth	Manage authentication, login, profiles, and tokens
current-user	Show information about the currently authenticated user or service principal
model-registry	Manage the workspace’s MLflow Model Registry: models, versions, transitions, metadata, and webhooks

Check more in:

If you have the Databricks CLI installed on your system, you should be able to run following command:

databricks --help | head -n 20

Databricks CLI

Usage:
  databricks [command]

Databricks Workspace
  fs                                     Filesystem related commands
  git-credentials                        Registers personal access token for Databricks to do operations on behalf of the user.
  repos                                  The Repos API allows users to manage their git repos.
  secrets                                The Secrets API allows you to manage secrets, secret scopes, and access permissions.
  workspace                              The Workspace API allows you to list, import, export, and delete notebooks and folders.

Compute
  cluster-policies                       You can use cluster policies to control users' ability to configure clusters based on a set of rules.
  clusters                               The Clusters API allows you to create, start, edit, list, terminate, and delete clusters.
  global-init-scripts                    The Global Init Scripts API enables Workspace administrators to configure global initialization scripts for their workspace.
  instance-pools                         Instance Pools API are used to create, edit, delete and list instance pools by using ready-to-use cloud instances which reduces a cluster start and auto-scaling times.
  instance-profiles                      The Instance Profiles API allows admins to add, list, and remove instance profiles that users can launch clusters with.
  libraries                              The Libraries API allows you to install and uninstall libraries and get the status of libraries on a cluster.
  policy-compliance-for-clusters         The policy compliance APIs allow you to view and manage the policy compliance status of clusters in your workspace.

Assets bundles#

The assets bundles is an instruction in a YAML file for managing a Databricks project.

There are two importatn concepts in the databricks assets bundles:

The databricks.yml file and its configuration allow you to specify the bundle.
The databricks bundle subcommand of the databricks CLI allows you to manipulate the bundle.

For more details check:

What is Databricks Asset Bundles.
Develop Databricks Asset Bundles: will guide you through the process of creating and deploying a bundle.
The bundle command group describes the details of the databricks CLI that are responsible for managing bundles.
The assets bundles page in the site.

Consider the process of creating the simpliest asset bundle.

Create the folder and the databricks.yml file within it:

bundle:
  name: knowledge

resources:
  jobs:
    hello-job:
      name: hello-job
      tasks:
        - task_key: hello-task
          notebook_task:
            notebook_path: ./hello.ipynb

targets:
  dev:
    default: true

Create hello.ipynb as “project” defines task based on it.

Use the command: databricks bundle deploy to push your bundle to the Databricks environment.

After these manipulations, you have to have the corresponding folder in the .bundles folder of your db environment. And hello-job will be listed in the jobs list.

To delete the bundle (only in the databricks environment) use the databricks bundle destroy.

AI&ML#

Databricks provides a range of tools for building and deploying machine learning solutions. Check the AI and machine learning on Databricks page for more information.

The most usefull services are:

Databricks provides OpenAI-compatible models endpoints, so you can access some models using only your databricks credentials. Check more Get started querying LLMs on Databricks.
The Mozaic AI Vector Search for embeddings retrieval.