Databricks#
Databricks is a platform for manipulating data and data related processes: analitics and ML.
Data#
Consider how databricks keeps data. There are:
Catalogs: top level container, containing schemas.
Schemas: or databases: Contains data objects.
Data objects can be: Volume, Table, View, Function or Model.
Check the:
Fondation models#
Fondation models are LLMs that provided by Databricks. There are popular models from major vendors. You can find the foundry models available to you in the “serving” section of your Databricks deployment.
The following cell uses the llmama 3 model, which is available on my Databricks account.
env=$(databricks auth env | jq ".env")
host=$(echo $env | jq -r ".DATABRICKS_HOST")
token=$(echo $env | jq -r ".DATABRICKS_TOKEN")
ans=$(curl $host/serving-endpoints/databricks-meta-llama-3-3-70b-instruct/invocations -s \
-H "Authorization: Bearer $token" \
-d '{
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
)
The API response is represented in the following cell.
echo $ans | jq "."
{
"id": "chatcmpl_5bad7112-c429-4370-9c73-1ef6d3b64fa7",
"object": "chat.completion",
"created": 1763045738,
"model": "meta-llama-3.3-70b-instruct-121024",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 17,
"completion_tokens": 8,
"total_tokens": 25
}
}
Or, more specifically, the response of the model:
echo $ans | jq ".choices.[0].message"
{
"role": "assistant",
"content": "The capital of France is Paris."
}
Feature store#
You can manipulate the feature store using the databricks Python SDK, module: databricks.feature_engineering. This is not provided with the Databricks Python SDK out of the box - install the separatre PyPI published package.
Create the feature store with code:
from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()
fe.create_table(
name="<catalog>.<schema>.<table name>",
primary_keys=["<primary key 1>", "<primary key2>"],
df=data,
description="This is some sort of description",
tags={"source": "bronze", "format": "delta"}
)
Jobs&Workflows#
Jobs and workflows allows to orchestrate tasks, wich are pieces of code that perform actions on the platform, and build relationships between them.
The following table lists teh ways you can define the databricks tasks.
Task Type |
Description |
Primary Use Case |
|---|---|---|
Notebook Task |
Runs a Databricks notebook written in Python, Scala, SQL, or R. |
Executing interactive code, ETL logic, or ML training pipelines. |
Pipeline Task |
Runs a specified Delta Live Tables (DLT) pipeline. |
Orchestrating end-to-end declarative ETL/streaming data pipelines. |
SQL File Task |
Executes a SQL script file stored in the workspace or a Git repository. |
Running complex SQL transformations, DDL, or DML statements. |
Python Script Task |
Executes a Python file on the cluster using |
Running standard Python code, often with Spark (PySpark) libraries. |
Python Wheel Task |
Runs a Python function packaged within a Python Wheel ( |
Running production-grade, modular, and version-controlled Python code. |
JAR Task |
Executes a compiled Java or Scala application packaged as a JAR file. |
Running compiled, production-ready code, typically for complex logic. |
Spark-Submit Task |
Allows submission of a generic Spark application via the |
Running custom or highly specialized Spark applications. |
dbt Task |
Runs one or more |
Orchestrating and running dbt projects for data transformation. |
Run Job Task |
Executes another Databricks Job as a task. |
Creating nested, modular, or reusable workflows (Parent-Child jobs). |
If/Else Condition Task |
Evaluates a condition and controls the execution flow of subsequent tasks. |
Adding conditional logic (branching) to a workflow. |
For Each Task |
Iterates over a collection of input values and runs a nested task for each value. |
Parallel processing or batch operations over a list of items. |
Dashboard Task |
Updates a Databricks SQL Dashboard. |
Automating the refresh of business intelligence dashboards. |
Tasks communication#
To communicate between tasks you can set and read “tasks values”.
In python code use for that dbutils.jobs.tasksValue:
dbutils.jobs.taskValues.set(key="<key>", value="<value>")for setting a value.dbutils.jobs.taskValues.get(taskKey="<name of the previous task>", key='key_from_script')for reading the value.
CLI#
The Databricks CLI allows you to manipulate your Databricks worksspace/account your machine command line. The following table shows corresponding subcommands:
Command group |
Description / purpose |
|---|---|
fs |
Manage files in DBFS / file system (list, copy, delete, read) |
git-credentials |
Manage personal access tokens for Databricks to perform operations on behalf of user |
repos |
Manage Git repos within Databricks (import, sync, permissions) |
secrets |
Manage secrets, scopes, and access control for secrets |
workspace |
Handle workspace contents (notebooks, folders) and permissions |
cluster-policies |
Control rules and policies for cluster configurations |
clusters |
Manage cluster lifecycle and settings |
api |
Call any Databricks REST API directly (for advanced or unsupported endpoints) |
completion |
Generate shell autocompletion scripts |
configure |
Set up and configure the Databricks CLI (e.g. host, profile) |
help |
Display summary and help information for commands |
bundle |
Manage Databricks Asset Bundles (CI/CD-style deployments) |
labs |
Work with experimental Labs applications and features in Databricks |
auth |
Manage authentication, login, profiles, and tokens |
current-user |
Show information about the currently authenticated user or service principal |
model-registry |
Manage the workspace’s MLflow Model Registry: models, versions, transitions, metadata, and webhooks |
Check more in:
If you have the Databricks CLI installed on your system, you should be able to run following command:
databricks --help | head -n 20
Databricks CLI
Usage:
databricks [command]
Databricks Workspace
fs Filesystem related commands
git-credentials Registers personal access token for Databricks to do operations on behalf of the user.
repos The Repos API allows users to manage their git repos.
secrets The Secrets API allows you to manage secrets, secret scopes, and access permissions.
workspace The Workspace API allows you to list, import, export, and delete notebooks and folders.
Compute
cluster-policies You can use cluster policies to control users' ability to configure clusters based on a set of rules.
clusters The Clusters API allows you to create, start, edit, list, terminate, and delete clusters.
global-init-scripts The Global Init Scripts API enables Workspace administrators to configure global initialization scripts for their workspace.
instance-pools Instance Pools API are used to create, edit, delete and list instance pools by using ready-to-use cloud instances which reduces a cluster start and auto-scaling times.
instance-profiles The Instance Profiles API allows admins to add, list, and remove instance profiles that users can launch clusters with.
libraries The Libraries API allows you to install and uninstall libraries and get the status of libraries on a cluster.
policy-compliance-for-clusters The policy compliance APIs allow you to view and manage the policy compliance status of clusters in your workspace.
Assets bundles#
The assets bundles is an instruction in a YAML file for managing a Databricks project.
There are two importatn concepts in the databricks assets bundles:
The
databricks.ymlfile and its configuration allow you to specify the bundle.The
databricks bundlesubcommand of the databricks CLI allows you to manipulate the bundle.
For more details check:
Develop Databricks Asset Bundles: will guide you through the process of creating and deploying a bundle.
The
bundlecommand group describes the details of the databricks CLI that are responsible for managing bundles.The assets bundles page in the site.
Consider the process of creating the simpliest asset bundle.
Create the folder and the databricks.yml file within it:
bundle:
name: knowledge
resources:
jobs:
hello-job:
name: hello-job
tasks:
- task_key: hello-task
notebook_task:
notebook_path: ./hello.ipynb
targets:
dev:
default: true
Create hello.ipynb as “project” defines task based on it.
Use the command: databricks bundle deploy to push your bundle to the Databricks environment.
After these manipulations, you have to have the corresponding folder in the .bundles folder of your db environment. And hello-job will be listed in the jobs list.
To delete the bundle (only in the databricks environment) use the databricks bundle destroy.
AI&ML#
Databricks provides a range of tools for building and deploying machine learning solutions. Check the AI and machine learning on Databricks page for more information.
The most usefull services are:
Databricks provides OpenAI-compatible models endpoints, so you can access some models using only your databricks credentials. Check more Get started querying LLMs on Databricks.
The Mozaic AI Vector Search for embeddings retrieval.