Databricks#

Databricks is a platform for manipulating data and data related processes: analitics and ML.

Data#

Consider how databricks keeps data. There are:

  • Catalogs: top level container, containing schemas.

  • Schemas: or databases: Contains data objects.

  • Data objects can be: Volume, Table, View, Function or Model.

Check the:

Fondation models#

Fondation models are LLMs that provided by Databricks. There are popular models from major vendors. You can find the foundry models available to you in the “serving” section of your Databricks deployment.


The following cell uses the llmama 3 model, which is available on my Databricks account.

env=$(databricks auth env | jq ".env")
host=$(echo $env | jq -r ".DATABRICKS_HOST")
token=$(echo $env | jq -r ".DATABRICKS_TOKEN")

ans=$(curl $host/serving-endpoints/databricks-meta-llama-3-3-70b-instruct/invocations -s \
  -H "Authorization: Bearer $token" \
  -d '{
      "messages": [
        {"role": "user", "content": "What is the capital of France?"}
      ]
    }'
)

The API response is represented in the following cell.

echo $ans | jq "."
{
  "id": "chatcmpl_5bad7112-c429-4370-9c73-1ef6d3b64fa7",
  "object": "chat.completion",
  "created": 1763045738,
  "model": "meta-llama-3.3-70b-instruct-121024",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "completion_tokens": 8,
    "total_tokens": 25
  }
}

Or, more specifically, the response of the model:

echo $ans | jq ".choices.[0].message"
{
  "role": "assistant",
  "content": "The capital of France is Paris."
}

Feature store#

You can manipulate the feature store using the databricks Python SDK, module: databricks.feature_engineering. This is not provided with the Databricks Python SDK out of the box - install the separatre PyPI published package.

Create the feature store with code:

from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()

fe.create_table(
    name="<catalog>.<schema>.<table name>",
    primary_keys=["<primary key 1>", "<primary key2>"],
    df=data,
    description="This is some sort of description",
    tags={"source": "bronze", "format": "delta"}
)

Jobs&Workflows#

Jobs and workflows allows to orchestrate tasks, wich are pieces of code that perform actions on the platform, and build relationships between them.

The following table lists teh ways you can define the databricks tasks.

Task Type

Description

Primary Use Case

Notebook Task

Runs a Databricks notebook written in Python, Scala, SQL, or R.

Executing interactive code, ETL logic, or ML training pipelines.

Pipeline Task

Runs a specified Delta Live Tables (DLT) pipeline.

Orchestrating end-to-end declarative ETL/streaming data pipelines.

SQL File Task

Executes a SQL script file stored in the workspace or a Git repository.

Running complex SQL transformations, DDL, or DML statements.

Python Script Task

Executes a Python file on the cluster using spark-submit.

Running standard Python code, often with Spark (PySpark) libraries.

Python Wheel Task

Runs a Python function packaged within a Python Wheel (.whl) file.

Running production-grade, modular, and version-controlled Python code.

JAR Task

Executes a compiled Java or Scala application packaged as a JAR file.

Running compiled, production-ready code, typically for complex logic.

Spark-Submit Task

Allows submission of a generic Spark application via the spark-submit command.

Running custom or highly specialized Spark applications.

dbt Task

Runs one or more dbt (data build tool) commands.

Orchestrating and running dbt projects for data transformation.

Run Job Task

Executes another Databricks Job as a task.

Creating nested, modular, or reusable workflows (Parent-Child jobs).

If/Else Condition Task

Evaluates a condition and controls the execution flow of subsequent tasks.

Adding conditional logic (branching) to a workflow.

For Each Task

Iterates over a collection of input values and runs a nested task for each value.

Parallel processing or batch operations over a list of items.

Dashboard Task

Updates a Databricks SQL Dashboard.

Automating the refresh of business intelligence dashboards.

Tasks communication#

To communicate between tasks you can set and read “tasks values”.

In python code use for that dbutils.jobs.tasksValue:

  • dbutils.jobs.taskValues.set(key="<key>", value="<value>") for setting a value.

  • dbutils.jobs.taskValues.get(taskKey="<name of the previous task>", key='key_from_script') for reading the value.

CLI#

The Databricks CLI allows you to manipulate your Databricks worksspace/account your machine command line. The following table shows corresponding subcommands:

Command group

Description / purpose

fs

Manage files in DBFS / file system (list, copy, delete, read)

git-credentials

Manage personal access tokens for Databricks to perform operations on behalf of user

repos

Manage Git repos within Databricks (import, sync, permissions)

secrets

Manage secrets, scopes, and access control for secrets

workspace

Handle workspace contents (notebooks, folders) and permissions

cluster-policies

Control rules and policies for cluster configurations

clusters

Manage cluster lifecycle and settings

api

Call any Databricks REST API directly (for advanced or unsupported endpoints)

completion

Generate shell autocompletion scripts

configure

Set up and configure the Databricks CLI (e.g. host, profile)

help

Display summary and help information for commands

bundle

Manage Databricks Asset Bundles (CI/CD-style deployments)

labs

Work with experimental Labs applications and features in Databricks

auth

Manage authentication, login, profiles, and tokens

current-user

Show information about the currently authenticated user or service principal

model-registry

Manage the workspace’s MLflow Model Registry: models, versions, transitions, metadata, and webhooks

Check more in:


If you have the Databricks CLI installed on your system, you should be able to run following command:

databricks --help | head -n 20
Databricks CLI

Usage:
  databricks [command]

Databricks Workspace
  fs                                     Filesystem related commands
  git-credentials                        Registers personal access token for Databricks to do operations on behalf of the user.
  repos                                  The Repos API allows users to manage their git repos.
  secrets                                The Secrets API allows you to manage secrets, secret scopes, and access permissions.
  workspace                              The Workspace API allows you to list, import, export, and delete notebooks and folders.

Compute
  cluster-policies                       You can use cluster policies to control users' ability to configure clusters based on a set of rules.
  clusters                               The Clusters API allows you to create, start, edit, list, terminate, and delete clusters.
  global-init-scripts                    The Global Init Scripts API enables Workspace administrators to configure global initialization scripts for their workspace.
  instance-pools                         Instance Pools API are used to create, edit, delete and list instance pools by using ready-to-use cloud instances which reduces a cluster start and auto-scaling times.
  instance-profiles                      The Instance Profiles API allows admins to add, list, and remove instance profiles that users can launch clusters with.
  libraries                              The Libraries API allows you to install and uninstall libraries and get the status of libraries on a cluster.
  policy-compliance-for-clusters         The policy compliance APIs allow you to view and manage the policy compliance status of clusters in your workspace.

Assets bundles#

The assets bundles is an instruction in a YAML file for managing a Databricks project.

There are two importatn concepts in the databricks assets bundles:

  • The databricks.yml file and its configuration allow you to specify the bundle.

  • The databricks bundle subcommand of the databricks CLI allows you to manipulate the bundle.

For more details check:


Consider the process of creating the simpliest asset bundle.

Create the folder and the databricks.yml file within it:

bundle:
  name: knowledge

resources:
  jobs:
    hello-job:
      name: hello-job
      tasks:
        - task_key: hello-task
          notebook_task:
            notebook_path: ./hello.ipynb

targets:
  dev:
    default: true

Create hello.ipynb as “project” defines task based on it.

Use the command: databricks bundle deploy to push your bundle to the Databricks environment.

After these manipulations, you have to have the corresponding folder in the .bundles folder of your db environment. And hello-job will be listed in the jobs list.

To delete the bundle (only in the databricks environment) use the databricks bundle destroy.

AI&ML#

Databricks provides a range of tools for building and deploying machine learning solutions. Check the AI and machine learning on Databricks page for more information.

The most usefull services are: