# Databricks

Databricks is a platform for manipulating data and data related processes: analitics and ML.

## Data

Consider how databricks keeps data. There are:

- **Catalogs**: top level container, containing schemas.
- **Schemas**: or databases: Contains data objects.
- **Data objects** can be: **Volume**, **Table**, **View**, **Function** or **Model**.

Check the:

- [Database objects in Databricks](https://docs.databricks.com/aws/en/database-objects/).

## Fondation models

Fondation models are LLMs that provided by Databricks. There are popular models from major vendors. You can find the foundry models available to you in the "serving" section of your Databricks deployment.

---

The following cell uses the llmama 3 model, which is available on my Databricks account.

In [48]:
env=$(databricks auth env | jq ".env")
host=$(echo $env | jq -r ".DATABRICKS_HOST")
token=$(echo $env | jq -r ".DATABRICKS_TOKEN")

ans=$(curl $host/serving-endpoints/databricks-meta-llama-3-3-70b-instruct/invocations -s \
  -H "Authorization: Bearer $token" \
  -d '{
      "messages": [
        {"role": "user", "content": "What is the capital of France?"}
      ]
    }'
)

The API response is represented in the following cell.

In [47]:
echo $ans | jq "."

[1;39m{
  [0m[1;34m"id"[0m[1;39m: [0m[0;32m"chatcmpl_5bad7112-c429-4370-9c73-1ef6d3b64fa7"[0m[1;39m,
  [0m[1;34m"object"[0m[1;39m: [0m[0;32m"chat.completion"[0m[1;39m,
  [0m[1;34m"created"[0m[1;39m: [0m[0;39m1763045738[0m[1;39m,
  [0m[1;34m"model"[0m[1;39m: [0m[0;32m"meta-llama-3.3-70b-instruct-121024"[0m[1;39m,
  [0m[1;34m"choices"[0m[1;39m: [0m[1;39m[
    [1;39m{
      [0m[1;34m"index"[0m[1;39m: [0m[0;39m0[0m[1;39m,
      [0m[1;34m"message"[0m[1;39m: [0m[1;39m{
        [0m[1;34m"role"[0m[1;39m: [0m[0;32m"assistant"[0m[1;39m,
        [0m[1;34m"content"[0m[1;39m: [0m[0;32m"The capital of France is Paris."[0m[1;39m
      [1;39m}[0m[1;39m,
      [0m[1;34m"finish_reason"[0m[1;39m: [0m[0;32m"stop"[0m[1;39m,
      [0m[1;34m"logprobs"[0m[1;39m: [0m[0;90mnull[0m[1;39m
    [1;39m}[0m[1;39m
  [1;39m][0m[1;39m,
  [0m[1;34m"usage"[0m[1;39m: [0m[1;39m{
    [0m[1;34m"prompt_tokens"[0m[1;39m: [0

Or, more specifically, the response of the model:

In [46]:
echo $ans | jq ".choices.[0].message"

[1;39m{
  [0m[1;34m"role"[0m[1;39m: [0m[0;32m"assistant"[0m[1;39m,
  [0m[1;34m"content"[0m[1;39m: [0m[0;32m"The capital of France is Paris."[0m[1;39m
[1;39m}[0m


## Feature store

You can manipulate the feature store using the databricks Python SDK, module: `databricks.feature_engineering`. This is not provided with the Databricks Python SDK out of the box - install the separatre [PyPI published package](https://pypi.org/project/databricks-feature-engineering/).

Create the feature store with code:

```python
from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()

fe.create_table(
    name="<catalog>.<schema>.<table name>",
    primary_keys=["<primary key 1>", "<primary key2>"],
    df=data,
    description="This is some sort of description",
    tags={"source": "bronze", "format": "delta"}
)
```

## Jobs&Workflows

Jobs and workflows allows to orchestrate tasks, wich are pieces of code that perform actions on the platform, and build relationships between them.

The following table lists teh ways you can define the databricks tasks.

| Task Type | Description | Primary Use Case |
| :--- | :--- | :--- |
| **Notebook Task** | Runs a Databricks notebook written in Python, Scala, SQL, or R. | Executing interactive code, ETL logic, or ML training pipelines. |
| **Pipeline Task** | Runs a specified Delta Live Tables (DLT) pipeline. | Orchestrating end-to-end declarative ETL/streaming data pipelines. |
| **SQL File Task** | Executes a SQL script file stored in the workspace or a Git repository. | Running complex SQL transformations, DDL, or DML statements. |
| **Python Script Task** | Executes a Python file on the cluster using `spark-submit`. | Running standard Python code, often with Spark (PySpark) libraries. |
| **Python Wheel Task** | Runs a Python function packaged within a Python Wheel (`.whl`) file. | Running production-grade, modular, and version-controlled Python code. |
| **JAR Task** | Executes a compiled Java or Scala application packaged as a JAR file. | Running compiled, production-ready code, typically for complex logic. |
| **Spark-Submit Task** | Allows submission of a generic Spark application via the `spark-submit` command. | Running custom or highly specialized Spark applications. |
| **dbt Task** | Runs one or more `dbt` (data build tool) commands. | Orchestrating and running dbt projects for data transformation. |
| **Run Job Task** | Executes another Databricks Job as a task. | Creating nested, modular, or reusable workflows (Parent-Child jobs). |
| **If/Else Condition Task**| Evaluates a condition and controls the execution flow of subsequent tasks. | Adding conditional logic (branching) to a workflow. |
| **For Each Task** | Iterates over a collection of input values and runs a nested task for each value. | Parallel processing or batch operations over a list of items. |
| **Dashboard Task** | Updates a Databricks SQL Dashboard. | Automating the refresh of business intelligence dashboards. |


### Tasks communication

To communicate between tasks you can set and read "tasks values".

In python code use for that [dbutils.jobs.tasksValue](https://docs.databricks.com/aws/en/dev-tools/databricks-utils#taskvalues-subutility-dbutilsjobstaskvalues):

- `dbutils.jobs.taskValues.set(key="<key>", value="<value>")` for setting a value.
- `dbutils.jobs.taskValues.get(taskKey="<name of the previous task>", key='key_from_script')` for reading the value.

## CLI

The Databricks CLI allows you to manipulate your Databricks worksspace/account your machine command line. The following table shows corresponding subcommands:

| Command group        | Description / purpose                                                                               |
| -------------------- | --------------------------------------------------------------------------------------------------- |
| **fs**               | Manage files in DBFS / file system (list, copy, delete, read)                                       |
| **git-credentials**  | Manage personal access tokens for Databricks to perform operations on behalf of user                |
| **repos**            | Manage Git repos within Databricks (import, sync, permissions)                                      |
| **secrets**          | Manage secrets, scopes, and access control for secrets                                              |
| **workspace**        | Handle workspace contents (notebooks, folders) and permissions                                      |
| **cluster-policies** | Control rules and policies for cluster configurations                                               |
| **clusters**         | Manage cluster lifecycle and settings                                                               |
| **api**              | Call any Databricks REST API directly (for advanced or unsupported endpoints)                       |
| **completion**       | Generate shell autocompletion scripts                                                               |
| **configure**        | Set up and configure the Databricks CLI (e.g. host, profile)                                        |
| **help**             | Display summary and help information for commands                                                   |
| **bundle**           | Manage Databricks Asset Bundles (CI/CD-style deployments)                                           |
| **labs**             | Work with experimental Labs applications and features in Databricks                                 |
| **auth**             | Manage authentication, login, profiles, and tokens                                                  |
| **current-user**     | Show information about the currently authenticated user or service principal                        |
| **model-registry**   | Manage the workspaceâ€™s MLflow Model Registry: models, versions, transitions, metadata, and webhooks |

Check more in:
- [What is the Databricks CLI](https://docs.databricks.com/aws/en/dev-tools/cli/).
- [Installation guide](https://docs.databricks.com/aws/en/dev-tools/cli/install).

---

If you have the Databricks CLI installed on your system, you should be able to run following command:

In [1]:
databricks --help | head -n 20

Databricks CLI

Usage:
  databricks [command]

Databricks Workspace
  fs                                     Filesystem related commands
  git-credentials                        Registers personal access token for Databricks to do operations on behalf of the user.
  repos                                  The Repos API allows users to manage their git repos.
  secrets                                The Secrets API allows you to manage secrets, secret scopes, and access permissions.
  workspace                              The Workspace API allows you to list, import, export, and delete notebooks and folders.

Compute
  cluster-policies                       You can use cluster policies to control users' ability to configure clusters based on a set of rules.
  clusters                               The Clusters API allows you to create, start, edit, list, terminate, and delete clusters.
  global-init-scripts                    The Global Init Scripts API enables Workspace administrators 

## Assets bundles

The assets bundles is an instruction in a YAML file for managing a Databricks project.

There are two importatn concepts in the databricks assets bundles:

- The `databricks.yml` file and its configuration allow you to specify the bundle.
- The `databricks bundle` subcommand of the databricks CLI allows you to manipulate the bundle.

For more details check:

- [What is Databricks Asset Bundles](https://docs.databricks.com/aws/en/dev-tools/bundles/).
- [Develop Databricks Asset Bundles](https://docs.databricks.com/aws/en/dev-tools/bundles/work-tasks): will guide you through the process of creating and deploying a bundle.
- The [`bundle` command group](https://docs.databricks.com/aws/en/dev-tools/cli/bundle-commands) describes the details of the databricks CLI that are responsible for managing bundles.
- The [assets bundles](databricks/assets_bundles.ipynb) page in the site.

---

Consider the process of creating the simpliest asset bundle.

Create the folder and the `databricks.yml` file within it:

```yaml
bundle:
  name: knowledge

resources:
  jobs:
    hello-job:
      name: hello-job
      tasks:
        - task_key: hello-task
          notebook_task:
            notebook_path: ./hello.ipynb

targets:
  dev:
    default: true
```

Create `hello.ipynb` as "project" defines task based on it.

Use the command: `databricks bundle deploy` to push your bundle to the Databricks environment.

After these manipulations, you have to have the corresponding folder in the `.bundles` folder of your db environment. And `hello-job` will be listed in the jobs list.

To delete the bundle (only in the databricks environment) use the `databricks bundle destroy`.

## AI&ML

Databricks provides a range of tools for building and deploying machine learning solutions. Check the [AI and machine learning on Databricks](https://docs.databricks.com/aws/en/machine-learning/) page for more information.

The most usefull services are:

- Databricks provides **OpenAI-compatible models** endpoints, so you can access some models using only your databricks credentials. Check more [Get started querying LLMs on Databricks](https://docs.databricks.com/aws/en/large-language-models/llm-serving-intro).
- The [Mozaic AI Vector Search](https://docs.databricks.com/aws/en/vector-search/vector-search) for embeddings retrieval.