Assets bundles

Assets bundles#

Asset bundles are a way to define the databricks project as a code. You can develop your project locally by following the typical Databricks project patterns, and deploy it to the platform using your CI/CD pipeline.

Configuration#

The bundle’s configuration of the bundle is defined in the databricks.yml file. Consider the configuration options for the bundle. Check the official desriptions in the Databricks Asset Bundle configuration.

bundle: Specifies the Databricks environment and the bundle’s basic properties.
include: allows to specify other configuration files. When configuration is relatively complex, it’s convenient to keep some configurations in the other files.
scripts: Define a script to be run in the local environemt. But the configuration specific to the Databricks environemnt that corresponds to the bunlde will be applied. You can use a command like databricks bundle run <name specified for the script>.
sync: Specifies which files will be pushed to the Databricks environemtn during databricks bundle deploy.
artifacts: if your project is supposed to produce some output files during build (python whl, java jar, etc.) you have to specify this using artifacts attribute. The most important detail is that here is defined the script that generates the artifact; this script will be executed with the databricks bundle build command.
variables: here, you can define variables that can be used in subtitutions.
resources: specifies the Databricks resources. It is literaly the features of the Databricks used by the project lie: jobs, dashboards, clusters etc.
targets: sometimes you need several setups for the same project. The most popular cases are dev and production. The targets allows to specify exactly this.

Artifacts#

Consider the simpliest possible artifact usage. The following code specifies the artifact as the result file.

bundle:
  name: knowledge

artifacts:
  default:
    build: echo "this is new configuration" > result

Running the command databricks bundle build will create the result file, which will then published in the Databricks environment.

Substitutions#

With substitutions mechanisms you will be able to retrieve some values and substitute them to the config during bundle build or bundle run. As typcail you have to define your substitutions in the ${<variable name>} format. Check more about substitutions in the Substitutions page of the official documentation.

For example the following pattern in the configuration:

artifacts:
  default:
    build: echo "This is ${bundle.name} bundle" > ${bundle.target}

It will create a file with the same name as the bundle’s target and save the string that containing the bundle name there.

Variables#

Variables can be sepcified using following symtax:

variables:
  <var_name1>:
    ...
  <var_name2>:
    ...

To pass a value to a variable:

Use the environment variable that follows the pattern BUNDLE_VAR_<name of variable>, databricks CLI commands executed from corresponding environement will automatically substitute this value.
Use the --var="<var_name1>=<var_value1>,<var_name2>=<var_value2>" options of the databricks bundle deploy.

For more check the Custom variables section of the official documentation.

As exmaple consider the following configuration for variables:

variables:
  var1:
    default: value1
  var2:
    default: value2

artifacts:
  default:
    build: echo "${var.var1} and ${var.var2}" > result

The values var1 and var2 are defined in the bundle, and then used in the command that creates the file.

After running the pipeline, the content of the result file content will contain the default values of the variables.

$ databricks bundle deploy

Building default...
Uploading bundle files to /Workspace/Users/fedor.kobak@innowise.com/.bundle/python_default/dev/files...
Deploying resources...
Updating deployment state...
Deployment complete!

$ cat result

value1 and value2

The following cell shows how the default values of the variables can be replaced:

The var1 is specified through BUNDLE_VAR_var1="hello".
The var2 is specified through --var="var2=world".

$ BUNDLE_VAR_var1="hello" databricks bundle deploy --var="var2=world"

Building default...
Uploading bundle files to /Workspace/Users/fedor.kobak@innowise.com/.bundle/knowledge/dev/files...
Deploying resources...
Deployment complete!

$ cat result 

hello and world

Execute scripts#

To execute scripts with a bundle configuration cretedentials use databricks bundle run [reference to the script]. The script can be defined inline or specified in databricks.yml.

For example if you try to access the DATABRICKS_HOST from the local raw python environment, you will receive an error:

$python3 -c 'import os; print(os.environ["DATABRICKS_HOST"])'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import os; print(os.environ["DATABRICKS_HOST"])
                     ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 717, in __getitem__
KeyError: 'DATABRICKS_HOST'

But the same command wroks fine under the Databricks CLI.

$ databricks bundle run -- python3 -c 'import os; print(os.environ["DATABRICKS_HOST"][:20])'
https://dbc-da0651ae

Targets#

The targets allow you to define multiple behavior patterns for the bundle. For example, you may need to organize development and production environments.

The target definition may have the following syntax:

targets:
  <tagtet1 name>:
    <configuration>
  <target2 name>:
    <configuration>

The typical way to specify the target is to use -t <target name> option.

As example consider the following configuration:

artifacts:
  default:
    build: echo "My target is ${bundle.target}" > result

targets:
  dev:
    mode: development
    default: true
  prod:
    mode: development

The information about the used target was forwarded through the artifact.

Note. Here, the mode for the prod target is set to the development value just to keep things simple, since the other options require additional configuration.

With the typical run system returns the dev as value for bundle.target.

$ databricks bundle deploy

Building default...
Uploading bundle files to /Workspace/Users/fedor.kobak@innowise.com/.bundle/python_default/dev/files...
Deploying resources...
Updating deployment state...
Deployment complete!

$ cat result

My target is dev

The following cell runs databricks bundle deploy -t prod which forces CLI to use prod target.

$ databricks bundle deploy -t prod

Building default...
Uploading bundle files to /Workspace/Users/fedor.kobak@innowise.com/.bundle/python_default/prod/files...
Deploying resources...
Updating deployment state...
Deployment complete!

$ cat result

My target is prod

There is corresponding result in the target.

Resources#

In this section, we will discuss how to define resources in Databricks. A resource is a feature of Databricks that can be used by our application.

They are defined using the following syntax:

resources:
    jobs:
        <list of the jobs>
    apps:
        <list of the apps>
    ...

Check the Supported resources.

Jobs#

Probably the most popular type of the resource in the Databricks. Job is an automated workflow defined in the Databricks.

The most imporatant aspects of the job configuration are:

name: Defines the name of the job.
tasks: lists the tasks awailable for the job.
shedule: sets up the rules when task have to be executed.

Check more in:

Job description for keys of assests bundles.
Job configuration in bundle configuration examples.

The most minimalistic definition of the job may take form:

resources:
  jobs:
    job1:
      name: job1

Registered model#

A registered model is an a model registered in the MLFlow inside Databricks.

Check more in registered_model reference in the Databricks documentation.

For example, the following configuration creates a model with named “example” the “knowledge” catalog in the knowledge schema.

bundle:
  name: knowledge

resources:
  registered_models:
    model:
      name: example
      catalog_name: knowledge 
      schema_name: knowledge
      comment: Registered model in Unity Catalog for ${bundle.target} deployment target

Dependencies#

There are several ways to manage third-party dependencies in assests bundles.

The libraries option allows you to define the packages that will be installed on the cluster during deployment. This option obviously won’t work for serverless configuration.
The environment option allows you to specify the environments that will be created when the job starts. This means that a new environment is created each time you create a job.

Libraries#

Typcailly, your project will depend on the third-third party libraries. You are supposed to specify this in the library key. This can be specified either for the entire cluster or for a particular task.

Note: libraries are not supported by the serverless computation type. The best solution is probably to create a new Databricks environment where you can set up packages you like Manage serverless base environments.

For more check the Dataricks Assets Bundles library dependencies.

The following code illustrates how to specify libraries for a paritcular task.

resources:
  jobs:
    my_job:
      tasks:
        - task_key: my_task
          libraries:
            - pypi:
                package: cowsay==6.1 

Environments#

Environments are specified for the task. There is environments key, each element of this list species a mapping that describes single environment. Keys available in the environment mapping are described in the following table:

Key	Description
`environment_key`	The environment identifier that is used by tasks to reference this environment.
`spec.environment_version`	The version of the Databricks envrionment.
`spec.depencies`	List where each key complements the `pip install` command.

For each task, you must specify the environment_key pointing to the environment key to the described environment. Note. it is not possible to use this approach with the Jupyter notebook. For some reason, it note that you have to use the %pip install magic command in your notebook code.

The following code defines the default_python environment and uses it as the environment for the my_task:

resources:
  jobs:
     my_job:
      name: cow_say_job
      environments:
        - environment_key: default_python
          spec:
            environment_version: '4'
            dependencies:
              - cowsay==6.1
      tasks:
        - task_key: my_task
          spark_python_task:
            python_file: ./my_file.py
          environment_key: default_python

Tasks#

A task is a stage in a job. The following important details are associated with task definitions:

The configuration of tasks is located at the following yaml path: resources.jobs.{job_name}.tasks.
Each task configuration begins with - task_key: task_identifier list element.
There is set of keys in the task configuration that determine the type of the task. Behind this path, there is a configuration specific to the taks type: notebook_task, sql_task, pipeline_task, spark_python_task, and so on.
There is a set of keys that generally descirbe the task: other task settings.

The configuration of the task minght look like this:

resources:
  jobs:
    my_job:
      name: cow_say_job
      tasks:
        - task_key: task1
          notebook_task:
            notebook_path: ./my_file.py
        - task_key: task2
          sql_task:
            path: ./my_file.sql

Check the task settings lists different types of tasks and their configuration.

Notebook task#

The Notebook Task allows to set up a task that will execute a notebook. The general form of definition is:

- task_key: some_task
  notebook_taks:
    notebook_path: my_file.ipynb

`.py` as notebook#

You can use a regular .py file as a notebook by adding the line # Databricks notebook source as the beginning of the file. After deployment databricks will treat it as a notebook.

Consider following resources configuration:

resources:
  jobs:
    job1:
      name: job1
      tasks:
        - task_key: task1
          notebook_task:
            notebook_path: file.py

Where file.py is:

print("hello world")

Attempt to deploy the bundle fails:

$ databricks bundle deploy
Error: expected a notebook for "resources.jobs.job1.tasks[0].notebook_task.notebook_path" but got a file: file at /tmp/databricks_experiments/file.py is not a notebook

However, if the file is defined slightly different:

# Databricks notebook source
print("hello world")

The deployemnt goes fine:

$ databricks bundle deploy

Uploading bundle files to /Workspace/Users/fedor.kobak@innowise.com/.bundle/my_bundle/dev/files...
Deploying resources...
Updating deployment state...
Deployment complete!