Assets bundles#
Asset bundles are a way to define the databricks project as a code. You can develop your project locally by following the typical Databricks project patterns, and deploy it to the platform using your CI/CD pipeline.
Configuration#
The bundle’s configuration of the bundle is defined in the databricks.yml file. Consider the configuration options for the bundle. Check the official desriptions in the Databricks Asset Bundle configuration.
bundle: Specifies the Databricks environment and the bundle’s basic properties.include: allows to specify other configuration files. When configuration is relatively complex, it’s convenient to keep some configurations in the other files.scripts: Define a script to be run in the local environemt. But the configuration specific to the Databricks environemnt that corresponds to the bunlde will be applied. You can use a command likedatabricks bundle run <name specified for the script>.sync: Specifies which files will be pushed to the Databricks environemtn duringdatabricks bundle deploy.artifacts: if your project is supposed to produce some output files during build (python whl, java jar, etc.) you have to specify this usingartifactsattribute. The most important detail is that here is defined the script that generates the artifact; this script will be executed with thedatabricks bundle buildcommand.variables: here, you can define variables that can be used in subtitutions.resources: specifies the Databricks resources. It is literaly the features of the Databricks used by the project lie: jobs, dashboards, clusters etc.targets: sometimes you need several setups for the same project. The most popular cases aredevandproduction. Thetargetsallows to specify exactly this.
Artifacts#
Consider the simpliest possible artifact usage. The following code specifies the artifact as the result file.
bundle:
name: knowledge
artifacts:
default:
build: echo "this is new configuration" > result
Running the command databricks bundle build will create the result file, which will then published in the Databricks environment.
Substitutions#
With substitutions mechanisms you will be able to retrieve some values and substitute them to the config during bundle build or bundle run. As typcail you have to define your substitutions in the ${<variable name>} format. Check more about substitutions in the Substitutions page of the official documentation.
For example the following pattern in the configuration:
artifacts:
default:
build: echo "This is ${bundle.name} bundle" > ${bundle.target}
It will create a file with the same name as the bundle’s target and save the string that containing the bundle name there.
Variables#
Variables can be sepcified using following symtax:
variables:
<var_name1>:
...
<var_name2>:
...
To pass a value to a variable:
Use the environment variable that follows the pattern
BUNDLE_VAR_<name of variable>, databricks CLI commands executed from corresponding environement will automatically substitute this value.Use the
--var="<var_name1>=<var_value1>,<var_name2>=<var_value2>"options of thedatabricks bundle deploy.
For more check the Custom variables section of the official documentation.
As exmaple consider the following configuration for variables:
variables:
var1:
default: value1
var2:
default: value2
artifacts:
default:
build: echo "${var.var1} and ${var.var2}" > result
The values var1 and var2 are defined in the bundle, and then used in the command that creates the file.
After running the pipeline, the content of the result file content will contain the default values of the variables.
$ databricks bundle deploy
Building default...
Uploading bundle files to /Workspace/Users/fedor.kobak@innowise.com/.bundle/python_default/dev/files...
Deploying resources...
Updating deployment state...
Deployment complete!
$ cat result
value1 and value2
The following cell shows how the default values of the variables can be replaced:
The
var1is specified throughBUNDLE_VAR_var1="hello".The
var2is specified through--var="var2=world".
$ BUNDLE_VAR_var1="hello" databricks bundle deploy --var="var2=world"
Building default...
Uploading bundle files to /Workspace/Users/fedor.kobak@innowise.com/.bundle/knowledge/dev/files...
Deploying resources...
Deployment complete!
$ cat result
hello and world
Execute scripts#
To execute scripts with a bundle configuration cretedentials use databricks bundle run [reference to the script]. The script can be defined inline or specified in databricks.yml.
For example if you try to access the DATABRICKS_HOST from the local raw python environment, you will receive an error:
$python3 -c 'import os; print(os.environ["DATABRICKS_HOST"])'
Traceback (most recent call last):
File "<string>", line 1, in <module>
import os; print(os.environ["DATABRICKS_HOST"])
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
File "<frozen os>", line 717, in __getitem__
KeyError: 'DATABRICKS_HOST'
But the same command wroks fine under the Databricks CLI.
$ databricks bundle run -- python3 -c 'import os; print(os.environ["DATABRICKS_HOST"][:20])'
https://dbc-da0651ae
Targets#
The targets allow you to define multiple behavior patterns for the bundle. For example, you may need to organize development and production environments.
The target definition may have the following syntax:
targets:
<tagtet1 name>:
<configuration>
<target2 name>:
<configuration>
The typical way to specify the target is to use -t <target name> option.
As example consider the following configuration:
artifacts:
default:
build: echo "My target is ${bundle.target}" > result
targets:
dev:
mode: development
default: true
prod:
mode: development
The information about the used target was forwarded through the artifact.
Note. Here, the mode for the prod target is set to the development value just to keep things simple, since the other options require additional configuration.
With the typical run system returns the dev as value for bundle.target.
$ databricks bundle deploy
Building default...
Uploading bundle files to /Workspace/Users/fedor.kobak@innowise.com/.bundle/python_default/dev/files...
Deploying resources...
Updating deployment state...
Deployment complete!
$ cat result
My target is dev
The following cell runs databricks bundle deploy -t prod which forces CLI to use prod target.
$ databricks bundle deploy -t prod
Building default...
Uploading bundle files to /Workspace/Users/fedor.kobak@innowise.com/.bundle/python_default/prod/files...
Deploying resources...
Updating deployment state...
Deployment complete!
$ cat result
My target is prod
There is corresponding result in the target.
Resources#
In this section, we will discuss how to define resources in Databricks. A resource is a feature of Databricks that can be used by our application.
They are defined using the following syntax:
resources:
jobs:
<list of the jobs>
apps:
<list of the apps>
...
Check the Supported resources.
Jobs#
Probably the most popular type of the resource in the Databricks. Job is an automated workflow defined in the Databricks.
The most imporatant aspects of the job configuration are:
name: Defines the name of the job.tasks: lists the tasks awailable for the job.shedule: sets up the rules when task have to be executed.
Check more in:
Job description for keys of assests bundles.
Job configuration in bundle configuration examples.
The most minimalistic definition of the job may take form:
resources:
jobs:
job1:
name: job1
Registered model#
A registered model is an a model registered in the MLFlow inside Databricks.
Check more in registered_model reference in the Databricks documentation.
For example, the following configuration creates a model with named “example” the “knowledge” catalog in the knowledge schema.
bundle:
name: knowledge
resources:
registered_models:
model:
name: example
catalog_name: knowledge
schema_name: knowledge
comment: Registered model in Unity Catalog for ${bundle.target} deployment target
Dependencies#
There are several ways to manage third-party dependencies in assests bundles.
The
librariesoption allows you to define the packages that will be installed on the cluster during deployment. This option obviously won’t work for serverless configuration.The
environmentoption allows you to specify the environments that will be created when the job starts. This means that a new environment is created each time you create a job.
Libraries#
Typcailly, your project will depend on the third-third party libraries. You are supposed to specify this in the library key. This can be specified either for the entire cluster or for a particular task.
Note: libraries are not supported by the serverless computation type. The best solution is probably to create a new Databricks environment where you can set up packages you like Manage serverless base environments.
For more check the Dataricks Assets Bundles library dependencies.
The following code illustrates how to specify libraries for a paritcular task.
resources:
jobs:
my_job:
tasks:
- task_key: my_task
libraries:
- pypi:
package: cowsay==6.1
Environments#
Environments are specified for the task. There is environments key, each element of this list species a mapping that describes single environment. Keys available in the environment mapping are described in the following table:
Key |
Description |
|---|---|
|
The environment identifier that is used by tasks to reference this environment. |
|
|
|
List where each key complements the |
For each task, you must specify the environment_key pointing to the environment key to the described environment. Note. it is not possible to use this approach with the Jupyter notebook. For some reason, it note that you have to use the %pip install magic command in your notebook code.
The following code defines the default_python environment and uses it as the environment for the my_task:
resources:
jobs:
my_job:
name: cow_say_job
environments:
- environment_key: default_python
spec:
environment_version: '4'
dependencies:
- cowsay==6.1
tasks:
- task_key: my_task
spark_python_task:
python_file: ./my_file.py
environment_key: default_python
Tasks#
A task is a stage in a job. The following important details are associated with task definitions:
The configuration of tasks is located at the following yaml path:
resources.jobs.{job_name}.tasks.Each task configuration begins with
- task_key: task_identifierlist element.There is set of keys in the task configuration that determine the type of the task. Behind this path, there is a configuration specific to the taks type:
notebook_task,sql_task,pipeline_task,spark_python_task, and so on.There is a set of keys that generally descirbe the task: other task settings.
The configuration of the task minght look like this:
resources:
jobs:
my_job:
name: cow_say_job
tasks:
- task_key: task1
notebook_task:
notebook_path: ./my_file.py
- task_key: task2
sql_task:
path: ./my_file.sql
Check the task settings lists different types of tasks and their configuration.
Notebook task#
The Notebook Task allows to set up a task that will execute a notebook. The general form of definition is:
- task_key: some_task
notebook_taks:
notebook_path: my_file.ipynb
.py as notebook#
You can use a regular .py file as a notebook by adding the line # Databricks notebook source as the beginning of the file. After deployment databricks will treat it as a notebook.
Consider following resources configuration:
resources:
jobs:
job1:
name: job1
tasks:
- task_key: task1
notebook_task:
notebook_path: file.py
Where file.py is:
print("hello world")
Attempt to deploy the bundle fails:
$ databricks bundle deploy
Error: expected a notebook for "resources.jobs.job1.tasks[0].notebook_task.notebook_path" but got a file: file at /tmp/databricks_experiments/file.py is not a notebook
However, if the file is defined slightly different:
# Databricks notebook source
print("hello world")
The deployemnt goes fine:
$ databricks bundle deploy
Uploading bundle files to /Workspace/Users/fedor.kobak@innowise.com/.bundle/my_bundle/dev/files...
Deploying resources...
Updating deployment state...
Deployment complete!