Feature engineering

Feature engineering#

This page discusses the posibilities of the feature engineering module. In databricks, feature engineering provides a convenient way to organize data for model fitting and deployment while ensuring that the data is stored correctly.

Setup#

There is a special module in the Databricks SDK databricks.feature_engineering. This allows you to manipulate with feature storage. There is a package that adds this module is published in PyPI.

Officially, it only works in the Databricks environment. The only way to use this package locally is by using VSCode databricks extension.

Note. You may be confused by the package databricks.feature_store, which has the same purpose. This is a legacy package.

Note. You won’t be able to create a FeatureEngineeringClient if in your environment spark - so use separate environment.

Note. When you’re using databricks environments, even through the VSCode extension, you will access the spark object without its assignment.

Note. The databricks uses mlflow so to control it, files recommened to setup mlflow tracking and registry URIs.

The following cell creates a client for feature engineering. It will only run if everything is configured correctly.

import mlflow
mlflow.set_registry_uri("file:///tmp/databricks")
mlflow.set_tracking_uri("file:///tmp/databricks")

from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()

/home/user/.virtualenvironments/databricks_connect/lib/python3.13/site-packages/databricks/ml_features/utils/request_context.py:8: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources

The following cell will create the schema that we will use as an example.

ans = spark.sql("CREATE SCHEMA IF NOT EXISTS knowledge")

The log output is scary, but the client works fine.

Create table#

To create feature tables use FeatureEngineeringClient.create_table method.

The following code creates the features table.

df = spark.createDataFrame(
    data=[(i, i*2) for i in range(10)],
    schema=["col1", "col2"]
)

fe.create_table(
    name="knowledge.name_of_table",
    df=df,
    primary_keys=["col1"]
)

2025/09/30 13:38:02 INFO databricks.ml_features._compute_client._compute_client: Setting columns ['col1'] of table 'workspace.knowledge.name_of_table' to NOT NULL.
2025/09/30 13:38:03 INFO databricks.ml_features._compute_client._compute_client: Setting Primary Keys constraint ['col1'] on table 'workspace.knowledge.name_of_table'.
2025/09/30 13:38:10 INFO databricks.ml_features._compute_client._compute_client: Created feature table 'workspace.knowledge.name_of_table'.

<FeatureTable: name='workspace.knowledge.name_of_table', table_id='a154e89c-d19b-4953-9219-83dde59fd85f', description='', primary_keys=['col1'], partition_columns=[], features=['col1', 'col2'], creation_timestamp=1759232282260, online_stores=[], notebook_producers=[], job_producers=[], table_data_sources=[], path_data_sources=[], custom_data_sources=[], timestamp_keys=[], tags={}>

The table appears among the regular tables and will be listed in the output of the SHOW TABLES command.

spark.sql("SHOW TABLES FROM knowledge;")

	database	tableName	isTemporary
0	knowledge	name_of_table	False

The table can also be dropped like a regular table.

ans = spark.sql("DROP TABLE IF EXISTS knowledge.name_of_table;")

Feature lookup#

The feature lookup specifies how features are searched in the storage. Create a feature lookup with the following code:

from databricks.feature_store import FeatureLookup

feature_lookup = FeatureLookup(
    table_name="load_from",
    lookup_key="key",
    features_names=["feature1", "feature2"]
)

Feature engineering

Contents

Feature engineering#

Setup#

Create table#

Feature lookup#