Serving

Serving#

Databricks provides many tools for serving ml models, as well as ready-deployed solutions. This page discusses how to use them through the Python SDK.

OpenAI client#

The method serving_endpoints.get_open_ai_client.get_open_ai_client returns the openai.OpenAI client, which you can use to requiest some served models.

The following cell creates the open_ai_client and shows that it is really open ai client.

from databricks.sdk import WorkspaceClient
w = WorkspaceClient()

open_ai_client = w.serving_endpoints.get_open_ai_client()
type(open_ai_client)

openai.OpenAI

The following cell illustrates the invocation of the embedding model.

embedding = open_ai_client.embeddings.create(
   model="databricks-gte-large-en",
   input="hello"
)
type(embedding)

openai.types.create_embedding_response.CreateEmbeddingResponse

The result is an openai embedding response object.

embedding.data[0].embedding[:20]

[-0.9521484375,
 -0.7998046875,
 -0.79931640625,
 -0.138427734375,
 -0.79150390625,
 -0.31787109375,
 -0.55810546875,
 0.392333984375,
 -0.36767578125,
 0.4013671875,
 -0.0791015625,
 -0.78515625,
 -0.4599609375,
 0.4189453125,
 0.418212890625,
 -0.36767578125,
 -0.587890625,
 -0.466796875,
 0.159423828125,
 -0.359130859375]

Serving endpoint#

With Databricks, you can launch an endpoint with a registered model. You can do this through the UI Databricks interface, but here we show the option of using the Python SDK.

The following cell registers the simple function that is logged as an ML model in MLFlow.

import mlflow

mlflow.set_tracking_uri("databricks")
mlflow.set_registry_uri("databricks-uc")

experiment_name = "/Users/fedor.kobak@innowise.com/serving_tests"
experiment = mlflow.get_experiment_by_name(experiment_name)
if experiment is None:
    experiment_id = mlflow.create_experiment(experiment_name)
else:
    experiment_id = experiment.experiment_id

mlflow.set_experiment(experiment_id=experiment_id)

@mlflow.pyfunc.utils.pyfunc
def model(model_input: list[float]) -> list[float]:
    return [x * 2 for x in model_input]

model_name = "workspace.knowledge.serving_example"

with mlflow.start_run() as run:
    mlflow.pyfunc.log_model(
        name="model",
        python_model=model,
        pip_requirements=[],
        registered_model_name=model_name
    )

Registered model 'workspace.knowledge.serving_example' already exists. Creating a new version of this model...

Created version '2' of model 'workspace.knowledge.serving_example'.

🏃 View run whimsical-smelt-614 at: https://dbc-6bc9e7c2-e867.cloud.databricks.com/ml/experiments/2555847948754149/runs/f2ecb60bae784c7f8fea0e9bf1c6c456
🧪 View experiment at: https://dbc-6bc9e7c2-e867.cloud.databricks.com/ml/experiments/2555847948754149

The following cell defines the endpoint configuration and endpoint name.

from databricks.sdk.service.serving import EndpointCoreConfigInput
config = EndpointCoreConfigInput.from_dict({
    "served_models": [
        {
            "model_name": model_name,
            "model_version": 1,
            "scale_to_zero_enabled": True,
            "workload_size": "Small"
        }
    ]
})

endpoint_name = "serving-example"

EndpointCoreConfigInput(auto_capture_config=None, name=None, served_entities=[], served_models=[ServedModelInput(scale_to_zero_enabled=True, model_name='workspace.knowledge.serving_example', model_version=1, environment_vars=None, instance_profile_arn=None, max_provisioned_concurrency=None, max_provisioned_throughput=None, min_provisioned_concurrency=None, min_provisioned_throughput=None, name=None, provisioned_model_units=None, workload_size='Small', workload_type=None)], traffic_config=None)

Use WorkspaceClient.serving_endpoitns.create_and_wait method to create the endpoint, as shown in the following cell.

Note. This cell may take some time to be executed ~10 min.

w = WorkspaceClient()
w.serving_endpoints.create_and_wait(
    name=endpoint_name,
    config=config
)

ServingEndpointDetailed(ai_gateway=None, budget_policy_id=None, config=EndpointCoreConfigOutput(auto_capture_config=None, config_version=1, served_entities=[ServedEntityOutput(creation_timestamp=1759322561000, creator='fedor.kobak@innowise.com', entity_name='workspace.knowledge.serving_example', entity_version='1', environment_vars=None, external_model=None, foundation_model=None, instance_profile_arn=None, max_provisioned_concurrency=None, max_provisioned_throughput=None, min_provisioned_concurrency=None, min_provisioned_throughput=None, name='serving_example-1', provisioned_model_units=None, scale_to_zero_enabled=True, state=ServedModelState(deployment=<ServedModelStateDeployment.DEPLOYMENT_READY: 'DEPLOYMENT_READY'>, deployment_state_message=''), workload_size='Small', workload_type=<ServingModelWorkloadType.CPU: 'CPU'>)], served_models=[ServedModelOutput(creation_timestamp=1759322561000, creator='fedor.kobak@innowise.com', environment_vars=None, instance_profile_arn=None, max_provisioned_concurrency=None, min_provisioned_concurrency=None, model_name='workspace.knowledge.serving_example', model_version='1', name='serving_example-1', provisioned_model_units=None, scale_to_zero_enabled=True, state=ServedModelState(deployment=<ServedModelStateDeployment.DEPLOYMENT_READY: 'DEPLOYMENT_READY'>, deployment_state_message=''), workload_size='Small', workload_type=<ServingModelWorkloadType.CPU: 'CPU'>)], traffic_config=TrafficConfig(routes=[Route(traffic_percentage=100, served_entity_name='serving_example-1', served_model_name='serving_example-1')])), creation_timestamp=1759322561000, creator='fedor.kobak@innowise.com', data_plane_info=None, description='', email_notifications=None, endpoint_url=None, id='a4b755e5064c430dba1ef294b40e5010', last_updated_timestamp=1759322561000, name='serving-example', pending_config=None, permission_level=<ServingEndpointDetailedPermissionLevel.CAN_MANAGE: 'CAN_MANAGE'>, route_optimized=False, state=EndpointState(config_update=<EndpointStateConfigUpdate.NOT_UPDATING: 'NOT_UPDATING'>, ready=<EndpointStateReady.READY: 'READY'>), tags=[], task=None)

After that your endpoint is awailable in the internet. The following cell throws curl to it.

To use it you must create the environment variables DATABRICKS_HOST and DATABRICKS_TOKEN.

%%bash
curl -s\
  -u token:$DATABRICKS_TOKEN \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"inputs": [5.0, 10.0]}'\
  $DATABRICKS_HOST/serving-endpoints/serving-example/invocations

{"predictions": [10.0, 20.0]}

The outputs, just as was specified in “model”, are twice inputs.

Serving

Contents

Serving#

OpenAI client#

Serving endpoint#