Serving

Contents

Serving#

MLFlow allows to serve models that are registered. This page looks at the relevant tools.

import mlflow
from multiprocessing import Process

mlflow_path = "/tmp/mlflow_serving"

!rm -rf $mlflow_path
mlflow.set_tracking_uri("file://" + mlflow_path)
mlflow.set_registry_uri("file://" + mlflow_path)

CLI#

The mlflow command-line interface allows you to start an http server that deploys the specified model. The following table shows parameters for the mlflow models serve command that allows to run the server.

Option

Description

-m, --model-uri <URI>

Path or URI of the model to serve (local path, S3, GCS, DBFS, registry URI).

-p, --port <PORT>

Port to serve the model on (default: 5000).

-h, --host <HOST>

Host address to bind (default: 127.0.0.1). Use 0.0.0.0 to make it accessible externally.

--no-conda

Prevents creation of a new conda environment; runs in the current environment.

--env-manager

Controls how the serving environment is created (default: conda).

--enable-mlserver

Use MLServer backend instead of the default gunicorn/waitress server (for better scaling).

--workers <N>

Number of worker processes to handle requests (only for gunicorn on Unix).

--install-mlflow

Reinstalls MLflow in the serving environment (useful if it’s missing).

Docker#

Use mlflow models build-docker interface to pack the model as a Docker image. The following table shows the important arguments:

Option

Description

-m, --model-uri <URI>

Path or URI of the model to include in the Docker image.

-n, --name <IMAGE_NAME>

Name of the resulting Docker image.

-b, --build <flavor>

Choose which model flavor to build (python_function, crate, etc.).

--enable-mlserver

Use MLServer as the serving backend instead of the default.

--install-mlflow

Ensures MLflow is installed in the image (sometimes required for compatibility).

--env-manager

Specifies how dependencies should be managed inside the image.

--platform <PLATFORM>

Target platform for multi-arch builds (e.g., linux/amd64, linux/arm64).

--no-cache

Do not use Docker’s build cache.

--build-arg KEY=VALUE

Pass custom build arguments to docker build.

Python#

To run the model from python, use mlflow.models.flavor_backend_registry.get_flavor_backend, which returns a special backend object that can start a server via the serve method.

Note. As the server holds the Python process that it run, to run another Jupyter cell, run the server in a child process.


The following cell registers the simple model object in the mlflow.

@mlflow.pyfunc.utils.pyfunc
def model(model_input: list[float]) -> list[float]:
    return [x * 2 for x in model_input]

with mlflow.start_run() as run:
    mlflow.pyfunc.log_model(
        name="model",
        python_model=model,
        registered_model_name="model",
        pip_requirements=[]
    )
Successfully registered model 'model'.
Created version '1' of model 'model'.

The next code shows how to start the server as a separate Python process.

model_uri = "models:/model/1"

def run_model_serve():
    backend = mlflow.models.flavor_backend_registry.get_flavor_backend(
        model_uri=model_uri,
        env_manager="local"
    )

    backend.serve(
        model_uri=model_uri,
        port=1234,
        host="localhost",
        timeout=60,
        enable_mlserver=False
    )

process = Process(target=run_model_serve)
process.start()
2025/10/24 16:15:56 INFO mlflow.models.flavor_backend_registry: Selected backend for flavor 'python_function'
2025/10/24 16:15:56 INFO mlflow.pyfunc.backend: === Running command 'exec uvicorn --host localhost --port 1234 --workers 1 mlflow.pyfunc.scoring_server.app:app'
INFO:     Started server process [362307]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:1234 (Press CTRL+C to quit)

Invocation of the server:

import requests

url = "http://127.0.0.1:1234/invocations"
headers = {'Content-Type': 'application/json'}
data = {"inputs": [1.0, 2.0, 3.0]}

response = requests.post(url, headers=headers, json=data)

print(response.text)
INFO:     127.0.0.1:50100 - "POST /invocations HTTP/1.1" 200 OK
{"predictions": [2.0, 4.0, 6.0]}

As expected, the API returns the inputs multiplicated by 2.

process.terminate()
process.join()
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [362307]