Serving

Serving#

MLFlow allows to serve models that are registered. This page looks at the relevant tools.

import mlflow
from multiprocessing import Process

mlflow_path = "/tmp/mlflow_serving"

!rm -rf $mlflow_path
mlflow.set_tracking_uri("file://" + mlflow_path)
mlflow.set_registry_uri("file://" + mlflow_path)

CLI#

The mlflow command-line interface allows you to start an http server that deploys the specified model. The following table shows parameters for the mlflow models serve command that allows to run the server.

Option	Description
`-m, --model-uri <URI>`	Path or URI of the model to serve (local path, S3, GCS, DBFS, registry URI).
`-p, --port <PORT>`	Port to serve the model on (default: `5000`).
`-h, --host <HOST>`	Host address to bind (default: `127.0.0.1`). Use `0.0.0.0` to make it accessible externally.
`--no-conda`	Prevents creation of a new conda environment; runs in the current environment.
`--env-manager`	Controls how the serving environment is created (default: `conda`).
`--enable-mlserver`	Use MLServer backend instead of the default gunicorn/waitress server (for better scaling).
`--workers <N>`	Number of worker processes to handle requests (only for `gunicorn` on Unix).
`--install-mlflow`	Reinstalls MLflow in the serving environment (useful if it’s missing).

Docker#

Use mlflow models build-docker interface to pack the model as a Docker image. The following table shows the important arguments:

Option	Description
`-m, --model-uri <URI>`	Path or URI of the model to include in the Docker image.
`-n, --name <IMAGE_NAME>`	Name of the resulting Docker image.
`-b, --build <flavor>`	Choose which model flavor to build (`python_function`, `crate`, etc.).
`--enable-mlserver`	Use MLServer as the serving backend instead of the default.
`--install-mlflow`	Ensures MLflow is installed in the image (sometimes required for compatibility).
`--env-manager`	Specifies how dependencies should be managed inside the image.
`--platform <PLATFORM>`	Target platform for multi-arch builds (e.g., `linux/amd64`, `linux/arm64`).
`--no-cache`	Do not use Docker’s build cache.
`--build-arg KEY=VALUE`	Pass custom build arguments to `docker build`.

Python#

To run the model from python, use mlflow.models.flavor_backend_registry.get_flavor_backend, which returns a special backend object that can start a server via the serve method.

Note. As the server holds the Python process that it run, to run another Jupyter cell, run the server in a child process.

The following cell registers the simple model object in the mlflow.

@mlflow.pyfunc.utils.pyfunc
def model(model_input: list[float]) -> list[float]:
    return [x * 2 for x in model_input]

with mlflow.start_run() as run:
    mlflow.pyfunc.log_model(
        name="model",
        python_model=model,
        registered_model_name="model",
        pip_requirements=[]
    )

Successfully registered model 'model'.
Created version '1' of model 'model'.

The next code shows how to start the server as a separate Python process.

model_uri = "models:/model/1"

def run_model_serve():
    backend = mlflow.models.flavor_backend_registry.get_flavor_backend(
        model_uri=model_uri,
        env_manager="local"
    )

    backend.serve(
        model_uri=model_uri,
        port=1234,
        host="localhost",
        timeout=60,
        enable_mlserver=False
    )

process = Process(target=run_model_serve)
process.start()

2025/10/24 16:15:56 INFO mlflow.models.flavor_backend_registry: Selected backend for flavor 'python_function'
2025/10/24 16:15:56 INFO mlflow.pyfunc.backend: === Running command 'exec uvicorn --host localhost --port 1234 --workers 1 mlflow.pyfunc.scoring_server.app:app'
INFO:     Started server process [362307]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:1234 (Press CTRL+C to quit)

Invocation of the server:

import requests

url = "http://127.0.0.1:1234/invocations"
headers = {'Content-Type': 'application/json'}
data = {"inputs": [1.0, 2.0, 3.0]}

response = requests.post(url, headers=headers, json=data)

print(response.text)

INFO:     127.0.0.1:50100 - "POST /invocations HTTP/1.1" 200 OK
{"predictions": [2.0, 4.0, 6.0]}

As expected, the API returns the inputs multiplicated by 2.

process.terminate()
process.join()

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [362307]

Serving

Contents

Serving#

CLI#

Docker#

Python#