Databricks Model Serving: Real-Time ML Inference (2026)

Training a machine learning model creates no value if production applications cannot use it. Databricks Model Serving is a serverless platform that exposes custom models logged with MLflow, Foundation Models like DBRX and Llama, and external models like OpenAI and Anthropic through a unified REST API endpoint. It requires no infrastructure management and supports autoscaling, GPUs, and scale-to-zero, while also covering inference result logging and A/B testing.

Model Serving Overview

Model Serving is a fully managed real-time inference platform from Databricks. When you deploy a model, Databricks automatically handles containerization, load balancing, and autoscaling, exposing the model as an HTTPS REST API that external applications can call.

There are four key characteristics.

Serverless: No VM, container, or Kubernetes management required. Databricks handles operations after deployment
REST API: Inference can be invoked with standard HTTP requests, independent of language or framework
Autoscaling: Instance count is automatically adjusted based on request volume, and can scale down to 0 when there is no traffic
GPU support: GPU-optimized instances can be selected for high-speed inference of LLMs and deep learning models

Three Endpoint Types

Model Serving offers three endpoint types you choose between based on use case. Exams test your ability to tell them apart and pick the right one for each scenario.

Type	Target Models	Hosting Location	Typical Use Case
Custom Models	Any model logged with MLflow (scikit-learn, PyTorch, Transformers, etc.)	Inside Databricks	Demand forecasting, fraud detection, and recommendation models trained on your own data
Foundation Model APIs	OSS models such as DBRX, Llama 3, Mixtral, BGE	Inside Databricks (pay-per-token)	Text generation, summarization, embedding generation (no data sent externally)
External Models	OpenAI GPT-4, Anthropic Claude, Cohere, etc.	External providers (via proxy)	Unified governance, rate limiting, and log aggregation for external LLMs

Custom Models are used to deploy your own models. You create an endpoint by specifying a version of a model registered in the MLflow Model Registry. Foundation Model APIs let you use OSS models pre-hosted by Databricks on a per-token pay-as-you-go basis, with no need to provision your own GPU cluster. External Models act as a proxy for external APIs, used to apply Unity Catalog access control and payload logging uniformly to external model calls.

How to Create an Endpoint

Endpoints can be created in three ways: the UI, the REST API, and the Python SDK. For production use, the API or SDK is the common choice because it slots into CI/CD pipelines.

Creating via the UI

From the Databricks workspace left menu, go to "Serving" → "Create serving endpoint" and specify the model name, version, instance size, and scale settings. This is suitable for prototyping and configuration checks.

Deploying an MLflow Model with the Python SDK

Here is an example of exposing a model logged with MLflow as an endpoint using the Python SDK.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import (
    EndpointCoreConfigInput,
    ServedEntityInput,
)

w = WorkspaceClient()

w.serving_endpoints.create_and_wait(
    name="fraud-detection-endpoint",
    config=EndpointCoreConfigInput(
        served_entities=[
            ServedEntityInput(
                entity_name="models:/fraud_detector",
                entity_version="3",
                workload_size="Small",
                scale_to_zero_enabled=True,
            )
        ]
    ),
)

For entity_name, specify the Unity Catalog registered model path (e.g. catalog.schema.model_name) or the model name in the MLflow Model Registry. workload_size is chosen from Small / Medium / Large, and scale_to_zero_enabled=True reduces cost while there is no traffic.

Creating via the REST API

POST /api/2.0/serving-endpoints

{
  "name": "fraud-detection-endpoint",
  "config": {
    "served_entities": [
      {
        "entity_name": "catalog.schema.fraud_detector",
        "entity_version": "3",
        "workload_size": "Small",
        "scale_to_zero_enabled": true
      }
    ]
  }
}

Integration with Feature Serving

One of Model Serving's biggest strengths is its integration with Feature Serving. At inference time, the client only sends a primary key (such as a customer ID), and the endpoint automatically fetches the latest features from a Feature Table (a Delta table managed by Unity Catalog) and feeds them into the model.

This removes the need to reimplement feature computation on the client side and structurally prevents feature mismatches between training and inference (Training-Serving Skew).

# Example configuration for Feature Serving integration
from databricks.sdk.service.serving import ServedEntityInput

ServedEntityInput(
    entity_name="catalog.schema.fraud_detector",
    entity_version="3",
    workload_size="Small",
    scale_to_zero_enabled=True,
    # Automatically fetched from the Feature Table defined in the Feature Spec
    # Associate the Feature Spec when registering the model
)

To use Feature Serving, when you log the model to MLflow, use the Feature Engineering API's log_model method to define the linkage (Feature Spec) with the feature tables. The endpoint automatically reads this Spec and looks up the Feature Table at inference time.

A/B Testing and Traffic Splitting

With Model Serving, you can register multiple model versions on a single endpoint as served_entities and split traffic between them by percentage. This is used for A/B testing performance between old and new models, and for gradual model rollouts (canary deployments).

w.serving_endpoints.create_and_wait(
    name="recommendation-endpoint",
    config=EndpointCoreConfigInput(
        served_entities=[
            ServedEntityInput(
                name="model-v2",
                entity_name="catalog.schema.recommender",
                entity_version="2",
                workload_size="Small",
                scale_to_zero_enabled=False,
            ),
            ServedEntityInput(
                name="model-v3",
                entity_name="catalog.schema.recommender",
                entity_version="3",
                workload_size="Small",
                scale_to_zero_enabled=False,
            ),
        ],
        traffic_config={
            "routes": [
                {"served_model_name": "model-v2", "traffic_percentage": 80},
                {"served_model_name": "model-v3", "traffic_percentage": 20},
            ]
        },
    ),
)

This example allocates 80% of traffic to v2 and 20% to v3. By analyzing the responses logged in the inference table, you can confirm that v3's accuracy is sufficient and then safely update the model by shifting the traffic allocation to 100%. The traffic_percentage values must always sum to 100.

Payload Logging and Inference Tables

Model Serving endpoints have an "Inference Table" feature that automatically logs every request and response to a Delta table. When you enable auto_capture_config at endpoint creation time, inference logs accumulate in the specified Unity Catalog schema.

{
  "auto_capture_config": {
    "catalog_name": "ml_prod",
    "schema_name": "inference_logs",
    "table_name_prefix": "fraud_detection",
    "enabled": true
  }
}

Inference tables record the request timestamp, input data, model output, latency, and status code. You can use this data for the following operational tasks.

Model monitoring: Integrate with Lakehouse Monitoring to detect drift in the prediction distribution
A/B test analysis: Compare prediction accuracy and latency across model versions
Debugging: Identify requests with anomalous inference results and inspect the input data
Compliance: Retain as an audit log of inference activity

GPU Serving

For models that need GPU inference, such as LLMs and deep learning models, you can specify a GPU-optimized instance (GPU workload type). Setting workload_type to GPU_SMALL, GPU_MEDIUM, or GPU_LARGE allocates the corresponding GPU instance.

For large models provided by Foundation Model APIs, like Llama 3 and Mixtral, Databricks manages the GPU infrastructure, so users do not need to think about GPU instances. You only need to specify a GPU workload type when deploying your own LLMs or PyTorch models via Custom Models.

Pricing Model

Model Serving is billed as pay-as-you-go based on DBUs (Databricks Units). Each endpoint type has a different billing model.

Type	Billing Unit	Scale to zero	Cost Characteristics
Custom Models	Provisioned time x DBU	Supported	DBUs are charged the entire time instances are running. Zero charge during scale to zero
Foundation Model APIs	Input/output tokens x DBU	Always starts from zero	Pay only for what you use. No GPU infrastructure management required
External Models	On Databricks: request count x DBU + external API fees	No standing capacity since it is a proxy	A dual cost structure: Databricks charges plus external provider charges

The keys to cost optimization are setting scale_to_zero_enabled=True in development environments, and in production setting the minimum instance count to 1 or higher to avoid cold-start latency. Foundation Model APIs are the most cost-efficient choice during LLM PoC because they require no GPU cluster management.

What the Exams Test

Model Serving appears on three certification exams: ML Associate, ML Professional, and GenAI Engineer. The depth of questions differs by exam level.

ML Associate: How to create endpoints, calling inference via the REST API, and the basic concept of Scale to zero
ML Professional: A/B test traffic split configuration, preventing Training-Serving Skew via Feature Serving integration, and model monitoring with inference tables
GenAI Engineer: Choosing between Foundation Model APIs and External Models, and using embedding endpoints in RAG patterns

On the ML Professional exam in particular, three topics come up frequently: the differences between using and not using Feature Serving, the procedure for verifying new model performance via A/B testing, and drift detection using inference table data. The differences between the three types (Custom Models / Foundation Model APIs / External Models) are asked across all exams.

Check with a Sample Question

ML Professional

問題 1

An ML engineer wants to deploy a fraud detection model trained with scikit-learn to production. At inference time, the user_id contained in the request must be used as a key to fetch the latest user behavior features from a Feature Table and feed them into the model. Which is the most appropriate configuration?

Create a Custom Model endpoint, define a Feature Spec when calling log_model with the Feature Engineering API, and register the model. The endpoint automatically looks up the Feature Table at inference time
Create a Foundation Model APIs endpoint and have it generate features by including the user_id in the prompt
Create a Custom Model endpoint, query the Feature Table directly from the client, and include all features in the request
Create an External Models endpoint and implement the feature retrieval logic in the OpenAI API

正解: A

A Custom Model endpoint integrated with Feature Serving is correct. By defining a Feature Spec at model registration time, the endpoint only receives the primary key (user_id) at inference time and automatically fetches the latest features from the Feature Table. This prevents Training-Serving Skew and simplifies the client implementation. B is for LLMs and cannot be used to deploy a model you trained in-house. C requires reimplementing feature computation on the client side and carries a Skew risk. D is an external API proxy and cannot host your own models.

Frequently Asked Questions

What is Scale to zero in Model Serving?

Scale to zero scales compute resources down to 0 when an endpoint has no requests for a certain period. When a new request arrives, the endpoint cold-starts to come back up. This is effective for cost savings in development and staging environments, but in production it can introduce tens of seconds to several minutes of latency, so the common practice is to set the minimum instance count to 1 or higher.

How do I choose between Foundation Model APIs and External Models?

Foundation Model APIs use OSS models hosted on Databricks (DBRX, Llama, Mixtral, and so on). Because data stays inside the Databricks environment, this option suits strict data governance and compliance requirements. External Models proxy APIs from external providers like OpenAI and Anthropic through Databricks endpoints, letting you apply Unity Catalog access control and payload logging uniformly across external model calls.

Which certification exams cover Model Serving?

The ML Associate exam covers the basics of Model Serving (creating endpoints, calling inference via the REST API). The ML Professional exam asks more practical design questions: A/B testing with traffic splitting, integration with Feature Serving, and choosing between Foundation Model APIs and other options. The GenAI Engineer exam covers RAG pattern implementations that use Foundation Model APIs and External Models.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Databricks Model Serving: Complete Guide to Real-time Inference