Training a machine learning model creates no value if production applications cannot use it. Databricks Model Serving is a serverless platform that exposes custom models logged with MLflow, Foundation Models like DBRX and Llama, and external models like OpenAI and Anthropic through a unified REST API endpoint. It requires no infrastructure management and supports autoscaling, GPUs, and scale-to-zero, while also covering inference result logging and A/B testing.
Model Serving is a fully managed real-time inference platform from Databricks. When you deploy a model, Databricks automatically handles containerization, load balancing, and autoscaling, exposing the model as an HTTPS REST API that external applications can call.
There are four key characteristics.
Model Serving offers three endpoint types you choose between based on use case. Exams test your ability to tell them apart and pick the right one for each scenario.
| Type | Target Models | Hosting Location | Typical Use Case |
|---|---|---|---|
| Custom Models | Any model logged with MLflow (scikit-learn, PyTorch, Transformers, etc.) | Inside Databricks | Demand forecasting, fraud detection, and recommendation models trained on your own data |
| Foundation Model APIs | OSS models such as DBRX, Llama 3, Mixtral, BGE | Inside Databricks (pay-per-token) | Text generation, summarization, embedding generation (no data sent externally) |
| External Models | OpenAI GPT-4, Anthropic Claude, Cohere, etc. | External providers (via proxy) | Unified governance, rate limiting, and log aggregation for external LLMs |
Custom Models are used to deploy your own models. You create an endpoint by specifying a version of a model registered in the MLflow Model Registry. Foundation Model APIs let you use OSS models pre-hosted by Databricks on a per-token pay-as-you-go basis, with no need to provision your own GPU cluster. External Models act as a proxy for external APIs, used to apply Unity Catalog access control and payload logging uniformly to external model calls.
Endpoints can be created in three ways: the UI, the REST API, and the Python SDK. For production use, the API or SDK is the common choice because it slots into CI/CD pipelines.
From the Databricks workspace left menu, go to "Serving" → "Create serving endpoint" and specify the model name, version, instance size, and scale settings. This is suitable for prototyping and configuration checks.
Here is an example of exposing a model logged with MLflow as an endpoint using the Python SDK.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import (
EndpointCoreConfigInput,
ServedEntityInput,
)
w = WorkspaceClient()
w.serving_endpoints.create_and_wait(
name="fraud-detection-endpoint",
config=EndpointCoreConfigInput(
served_entities=[
ServedEntityInput(
entity_name="models:/fraud_detector",
entity_version="3",
workload_size="Small",
scale_to_zero_enabled=True,
)
]
),
)For entity_name, specify the Unity Catalog registered model path (e.g. catalog.schema.model_name) or the model name in the MLflow Model Registry. workload_size is chosen from Small / Medium / Large, and scale_to_zero_enabled=True reduces cost while there is no traffic.
POST /api/2.0/serving-endpoints
{
"name": "fraud-detection-endpoint",
"config": {
"served_entities": [
{
"entity_name": "catalog.schema.fraud_detector",
"entity_version": "3",
"workload_size": "Small",
"scale_to_zero_enabled": true
}
]
}
}One of Model Serving's biggest strengths is its integration with Feature Serving. At inference time, the client only sends a primary key (such as a customer ID), and the endpoint automatically fetches the latest features from a Feature Table (a Delta table managed by Unity Catalog) and feeds them into the model.
This removes the need to reimplement feature computation on the client side and structurally prevents feature mismatches between training and inference (Training-Serving Skew).
# Example configuration for Feature Serving integration
from databricks.sdk.service.serving import ServedEntityInput
ServedEntityInput(
entity_name="catalog.schema.fraud_detector",
entity_version="3",
workload_size="Small",
scale_to_zero_enabled=True,
# Automatically fetched from the Feature Table defined in the Feature Spec
# Associate the Feature Spec when registering the model
)To use Feature Serving, when you log the model to MLflow, use the Feature Engineering API's log_model method to define the linkage (Feature Spec) with the feature tables. The endpoint automatically reads this Spec and looks up the Feature Table at inference time.
With Model Serving, you can register multiple model versions on a single endpoint as served_entities and split traffic between them by percentage. This is used for A/B testing performance between old and new models, and for gradual model rollouts (canary deployments).
w.serving_endpoints.create_and_wait(
name="recommendation-endpoint",
config=EndpointCoreConfigInput(
served_entities=[
ServedEntityInput(
name="model-v2",
entity_name="catalog.schema.recommender",
entity_version="2",
workload_size="Small",
scale_to_zero_enabled=False,
),
ServedEntityInput(
name="model-v3",
entity_name="catalog.schema.recommender",
entity_version="3",
workload_size="Small",
scale_to_zero_enabled=False,
),
],
traffic_config={
"routes": [
{"served_model_name": "model-v2", "traffic_percentage": 80},
{"served_model_name": "model-v3", "traffic_percentage": 20},
]
},
),
)This example allocates 80% of traffic to v2 and 20% to v3. By analyzing the responses logged in the inference table, you can confirm that v3's accuracy is sufficient and then safely update the model by shifting the traffic allocation to 100%. The traffic_percentage values must always sum to 100.
Model Serving endpoints have an "Inference Table" feature that automatically logs every request and response to a Delta table. When you enable auto_capture_config at endpoint creation time, inference logs accumulate in the specified Unity Catalog schema.
{
"auto_capture_config": {
"catalog_name": "ml_prod",
"schema_name": "inference_logs",
"table_name_prefix": "fraud_detection",
"enabled": true
}
}Inference tables record the request timestamp, input data, model output, latency, and status code. You can use this data for the following operational tasks.
For models that need GPU inference, such as LLMs and deep learning models, you can specify a GPU-optimized instance (GPU workload type). Setting workload_type to GPU_SMALL, GPU_MEDIUM, or GPU_LARGE allocates the corresponding GPU instance.
For large models provided by Foundation Model APIs, like Llama 3 and Mixtral, Databricks manages the GPU infrastructure, so users do not need to think about GPU instances. You only need to specify a GPU workload type when deploying your own LLMs or PyTorch models via Custom Models.
Model Serving is billed as pay-as-you-go based on DBUs (Databricks Units). Each endpoint type has a different billing model.
| Type | Billing Unit | Scale to zero | Cost Characteristics |
|---|---|---|---|
| Custom Models | Provisioned time x DBU | Supported | DBUs are charged the entire time instances are running. Zero charge during scale to zero |
| Foundation Model APIs | Input/output tokens x DBU | Always starts from zero | Pay only for what you use. No GPU infrastructure management required |
| External Models | On Databricks: request count x DBU + external API fees | No standing capacity since it is a proxy | A dual cost structure: Databricks charges plus external provider charges |
The keys to cost optimization are setting scale_to_zero_enabled=True in development environments, and in production setting the minimum instance count to 1 or higher to avoid cold-start latency. Foundation Model APIs are the most cost-efficient choice during LLM PoC because they require no GPU cluster management.
Model Serving appears on three certification exams: ML Associate, ML Professional, and GenAI Engineer. The depth of questions differs by exam level.
On the ML Professional exam in particular, three topics come up frequently: the differences between using and not using Feature Serving, the procedure for verifying new model performance via A/B testing, and drift detection using inference table data. The differences between the three types (Custom Models / Foundation Model APIs / External Models) are asked across all exams.
ML Professional
問題 1
An ML engineer wants to deploy a fraud detection model trained with scikit-learn to production. At inference time, the user_id contained in the request must be used as a key to fetch the latest user behavior features from a Feature Table and feed them into the model. Which is the most appropriate configuration?
正解: A
A Custom Model endpoint integrated with Feature Serving is correct. By defining a Feature Spec at model registration time, the endpoint only receives the primary key (user_id) at inference time and automatically fetches the latest features from the Feature Table. This prevents Training-Serving Skew and simplifies the client implementation. B is for LLMs and cannot be used to deploy a model you trained in-house. C requires reimplementing feature computation on the client side and carries a Skew risk. D is an external API proxy and cannot host your own models.
What is Scale to zero in Model Serving?
Scale to zero scales compute resources down to 0 when an endpoint has no requests for a certain period. When a new request arrives, the endpoint cold-starts to come back up. This is effective for cost savings in development and staging environments, but in production it can introduce tens of seconds to several minutes of latency, so the common practice is to set the minimum instance count to 1 or higher.
How do I choose between Foundation Model APIs and External Models?
Foundation Model APIs use OSS models hosted on Databricks (DBRX, Llama, Mixtral, and so on). Because data stays inside the Databricks environment, this option suits strict data governance and compliance requirements. External Models proxy APIs from external providers like OpenAI and Anthropic through Databricks endpoints, letting you apply Unity Catalog access control and payload logging uniformly across external model calls.
Which certification exams cover Model Serving?
The ML Associate exam covers the basics of Model Serving (creating endpoints, calling inference via the REST API). The ML Professional exam asks more practical design questions: A/B testing with traffic splitting, integration with Feature Serving, and choosing between Foundation Model APIs and other options. The GenAI Engineer exam covers RAG pattern implementations that use Foundation Model APIs and External Models.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...