Data quality in production tables degrades over time. Even when the schema stays the same, rising NULL rates, skewed value distributions, and silent data corruption from upstream system changes happen constantly. Lakehouse Monitoring is Databricks' native feature that periodically computes statistical profiles on Unity Catalog-registered Delta tables and automatically detects quality changes and drift.
This article walks through how to choose between the 3 profile types, concrete SQL/API for creating monitors, the structure of the metrics tables, auto-generated dashboards, alerting setup, and the points tested on the Databricks certification exams (MLA/MLP).
When you create a monitor on a Unity Catalog managed or external table, Lakehouse Monitoring computes statistics on the schedule you define and writes the results to two metrics tables (profile_metrics and drift_metrics) as Delta tables. The computation runs on Serverless Compute, so you do not need to provision a cluster yourself.
You pick one "profile type" when creating the monitor. The profile type determines which statistics are computed and how drift is detected. Choose the type that matches your table's nature: static master data, time-series data, or inference logs.
Lakehouse Monitoring has 3 profile types, and you choose one based on the data characteristics of the target table.
| Profile Type | Target Table | What's Computed | Drift Detection Basis |
|---|---|---|---|
| Snapshot | Static master and dimension tables | Table-wide statistics at run time (mean, variance, NULL rate, cardinality, etc.) | Diff vs. the previous snapshot |
| TimeSeries | Fact tables and log data with a timestamp column | Statistics per time window, tracking change across windows | KL divergence and Jensen-Shannon distance between adjacent time windows |
| InferenceLog | ML inference result tables (predictions plus input features) | Prediction distribution, input feature distribution, and per-model-version statistics | Comparison vs. a baseline table plus comparison across time windows |
Snapshot is the simplest: it periodically records the "current state" of the table. TimeSeries takes a timestamp_col and computes statistics over aggregation windows defined by granularities (daily, weekly, etc.). InferenceLog is an extension of TimeSeries that adds prediction_col, model_id_col, and (optionally) label_col, automatically computing ML-specific metrics like prediction drift, input drift, and cross-model comparisons.
You create monitors via SQL (CREATE MONITOR) or the Databricks SDK (Python API). The following SQL example creates a monitor with the TimeSeries profile.
-- TimeSeriesプロファイルでモニターを作成
CREATE MONITOR catalog.schema.sales_daily
USING TIMESERIES (
TIMESTAMP_COL order_date,
GRANULARITIES ('1 day', '1 week')
)
WITH (
LINKED_ENTITIES (catalog.schema.sales_daily),
SCHEDULE CRON '0 8 * * *',
BASELINE_TABLE catalog.schema.sales_daily_baseline
);The equivalent operation in the Python API looks like this.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import (
MonitorTimeSeries, MonitorCronSchedule
)
w = WorkspaceClient()
w.quality_monitors.create(
table_name="catalog.schema.sales_daily",
assets_dir="/Shared/monitors/sales_daily",
output_schema_name="catalog.schema",
time_series=MonitorTimeSeries(
timestamp_col="order_date",
granularities=["1 day", "1 week"]
),
schedule=MonitorCronSchedule(
quartz_cron_expression="0 0 8 * * ?",
timezone_id="Asia/Tokyo"
),
baseline_table_name="catalog.schema.sales_daily_baseline"
)For InferenceLog, replace time_series with the inference_log parameter and set prediction_col, model_id_col, and problem_type (classification or regression). For Snapshot, simply omit both time_series and inference_log and the monitor defaults to a Snapshot profile.
When a monitor runs, two Delta tables are automatically created in the output_schema you specified.
| Table | Contents | Key Columns |
|---|---|---|
| {table_name}_profile_metrics | Per-column statistics (mean, standard deviation, NULL rate, min/max, histograms, etc.) | column_name, data_type, mean, stddev, null_count, distinct_count, quantiles, window |
| {table_name}_drift_metrics | Statistical differences between windows or against the baseline | column_name, drift_type, chi_square_statistic, ks_statistic, js_distance, wasserstein_distance, window |
Each monitor run appends rows to profile_metrics, letting you track statistic trends over time. drift_metrics records statistical test results: chi-square for categorical variables, and KS test, JS distance, and Wasserstein distance for numerical variables. Because these are regular Delta tables, you can query them directly from Databricks SQL or notebooks to build custom analyses and alerts.
When you specify assets_dir at monitor creation, a Databricks SQL dashboard is generated automatically. The dashboard includes the following visualizations.
Dashboards are generated as Databricks SQL Lakeview Dashboards and stored in assets_dir. You can manually customize the dashboard after generation, but recreating the monitor also regenerates the dashboard. For substantial customization, we recommend copying the dashboard to a separate location and working from there.
Lakehouse Monitoring alone has no built-in alert notification feature. To get notified on anomalies, pair it with Databricks SQL Alerts running against the drift_metrics table.
-- ドリフトスコアが閾値を超えたカラムを検出するアラートクエリ
SELECT
column_name,
js_distance,
window.start AS window_start,
window.end AS window_end
FROM catalog.schema.sales_daily_drift_metrics
WHERE js_distance > 0.3
AND window.end = (
SELECT MAX(window.end)
FROM catalog.schema.sales_daily_drift_metrics
);Register this query as a Databricks SQL Alert and have it send notifications via email, Slack, or webhook whenever the scheduled run returns one or more rows. The threshold depends on your data's characteristics, but 0.1 to 0.3 for JS distance is a reasonable starting point. In production, tune the threshold incrementally to reduce the false-positive rate.
Databricks ships several mechanisms for guaranteeing data quality. Each runs at a different point and serves a different purpose, so it's important not to confuse them.
| Feature | When It Runs | Target | Detection Method | Primary Use |
|---|---|---|---|---|
| DLT Expectations | During ETL pipeline execution (real time) | Records flowing through a DLT pipeline | Per-row rule-based checks (EXPECT, EXPECT OR DROP, EXPECT OR FAIL) | Filtering bad records or halting the pipeline |
| Delta Constraints | On write to the table | Any Delta table | CHECK and NOT NULL constraints; writes fail on violation | Schema-level integrity guarantees |
| Lakehouse Monitoring | After write to the table (batch, on a schedule) | Unity Catalog-registered tables | Drift detection via statistical tests (KS test, chi-square, etc.) | Continuous table quality monitoring and data drift detection |
A typical production setup is a three-layer design: DLT Expectations filter out "obviously bad records" at the pipeline stage, Delta Constraints enforce "table-level invariants" (primary keys are NOT NULL, etc.), and Lakehouse Monitoring watches "whether the quality of already-written data is degrading over time."
Lakehouse Monitoring appears on Databricks Certified ML Associate (MLA) and ML Professional (MLP) in the context of "model monitoring" and "data drift." The main question patterns are below.
On the exam, you need to instantly map "drift detection = Lakehouse Monitoring," "per-record validation = DLT Expectations," and "schema constraints = Delta Constraints."
ML Associate / ML Professional
問題 1
An ML engineer wants to monitor inference quality on a production model serving endpoint. Inference logs are stored in a Unity Catalog Delta table that contains prediction, timestamp, and model ID columns. Which approach best tracks input feature drift and changes in the prediction distribution over time?
正解: A
The InferenceLog profile is purpose-built for inference log tables; specifying prediction_col, timestamp_col, and model_id_col automatically tracks input feature drift and changes in the prediction distribution. DLT Expectations are per-record validations inside the pipeline and are unsuited to after-the-fact statistical monitoring. Delta Constraints are schema-level rules with no drift detection. MLflow is for experiment and model lifecycle management and has no built-in data drift detection.
When should I use Lakehouse Monitoring vs. DLT Expectations?
DLT Expectations apply per-record quality rules (NOT NULL, range checks, etc.) inside a Delta Live Tables pipeline in real time. Lakehouse Monitoring, in contrast, periodically computes table-wide statistics and detects drift and distribution shifts over time. The rule of thumb is: use Expectations for in-flight ETL validation, and Lakehouse Monitoring for continuous post-ETL table quality monitoring.
Are there license or cost implications for using Lakehouse Monitoring?
Lakehouse Monitoring is available on any Unity Catalog-enabled workspace. Creating a monitor is free, but profile computation runs on Serverless Compute, so Serverless DBUs are consumed at execution time. Metrics tables are stored as Delta tables, so you also pay for storage. For large tables, you can optimize costs by limiting the compute frequency and the number of columns monitored.
When should I use the InferenceLog profile?
Use InferenceLog on tables that log ML model inference results. When you specify the prediction, timestamp, and model ID columns, it automatically tracks shifts in the prediction distribution, drift in input features, and accuracy differences between model versions. Combined with Model Serving payload logs, it gives you early detection of quality degradation in production models.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...