Lakehouse Monitoring: Data Quality & Drift Detection (2026)

Data quality in production tables degrades over time. Even when the schema stays the same, rising NULL rates, skewed value distributions, and silent data corruption from upstream system changes happen constantly. Lakehouse Monitoring is Databricks' native feature that periodically computes statistical profiles on Unity Catalog-registered Delta tables and automatically detects quality changes and drift.

This article walks through how to choose between the 3 profile types, concrete SQL/API for creating monitors, the structure of the metrics tables, auto-generated dashboards, alerting setup, and the points tested on the Databricks certification exams (MLA/MLP).

How Lakehouse Monitoring Works

When you create a monitor on a Unity Catalog managed or external table, Lakehouse Monitoring computes statistics on the schedule you define and writes the results to two metrics tables (profile_metrics and drift_metrics) as Delta tables. The computation runs on Serverless Compute, so you do not need to provision a cluster yourself.

You pick one "profile type" when creating the monitor. The profile type determines which statistics are computed and how drift is detected. Choose the type that matches your table's nature: static master data, time-series data, or inference logs.

Comparing the 3 Profile Types

Lakehouse Monitoring has 3 profile types, and you choose one based on the data characteristics of the target table.

Profile Type	Target Table	What's Computed	Drift Detection Basis
Snapshot	Static master and dimension tables	Table-wide statistics at run time (mean, variance, NULL rate, cardinality, etc.)	Diff vs. the previous snapshot
TimeSeries	Fact tables and log data with a timestamp column	Statistics per time window, tracking change across windows	KL divergence and Jensen-Shannon distance between adjacent time windows
InferenceLog	ML inference result tables (predictions plus input features)	Prediction distribution, input feature distribution, and per-model-version statistics	Comparison vs. a baseline table plus comparison across time windows

Snapshot is the simplest: it periodically records the "current state" of the table. TimeSeries takes a timestamp_col and computes statistics over aggregation windows defined by granularities (daily, weekly, etc.). InferenceLog is an extension of TimeSeries that adds prediction_col, model_id_col, and (optionally) label_col, automatically computing ML-specific metrics like prediction drift, input drift, and cross-model comparisons.

Creating Monitors via SQL and the API

You create monitors via SQL (CREATE MONITOR) or the Databricks SDK (Python API). The following SQL example creates a monitor with the TimeSeries profile.

-- TimeSeriesプロファイルでモニターを作成
CREATE MONITOR catalog.schema.sales_daily
USING TIMESERIES (
  TIMESTAMP_COL order_date,
  GRANULARITIES ('1 day', '1 week')
)
WITH (
  LINKED_ENTITIES (catalog.schema.sales_daily),
  SCHEDULE CRON '0 8 * * *',
  BASELINE_TABLE catalog.schema.sales_daily_baseline
);

The equivalent operation in the Python API looks like this.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import (
    MonitorTimeSeries, MonitorCronSchedule
)

w = WorkspaceClient()

w.quality_monitors.create(
    table_name="catalog.schema.sales_daily",
    assets_dir="/Shared/monitors/sales_daily",
    output_schema_name="catalog.schema",
    time_series=MonitorTimeSeries(
        timestamp_col="order_date",
        granularities=["1 day", "1 week"]
    ),
    schedule=MonitorCronSchedule(
        quartz_cron_expression="0 0 8 * * ?",
        timezone_id="Asia/Tokyo"
    ),
    baseline_table_name="catalog.schema.sales_daily_baseline"
)

For InferenceLog, replace time_series with the inference_log parameter and set prediction_col, model_id_col, and problem_type (classification or regression). For Snapshot, simply omit both time_series and inference_log and the monitor defaults to a Snapshot profile.

Structure of the Metrics Tables

When a monitor runs, two Delta tables are automatically created in the output_schema you specified.

Table	Contents	Key Columns
{table_name}_profile_metrics	Per-column statistics (mean, standard deviation, NULL rate, min/max, histograms, etc.)	column_name, data_type, mean, stddev, null_count, distinct_count, quantiles, window
{table_name}_drift_metrics	Statistical differences between windows or against the baseline	column_name, drift_type, chi_square_statistic, ks_statistic, js_distance, wasserstein_distance, window

Each monitor run appends rows to profile_metrics, letting you track statistic trends over time. drift_metrics records statistical test results: chi-square for categorical variables, and KS test, JS distance, and Wasserstein distance for numerical variables. Because these are regular Delta tables, you can query them directly from Databricks SQL or notebooks to build custom analyses and alerts.

Auto-Generated Dashboards

When you specify assets_dir at monitor creation, a Databricks SQL dashboard is generated automatically. The dashboard includes the following visualizations.

Per-column statistic trends (time-series charts of NULL rate, mean, and cardinality)
Drift score heatmap showing which columns changed at which point in time
Distribution histograms for numerical columns, comparing baseline vs. current
For InferenceLog: changes in the prediction distribution and comparisons across model versions

Dashboards are generated as Databricks SQL Lakeview Dashboards and stored in assets_dir. You can manually customize the dashboard after generation, but recreating the monitor also regenerates the dashboard. For substantial customization, we recommend copying the dashboard to a separate location and working from there.

Setting Up Alerts

Lakehouse Monitoring alone has no built-in alert notification feature. To get notified on anomalies, pair it with Databricks SQL Alerts running against the drift_metrics table.

-- ドリフトスコアが閾値を超えたカラムを検出するアラートクエリ
SELECT
  column_name,
  js_distance,
  window.start AS window_start,
  window.end AS window_end
FROM catalog.schema.sales_daily_drift_metrics
WHERE js_distance > 0.3
  AND window.end = (
    SELECT MAX(window.end)
    FROM catalog.schema.sales_daily_drift_metrics
  );

Register this query as a Databricks SQL Alert and have it send notifications via email, Slack, or webhook whenever the scheduled run returns one or more rows. The threshold depends on your data's characteristics, but 0.1 to 0.3 for JS distance is a reasonable starting point. In production, tune the threshold incrementally to reduce the false-positive rate.

DLT Expectations vs. Delta Constraints vs. Lakehouse Monitoring

Databricks ships several mechanisms for guaranteeing data quality. Each runs at a different point and serves a different purpose, so it's important not to confuse them.

Feature	When It Runs	Target	Detection Method	Primary Use
DLT Expectations	During ETL pipeline execution (real time)	Records flowing through a DLT pipeline	Per-row rule-based checks (EXPECT, EXPECT OR DROP, EXPECT OR FAIL)	Filtering bad records or halting the pipeline
Delta Constraints	On write to the table	Any Delta table	CHECK and NOT NULL constraints; writes fail on violation	Schema-level integrity guarantees
Lakehouse Monitoring	After write to the table (batch, on a schedule)	Unity Catalog-registered tables	Drift detection via statistical tests (KS test, chi-square, etc.)	Continuous table quality monitoring and data drift detection

A typical production setup is a three-layer design: DLT Expectations filter out "obviously bad records" at the pipeline stage, Delta Constraints enforce "table-level invariants" (primary keys are NOT NULL, etc.), and Lakehouse Monitoring watches "whether the quality of already-written data is degrading over time."

What's Tested on the MLA/MLP Exams

Lakehouse Monitoring appears on Databricks Certified ML Associate (MLA) and ML Professional (MLP) in the context of "model monitoring" and "data drift." The main question patterns are below.

"I want to monitor whether the input distribution of a production ML model is changing." → InferenceLog profile. Attach the monitor to the Model Serving payload log table and detect input feature drift.
"Which statistical methods are used for data drift detection?" → KS test (Kolmogorov-Smirnov) for numerical variables, chi-square for categorical variables. Also remember Jensen-Shannon distance.
"Where are monitor results stored?" → Two Delta tables: profile_metrics and drift_metrics. The destination is the output_schema you specify at monitor creation.
"How do Lakehouse Monitoring and MLflow relate?" → Lakehouse Monitoring handles data and prediction quality monitoring; MLflow handles model lifecycle management (experiment tracking, registry, deployment). They are complementary.
"What's the difference between DLT Expectations, Delta Constraints, and Lakehouse Monitoring?" → Know precisely how their timing and purpose differ.

On the exam, you need to instantly map "drift detection = Lakehouse Monitoring," "per-record validation = DLT Expectations," and "schema constraints = Delta Constraints."

Check Your Understanding

ML Associate / ML Professional

問題 1

An ML engineer wants to monitor inference quality on a production model serving endpoint. Inference logs are stored in a Unity Catalog Delta table that contains prediction, timestamp, and model ID columns. Which approach best tracks input feature drift and changes in the prediction distribution over time?

Create a Lakehouse Monitoring monitor with the InferenceLog profile, specifying prediction_col, timestamp_col, and model_id_col
Define a DLT Expectations rule that range-checks prediction values and validate inside the pipeline
Add a Delta Constraint CHECK to enforce that the prediction value is NOT NULL
Manually compare per-version metrics in the MLflow Model Registry

正解: A

The InferenceLog profile is purpose-built for inference log tables; specifying prediction_col, timestamp_col, and model_id_col automatically tracks input feature drift and changes in the prediction distribution. DLT Expectations are per-record validations inside the pipeline and are unsuited to after-the-fact statistical monitoring. Delta Constraints are schema-level rules with no drift detection. MLflow is for experiment and model lifecycle management and has no built-in data drift detection.

Frequently Asked Questions

When should I use Lakehouse Monitoring vs. DLT Expectations?

DLT Expectations apply per-record quality rules (NOT NULL, range checks, etc.) inside a Delta Live Tables pipeline in real time. Lakehouse Monitoring, in contrast, periodically computes table-wide statistics and detects drift and distribution shifts over time. The rule of thumb is: use Expectations for in-flight ETL validation, and Lakehouse Monitoring for continuous post-ETL table quality monitoring.

Are there license or cost implications for using Lakehouse Monitoring?

Lakehouse Monitoring is available on any Unity Catalog-enabled workspace. Creating a monitor is free, but profile computation runs on Serverless Compute, so Serverless DBUs are consumed at execution time. Metrics tables are stored as Delta tables, so you also pay for storage. For large tables, you can optimize costs by limiting the compute frequency and the number of columns monitored.

When should I use the InferenceLog profile?

Use InferenceLog on tables that log ML model inference results. When you specify the prediction, timestamp, and model ID columns, it automatically tracks shifts in the prediction distribution, drift in input features, and accuracy differences between model versions. Combined with Model Serving payload logs, it gives you early detection of quality degradation in production models.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Lakehouse Monitoring Complete Guide: Data Quality Monitoring