Databricks

Liquid Clustering Complete Guide: Delta Lake's Next-Gen Clustering

2026-03-26
更新: 2026-03-27
NicheeLab Editorial Team

As Delta Lake tables grow, controlling how much data each query scans becomes the dominant performance factor. Traditionally, teams used partitioning or Z-ORDER to shape data layout, but changing partition keys required rebuilding the table, and Z-ORDER re-sorted the entire table every time it ran.

Liquid Clustering is Delta Lake's next-generation data layout optimization, introduced in DBR 13.3 LTS (GA in late 2023). With three defining features — no partitions, dynamic clustering key changes, and incremental re-clustering — it fundamentally removes the constraints of the older approaches.

Three Defining Features of Liquid Clustering

Liquid Clustering fundamentally differs from older approaches in these three ways.

  • No partitions required: You no longer need PARTITIONED BY to split physical directories. You also do not need to worry about partition column cardinality (the number of distinct values), and data skipping works well even with high-cardinality columns such as user_id or timestamp.
  • Instant key changes: A single ALTER TABLE changes the clustering keys. No table rebuild and no full data copy — only a metadata update. The next OPTIMIZE incrementally re-clusters the table with the new keys.
  • Incremental optimization: OPTIMIZE only processes data that has not yet been clustered (files added by recent INSERTs). Unlike Z-ORDER, which re-sorts the entire table every time, the work scales with new data volume, so even very large tables finish quickly.

CLUSTER BY Syntax

Liquid Clustering is configured with the CLUSTER BY clause. You can specify it at table creation time or add it to an existing table afterwards.

Specifying at table creation time

-- 新規テーブルをLiquid Clusteringで作成
CREATE TABLE catalog.schema.sales (
  sale_id     BIGINT,
  customer_id BIGINT,
  region      STRING,
  sale_date   DATE,
  amount      DECIMAL(18,2)
)
CLUSTER BY (region, sale_date);

-- CTASでLiquid Clusteringテーブルを作成
CREATE TABLE catalog.schema.sales_optimized
CLUSTER BY (customer_id, sale_date)
AS SELECT * FROM catalog.schema.raw_sales;

Changing clustering keys

-- クラスタリングキーを変更(メタデータ更新のみ、即時完了)
ALTER TABLE catalog.schema.sales
CLUSTER BY (customer_id, region);

-- クラスタリングを無効化
ALTER TABLE catalog.schema.sales
CLUSTER BY NONE;

CLUSTER BY NONE disables clustering. The existing data layout is preserved, but subsequent OPTIMIZE runs will no longer perform clustering.

Z-ORDER vs. Partitioning vs. Liquid Clustering

Here is a side-by-side comparison of the three data layout optimization techniques. Liquid Clustering combines the strengths of both predecessors while eliminating their weaknesses.

AspectPartitioningZ-ORDERLiquid Clustering
CardinalityLow cardinality recommended (tens to hundreds). High cardinality triggers the small files problem.No constraintsNo constraints
Key changesRequires rebuilding the table (directory structure change)Can change at OPTIMIZE time, but re-sorts the whole tableInstant change with a single ALTER TABLE (metadata-only update)
Incremental optimizationPossible at partition granularityNot supported (sorts the whole table every time)Processes only unclustered data
Need for OPTIMIZEOnly for file compactionRequired (run OPTIMIZE ZORDER BY)Required, but fast because it is incremental
Data skippingPartition pruning (directory granularity)File-level min/max statisticsHigh-precision file-level skipping via Hilbert Curve
Recommended use caseLegacy tables; low-cardinality columns such as date or regionRetrofitted optimization on existing tables (environments without Liquid Clustering support)All new tables; tables where keys may change; high-cardinality filters

Databricks officially recommends Liquid Clustering for all new tables. Partitioning and Z-ORDER remain available primarily for backward compatibility with existing tables.

Internals: Hilbert Curve and Incremental Re-clustering

Internally, Liquid Clustering projects multi-dimensional data onto one dimension using a Hilbert Curve, a space-filling curve. Compared with the Z-curve (Morton curve) used by Z-ORDER, the Hilbert Curve keeps spatially adjacent data points closer together along the 1D projection, which improves data skipping precision.

┌──────────────────────────────────────────────────────┐
│  Z-Curve (Morton曲線)           Hilbert Curve         │
│                                                       │
│  0 ─ 1   4 ─ 5                0 ─ 1   14─ 15        │
│  │   │   │   │                │   │    │   │         │
│  3 ─ 2   7 ─ 6                3 ─ 2   13─ 12        │
│                                    │        │         │
│  12─ 13  8 ─ 9                4 ─ 5   10─ 11        │
│  │   │   │   │                │   │    │   │         │
│  15─ 14  11─ 10               7 ─ 6    9 ─ 8        │
│                                                       │
│  → 空間的に近いデータが       → 空間的に近いデータが  │
│    1次元で離れるケースあり       常に1次元でも近接     │
└──────────────────────────────────────────────────────┘

Hilbert Curveの利点:
  - 空間的局所性がZ-Curveより高い
  - WHERE region = 'APAC' AND sale_date = '2026-03-01'
    のような多次元フィルタでスキッピング精度が向上
  - 統計情報(min/max)の範囲が狭くなり、
    不要ファイルの除外率が高まる

How incremental re-clustering works

Liquid Clustering attaches metadata to each data file indicating whether it is "clustered" or "unclustered". When OPTIMIZE runs, only the unclustered files (those added by recent INSERT, UPDATE, or MERGE) are eligible for clustering work.

-- 増分再クラスタリングの流れ

-- Step 1: テーブル作成
CREATE TABLE events CLUSTER BY (user_id, event_date) ...;

-- Step 2: データ追加(未クラスタリング状態でファイルが追加)
INSERT INTO events SELECT * FROM raw_events;
-- → 新規ファイルは is_clustered = false

-- Step 3: OPTIMIZE実行
OPTIMIZE events;
-- → is_clustered = false のファイルだけを読み込み
-- → Hilbert Curveで再配置し、新しいファイルとして書き出し
-- → 新ファイルは is_clustered = true
-- → 古いファイルはトランザクションログで無効化

-- Step 4: 追加データのみ再クラスタリング
INSERT INTO events SELECT * FROM new_events;
OPTIMIZE events;
-- → Step 4のINSERTで追加されたファイルだけが対象
-- → Step 3でクラスタリング済みのファイルは処理しない

Relationship with OPTIMIZE

On tables with Liquid Clustering configured, the OPTIMIZE command behaves differently from a regular Delta table.

BehaviorRegular Delta tableLiquid Clustering table
OPTIMIZESmall file compaction onlyCompaction plus Hilbert Curve re-clustering
OPTIMIZE ... ZORDER BYReorganizes data with Z-ORDERError (cannot be combined with CLUSTER BY)
Processed scopeEntire table (with Z-ORDER)Only unclustered files (incremental)
-- Liquid Clusteringテーブルでの正しいOPTIMIZE
OPTIMIZE catalog.schema.sales;
-- → 未クラスタリングデータだけを増分処理

-- エラーになる例
OPTIMIZE catalog.schema.sales ZORDER BY (region);
-- → Error: CLUSTER BYが設定されたテーブルにZORDER BYは使用できません

Interaction with auto-OPTIMIZE

When Databricks Predictive Optimization is enabled, OPTIMIZE is automatically scheduled for Unity Catalog managed tables. On Liquid Clustering tables, this auto-OPTIMIZE also runs incremental re-clustering at the same time. You no longer need to run OPTIMIZE by hand, dramatically reducing operational overhead.

-- Predictive Optimizationの有効化(Unity Catalogスキーマ単位)
ALTER SCHEMA catalog.schema
SET DBPROPERTIES ('delta.enablePredictiveOptimization' = 'true');

-- テーブル単位での設定
ALTER TABLE catalog.schema.sales
SET TBLPROPERTIES ('delta.enablePredictiveOptimization' = 'true');

-- 有効な場合のLiquid Clusteringテーブルの運用:
-- 1. OPTIMIZE が自動スケジュールされる
-- 2. 増分再クラスタリングが自動実行される
-- 3. VACUUMも自動スケジュールされる
-- → 手動でのメンテナンスコマンド実行が不要に

Migration Patterns from Existing Tables

The migration steps from partitioned tables or Z-ORDER tables to Liquid Clustering differ depending on the table type.

Pattern 1: Migrating from a non-partitioned table that uses Z-ORDER

-- Step 1: Liquid Clusteringを有効化
ALTER TABLE catalog.schema.events
CLUSTER BY (user_id, event_date);

-- Step 2: 増分再クラスタリング実行
OPTIMIZE catalog.schema.events;

-- これ以降、OPTIMIZE ZORDER BY は不要(使用するとエラー)
-- 以降のOPTIMIZEで自動的にHilbert Curveベースのクラスタリングが実行

Pattern 2: Migrating from a partitioned table

You cannot enable Liquid Clustering directly on a partitioned table. You need to create a new table using CTAS (CREATE TABLE AS SELECT) or DEEP CLONE.

-- 方法A: CTAS で新テーブル作成
CREATE TABLE catalog.schema.events_v2
CLUSTER BY (region, event_date)
AS SELECT * FROM catalog.schema.events_partitioned;

-- 方法B: DEEP CLONE + ALTER TABLE
CREATE TABLE catalog.schema.events_v2
DEEP CLONE catalog.schema.events_partitioned;

ALTER TABLE catalog.schema.events_v2
CLUSTER BY (region, event_date);

OPTIMIZE catalog.schema.events_v2;

-- 移行後の検証
SELECT COUNT(*) FROM catalog.schema.events_v2;
DESCRIBE DETAIL catalog.schema.events_v2;
-- clusteringColumns に [region, event_date] が表示されることを確認

Pattern 3: Changing clustering keys

-- 分析パターンの変化に応じてキーを変更
-- Before: region と sale_date でフィルタが多かった
-- After:  customer_id と product_category でフィルタが増えた

ALTER TABLE catalog.schema.sales
CLUSTER BY (customer_id, product_category);

-- 次回のOPTIMIZEで新キーによる増分再クラスタリングが開始
OPTIMIZE catalog.schema.sales;

Constraints and Requirements

  • DBR 13.3 LTS or later is required. Older runtimes do not recognize the CLUSTER BY syntax.
  • Photon recommended: Clustering work is optimized on Photon Engine. It also runs without Photon, but OPTIMIZE will be noticeably slower.
  • Cannot be combined with partitioning: You cannot configure PARTITIONED BY and CLUSTER BY on the same table. Migrating from a partitioned table requires CTAS or DEEP CLONE.
  • Cannot be combined with Z-ORDER: Running OPTIMIZE ... ZORDER BY on a table that has CLUSTER BY configured raises an error.
  • Number of clustering key columns: Up to 4 columns are allowed. Adding more columns dilutes the clustering effect, so 2-3 of your most frequently filtered columns is the recommended sweet spot.
  • Supported column types: Numeric, date, timestamp, and string types are supported. Struct (STRUCT) and map (MAP) types cannot be used as clustering keys.
  • Delta Lake only: Cannot be used with Parquet, CSV, or JSON tables. A Delta format table (managed or external Delta table) is required.

What to Watch for on the Exam

Liquid Clustering is an important topic on both the Data Engineer Associate (DEA) and Spark Developer exams. The differences versus Z-ORDER come up especially often, and you are typically asked to decide which one to use.

  • The decisive difference vs. Z-ORDER: Z-ORDER re-sorts the full dataset every time. Liquid Clustering processes only unclustered data incrementally. When asked which is more efficient for periodic OPTIMIZE on large tables, the answer is Liquid Clustering.
  • Co-existence constraint with partitioning: PARTITIONED BY and CLUSTER BY cannot be configured on the same table. Migrating from a partitioned table requires CTAS. Questions test this constraint directly.
  • Changing keys with ALTER TABLE: You can change keys with ALTER TABLE t CLUSTER BY (new_col), with no need to rebuild the table. This shows up as 'the lowest-cost response when query patterns change'.
  • CLUSTER BY NONE: The syntax for disabling clustering. The existing data layout is preserved, but subsequent OPTIMIZE runs no longer perform clustering.
  • Predictive Optimization: Auto-OPTIMIZE is available for Unity Catalog managed tables. Combined with Liquid Clustering, you get fully automated data layout optimization.
  • Why Hilbert Curve wins: Hilbert Curve has stronger spatial locality than Z-ORDER's Z-curve (Morton curve), improving skipping precision on multi-dimensional filters. Internalize this as the 'why' behind Liquid Clustering's advantage.

Sample Question

Data Engineer Associate

問題 1

A team runs OPTIMIZE ... ZORDER BY (user_id) on the Delta Lake table events every day. The table has grown past 10 TB and OPTIMIZE now takes more than 8 hours. Most queries filter on user_id and event_date. Which is the best way to shorten OPTIMIZE while maintaining or improving query data-skipping?

  1. Reduce the OPTIMIZE ... ZORDER BY columns to just user_id, dropping event_date.
  2. Run ALTER TABLE events CLUSTER BY (user_id, event_date) and drop ZORDER BY from subsequent OPTIMIZEs.
  3. Rebuild the table with PARTITIONED BY (event_date) and run OPTIMIZE per partition.
  4. Increase spark.databricks.delta.optimize.maxFileSize to grow file sizes and reduce the number of files OPTIMIZE processes.

正解: B

Migrating to Liquid Clustering (CLUSTER BY) makes OPTIMIZE incrementally process only the unclustered data, dramatically shortening runtime compared with Z-ORDER, which re-sorts the entire 10 TB every time. Hilbert Curve-based layout matches or exceeds Z-ORDER's skipping precision on the user_id and event_date multi-dimensional filters. Option A reduces the column count but does not fix Z-ORDER's fundamental whole-table re-sort. Option C triggers the small files problem when event_date has high cardinality, and the rebuild cost is substantial. Option D only tweaks file size — it does not reduce the work itself, so it is not a real fix.

Frequently Asked Questions

Can Liquid Clustering and Z-ORDER be used together on the same table?

No. Running OPTIMIZE ... ZORDER BY on a table that has Liquid Clustering (CLUSTER BY) configured raises an error. Liquid Clustering is designed as a superset of Z-ORDER, combining Hilbert Curve-based multi-dimensional data layout with file-level data skipping. If your existing tables use Z-ORDER, migrate them with ALTER TABLE ... CLUSTER BY (col1, col2). From that point on, plain OPTIMIZE (without ZORDER BY) performs incremental re-clustering.

What happens to existing data when I change the Liquid Clustering keys?

When you change the clustering keys with ALTER TABLE ... CLUSTER BY (new_col), existing data is not immediately re-clustered. The next OPTIMIZE incrementally re-clusters any unclustered data (data that is not aligned with the new keys). The key change itself is only a metadata update, with no table rewrite. This is the biggest difference from partitioning, where changing keys requires modifying the physical directory structure and rewriting the entire table.

Is Photon Engine required to use Liquid Clustering?

Not required, but strongly recommended. Liquid Clustering's OPTIMIZE is designed to run most efficiently on Photon, and for large tables you can see 2-5x performance differences versus running without Photon. DBR 13.3 LTS or later is required, and we recommend running on Serverless SQL Warehouses or Photon-enabled clusters. Liquid Clustering still works in environments without Photon, but OPTIMIZE may take significantly longer.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.