Delta Lake Command Reference: OPTIMIZE, VACUUM, MERGE (2026)

The SQL commands you use to operate on Delta Lake are the single most heavily tested topic across the Databricks certifications. On the Data Engineer Associate exam, roughly 30-40% of the questions involve Delta Lake, and your understanding of the exact syntax and behavior of OPTIMIZE, VACUUM, MERGE INTO, DESCRIBE HISTORY, RESTORE, and DESCRIBE DETAIL often decides pass vs. fail. This article walks through each command's syntax, behavior, gotchas, and exam patterns in full detail.

Delta Lake Core Commands at a Glance

Command	Purpose	Data change	Metadata change
`OPTIMIZE`	Compacts small files	Physical layout only	Creates a new version
`VACUUM`	Physically deletes unreferenced files	None	None (not recorded in the log)
`MERGE INTO`	Conditional upsert / CDC	INSERT/UPDATE/DELETE	Creates a new version
`DESCRIBE HISTORY`	Inspects change history	None	None
`RESTORE TABLE`	Restores to a previous version	Logical restore	Creates a new version
`DESCRIBE DETAIL`	Inspects physical table metadata	None	None

VACUUM is not recorded in the transaction log, so it never shows up in DESCRIBE HISTORY output. This is a classic exam gotcha.

OPTIMIZE

OPTIMIZE is the compaction command that merges accumulated small Parquet files into larger ones. When Structured Streaming micro-batches or frequent INSERTs generate large numbers of small files, metadata overhead balloons and query performance degrades, so periodic OPTIMIZE runs are recommended.

-- 基本構文：テーブル全体を最適化
OPTIMIZE my_catalog.my_schema.sales

-- WHERE句でパーティションを絞り込み
OPTIMIZE my_catalog.my_schema.sales
  WHERE region = 'APAC' AND order_date >= '2026-01-01'

-- Z-ORDERと組み合わせ（最大4カラム、推奨2〜3カラム）
OPTIMIZE my_catalog.my_schema.sales
  ZORDER BY (customer_id, product_id)

-- Liquid Clustering対応テーブルではZORDER不要
-- OPTIMIZE実行時にCLUSTER BY設定に基づき自動配置
OPTIMIZE my_catalog.my_schema.events

Z-ORDER vs Liquid Clustering

Z-ORDER lays out data along a space-filling curve based on the specified columns to maximize data skipping. But Z-ORDER is only applied when OPTIMIZE runs, so its effect fades with subsequent INSERTs. Liquid Clustering is a newer approach available in Databricks Runtime 13.3+ that automatically places data on write based on the columns specified in CLUSTER BY.

-- Liquid Clusteringテーブルの作成
CREATE TABLE events (
  event_id BIGINT,
  user_id BIGINT,
  event_type STRING,
  event_date DATE
)
CLUSTER BY (user_id, event_date)

-- 既存テーブルにLiquid Clusteringを設定
ALTER TABLE events CLUSTER BY (user_id, event_date)

-- Liquid Clusteringを解除
ALTER TABLE events CLUSTER BY NONE

VACUUM

VACUUM physically deletes old Parquet files that are no longer referenced by the current table version. It cuts storage costs, but be aware that any Time Travel or RESTORE targeting versions tied to the deleted files will fail.

-- デフォルト保持期間（168時間 = 7日間）で実行
VACUUM my_catalog.my_schema.sales

-- 保持期間を30日に指定
VACUUM my_catalog.my_schema.sales RETAIN 720 HOURS

-- DRY RUN：削除対象のファイル一覧を確認（実際には削除しない）
VACUUM my_catalog.my_schema.sales DRY RUN

The default retention period is 168 hours (7 days). Specifying a shorter period would return an error because in-flight readers might still be holding references to those files. You can force a shorter window with the setting below, but it is not recommended in production:

-- 保持期間チェックの無効化（非推奨）
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM my_catalog.my_schema.sales RETAIN 0 HOURS

MERGE INTO

MERGE INTO joins source data with a target Delta Table on a matching condition and performs UPDATE, INSERT, and DELETE as a single atomic operation based on match/no-match outcomes. It is the most important command at the heart of CDC pipelines and upsert patterns.

MERGE INTO gold.customers AS target
USING silver.customers_updates AS source
ON target.customer_id = source.customer_id

WHEN MATCHED AND source.is_deleted = true THEN
  DELETE

WHEN MATCHED THEN
  UPDATE SET
    target.name       = source.name,
    target.email      = source.email,
    target.updated_at = source.updated_at

WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, created_at, updated_at)
  VALUES (source.customer_id, source.name, source.email,
          source.created_at, source.updated_at)

WHEN NOT MATCHED BY SOURCE THEN
  UPDATE SET target.is_active = false

Behavior of Each Clause

Clause	Rows targeted	Allowed actions
WHEN MATCHED	Exists in both source and target	UPDATE / DELETE
WHEN NOT MATCHED	Exists in source, missing in target	INSERT only
WHEN NOT MATCHED BY SOURCE	Exists in target, missing in source	UPDATE / DELETE

DESCRIBE HISTORY

DESCRIBE HISTORY shows the change history recorded in a Delta Table's transaction log. You can inspect each version's operation, user, and parameters — useful for debugging and auditing.

-- 全履歴を表示
DESCRIBE HISTORY my_catalog.my_schema.sales

-- 直近10件の履歴を表示
DESCRIBE HISTORY my_catalog.my_schema.sales LIMIT 10

Output Columns

Column	Type	Description
version	BIGINT	Version number (0-indexed)
timestamp	TIMESTAMP	Commit timestamp
operation	STRING	WRITE / MERGE / DELETE / OPTIMIZE / RESTORE, etc.
operationParameters	MAP	Operation parameters such as mode, predicate, partitionBy
readVersion	BIGINT	Snapshot version that was read
isolationLevel	STRING	Transaction isolation level
operationMetrics	MAP	Stats such as numOutputRows and numOutputBytes
userIdentity	STRING	User who executed the operation

RESTORE TABLE

RESTORE TABLE rolls a Delta Table back to a previous version or timestamp. Use it to undo mistaken DELETEs or UPDATEs. RESTORE creates a new version — it does not rewrite history; instead it appends the restored state as a new version.

-- バージョン番号で復元
RESTORE TABLE my_catalog.my_schema.sales TO VERSION AS OF 5

-- タイムスタンプで復元
RESTORE TABLE my_catalog.my_schema.sales
  TO TIMESTAMP AS OF '2026-03-20 10:00:00'

If the target version needs Parquet files that VACUUM has already deleted, RESTORE fails with FileNotFoundException. For important tables, set a longer VACUUM retention period.

DESCRIBE DETAIL

DESCRIBE DETAIL shows a Delta Table's physical metadata (file location, size, partition info, etc.). Use it to diagnose table health.

DESCRIBE DETAIL my_catalog.my_schema.sales

Output Columns

Column	Description
format	Table format (always "delta")
id	Unique table identifier (UUID)
name	Table name in Unity Catalog
location	Path where data files are stored
numFiles	Number of files in the current version
sizeInBytes	Total size of the current version
partitionColumns	List of partition columns
clusteringColumns	List of Liquid Clustering columns
minReaderVersion	Minimum required reader version
minWriterVersion	Minimum required writer version

numFiles being unusually high suggests you should run OPTIMIZE, while sizeInBytes being larger than expected may mean VACUUM has not been run. DESCRIBE DETAIL is a handy command for checking table health.

Time Travel Queries

Time Travel lets you read past versions of a Delta Table. Unlike RESTORE, it does not change the table's current state.

-- VERSION AS OF 構文
SELECT * FROM my_catalog.my_schema.sales VERSION AS OF 3

-- TIMESTAMP AS OF 構文
SELECT * FROM my_catalog.my_schema.sales
  TIMESTAMP AS OF '2026-03-20 10:00:00'

-- @ 構文（簡略表記）
SELECT * FROM my_catalog.my_schema.sales@v3

Sample Question

Delta Lake - VACUUM / DESCRIBE HISTORY

問題 1

After running OPTIMIZE on a Delta Table, you checked the version history with DESCRIBE HISTORY. You then ran VACUUM RETAIN 168 HOURS. When you run DESCRIBE HISTORY again, how does the VACUUM operation appear?

A new version with operation = 'VACUUM' is appended
VACUUM statistics are attached to the latest version's operationMetrics
VACUUM is not recorded in the transaction log, so DESCRIBE HISTORY output is unchanged
VACUUM advances the table to a new version, but the operation column shows 'WRITE'

正解: C

VACUUM is an operation that is not recorded in the Delta Lake transaction log. It doesn't change the table's data content — it only deletes unreferenced physical files — so no new version is created. As a result, VACUUM never shows up in DESCRIBE HISTORY output. OPTIMIZE and MERGE INTO, on the other hand, advance the version and are recorded in DESCRIBE HISTORY.

Frequently Asked Questions

Should I run OPTIMIZE or VACUUM first?

Run OPTIMIZE first. Once OPTIMIZE compacts small files, the original pre-compaction files become unreferenced. Running VACUUM afterward removes all the now-unneeded files in one pass. If you do it in the opposite order, the new unreferenced files produced by OPTIMIZE will still be sitting around. The general best-practice sequence is OPTIMIZE then VACUUM then ANALYZE TABLE (to refresh statistics).

What happens if I use Time Travel after running VACUUM?

Any Time Travel query that needs Parquet files VACUUM has already removed will fail with FileNotFoundException. For example, running VACUUM RETAIN 168 HOURS (the default) deletes data files for versions older than 7 days, making SELECT ... VERSION AS OF or RESTORE TABLE against those versions impossible. Always set the VACUUM retention period to match how far back you actually need Time Travel.

When should I use the WHEN NOT MATCHED BY SOURCE clause in MERGE INTO?

Use WHEN NOT MATCHED BY SOURCE to DELETE or UPDATE rows that exist in the target table but no longer exist in the source. Typical cases include SCD Type 2 scenarios where the source provides a full snapshot and you want to soft-delete rows missing from the source (set is_deleted = true), and full-refresh MERGE patterns where you physically delete records that are no longer in the source. Available in Spark 3.4 / Databricks Runtime 13.x and later.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Delta Lake Commands Reference: OPTIMIZE, VACUUM, MERGE Complete Guide

Delta Lake Core Commands at a Glance

OPTIMIZE

Z-ORDER vs Liquid Clustering

VACUUM

MERGE INTO

Behavior of Each Clause

DESCRIBE HISTORY

Output Columns

RESTORE TABLE

DESCRIBE DETAIL

Output Columns

Time Travel Queries

Sample Question

Frequently Asked Questions

Check what you learned with practice questions

Author

Related articles

Browse all Databricks articles (110)