Databricks

Delta Lake Commands Reference: OPTIMIZE, VACUUM, MERGE Complete Guide

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

The SQL commands you use to operate on Delta Lake are the single most heavily tested topic across the Databricks certifications. On the Data Engineer Associate exam, roughly 30-40% of the questions involve Delta Lake, and your understanding of the exact syntax and behavior of OPTIMIZE, VACUUM, MERGE INTO, DESCRIBE HISTORY, RESTORE, and DESCRIBE DETAIL often decides pass vs. fail. This article walks through each command's syntax, behavior, gotchas, and exam patterns in full detail.

Delta Lake Core Commands at a Glance

CommandPurposeData changeMetadata change
OPTIMIZECompacts small filesPhysical layout onlyCreates a new version
VACUUMPhysically deletes unreferenced filesNoneNone (not recorded in the log)
MERGE INTOConditional upsert / CDCINSERT/UPDATE/DELETECreates a new version
DESCRIBE HISTORYInspects change historyNoneNone
RESTORE TABLERestores to a previous versionLogical restoreCreates a new version
DESCRIBE DETAILInspects physical table metadataNoneNone

VACUUM is not recorded in the transaction log, so it never shows up in DESCRIBE HISTORY output. This is a classic exam gotcha.

OPTIMIZE

OPTIMIZE is the compaction command that merges accumulated small Parquet files into larger ones. When Structured Streaming micro-batches or frequent INSERTs generate large numbers of small files, metadata overhead balloons and query performance degrades, so periodic OPTIMIZE runs are recommended.

-- 基本構文:テーブル全体を最適化
OPTIMIZE my_catalog.my_schema.sales

-- WHERE句でパーティションを絞り込み
OPTIMIZE my_catalog.my_schema.sales
  WHERE region = 'APAC' AND order_date >= '2026-01-01'

-- Z-ORDERと組み合わせ(最大4カラム、推奨2〜3カラム)
OPTIMIZE my_catalog.my_schema.sales
  ZORDER BY (customer_id, product_id)

-- Liquid Clustering対応テーブルではZORDER不要
-- OPTIMIZE実行時にCLUSTER BY設定に基づき自動配置
OPTIMIZE my_catalog.my_schema.events

Z-ORDER vs Liquid Clustering

Z-ORDER lays out data along a space-filling curve based on the specified columns to maximize data skipping. But Z-ORDER is only applied when OPTIMIZE runs, so its effect fades with subsequent INSERTs. Liquid Clustering is a newer approach available in Databricks Runtime 13.3+ that automatically places data on write based on the columns specified in CLUSTER BY.

-- Liquid Clusteringテーブルの作成
CREATE TABLE events (
  event_id BIGINT,
  user_id BIGINT,
  event_type STRING,
  event_date DATE
)
CLUSTER BY (user_id, event_date)

-- 既存テーブルにLiquid Clusteringを設定
ALTER TABLE events CLUSTER BY (user_id, event_date)

-- Liquid Clusteringを解除
ALTER TABLE events CLUSTER BY NONE

VACUUM

VACUUM physically deletes old Parquet files that are no longer referenced by the current table version. It cuts storage costs, but be aware that any Time Travel or RESTORE targeting versions tied to the deleted files will fail.

-- デフォルト保持期間(168時間 = 7日間)で実行
VACUUM my_catalog.my_schema.sales

-- 保持期間を30日に指定
VACUUM my_catalog.my_schema.sales RETAIN 720 HOURS

-- DRY RUN:削除対象のファイル一覧を確認(実際には削除しない)
VACUUM my_catalog.my_schema.sales DRY RUN

The default retention period is 168 hours (7 days). Specifying a shorter period would return an error because in-flight readers might still be holding references to those files. You can force a shorter window with the setting below, but it is not recommended in production:

-- 保持期間チェックの無効化(非推奨)
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM my_catalog.my_schema.sales RETAIN 0 HOURS

MERGE INTO

MERGE INTO joins source data with a target Delta Table on a matching condition and performs UPDATE, INSERT, and DELETE as a single atomic operation based on match/no-match outcomes. It is the most important command at the heart of CDC pipelines and upsert patterns.

MERGE INTO gold.customers AS target
USING silver.customers_updates AS source
ON target.customer_id = source.customer_id

WHEN MATCHED AND source.is_deleted = true THEN
  DELETE

WHEN MATCHED THEN
  UPDATE SET
    target.name       = source.name,
    target.email      = source.email,
    target.updated_at = source.updated_at

WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, created_at, updated_at)
  VALUES (source.customer_id, source.name, source.email,
          source.created_at, source.updated_at)

WHEN NOT MATCHED BY SOURCE THEN
  UPDATE SET target.is_active = false

Behavior of Each Clause

ClauseRows targetedAllowed actions
WHEN MATCHEDExists in both source and targetUPDATE / DELETE
WHEN NOT MATCHEDExists in source, missing in targetINSERT only
WHEN NOT MATCHED BY SOURCEExists in target, missing in sourceUPDATE / DELETE

DESCRIBE HISTORY

DESCRIBE HISTORY shows the change history recorded in a Delta Table's transaction log. You can inspect each version's operation, user, and parameters — useful for debugging and auditing.

-- 全履歴を表示
DESCRIBE HISTORY my_catalog.my_schema.sales

-- 直近10件の履歴を表示
DESCRIBE HISTORY my_catalog.my_schema.sales LIMIT 10

Output Columns

ColumnTypeDescription
versionBIGINTVersion number (0-indexed)
timestampTIMESTAMPCommit timestamp
operationSTRINGWRITE / MERGE / DELETE / OPTIMIZE / RESTORE, etc.
operationParametersMAPOperation parameters such as mode, predicate, partitionBy
readVersionBIGINTSnapshot version that was read
isolationLevelSTRINGTransaction isolation level
operationMetricsMAPStats such as numOutputRows and numOutputBytes
userIdentitySTRINGUser who executed the operation

RESTORE TABLE

RESTORE TABLE rolls a Delta Table back to a previous version or timestamp. Use it to undo mistaken DELETEs or UPDATEs. RESTORE creates a new version — it does not rewrite history; instead it appends the restored state as a new version.

-- バージョン番号で復元
RESTORE TABLE my_catalog.my_schema.sales TO VERSION AS OF 5

-- タイムスタンプで復元
RESTORE TABLE my_catalog.my_schema.sales
  TO TIMESTAMP AS OF '2026-03-20 10:00:00'

If the target version needs Parquet files that VACUUM has already deleted, RESTORE fails with FileNotFoundException. For important tables, set a longer VACUUM retention period.

DESCRIBE DETAIL

DESCRIBE DETAIL shows a Delta Table's physical metadata (file location, size, partition info, etc.). Use it to diagnose table health.

DESCRIBE DETAIL my_catalog.my_schema.sales

Output Columns

ColumnDescription
formatTable format (always "delta")
idUnique table identifier (UUID)
nameTable name in Unity Catalog
locationPath where data files are stored
numFilesNumber of files in the current version
sizeInBytesTotal size of the current version
partitionColumnsList of partition columns
clusteringColumnsList of Liquid Clustering columns
minReaderVersionMinimum required reader version
minWriterVersionMinimum required writer version

numFiles being unusually high suggests you should run OPTIMIZE, while sizeInBytes being larger than expected may mean VACUUM has not been run. DESCRIBE DETAIL is a handy command for checking table health.

Time Travel Queries

Time Travel lets you read past versions of a Delta Table. Unlike RESTORE, it does not change the table's current state.

-- VERSION AS OF 構文
SELECT * FROM my_catalog.my_schema.sales VERSION AS OF 3

-- TIMESTAMP AS OF 構文
SELECT * FROM my_catalog.my_schema.sales
  TIMESTAMP AS OF '2026-03-20 10:00:00'

-- @ 構文(簡略表記)
SELECT * FROM my_catalog.my_schema.sales@v3

Sample Question

Delta Lake - VACUUM / DESCRIBE HISTORY

問題 1

After running OPTIMIZE on a Delta Table, you checked the version history with DESCRIBE HISTORY. You then ran VACUUM RETAIN 168 HOURS. When you run DESCRIBE HISTORY again, how does the VACUUM operation appear?

  1. A new version with operation = 'VACUUM' is appended
  2. VACUUM statistics are attached to the latest version's operationMetrics
  3. VACUUM is not recorded in the transaction log, so DESCRIBE HISTORY output is unchanged
  4. VACUUM advances the table to a new version, but the operation column shows 'WRITE'

正解: C

VACUUM is an operation that is not recorded in the Delta Lake transaction log. It doesn't change the table's data content — it only deletes unreferenced physical files — so no new version is created. As a result, VACUUM never shows up in DESCRIBE HISTORY output. OPTIMIZE and MERGE INTO, on the other hand, advance the version and are recorded in DESCRIBE HISTORY.

Frequently Asked Questions

Should I run OPTIMIZE or VACUUM first?

Run OPTIMIZE first. Once OPTIMIZE compacts small files, the original pre-compaction files become unreferenced. Running VACUUM afterward removes all the now-unneeded files in one pass. If you do it in the opposite order, the new unreferenced files produced by OPTIMIZE will still be sitting around. The general best-practice sequence is OPTIMIZE then VACUUM then ANALYZE TABLE (to refresh statistics).

What happens if I use Time Travel after running VACUUM?

Any Time Travel query that needs Parquet files VACUUM has already removed will fail with FileNotFoundException. For example, running VACUUM RETAIN 168 HOURS (the default) deletes data files for versions older than 7 days, making SELECT ... VERSION AS OF or RESTORE TABLE against those versions impossible. Always set the VACUUM retention period to match how far back you actually need Time Travel.

When should I use the WHEN NOT MATCHED BY SOURCE clause in MERGE INTO?

Use WHEN NOT MATCHED BY SOURCE to DELETE or UPDATE rows that exist in the target table but no longer exist in the source. Typical cases include SCD Type 2 scenarios where the source provides a full snapshot and you want to soft-delete rows missing from the source (set is_deleted = true), and full-refresh MERGE patterns where you physically delete records that are no longer in the source. Available in Spark 3.4 / Databricks Runtime 13.x and later.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.