The SQL commands you use to operate on Delta Lake are the single most heavily tested topic across the Databricks certifications. On the Data Engineer Associate exam, roughly 30-40% of the questions involve Delta Lake, and your understanding of the exact syntax and behavior of OPTIMIZE, VACUUM, MERGE INTO, DESCRIBE HISTORY, RESTORE, and DESCRIBE DETAIL often decides pass vs. fail. This article walks through each command's syntax, behavior, gotchas, and exam patterns in full detail.
| Command | Purpose | Data change | Metadata change |
|---|---|---|---|
OPTIMIZE | Compacts small files | Physical layout only | Creates a new version |
VACUUM | Physically deletes unreferenced files | None | None (not recorded in the log) |
MERGE INTO | Conditional upsert / CDC | INSERT/UPDATE/DELETE | Creates a new version |
DESCRIBE HISTORY | Inspects change history | None | None |
RESTORE TABLE | Restores to a previous version | Logical restore | Creates a new version |
DESCRIBE DETAIL | Inspects physical table metadata | None | None |
VACUUM is not recorded in the transaction log, so it never shows up in DESCRIBE HISTORY output. This is a classic exam gotcha.
OPTIMIZE is the compaction command that merges accumulated small Parquet files into larger ones. When Structured Streaming micro-batches or frequent INSERTs generate large numbers of small files, metadata overhead balloons and query performance degrades, so periodic OPTIMIZE runs are recommended.
-- 基本構文:テーブル全体を最適化
OPTIMIZE my_catalog.my_schema.sales
-- WHERE句でパーティションを絞り込み
OPTIMIZE my_catalog.my_schema.sales
WHERE region = 'APAC' AND order_date >= '2026-01-01'
-- Z-ORDERと組み合わせ(最大4カラム、推奨2〜3カラム)
OPTIMIZE my_catalog.my_schema.sales
ZORDER BY (customer_id, product_id)
-- Liquid Clustering対応テーブルではZORDER不要
-- OPTIMIZE実行時にCLUSTER BY設定に基づき自動配置
OPTIMIZE my_catalog.my_schema.eventsZ-ORDER lays out data along a space-filling curve based on the specified columns to maximize data skipping. But Z-ORDER is only applied when OPTIMIZE runs, so its effect fades with subsequent INSERTs. Liquid Clustering is a newer approach available in Databricks Runtime 13.3+ that automatically places data on write based on the columns specified in CLUSTER BY.
-- Liquid Clusteringテーブルの作成
CREATE TABLE events (
event_id BIGINT,
user_id BIGINT,
event_type STRING,
event_date DATE
)
CLUSTER BY (user_id, event_date)
-- 既存テーブルにLiquid Clusteringを設定
ALTER TABLE events CLUSTER BY (user_id, event_date)
-- Liquid Clusteringを解除
ALTER TABLE events CLUSTER BY NONEVACUUM physically deletes old Parquet files that are no longer referenced by the current table version. It cuts storage costs, but be aware that any Time Travel or RESTORE targeting versions tied to the deleted files will fail.
-- デフォルト保持期間(168時間 = 7日間)で実行
VACUUM my_catalog.my_schema.sales
-- 保持期間を30日に指定
VACUUM my_catalog.my_schema.sales RETAIN 720 HOURS
-- DRY RUN:削除対象のファイル一覧を確認(実際には削除しない)
VACUUM my_catalog.my_schema.sales DRY RUNThe default retention period is 168 hours (7 days). Specifying a shorter period would return an error because in-flight readers might still be holding references to those files. You can force a shorter window with the setting below, but it is not recommended in production:
-- 保持期間チェックの無効化(非推奨)
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM my_catalog.my_schema.sales RETAIN 0 HOURSMERGE INTO joins source data with a target Delta Table on a matching condition and performs UPDATE, INSERT, and DELETE as a single atomic operation based on match/no-match outcomes. It is the most important command at the heart of CDC pipelines and upsert patterns.
MERGE INTO gold.customers AS target
USING silver.customers_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.is_deleted = true THEN
DELETE
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.email = source.email,
target.updated_at = source.updated_at
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, created_at, updated_at)
VALUES (source.customer_id, source.name, source.email,
source.created_at, source.updated_at)
WHEN NOT MATCHED BY SOURCE THEN
UPDATE SET target.is_active = false| Clause | Rows targeted | Allowed actions |
|---|---|---|
| WHEN MATCHED | Exists in both source and target | UPDATE / DELETE |
| WHEN NOT MATCHED | Exists in source, missing in target | INSERT only |
| WHEN NOT MATCHED BY SOURCE | Exists in target, missing in source | UPDATE / DELETE |
DESCRIBE HISTORY shows the change history recorded in a Delta Table's transaction log. You can inspect each version's operation, user, and parameters — useful for debugging and auditing.
-- 全履歴を表示
DESCRIBE HISTORY my_catalog.my_schema.sales
-- 直近10件の履歴を表示
DESCRIBE HISTORY my_catalog.my_schema.sales LIMIT 10| Column | Type | Description |
|---|---|---|
| version | BIGINT | Version number (0-indexed) |
| timestamp | TIMESTAMP | Commit timestamp |
| operation | STRING | WRITE / MERGE / DELETE / OPTIMIZE / RESTORE, etc. |
| operationParameters | MAP | Operation parameters such as mode, predicate, partitionBy |
| readVersion | BIGINT | Snapshot version that was read |
| isolationLevel | STRING | Transaction isolation level |
| operationMetrics | MAP | Stats such as numOutputRows and numOutputBytes |
| userIdentity | STRING | User who executed the operation |
RESTORE TABLE rolls a Delta Table back to a previous version or timestamp. Use it to undo mistaken DELETEs or UPDATEs. RESTORE creates a new version — it does not rewrite history; instead it appends the restored state as a new version.
-- バージョン番号で復元
RESTORE TABLE my_catalog.my_schema.sales TO VERSION AS OF 5
-- タイムスタンプで復元
RESTORE TABLE my_catalog.my_schema.sales
TO TIMESTAMP AS OF '2026-03-20 10:00:00'If the target version needs Parquet files that VACUUM has already deleted, RESTORE fails with FileNotFoundException. For important tables, set a longer VACUUM retention period.
DESCRIBE DETAIL shows a Delta Table's physical metadata (file location, size, partition info, etc.). Use it to diagnose table health.
DESCRIBE DETAIL my_catalog.my_schema.sales| Column | Description |
|---|---|
| format | Table format (always "delta") |
| id | Unique table identifier (UUID) |
| name | Table name in Unity Catalog |
| location | Path where data files are stored |
| numFiles | Number of files in the current version |
| sizeInBytes | Total size of the current version |
| partitionColumns | List of partition columns |
| clusteringColumns | List of Liquid Clustering columns |
| minReaderVersion | Minimum required reader version |
| minWriterVersion | Minimum required writer version |
numFiles being unusually high suggests you should run OPTIMIZE, while sizeInBytes being larger than expected may mean VACUUM has not been run. DESCRIBE DETAIL is a handy command for checking table health.
Time Travel lets you read past versions of a Delta Table. Unlike RESTORE, it does not change the table's current state.
-- VERSION AS OF 構文
SELECT * FROM my_catalog.my_schema.sales VERSION AS OF 3
-- TIMESTAMP AS OF 構文
SELECT * FROM my_catalog.my_schema.sales
TIMESTAMP AS OF '2026-03-20 10:00:00'
-- @ 構文(簡略表記)
SELECT * FROM my_catalog.my_schema.sales@v3Delta Lake - VACUUM / DESCRIBE HISTORY
問題 1
After running OPTIMIZE on a Delta Table, you checked the version history with DESCRIBE HISTORY. You then ran VACUUM RETAIN 168 HOURS. When you run DESCRIBE HISTORY again, how does the VACUUM operation appear?
正解: C
VACUUM is an operation that is not recorded in the Delta Lake transaction log. It doesn't change the table's data content — it only deletes unreferenced physical files — so no new version is created. As a result, VACUUM never shows up in DESCRIBE HISTORY output. OPTIMIZE and MERGE INTO, on the other hand, advance the version and are recorded in DESCRIBE HISTORY.
Should I run OPTIMIZE or VACUUM first?
Run OPTIMIZE first. Once OPTIMIZE compacts small files, the original pre-compaction files become unreferenced. Running VACUUM afterward removes all the now-unneeded files in one pass. If you do it in the opposite order, the new unreferenced files produced by OPTIMIZE will still be sitting around. The general best-practice sequence is OPTIMIZE then VACUUM then ANALYZE TABLE (to refresh statistics).
What happens if I use Time Travel after running VACUUM?
Any Time Travel query that needs Parquet files VACUUM has already removed will fail with FileNotFoundException. For example, running VACUUM RETAIN 168 HOURS (the default) deletes data files for versions older than 7 days, making SELECT ... VERSION AS OF or RESTORE TABLE against those versions impossible. Always set the VACUUM retention period to match how far back you actually need Time Travel.
When should I use the WHEN NOT MATCHED BY SOURCE clause in MERGE INTO?
Use WHEN NOT MATCHED BY SOURCE to DELETE or UPDATE rows that exist in the target table but no longer exist in the source. Typical cases include SCD Type 2 scenarios where the source provides a full snapshot and you want to soft-delete rows missing from the source (set is_deleted = true), and full-refresh MERGE patterns where you physically delete records that are no longer in the source. Available in Spark 3.4 / Databricks Runtime 13.x and later.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...