On the Databricks certification exams (DEA, DEP, Spark Developer), precise knowledge of Spark SQL syntax is what separates passing from failing. In particular, the Data Engineer Associate exam's ELT with Spark SQL & Python (29%) domain tests your ability to read and write SQL queries directly. This article systematically organizes the syntax that shows up on the exams — CREATE TABLE, MERGE INTO, window functions, CTEs, Delta Lake operations, higher-order functions, PIVOT/UNPIVOT — with runnable SQL code.
The table below summarizes how often each Spark SQL construct appears on the Databricks certification exams. Use it to prioritize your study.
| Syntax | Frequency | Primary Exams | Question Patterns |
|---|---|---|---|
| MERGE INTO | ★★★★★ | DEA / DEP / Spark | Upsert syntax, SCD Type 2, conditional branching |
| Window functions | ★★★★★ | DEA / Spark / DAA | ROW_NUMBER deduplication, RANK comparison, LAG/LEAD diffs |
| OPTIMIZE / VACUUM | ★★★★☆ | DEA / DEP | Functional differences, ZORDER, RETAIN period |
| CTEs (WITH clause) | ★★★★☆ | DEA / Spark / DAA | Combining with window functions, chaining multiple CTEs |
| CREATE TABLE / CTAS | ★★★☆☆ | DEA / DEP | Managed vs external, schema inference |
| COPY INTO | ★★★☆☆ | DEA / DEP | Comparison with Auto Loader, idempotency |
| DESCRIBE HISTORY / RESTORE | ★★★☆☆ | DEA / DEP | Relationship to Time Travel, restore procedure |
| Higher-order functions | ★★☆☆☆ | DEA / Spark | Array manipulation with TRANSFORM/FILTER |
| PIVOT / UNPIVOT | ★★☆☆☆ | DAA / Spark | Understanding row/column transformation syntax |
There are three main CREATE patterns for Delta Lake tables. Be sure you understand the difference between managed and external tables, and whether DROP TABLE removes the underlying data.
A CREATE TABLE statement without LOCATION creates a managed table. Unity Catalog manages where the data lives, and DROP TABLE deletes both the metadata and the data files.
-- マネージドテーブル: スキーマ定義で作成
CREATE TABLE catalog_name.schema_name.orders (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
amount DECIMAL(10, 2),
status STRING
) USING DELTA
COMMENT 'Customer order transactions'
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);Providing an external storage path via the LOCATION clause creates an external table. DROP TABLE only removes the metadata; the data files remain in external storage.
-- 外部テーブル: データは指定パスに格納
CREATE TABLE catalog_name.schema_name.events (
event_id BIGINT,
event_type STRING,
event_ts TIMESTAMP,
payload STRING
) USING DELTA
LOCATION 's3://my-bucket/bronze/events';CREATE OR REPLACE TABLE (CORT) recreates an existing table by overwriting it. CTAS (CREATE TABLE AS SELECT) infers the schema from the SELECT result, so you do not need to declare column definitions.
-- CTAS: SELECTの結果をテーブル化(スキーマ自動推論)
CREATE OR REPLACE TABLE silver.daily_summary AS
SELECT
order_date,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount
FROM bronze.orders
WHERE status = 'completed'
GROUP BY order_date;-- セッションスコープ(セッション終了で消滅)
CREATE OR REPLACE TEMP VIEW v_active_users AS
SELECT * FROM users WHERE is_active = true;
-- クラスタースコープ(global_temp.で参照)
CREATE OR REPLACE GLOBAL TEMP VIEW v_summary AS
SELECT region, COUNT(*) AS cnt FROM sales GROUP BY region;
SELECT * FROM global_temp.v_summary;INSERT INTO appends to existing data. INSERT OVERWRITE deletes the existing data in the table (or partition) before writing. Note that in Delta Lake, behavior changes based on whether you specify a partition.
-- 追記: 既存データは保持される
INSERT INTO silver.orders
SELECT * FROM bronze.orders_raw
WHERE process_date = current_date();
-- 上書き: テーブル全体を置換
INSERT OVERWRITE silver.orders
SELECT * FROM bronze.orders_raw;
-- パーティション単位の上書き(動的パーティション上書き)
INSERT OVERWRITE silver.orders
PARTITION (order_date)
SELECT order_id, customer_id, amount, status, order_date
FROM bronze.orders_raw;COPY INTO loads data from external files (CSV/JSON/Parquet, etc.) into Delta Lake tables. It automatically tracks which files have already been processed, giving it idempotency that prevents the same file from being ingested twice. It is best suited for small batches of up to a few thousand files; for large-scale or continuous ingestion, Auto Loader is the recommended choice.
COPY INTO silver.transactions
FROM 's3://data-lake/landing/transactions/'
FILEFORMAT = JSON
FORMAT_OPTIONS (
'inferSchema' = 'true',
'mergeSchema' = 'true',
'dateFormat' = 'yyyy-MM-dd'
)
COPY_OPTIONS (
'mergeSchema' = 'true',
'force' = 'false' -- true にすると重複取り込み防止を無効化
);MERGE INTO matches a source and a target table on an ON condition, then runs INSERT, UPDATE, and DELETE in a single statement through three branches: WHEN MATCHED, WHEN NOT MATCHED, and WHEN NOT MATCHED BY SOURCE. It is the single most frequently tested SQL construct across the Databricks exams.
MERGE INTO gold.customers AS target
USING staging.customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.email = source.email,
target.updated_at = current_timestamp()
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, created_at, updated_at)
VALUES (source.customer_id, source.name, source.email,
current_timestamp(), current_timestamp());MERGE INTO gold.products AS t
USING staging.product_feed AS s
ON t.product_id = s.product_id
-- 条件付きUPDATE: 更新日が新しい場合のみ
WHEN MATCHED AND s.updated_at > t.updated_at THEN
UPDATE SET *
-- ソースに存在しない行を論理削除
WHEN NOT MATCHED BY SOURCE THEN
UPDATE SET t.is_active = false, t.deleted_at = current_timestamp()
-- 新規行はINSERT
WHEN NOT MATCHED BY TARGET THEN
INSERT *;SCD Type 2 is a dimension-management pattern that preserves change history. You expire the existing record by setting its end_date, then INSERT a new version row.
MERGE INTO dim.customers AS t
USING (
SELECT s.*, true AS is_new
FROM staging.customers s
JOIN dim.customers d
ON s.customer_id = d.customer_id
AND d.is_current = true
AND (s.name <> d.name OR s.email <> d.email)
UNION ALL
SELECT s.*, true AS is_new
FROM staging.customers s
WHERE NOT EXISTS (
SELECT 1 FROM dim.customers d
WHERE d.customer_id = s.customer_id
)
) AS source
ON t.customer_id = source.customer_id AND t.is_current = true
WHEN MATCHED THEN
UPDATE SET t.is_current = false, t.end_date = current_date()
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, start_date, end_date, is_current)
VALUES (source.customer_id, source.name, source.email,
current_date(), NULL, true);OPTIMIZE is a compaction command that merges small files to improve read performance. Combined with ZORDER BY, it colocates data by the specified column values (placing rows with the same column values physically close together), which improves data-skipping efficiency on filter queries.
-- 基本Compaction
OPTIMIZE gold.sales;
-- ZORDERでデータをコロケーション
OPTIMIZE gold.sales ZORDER BY (region, product_id);
-- WHERE句で対象パーティションを限定
OPTIMIZE gold.sales
WHERE order_date >= '2026-01-01'
ZORDER BY (customer_id);Physically removes old data files that are no longer referenced by the current table version, reducing storage costs. Time Travel to the deleted versions becomes unavailable.
-- デフォルト: 168時間(7日)より古い不要ファイルを削除
VACUUM gold.sales;
-- 保持期間を明示的に指定
VACUUM gold.sales RETAIN 720 HOURS; -- 30日間
-- DRY RUN: 削除対象ファイルの一覧のみ表示(削除しない)
VACUUM gold.sales DRY RUN;| Comparison | OPTIMIZE | VACUUM |
|---|---|---|
| Purpose | Small-file compaction | Physical removal of unused files |
| Effect | Improved query read performance | Lower storage costs |
| Effect on data | No logical change (read results identical) | Time Travel to old versions becomes unavailable |
| Recommended timing | Run regularly after write pipelines | Run after OPTIMIZE |
-- トランザクション履歴(バージョン・操作・タイムスタンプ)
DESCRIBE HISTORY gold.sales;
DESCRIBE HISTORY gold.sales LIMIT 5;
-- テーブル詳細情報(ファイル数・サイズ・パーティション等)
DESCRIBE DETAIL gold.sales;-- 特定バージョンに復元(実際にデータを戻す)
RESTORE TABLE gold.sales TO VERSION AS OF 12;
-- タイムスタンプ指定で復元
RESTORE TABLE gold.sales TO TIMESTAMP AS OF '2026-03-20 10:00:00';
-- Time Travel: 過去バージョンを参照のみ(復元しない)
SELECT * FROM gold.sales VERSION AS OF 12;
SELECT * FROM gold.sales TIMESTAMP AS OF '2026-03-20';
-- Time Travel構文でのテーブル指定(@記法)
SELECT * FROM gold.sales@v12;Window functions perform aggregations, ranking, and lead/lag references without collapsing rows like GROUP BY does. The OVER clause specifies the partition and ordering.
The table below shows how each function's output differs on the same dataset (salary descending within department).
| name | dept | salary | ROW_NUMBER | RANK | DENSE_RANK |
|---|---|---|---|---|---|
| Alice | Sales | 90000 | 1 | 1 | 1 |
| Bob | Sales | 85000 | 2 | 2 | 2 |
| Carol | Sales | 85000 | 3 | 2 | 2 |
| Dave | Sales | 78000 | 4 | 4 | 3 |
When Bob and Carol both earn 85000: ROW_NUMBER assigns 2 and 3 in arbitrary order. RANK gives both 2 and the next value is 4 (skipping 3). DENSE_RANK gives both 2 and the next value is 3 (no skip).
-- 各顧客の最新注文のみ取得(CTE + ROW_NUMBERパターン)
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) AS rn
FROM silver.orders
)
SELECT order_id, customer_id, order_date, amount
FROM ranked
WHERE rn = 1;-- 前月比の売上差分を計算
SELECT
month,
product_id,
revenue,
LAG(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS prev_month,
LEAD(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS next_month,
revenue - LAG(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS mom_diff
FROM gold.monthly_sales;-- 累積売上と移動平均
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_revenue,
AVG(daily_revenue) OVER (ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7d
FROM gold.daily_sales;A CTE defines a named result set in a WITH clause that subsequent SELECTs can reference. Unlike subqueries, the same CTE can be referenced from multiple locations, which significantly improves query readability.
WITH active_customers AS (
SELECT customer_id, name, region
FROM gold.customers
WHERE is_active = true
),
recent_orders AS (
SELECT customer_id, SUM(amount) AS total_spent
FROM silver.orders
WHERE order_date >= add_months(current_date(), -3)
GROUP BY customer_id
),
customer_summary AS (
SELECT
ac.customer_id,
ac.name,
ac.region,
COALESCE(ro.total_spent, 0) AS total_spent_3m
FROM active_customers ac
LEFT JOIN recent_orders ro ON ac.customer_id = ro.customer_id
)
SELECT region, COUNT(*) AS customer_count, SUM(total_spent_3m) AS region_revenue
FROM customer_summary
GROUP BY region
ORDER BY region_revenue DESC;You cannot filter on a window-function alias directly in a WHERE clause (SQL evaluates WHERE before SELECT). The standard pattern is to compute the window function in a CTE and then filter in the outer query.
-- 各部門のトップ3給与者を取得
WITH salary_rank AS (
SELECT
employee_id, name, department,
salary,
DENSE_RANK() OVER (
PARTITION BY department ORDER BY salary DESC
) AS dr
FROM hr.employees
)
SELECT employee_id, name, department, salary
FROM salary_rank
WHERE dr <= 3
ORDER BY department, dr;Spark SQL's higher-order functions apply lambda expressions element-wise to array-typed columns. They let you express EXPLODE → aggregation → COLLECT_LIST chains in a single line, which is powerful for nested-structure ETL.
Applies a function to each element of an array, returning a new array.
-- 各要素を2倍に変換
SELECT TRANSFORM(ARRAY(1, 2, 3, 4), x -> x * 2);
-- 結果: [2, 4, 6, 8]
-- 注文の各商品に税率10%を加算
SELECT
order_id,
TRANSFORM(item_prices, price -> price * 1.10) AS prices_with_tax
FROM orders;Returns a new array containing only the elements that match the condition.
-- 偶数のみ抽出
SELECT FILTER(ARRAY(1, 2, 3, 4, 5, 6), x -> x % 2 = 0);
-- 結果: [2, 4, 6]
-- 1000円以上の商品のみ残す
SELECT
order_id,
FILTER(item_prices, p -> p >= 1000) AS high_value_items
FROM orders;Folds the array elements with an accumulator, collapsing them into a single value.
-- 配列要素の合計を計算
SELECT REDUCE(ARRAY(1, 2, 3, 4), 0, (acc, x) -> acc + x);
-- 結果: 10
-- 最大値を取得
SELECT REDUCE(
ARRAY(35, 12, 88, 47), 0,
(acc, x) -> CASE WHEN x > acc THEN x ELSE acc END
);
-- 結果: 88PIVOT turns rows into columns (row → column); UNPIVOT turns columns into rows (column → row). These are used for cross-tabulation reports and ELT data reshaping.
-- 四半期別の売上を列方向に展開
SELECT *
FROM (
SELECT region, quarter, revenue
FROM gold.quarterly_sales
)
PIVOT (
SUM(revenue) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4')
);
-- 結果: region | Q1 | Q2 | Q3 | Q4-- 列方向のデータを行方向に変換
SELECT *
FROM gold.quarterly_sales_wide
UNPIVOT (
revenue FOR quarter IN (Q1, Q2, Q3, Q4)
);
-- 結果: region | quarter | revenue(各四半期が1行ずつ)MERGE INTO - Conditional Branching
問題 1
Which of the following correctly describes this MERGE INTO statement? MERGE INTO gold.inventory AS t USING staging.stock_updates AS s ON t.sku = s.sku WHEN MATCHED AND s.quantity = 0 THEN DELETE WHEN MATCHED THEN UPDATE SET t.quantity = s.quantity, t.updated_at = current_timestamp() WHEN NOT MATCHED THEN INSERT (sku, quantity, created_at, updated_at) VALUES (s.sku, s.quantity, current_timestamp(), current_timestamp())
正解: B
MERGE INTO allows multiple WHEN MATCHED clauses, with the conditional clause (AND s.quantity = 0) coming first and the unconditional clause second. Matched rows are evaluated against the first condition (quantity = 0); rows that match it are DELETEd. Rows that do not match fall through to the next WHEN MATCHED clause and are UPDATEd. Unmatched source rows are INSERTed by WHEN NOT MATCHED. Option A is wrong — only quantity=0 rows are deleted, not all matches. Option C is wrong — Delta Lake supports multiple WHEN MATCHED clauses. Option D is wrong — WHEN NOT MATCHED is a standard syntax element.
Try Spark SQL questions right now
Practice questions on MERGE INTO, window functions, and Delta Lake commands
Try free questions →Can you write multiple WHEN MATCHED clauses in a MERGE INTO statement?
Yes. Databricks (Delta Lake) lets you write up to two WHEN MATCHED clauses. Typically the first is conditional (e.g. WHEN MATCHED AND source.updated_at > target.updated_at THEN UPDATE SET *) and the second is unconditional (e.g. WHEN MATCHED THEN DELETE). The unconditional clause must come last and must appear only once, so the conditional clauses have to be written first. Standard SQL only allows one WHEN MATCHED clause, but Delta Lake extends the syntax to allow multiple.
How do ROW_NUMBER, RANK, and DENSE_RANK differ in window functions?
ROW_NUMBER always assigns a unique sequential number even for ties, making it ideal for deduplication (e.g. getting the latest record per group). RANK assigns the same rank to ties but skips the next rank (1, 2, 2, 4) — like a sports leaderboard where ties cause gaps. DENSE_RANK also gives ties the same rank but does not skip (1, 2, 2, 3), which is useful for getting the top-N distinct values. The exam frequently asks you to distinguish the three outputs, so memorize how each behaves on a dataset with ties.
What happens to Time Travel when you run VACUUM?
Versions that point to data files deleted by VACUUM can no longer be accessed via Time Travel (VERSION AS OF / TIMESTAMP AS OF). The default RETAIN period is 168 hours (7 days), configurable through the delta.deletedFileRetentionDuration property. RETAIN 0 HOURS removes all old files immediately, but you must set spark.databricks.delta.retentionDurationCheck.enabled to false to avoid an error. In production, keep at least a 7-day retention window. The exam often tests the fact that Time Travel is unavailable after VACUUM.
Related Spark SQL Articles
Data Engineer Associate: Complete Guide
The exam where Spark SQL is the largest domain (29%)
Delta Lake Complete Guide
How ACID, Time Travel, and schema evolution work
Delta Lake Command Reference
Detailed coverage of OPTIMIZE, VACUUM, and MERGE
PySpark Complete Guide
When to use the DataFrame API vs Spark SQL
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...