Spark SQL Guide: Must-Know Syntax for Databricks Certs (2026)

On the Databricks certification exams (DEA, DEP, Spark Developer), precise knowledge of Spark SQL syntax is what separates passing from failing. In particular, the Data Engineer Associate exam's ELT with Spark SQL & Python (29%) domain tests your ability to read and write SQL queries directly. This article systematically organizes the syntax that shows up on the exams — CREATE TABLE, MERGE INTO, window functions, CTEs, Delta Lake operations, higher-order functions, PIVOT/UNPIVOT — with runnable SQL code.

Exam Frequency Map

The table below summarizes how often each Spark SQL construct appears on the Databricks certification exams. Use it to prioritize your study.

Syntax	Frequency	Primary Exams	Question Patterns
MERGE INTO	★★★★★	DEA / DEP / Spark	Upsert syntax, SCD Type 2, conditional branching
Window functions	★★★★★	DEA / Spark / DAA	ROW_NUMBER deduplication, RANK comparison, LAG/LEAD diffs
OPTIMIZE / VACUUM	★★★★☆	DEA / DEP	Functional differences, ZORDER, RETAIN period
CTEs (WITH clause)	★★★★☆	DEA / Spark / DAA	Combining with window functions, chaining multiple CTEs
CREATE TABLE / CTAS	★★★☆☆	DEA / DEP	Managed vs external, schema inference
COPY INTO	★★★☆☆	DEA / DEP	Comparison with Auto Loader, idempotency
DESCRIBE HISTORY / RESTORE	★★★☆☆	DEA / DEP	Relationship to Time Travel, restore procedure
Higher-order functions	★★☆☆☆	DEA / Spark	Array manipulation with TRANSFORM/FILTER
PIVOT / UNPIVOT	★★☆☆☆	DAA / Spark	Understanding row/column transformation syntax

Table Operations

There are three main CREATE patterns for Delta Lake tables. Be sure you understand the difference between managed and external tables, and whether DROP TABLE removes the underlying data.

CREATE TABLE USING DELTA (Managed Table)

A CREATE TABLE statement without LOCATION creates a managed table. Unity Catalog manages where the data lives, and DROP TABLE deletes both the metadata and the data files.

-- マネージドテーブル: スキーマ定義で作成
CREATE TABLE catalog_name.schema_name.orders (
  order_id   BIGINT,
  customer_id BIGINT,
  order_date DATE,
  amount     DECIMAL(10, 2),
  status     STRING
) USING DELTA
COMMENT 'Customer order transactions'
TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact'   = 'true'
);

CREATE TABLE ... LOCATION (External Table)

Providing an external storage path via the LOCATION clause creates an external table. DROP TABLE only removes the metadata; the data files remain in external storage.

-- 外部テーブル: データは指定パスに格納
CREATE TABLE catalog_name.schema_name.events (
  event_id   BIGINT,
  event_type STRING,
  event_ts   TIMESTAMP,
  payload    STRING
) USING DELTA
LOCATION 's3://my-bucket/bronze/events';

CREATE OR REPLACE TABLE & CTAS

CREATE OR REPLACE TABLE (CORT) recreates an existing table by overwriting it. CTAS (CREATE TABLE AS SELECT) infers the schema from the SELECT result, so you do not need to declare column definitions.

-- CTAS: SELECTの結果をテーブル化（スキーマ自動推論）
CREATE OR REPLACE TABLE silver.daily_summary AS
SELECT
  order_date,
  COUNT(*)          AS order_count,
  SUM(amount)       AS total_amount,
  AVG(amount)       AS avg_amount
FROM bronze.orders
WHERE status = 'completed'
GROUP BY order_date;

Temp View vs Global Temp View

-- セッションスコープ（セッション終了で消滅）
CREATE OR REPLACE TEMP VIEW v_active_users AS
SELECT * FROM users WHERE is_active = true;

-- クラスタースコープ（global_temp.で参照）
CREATE OR REPLACE GLOBAL TEMP VIEW v_summary AS
SELECT region, COUNT(*) AS cnt FROM sales GROUP BY region;

SELECT * FROM global_temp.v_summary;

Data Manipulation (INSERT / COPY INTO)

INSERT INTO vs INSERT OVERWRITE

INSERT INTO appends to existing data. INSERT OVERWRITE deletes the existing data in the table (or partition) before writing. Note that in Delta Lake, behavior changes based on whether you specify a partition.

-- 追記: 既存データは保持される
INSERT INTO silver.orders
SELECT * FROM bronze.orders_raw
WHERE process_date = current_date();

-- 上書き: テーブル全体を置換
INSERT OVERWRITE silver.orders
SELECT * FROM bronze.orders_raw;

-- パーティション単位の上書き（動的パーティション上書き）
INSERT OVERWRITE silver.orders
PARTITION (order_date)
SELECT order_id, customer_id, amount, status, order_date
FROM bronze.orders_raw;

COPY INTO

COPY INTO loads data from external files (CSV/JSON/Parquet, etc.) into Delta Lake tables. It automatically tracks which files have already been processed, giving it idempotency that prevents the same file from being ingested twice. It is best suited for small batches of up to a few thousand files; for large-scale or continuous ingestion, Auto Loader is the recommended choice.

COPY INTO silver.transactions
FROM 's3://data-lake/landing/transactions/'
FILEFORMAT = JSON
FORMAT_OPTIONS (
  'inferSchema'  = 'true',
  'mergeSchema'  = 'true',
  'dateFormat'   = 'yyyy-MM-dd'
)
COPY_OPTIONS (
  'mergeSchema' = 'true',
  'force'       = 'false'   -- true にすると重複取り込み防止を無効化
);

MERGE INTO (Most Frequently Tested)

MERGE INTO matches a source and a target table on an ON condition, then runs INSERT, UPDATE, and DELETE in a single statement through three branches: WHEN MATCHED, WHEN NOT MATCHED, and WHEN NOT MATCHED BY SOURCE. It is the single most frequently tested SQL construct across the Databricks exams.

Basic Upsert Pattern

MERGE INTO gold.customers AS target
USING staging.customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET
    target.name       = source.name,
    target.email      = source.email,
    target.updated_at = current_timestamp()
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, created_at, updated_at)
  VALUES (source.customer_id, source.name, source.email,
          current_timestamp(), current_timestamp());

Conditional MERGE & DELETE Branches

MERGE INTO gold.products AS t
USING staging.product_feed AS s
ON t.product_id = s.product_id
-- 条件付きUPDATE: 更新日が新しい場合のみ
WHEN MATCHED AND s.updated_at > t.updated_at THEN
  UPDATE SET *
-- ソースに存在しない行を論理削除
WHEN NOT MATCHED BY SOURCE THEN
  UPDATE SET t.is_active = false, t.deleted_at = current_timestamp()
-- 新規行はINSERT
WHEN NOT MATCHED BY TARGET THEN
  INSERT *;

Implementing SCD Type 2

SCD Type 2 is a dimension-management pattern that preserves change history. You expire the existing record by setting its end_date, then INSERT a new version row.

MERGE INTO dim.customers AS t
USING (
  SELECT s.*, true AS is_new
  FROM staging.customers s
  JOIN dim.customers d
    ON s.customer_id = d.customer_id
   AND d.is_current = true
   AND (s.name <> d.name OR s.email <> d.email)
  UNION ALL
  SELECT s.*, true AS is_new
  FROM staging.customers s
  WHERE NOT EXISTS (
    SELECT 1 FROM dim.customers d
    WHERE d.customer_id = s.customer_id
  )
) AS source
ON t.customer_id = source.customer_id AND t.is_current = true
WHEN MATCHED THEN
  UPDATE SET t.is_current = false, t.end_date = current_date()
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, start_date, end_date, is_current)
  VALUES (source.customer_id, source.name, source.email,
          current_date(), NULL, true);

Delta Lake Operational Commands

OPTIMIZE & ZORDER

OPTIMIZE is a compaction command that merges small files to improve read performance. Combined with ZORDER BY, it colocates data by the specified column values (placing rows with the same column values physically close together), which improves data-skipping efficiency on filter queries.

-- 基本Compaction
OPTIMIZE gold.sales;

-- ZORDERでデータをコロケーション
OPTIMIZE gold.sales ZORDER BY (region, product_id);

-- WHERE句で対象パーティションを限定
OPTIMIZE gold.sales
WHERE order_date >= '2026-01-01'
ZORDER BY (customer_id);

VACUUM

Physically removes old data files that are no longer referenced by the current table version, reducing storage costs. Time Travel to the deleted versions becomes unavailable.

-- デフォルト: 168時間（7日）より古い不要ファイルを削除
VACUUM gold.sales;

-- 保持期間を明示的に指定
VACUUM gold.sales RETAIN 720 HOURS;  -- 30日間

-- DRY RUN: 削除対象ファイルの一覧のみ表示（削除しない）
VACUUM gold.sales DRY RUN;

Comparison	OPTIMIZE	VACUUM
Purpose	Small-file compaction	Physical removal of unused files
Effect	Improved query read performance	Lower storage costs
Effect on data	No logical change (read results identical)	Time Travel to old versions becomes unavailable
Recommended timing	Run regularly after write pipelines	Run after OPTIMIZE

DESCRIBE HISTORY & DESCRIBE DETAIL

-- トランザクション履歴（バージョン・操作・タイムスタンプ）
DESCRIBE HISTORY gold.sales;
DESCRIBE HISTORY gold.sales LIMIT 5;

-- テーブル詳細情報（ファイル数・サイズ・パーティション等）
DESCRIBE DETAIL gold.sales;

RESTORE TABLE & Time Travel

-- 特定バージョンに復元（実際にデータを戻す）
RESTORE TABLE gold.sales TO VERSION AS OF 12;

-- タイムスタンプ指定で復元
RESTORE TABLE gold.sales TO TIMESTAMP AS OF '2026-03-20 10:00:00';

-- Time Travel: 過去バージョンを参照のみ（復元しない）
SELECT * FROM gold.sales VERSION AS OF 12;
SELECT * FROM gold.sales TIMESTAMP AS OF '2026-03-20';

-- Time Travel構文でのテーブル指定（@記法）
SELECT * FROM gold.sales@v12;

Window Functions

Window functions perform aggregations, ranking, and lead/lag references without collapsing rows like GROUP BY does. The OVER clause specifies the partition and ordering.

Comparing Ranking Functions

The table below shows how each function's output differs on the same dataset (salary descending within department).

name	dept	salary	ROW_NUMBER	RANK	DENSE_RANK
Alice	Sales	90000	1	1	1
Bob	Sales	85000	2	2	2
Carol	Sales	85000	3	2	2
Dave	Sales	78000	4	4	3

When Bob and Carol both earn 85000: ROW_NUMBER assigns 2 and 3 in arbitrary order. RANK gives both 2 and the next value is 4 (skipping 3). DENSE_RANK gives both 2 and the next value is 3 (no skip).

Deduplication with ROW_NUMBER

-- 各顧客の最新注文のみ取得（CTE + ROW_NUMBERパターン）
WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY order_date DESC
    ) AS rn
  FROM silver.orders
)
SELECT order_id, customer_id, order_date, amount
FROM ranked
WHERE rn = 1;

LAG / LEAD (Previous/Next Row References)

-- 前月比の売上差分を計算
SELECT
  month,
  product_id,
  revenue,
  LAG(revenue, 1)  OVER (PARTITION BY product_id ORDER BY month) AS prev_month,
  LEAD(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS next_month,
  revenue - LAG(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS mom_diff
FROM gold.monthly_sales;

Running Aggregates (SUM / AVG OVER)

-- 累積売上と移動平均
SELECT
  order_date,
  daily_revenue,
  SUM(daily_revenue) OVER (ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_revenue,
  AVG(daily_revenue) OVER (ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7d
FROM gold.daily_sales;

CTE (Common Table Expressions)

A CTE defines a named result set in a WITH clause that subsequent SELECTs can reference. Unlike subqueries, the same CTE can be referenced from multiple locations, which significantly improves query readability.

Chaining Multiple CTEs

WITH active_customers AS (
  SELECT customer_id, name, region
  FROM gold.customers
  WHERE is_active = true
),
recent_orders AS (
  SELECT customer_id, SUM(amount) AS total_spent
  FROM silver.orders
  WHERE order_date >= add_months(current_date(), -3)
  GROUP BY customer_id
),
customer_summary AS (
  SELECT
    ac.customer_id,
    ac.name,
    ac.region,
    COALESCE(ro.total_spent, 0) AS total_spent_3m
  FROM active_customers ac
  LEFT JOIN recent_orders ro ON ac.customer_id = ro.customer_id
)
SELECT region, COUNT(*) AS customer_count, SUM(total_spent_3m) AS region_revenue
FROM customer_summary
GROUP BY region
ORDER BY region_revenue DESC;

Window Function + CTE: A Common Exam Pattern

You cannot filter on a window-function alias directly in a WHERE clause (SQL evaluates WHERE before SELECT). The standard pattern is to compute the window function in a CTE and then filter in the outer query.

-- 各部門のトップ3給与者を取得
WITH salary_rank AS (
  SELECT
    employee_id, name, department,
    salary,
    DENSE_RANK() OVER (
      PARTITION BY department ORDER BY salary DESC
    ) AS dr
  FROM hr.employees
)
SELECT employee_id, name, department, salary
FROM salary_rank
WHERE dr <= 3
ORDER BY department, dr;

Higher-Order Functions (TRANSFORM / FILTER / REDUCE)

Spark SQL's higher-order functions apply lambda expressions element-wise to array-typed columns. They let you express EXPLODE → aggregation → COLLECT_LIST chains in a single line, which is powerful for nested-structure ETL.

TRANSFORM

Applies a function to each element of an array, returning a new array.

-- 各要素を2倍に変換
SELECT TRANSFORM(ARRAY(1, 2, 3, 4), x -> x * 2);
-- 結果: [2, 4, 6, 8]

-- 注文の各商品に税率10%を加算
SELECT
  order_id,
  TRANSFORM(item_prices, price -> price * 1.10) AS prices_with_tax
FROM orders;

FILTER

Returns a new array containing only the elements that match the condition.

-- 偶数のみ抽出
SELECT FILTER(ARRAY(1, 2, 3, 4, 5, 6), x -> x % 2 = 0);
-- 結果: [2, 4, 6]

-- 1000円以上の商品のみ残す
SELECT
  order_id,
  FILTER(item_prices, p -> p >= 1000) AS high_value_items
FROM orders;

REDUCE

Folds the array elements with an accumulator, collapsing them into a single value.

-- 配列要素の合計を計算
SELECT REDUCE(ARRAY(1, 2, 3, 4), 0, (acc, x) -> acc + x);
-- 結果: 10

-- 最大値を取得
SELECT REDUCE(
  ARRAY(35, 12, 88, 47), 0,
  (acc, x) -> CASE WHEN x > acc THEN x ELSE acc END
);
-- 結果: 88

PIVOT / UNPIVOT

PIVOT turns rows into columns (row → column); UNPIVOT turns columns into rows (column → row). These are used for cross-tabulation reports and ELT data reshaping.

PIVOT

-- 四半期別の売上を列方向に展開
SELECT *
FROM (
  SELECT region, quarter, revenue
  FROM gold.quarterly_sales
)
PIVOT (
  SUM(revenue) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4')
);
-- 結果: region | Q1 | Q2 | Q3 | Q4

UNPIVOT

-- 列方向のデータを行方向に変換
SELECT *
FROM gold.quarterly_sales_wide
UNPIVOT (
  revenue FOR quarter IN (Q1, Q2, Q3, Q4)
);
-- 結果: region | quarter | revenue（各四半期が1行ずつ）

Practical Exam Tips

MERGE INTO syntax: Memorize the three branches — WHEN MATCHED, WHEN NOT MATCHED, and WHEN NOT MATCHED BY SOURCE. The UPDATE SET * and INSERT * wildcard forms are also frequently tested.
The window-function WHERE pitfall: "You need a CTE to filter on a ROW_NUMBER() alias" is asked again and again. Make sure you understand why you can't write WHERE rn = 1 directly (SQL evaluation order).
OPTIMIZE vs VACUUM: OPTIMIZE = "merge files → improve read performance"; VACUUM = "delete unused files → save storage." The side effect of VACUUM is that Time Travel to old versions stops working.
INSERT OVERWRITE vs INSERT INTO: OVERWRITE replaces existing data; INTO appends. The scope changes depending on whether you specify a partition.
COPY INTO idempotency: Running the same file twice does not re-ingest it. Watch out for force = true, which disables this idempotency — a frequently tested gotcha.
RESTORE vs Time Travel: RESTORE actually writes the table back to a previous version, while Time Travel (VERSION AS OF) only reads from a previous version.

Sample Question

MERGE INTO - Conditional Branching

問題 1

Which of the following correctly describes this MERGE INTO statement? MERGE INTO gold.inventory AS t USING staging.stock_updates AS s ON t.sku = s.sku WHEN MATCHED AND s.quantity = 0 THEN DELETE WHEN MATCHED THEN UPDATE SET t.quantity = s.quantity, t.updated_at = current_timestamp() WHEN NOT MATCHED THEN INSERT (sku, quantity, created_at, updated_at) VALUES (s.sku, s.quantity, current_timestamp(), current_timestamp())

All matching rows are DELETEd first, then UPDATE runs
Rows where sku matches and source quantity = 0 are DELETEd, other matching rows are UPDATEd, and unmatched rows are INSERTed
Only one WHEN MATCHED clause is allowed, so this is a syntax error
WHEN NOT MATCHED clauses are not permitted in a MERGE INTO statement

正解: B

MERGE INTO allows multiple WHEN MATCHED clauses, with the conditional clause (AND s.quantity = 0) coming first and the unconditional clause second. Matched rows are evaluated against the first condition (quantity = 0); rows that match it are DELETEd. Rows that do not match fall through to the next WHEN MATCHED clause and are UPDATEd. Unmatched source rows are INSERTed by WHEN NOT MATCHED. Option A is wrong — only quantity=0 rows are deleted, not all matches. Option C is wrong — Delta Lake supports multiple WHEN MATCHED clauses. Option D is wrong — WHEN NOT MATCHED is a standard syntax element.

Try Spark SQL questions right now

Practice questions on MERGE INTO, window functions, and Delta Lake commands

Try free questions →

Frequently Asked Questions

Can you write multiple WHEN MATCHED clauses in a MERGE INTO statement?

Yes. Databricks (Delta Lake) lets you write up to two WHEN MATCHED clauses. Typically the first is conditional (e.g. WHEN MATCHED AND source.updated_at > target.updated_at THEN UPDATE SET *) and the second is unconditional (e.g. WHEN MATCHED THEN DELETE). The unconditional clause must come last and must appear only once, so the conditional clauses have to be written first. Standard SQL only allows one WHEN MATCHED clause, but Delta Lake extends the syntax to allow multiple.

How do ROW_NUMBER, RANK, and DENSE_RANK differ in window functions?

ROW_NUMBER always assigns a unique sequential number even for ties, making it ideal for deduplication (e.g. getting the latest record per group). RANK assigns the same rank to ties but skips the next rank (1, 2, 2, 4) — like a sports leaderboard where ties cause gaps. DENSE_RANK also gives ties the same rank but does not skip (1, 2, 2, 3), which is useful for getting the top-N distinct values. The exam frequently asks you to distinguish the three outputs, so memorize how each behaves on a dataset with ties.

What happens to Time Travel when you run VACUUM?

Versions that point to data files deleted by VACUUM can no longer be accessed via Time Travel (VERSION AS OF / TIMESTAMP AS OF). The default RETAIN period is 168 hours (7 days), configurable through the delta.deletedFileRetentionDuration property. RETAIN 0 HOURS removes all old files immediately, but you must set spark.databricks.delta.retentionDurationCheck.enabled to false to avoid an error. In production, keep at least a 7-day retention window. The exam often tests the fact that Time Travel is unavailable after VACUUM.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Spark SQL Complete Guide: Essential SQL Syntax for the Databricks Exams