Databricks

Databricks Data Engineer Question Bank: Practice by Domain with Explanations

2026-03-21

更新: 2026-03-27

NicheeLab Editorial Team

The Databricks Data Engineer exams (Associate / Professional) test a broad set of skills: Delta Lake, Spark SQL, ETL pipelines, and data governance. This article shows the weight of each exam domain, then walks through representative practice questions for the most common patterns, explaining the level of knowledge expected and where to focus when answering.

DEA (Data Engineer Associate) Exam Domains and Weights

DEA is a 45-question, 120-minute exam covering the 5 domains below. The March 2026 revision increased the weight of Unity Catalog topics and removed the legacy Hive Metastore questions from scope.

Domain	Weight	Key Topics
ETL with Spark SQL & Python	~30%	DataFrame read/write, Spark SQL syntax, data transformation, UDFs
Delta Lake	~25%	MERGE, VACUUM, Time Travel, Liquid Clustering, CDC
Incremental Data Processing	~20%	Auto Loader, Structured Streaming, COPY INTO
Production Pipelines	~15%	DLT Expectations, LakeFlow Jobs, multi-task Workflows
Data Governance	~10%	Unity Catalog GRANT, 3-level namespace, lineage

DEP (Data Engineer Professional) Exam Domains and Weights

DEP is the senior-level 60-question, 120-minute exam, dominated by long scenario questions. It tests practical skill in design judgment, troubleshooting, and performance optimization.

Domain	Weight	Key Topics
Data Processing	~30%	Complex ETL design, SCD Type 2, schema evolution, error handling
Data Modeling	~20%	Medallion architecture design, star schema, denormalization decisions
Security / Governance	~20%	Dynamic views, row-level security, column masking, external locations
Monitoring / Logging	~15%	System Tables, DLT event logs, reading Spark UI
Testing / Deployment	~15%	CI/CD pipelines, Databricks Asset Bundles, environment separation

Why Domain-by-Domain Study Works

Data Engineer exam questions are not evenly distributed across domains — the top 2 (ETL + Delta Lake) alone account for 50-55% of the score. Studying by domain produces 3 concrete benefits.

Surfacing weak areas: Measuring accuracy per domain makes it instantly clear which areas you understand shallowly. For example, if you are strong on Delta Lake but weak on incremental data processing, you can concentrate effort on Auto Loader and Structured Streaming.
Time allocation based on weight: Spending equal study time on the 30%-weight ETL domain and the 10%-weight governance domain is inefficient. Using the weight table as a guide and locking down high-weight domains first is the key to maximizing your score.
Stronger pattern recognition: Each domain has its own recurring patterns. For Delta Lake it is conditional MERGE branches; for Auto Loader, schema evolution modes; for DLT, Expectation action types. Recognizing these patterns reduces hesitation on the real exam.

Representative Question Patterns by Domain

Delta Lake MERGE — Conditional Branches and Evaluation Order

MERGE INTO can contain multiple WHEN MATCHED / WHEN NOT MATCHED clauses, evaluated top to bottom. It is the most frequently tested syntax on both DEA and DEP, and the exam checks whether you can accurately trace combinations of conditional DELETE, UPDATE SET *, and INSERT *. SCD Type 1 (overwrite with latest) and Type 2 (history preserved) MERGE patterns are guaranteed to appear on DEP.

Auto Loader — Behavior of schemaEvolutionMode

Auto Loader using the cloudFiles format behaves differently when new columns arrive depending on schemaEvolutionMode (addNewColumns / failOnNewColumns / rescue / none). The exam frequently asks what happens when JSON with a new column arrives. Remember that in rescue mode, unknown columns are stored in _rescued_data.

Unity Catalog GRANT — Least Privilege and Hierarchical Propagation

Unity Catalog privileges are managed in 3 levels: CATALOG → SCHEMA → TABLE/VIEW. GRANT USAGE only grants the right to access that level. The exam often tests that USAGE on a CATALOG alone is not enough to SELECT from underlying tables, and that GRANT SELECT ON SCHEMA ... TO ... grants SELECT on every table beneath the schema.

DLT Expectations — WARN / DROP ROW / FAIL

Delta Live Tables data quality constraints (Expectations) have 3 ON VIOLATION actions — WARN, DROP ROW, and FAIL — which respectively output the row with a warning only, drop the row and record it in metrics, or stop the pipeline. The exam can go as deep as asking where dropped row counts are recorded in the event log (flow_progress.data_quality.dropped_records).

Workflows — Dependencies and Retry Policies

Multi-task Workflows are tested on task dependencies (depends_on), retry policies (max_retries / retry_on_timeout), and notification settings. Common DEP patterns include the behavior where task B is skipped when its dependency task A fails, and how to configure conditional branch (if/else) tasks.

Practice Questions by Domain (3 Questions)

Below is one representative question each from the Delta Lake, incremental data processing, and data governance domains. Try answering on your own before reading the explanation.

Delta Lake

問題 1

What is the correct behavior when the following MERGE statement runs? MERGE INTO silver USING bronze ON silver.id = bronze.id WHEN MATCHED AND bronze.op = 'DELETE' THEN DELETE WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED AND bronze.op != 'DELETE' THEN INSERT *

All matched rows including those with bronze.op='DELETE' are deleted, and all unmatched rows are inserted
Matched rows where bronze.op='DELETE' are deleted from silver; matched rows that are not 'DELETE' are UPDATEd; only unmatched rows where op is not 'DELETE' are INSERTed
AND conditions cannot be written on a WHEN NOT MATCHED clause, so a SyntaxError occurs
Rows with bronze.op='DELETE' are inserted into silver, while other matched rows are deleted

正解: B

Delta Lake MERGE evaluates WHEN MATCHED / WHEN NOT MATCHED clauses top to bottom. The first WHEN MATCHED conditionally deletes matched rows where bronze.op='DELETE'; the second WHEN MATCHED updates the remaining matched rows; the WHEN NOT MATCHED clause carries AND bronze.op != 'DELETE', so it inserts only unmatched rows whose op is not 'DELETE'. AND conditions are allowed on WHEN NOT MATCHED, so no syntax error occurs (C is wrong). This pattern is the canonical implementation for reflecting CDC (Change Data Capture) events into the silver layer and is heavily tested on both DEA and DEP.

Incremental Data Processing

問題 2

When schemaEvolutionMode='rescue' is set on Auto Loader (cloudFiles) and a JSON file arrives with a column not in the existing schema, what is the correct behavior?

Values for columns not in the existing schema are stored in the _rescued_data column as a JSON string
The new column is automatically added to the schema, and past data is backfilled with null
A schema mismatch error occurs and the stream stops
Columns not in the existing schema are ignored and not stored at all

正解: A

Under schemaEvolutionMode='rescue', the schema is not auto-extended; values for columns or data types that do not match the existing schema are stashed in the _rescued_data column as a JSON string. The advantage is keeping schema stability without losing data. Behavior B corresponds to addNewColumns mode; behavior C corresponds to failOnNewColumns mode. Rescue mode is suited for safely handling unexpected schema changes in production pipelines, and questions distinguishing Auto Loader modes appear frequently on DEA.

Data Governance

問題 3

After executing the following GRANT statements in Unity Catalog, what can analyst_group do? GRANT USAGE ON CATALOG prod_catalog TO analyst_group; GRANT USAGE ON SCHEMA prod_catalog.sales TO analyst_group; GRANT SELECT ON SCHEMA prod_catalog.sales TO analyst_group;

SELECT on every table in every schema under prod_catalog
SELECT on every table under the prod_catalog.sales schema, with no access to other schemas
Read schema metadata for prod_catalog.sales, but cannot read table data
An error occurs because the GRANT order is invalid, so no privileges are granted

正解: B

Unity Catalog privileges are hierarchical: you cannot access a lower level without USAGE at the higher level. This example grants 3 privileges: CATALOG-level USAGE (access to prod_catalog), SCHEMA-level USAGE (access to the sales schema), and SCHEMA-level SELECT (SELECT on every table under the sales schema). GRANT SELECT ON SCHEMA grants SELECT in bulk to all current and future tables under the schema. However, other schemas under prod_catalog (e.g., prod_catalog.marketing) have no USAGE grant, so they are inaccessible. There is no constraint on the order of GRANT statements, so D is incorrect.

How to Approach Domain-by-Domain Study

Step 1: Set priorities based on domain weight

Start by referencing the weight table and allocate 50%+ of study time to the high-weight domains (ETL and Delta Lake). These 2 domains determine the majority of your score, so answering them consistently puts the passing line within reach. Split the remaining 50% across incremental data processing, pipelines, and governance.

Step 2: Measure accuracy by domain

Work through the question bank one domain at a time and record accuracy. Any domain below 70% is a weak area. For weak domains, re-read the relevant section of the official documentation, run code in Community Edition to verify behavior, then redo the questions.

Step 3: Train yourself to predict code execution results

Both DEA and DEP feature questions where you read a code snippet and infer the behavior. The ability to instantly judge the outcome of MERGE statements, Auto Loader settings, DLT Expectation definitions, etc., is the key to passing. Practice repeatedly in Community Edition by running the code and comparing the actual result against your prediction.

Frequently Asked Questions

How do the exam domain weights differ between DEA (Data Engineer Associate) and DEP (Data Engineer Professional)?

DEA covers 5 domains: Delta Lake operations, ETL with Spark SQL, incremental data processing, production pipelines, and data governance, with ETL with Spark SQL carrying the largest weight (~30%). DEP covers 5 domains: data processing, data modeling, security/governance, monitoring/logging, and testing/deployment, where data processing and modeling together account for about 50%. DEP centers on long scenario questions that require cross-domain design judgment.

If studying by domain, which domain should you tackle first?

The most efficient starting point is ETL / Data Processing (Spark SQL, PySpark), which has the highest weight and serves as the foundation for other domains. After mastering Spark read/write operations, move to Delta Lake (MERGE, VACUUM, Time Travel), then data governance (Unity Catalog GRANT, lineage), then pipelines (DLT Expectations, Workflows). This order builds knowledge cumulatively and surfaces cross-domain relationships.

Should you do domain-by-domain practice and mock exams in parallel?

The most effective approach is to first solidify each domain through targeted practice, identify and shore up weak areas, then move to mock exams. Once you exceed 80% accuracy on domain practice, shift to full mock exams under the real 120-minute time limit. If a mock reveals new weaknesses, return to domain practice — this loop is the fastest path to passing.

Test your skills with the Data Engineer question bank

Sharpen your pass-ready skills with 550+ DEA and 400+ DEP practice questions

Try free questions →

Related Databricks Data Engineer Articles

Data Engineer Associate: Complete Guide

Deep dive on DEA scope, difficulty, and prep

Data Engineer Professional: Complete Guide

Deep dive on DEP scope, difficulty, and prep

Databricks Exam 2026 Update Notes

Summary of the latest DEA / DEP revisions

Free Databricks Question Bank

6,800+ bilingual practice questions

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Databricks Data Engineer Question Bank: Practice by Domain with Explanations

DEA (Data Engineer Associate) Exam Domains and Weights

DEP (Data Engineer Professional) Exam Domains and Weights

Why Domain-by-Domain Study Works

Representative Question Patterns by Domain

Delta Lake MERGE — Conditional Branches and Evaluation Order

Auto Loader — Behavior of schemaEvolutionMode

Unity Catalog GRANT — Least Privilege and Hierarchical Propagation

DLT Expectations — WARN / DROP ROW / FAIL

Workflows — Dependencies and Retry Policies

Practice Questions by Domain (3 Questions)

How to Approach Domain-by-Domain Study

Step 1: Set priorities based on domain weight

Step 2: Measure accuracy by domain

Step 3: Train yourself to predict code execution results

Frequently Asked Questions

Check what you learned with practice questions

Author

Related articles

Browse all Databricks articles (110)