The Databricks Data Engineer exams (Associate / Professional) test a broad set of skills: Delta Lake, Spark SQL, ETL pipelines, and data governance. This article shows the weight of each exam domain, then walks through representative practice questions for the most common patterns, explaining the level of knowledge expected and where to focus when answering.
DEA is a 45-question, 120-minute exam covering the 5 domains below. The March 2026 revision increased the weight of Unity Catalog topics and removed the legacy Hive Metastore questions from scope.
| Domain | Weight | Key Topics |
|---|---|---|
| ETL with Spark SQL & Python | ~30% | DataFrame read/write, Spark SQL syntax, data transformation, UDFs |
| Delta Lake | ~25% | MERGE, VACUUM, Time Travel, Liquid Clustering, CDC |
| Incremental Data Processing | ~20% | Auto Loader, Structured Streaming, COPY INTO |
| Production Pipelines | ~15% | DLT Expectations, LakeFlow Jobs, multi-task Workflows |
| Data Governance | ~10% | Unity Catalog GRANT, 3-level namespace, lineage |
DEP is the senior-level 60-question, 120-minute exam, dominated by long scenario questions. It tests practical skill in design judgment, troubleshooting, and performance optimization.
| Domain | Weight | Key Topics |
|---|---|---|
| Data Processing | ~30% | Complex ETL design, SCD Type 2, schema evolution, error handling |
| Data Modeling | ~20% | Medallion architecture design, star schema, denormalization decisions |
| Security / Governance | ~20% | Dynamic views, row-level security, column masking, external locations |
| Monitoring / Logging | ~15% | System Tables, DLT event logs, reading Spark UI |
| Testing / Deployment | ~15% | CI/CD pipelines, Databricks Asset Bundles, environment separation |
Data Engineer exam questions are not evenly distributed across domains — the top 2 (ETL + Delta Lake) alone account for 50-55% of the score. Studying by domain produces 3 concrete benefits.
MERGE INTO can contain multiple WHEN MATCHED / WHEN NOT MATCHED clauses, evaluated top to bottom. It is the most frequently tested syntax on both DEA and DEP, and the exam checks whether you can accurately trace combinations of conditional DELETE, UPDATE SET *, and INSERT *. SCD Type 1 (overwrite with latest) and Type 2 (history preserved) MERGE patterns are guaranteed to appear on DEP.
Auto Loader using the cloudFiles format behaves differently when new columns arrive depending on schemaEvolutionMode (addNewColumns / failOnNewColumns / rescue / none). The exam frequently asks what happens when JSON with a new column arrives. Remember that in rescue mode, unknown columns are stored in _rescued_data.
Unity Catalog privileges are managed in 3 levels: CATALOG → SCHEMA → TABLE/VIEW. GRANT USAGE only grants the right to access that level. The exam often tests that USAGE on a CATALOG alone is not enough to SELECT from underlying tables, and that GRANT SELECT ON SCHEMA ... TO ... grants SELECT on every table beneath the schema.
Delta Live Tables data quality constraints (Expectations) have 3 ON VIOLATION actions — WARN, DROP ROW, and FAIL — which respectively output the row with a warning only, drop the row and record it in metrics, or stop the pipeline. The exam can go as deep as asking where dropped row counts are recorded in the event log (flow_progress.data_quality.dropped_records).
Multi-task Workflows are tested on task dependencies (depends_on), retry policies (max_retries / retry_on_timeout), and notification settings. Common DEP patterns include the behavior where task B is skipped when its dependency task A fails, and how to configure conditional branch (if/else) tasks.
Below is one representative question each from the Delta Lake, incremental data processing, and data governance domains. Try answering on your own before reading the explanation.
Delta Lake
問題 1
What is the correct behavior when the following MERGE statement runs? MERGE INTO silver USING bronze ON silver.id = bronze.id WHEN MATCHED AND bronze.op = 'DELETE' THEN DELETE WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED AND bronze.op != 'DELETE' THEN INSERT *
正解: B
Delta Lake MERGE evaluates WHEN MATCHED / WHEN NOT MATCHED clauses top to bottom. The first WHEN MATCHED conditionally deletes matched rows where bronze.op='DELETE'; the second WHEN MATCHED updates the remaining matched rows; the WHEN NOT MATCHED clause carries AND bronze.op != 'DELETE', so it inserts only unmatched rows whose op is not 'DELETE'. AND conditions are allowed on WHEN NOT MATCHED, so no syntax error occurs (C is wrong). This pattern is the canonical implementation for reflecting CDC (Change Data Capture) events into the silver layer and is heavily tested on both DEA and DEP.
Incremental Data Processing
問題 2
When schemaEvolutionMode='rescue' is set on Auto Loader (cloudFiles) and a JSON file arrives with a column not in the existing schema, what is the correct behavior?
正解: A
Under schemaEvolutionMode='rescue', the schema is not auto-extended; values for columns or data types that do not match the existing schema are stashed in the _rescued_data column as a JSON string. The advantage is keeping schema stability without losing data. Behavior B corresponds to addNewColumns mode; behavior C corresponds to failOnNewColumns mode. Rescue mode is suited for safely handling unexpected schema changes in production pipelines, and questions distinguishing Auto Loader modes appear frequently on DEA.
Data Governance
問題 3
After executing the following GRANT statements in Unity Catalog, what can analyst_group do? GRANT USAGE ON CATALOG prod_catalog TO analyst_group; GRANT USAGE ON SCHEMA prod_catalog.sales TO analyst_group; GRANT SELECT ON SCHEMA prod_catalog.sales TO analyst_group;
正解: B
Unity Catalog privileges are hierarchical: you cannot access a lower level without USAGE at the higher level. This example grants 3 privileges: CATALOG-level USAGE (access to prod_catalog), SCHEMA-level USAGE (access to the sales schema), and SCHEMA-level SELECT (SELECT on every table under the sales schema). GRANT SELECT ON SCHEMA grants SELECT in bulk to all current and future tables under the schema. However, other schemas under prod_catalog (e.g., prod_catalog.marketing) have no USAGE grant, so they are inaccessible. There is no constraint on the order of GRANT statements, so D is incorrect.
Start by referencing the weight table and allocate 50%+ of study time to the high-weight domains (ETL and Delta Lake). These 2 domains determine the majority of your score, so answering them consistently puts the passing line within reach. Split the remaining 50% across incremental data processing, pipelines, and governance.
Work through the question bank one domain at a time and record accuracy. Any domain below 70% is a weak area. For weak domains, re-read the relevant section of the official documentation, run code in Community Edition to verify behavior, then redo the questions.
Both DEA and DEP feature questions where you read a code snippet and infer the behavior. The ability to instantly judge the outcome of MERGE statements, Auto Loader settings, DLT Expectation definitions, etc., is the key to passing. Practice repeatedly in Community Edition by running the code and comparing the actual result against your prediction.
How do the exam domain weights differ between DEA (Data Engineer Associate) and DEP (Data Engineer Professional)?
DEA covers 5 domains: Delta Lake operations, ETL with Spark SQL, incremental data processing, production pipelines, and data governance, with ETL with Spark SQL carrying the largest weight (~30%). DEP covers 5 domains: data processing, data modeling, security/governance, monitoring/logging, and testing/deployment, where data processing and modeling together account for about 50%. DEP centers on long scenario questions that require cross-domain design judgment.
If studying by domain, which domain should you tackle first?
The most efficient starting point is ETL / Data Processing (Spark SQL, PySpark), which has the highest weight and serves as the foundation for other domains. After mastering Spark read/write operations, move to Delta Lake (MERGE, VACUUM, Time Travel), then data governance (Unity Catalog GRANT, lineage), then pipelines (DLT Expectations, Workflows). This order builds knowledge cumulatively and surfaces cross-domain relationships.
Should you do domain-by-domain practice and mock exams in parallel?
The most effective approach is to first solidify each domain through targeted practice, identify and shore up weak areas, then move to mock exams. Once you exceed 80% accuracy on domain practice, shift to full mock exams under the real 120-minute time limit. If a mock reveals new weaknesses, return to domain practice — this loop is the fastest path to passing.
Test your skills with the Data Engineer question bank
Sharpen your pass-ready skills with 550+ DEA and 400+ DEP practice questions
Try free questions →Related Databricks Data Engineer Articles
Data Engineer Associate: Complete Guide
Deep dive on DEA scope, difficulty, and prep
Data Engineer Professional: Complete Guide
Deep dive on DEP scope, difficulty, and prep
Databricks Exam 2026 Update Notes
Summary of the latest DEA / DEP revisions
Free Databricks Question Bank
6,800+ bilingual practice questions
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...