Databricks Certified Data Engineer Professional (DEP) is Databricks' advanced data engineering certification. While the Associate exam (DEA) asks "do you understand each feature?", DEP asks "can you design, implement, and operate pipelines that meet complex requirements?". Nearly every question is grounded in a real-world scenario, requiring deep understanding of Delta Lake, Structured Streaming, Unity Catalog, and Jobs Orchestration, plus production troubleshooting, security design, and CI/CD.
This article systematically covers the DEP exam specification, details of the 6 exam domains, differences from DEA, advanced topics per domain, a study roadmap from post-DEA, and strategies for tackling long-scenario questions.
The DEP exam specification is as follows. Like DEA, there are 60 questions, but the time limit is expanded to 120 minutes, which works out to 2 minutes per question. The design accounts for the time needed to parse long scenarios.
| Item | Details |
|---|---|
| Number of questions | 60 questions |
| Time limit | 120 minutes |
| Passing score | 70% (42+ correct) |
| Exam fee | $200 (excl. tax) |
| Language | English and Japanese |
| Question format | Single choice / multiple choice (scenario-based) |
| Prerequisites | None (DEA cert not required, but the knowledge is) |
| Validity period | 2 years |
| Delivery format | Online proctored / test center |
DEA and DEP belong to the same "Data Engineer" track, but the required skill level is fundamentally different. DEA is at the "you know what each feature does" level; DEP is at the "you can combine multiple features to make design decisions that satisfy requirements" level.
| Comparison | DEA (Associate) | DEP (Professional) |
|---|---|---|
| Number of questions | 45 questions | 60 questions |
| Time limit | 90 minutes | 120 minutes |
| Passing score | 70% | 70% |
| Exam fee | $200 | $200 |
| Difficulty | Beginner to intermediate | Intermediate to advanced |
| Question style | Mostly short questions directly testing feature understanding | Mostly long-form questions asking for design decisions based on real-world scenarios |
| Expected experience | 6+ months of Databricks usage | Databricks + 2+ years of data engineering experience |
| Code reading | Read basic PySpark/SQL syntax | Read implementation code (foreachBatch, DLT pipelines, Jobs API, etc.) and judge correctness |
DEP draws from 6 domains. Data Processing is the largest at 25%, followed by Databricks Tooling and Data Modeling at 20% each. Security and Governance and Monitoring and Logging are each 10%, but because the question count is small, losing a single question hits your pass/fail margin directly — you can't afford to skip them.
| Domain | Weight | Approx. Questions |
|---|---|---|
| 1. Databricks Tooling | 20% | ~12 questions |
| 2. Data Processing | 25% | ~15 questions |
| 3. Data Modeling | 20% | ~12 questions |
| 4. Security and Governance | 10% | ~6 questions |
| 5. Monitoring and Logging | 10% | ~6 questions |
| 6. Testing and Deployment | 15% | ~9 questions |
This domain tests deep understanding of Databricks' development and operations toolchain. DEA tests basic concepts of Workflows and Delta Live Tables; DEP requires implementation-level knowledge.
Databricks Asset Bundles (DABs) let you define jobs, pipelines, notebooks, and config files as YAML templates and deploy them to a workspace via thedatabricks bundle deploy command. On the exam, this comes up in scenarios like "the most efficient way to deploy a pipeline across dev, staging, and prod environments". You need to understand how to override cluster settings and catalog names per target environment within a bundle definition.
Creating, running, and monitoring jobs via the REST API is a staple DEP topic. In particular, expect questions on defining dependencies across tasks in a Multi-Task Workflow, failure retry policies (max_retries, retry_on_timeout), and passing parameters between tasks (task values). Be sure you know how to use dbutils.jobs.taskValues.set() and dbutils.jobs.taskValues.get().
Databricks Repos clones Git repositories into your workspace and enables branch management and pull-request workflows. On DEP, you'll see operational design questions like "deployment flow into production workspaces" or "how do you prevent developers from pushing directly to main?". Automating deployment from a CI/CD pipeline (such as GitHub Actions) via the Repos API is also in scope.
The highest-weighted domain, focused on advanced use of Structured Streaming, Delta Lake, and Auto Loader. It goes a step beyond DEA's "read with readStream, write with writeStream" — you need to design pipelines that include complex transformations and error handling.
For upserting streaming data into Delta tables, the standard pattern is to use foreachBatch to runMERGE INTO per micro-batch. The exam often shows a code snippet of this pattern and asks "what's wrong with this code?" or "what should be added to guarantee idempotency?". You'll need to understand deduplication using batch IDs and the precise semantics of the WHEN MATCHED / WHEN NOT MATCHED conditions in MERGE.
Auto Loader has two file-detection modes. Directory Listing periodically lists the cloud storage directory to detect new files. File Notification uses cloud event services (AWS SQS / Azure Event Grid / GCS Notifications) to receive file-arrival events. DEP asks questions like "Auto Loader performance is degrading in a directory with millions of accumulated files — what's the fix?", expecting you to switch to File Notification mode. You need to understand the cloudFiles.useNotifications = true setting plus the required cloud permissions (such as the right to create SQS queues).
Enabling Change Data Feed on a Delta Table lets downstream consumers read the INSERT/UPDATE/DELETE history. You retrieve change data with the table_changes() function or with readStream.option("readChangeFeed", "true") on a Spark DataFrame, then apply it downstream. Memorize precisely that CDF records include _change_type (insert / update_preimage / update_postimage / delete),_commit_version, and _commit_timestamp.
This domain covers data modeling approaches on the Lakehouse. DEA focuses on conceptual understanding of Medallion Architecture (Bronze/Silver/Gold), but DEP goes further — implementing SCD Type 2, choosing between Star Schema designs, and justifying specific modeling choices.
A pattern used in dimension tables like customer masters or product masters where you want to retain historical state. "Implementing SCD Type 2 in Delta Lake" is a classic DEP scenario, typically shown as a two-step MERGE that updates the end_date of an existing record and INSERTs a new one. Make sure you can accurately read conditional clauses such asWHEN MATCHED AND s.value <> t.value THEN UPDATE SET t.end_date = s.effective_date, t.is_current = false.
Star Schema (fact + dimensions) is well-suited to accelerating analytical queries and is commonly adopted in the Gold layer. Data Vault uses a three-layer Hub-Link-Satellite structure that is resilient to source-system changes but increases query complexity. DEP asks you to decide which modeling approach fits a given requirement. Keep these decision criteria in mind: "Data Vault when source systems change frequently" and "Star Schema when BI-layer aggregation performance is the top priority".
This domain centers on data governance features in Unity Catalog. The weighting is 10%, but it's a zero-tolerance area where misconfiguration is unacceptable in production, and the complexity of the questions is dramatically higher than on DEA.
Row Filters are applied via ALTER TABLE ... SET ROW FILTER with a specified function, controlling row visibility based on user attributes. Column Masks are applied via ALTER TABLE ... ALTER COLUMN ... SET MASK to mask column values. DEP presents combined scenarios such as "the sales team should only see data from their region, and PII columns must be masked". The evaluation order when Row Filter and Column Mask are both applied (Row Filter first) is also tested.
A technique that uses views with conditions like current_user() or is_account_group_member()to dynamically control data access. It predates Unity Catalog's Row Filter, but DEP can ask about scenarios for "migrating existing Dynamic View-based security to Row Filter / Column Mask".
This domain covers pipeline monitoring and log analysis. It centers on leveraging System Tables and audit logs.
Databricks system tables such as system.billing.usage, system.access.audit, and system.compute.clusters are used for cost analysis, security audits, and performance analysis. DEP asks you to read SQL queries against System Tables, such as "a query that identifies the jobs with the highest DBU consumption over the past 30 days" or "a query that audits a specific user's data access history".
Unity Catalog audit logs are recorded in the system.access.audit table. You're expected to combine action type (action_name), target resource (request_params), and executing user (user_identity) to query "who accessed which table when". Detecting anomalies in streaming jobs (latency spikes, throughput drops) via System Tables is also in scope.
This domain covers quality assurance and deployment strategy for production pipelines. DEA barely touches it, but DEP gives it 15% — and it's where differences in real-world experience show up most.
Implementing unit and integration tests inside Databricks notebooks is tested. A typical pattern: import helper functions with %run, run transformation logic against test-only temp views, and verify the results. Using Databricks Connect to attach a local IDE to a remote cluster and verifying with test frameworks like pytest is also in scope.
Integrating Databricks with CI/CD tools such as GitHub Actions, Azure DevOps, or GitLab CI is tested. The canonical flow is: "PR merge → CI runs unit tests → Asset Bundle deploys to staging → integration tests → production deploy". DEP asks operations-oriented design questions like "how do you update a streaming pipeline in production without downtime?" or "what's the procedure if you need to roll back?".
For engineers who already hold DEA and are targeting DEP, we recommend a structured 3-4 month plan. The approach builds on DEA knowledge while layering on the deep implementation knowledge unique to DEP.
The defining feature of DEP is long question text. Each question describes a scenario in 200-300 words of English (the Japanese version has comparable length), and you have to extract "what's actually being asked" from within it. The following 4-step approach answers them efficiently.
Read the "Which of the following ..." or "What should the engineer do ..." line at the end of the scenario first. Knowing what's being asked before you read the scenario lets you skip unnecessary information and cuts reading time by 30-40%.
Keywords in the scenario like "must", "should not", "minimum cost", "without downtime", and "exactly-once" are decisive for choosing the right answer. Write them down on the scratch paper (a whiteboard/paper is provided during the exam). In many cases, the correct answer and the runner-up are separated by a single point: "does this satisfy the constraint or not?".
Of the 4 options, typically 2 are clearly wrong (non-existent APIs, opposite behavior, etc.). For the remaining 2, judge by matching against the constraints. DEP often comes down to "both would work, but which most efficiently satisfies the requirements?", so train yourself to evaluate along three axes: cost, performance, and operability.
60 questions in 120 minutes means 2 minutes per question. If you've spent 2 and a half minutes and still aren't sure, flag the question and move on. The DEP exam system lets you review only the flagged questions after answering all of them. Concentrating your remaining time on flagged questions is an effective strategy for reaching the 70% pass line.
Data Processing
問題 1
A data engineer operates a streaming pipeline that uses Auto Loader to ingest CSV files arriving in cloud storage in real time. The file volume is about 500,000 per day and growing, and recently a multi-minute delay has appeared at Auto Loader startup. The team wants to minimize ingestion delay while preserving the existing checkpoint. Which is the most appropriate action?
正解: A
When file volume is large (500,000/day and growing), Directory Listing mode is slow to enumerate files in storage, causing startup delays. Switching to File Notification mode delivers file-arrival events via a cloud event service (AWS SQS / Azure Event Grid, etc.), eliminating the need to list the entire directory and resolving the delay. The checkpoint is preserved across the mode switch. B violates the "real-time ingestion" requirement. C only caps the file processing count and doesn't address the root cause (file-detection latency). D violates the checkpoint-preservation requirement and incurs unnecessary cost and downtime by reprocessing every file.
Can I take DEP (Professional) without first passing DEA (Associate)?
Yes. Databricks does not require Associate certification as a prerequisite for the Professional exam. However, DEP is a higher-level exam that builds on DEA knowledge. Without a solid grasp of Delta Lake, Structured Streaming, and Unity Catalog fundamentals, you'll burn time just parsing the question text. In practice, treat DEA-level knowledge as mandatory.
Are all DEP questions long-scenario format, or are there short knowledge questions too?
Most are long-scenario format, but not all. Roughly 70-80% start with something like "You are a data engineer building a pipeline that must satisfy the following requirements...", while the remainder are shorter questions that directly test API arguments or SQL syntax. Scenario questions are text-heavy, so reading speed and the ability to organize requirements often decide pass/fail.
Do I need a real Databricks environment to study for DEP? Is Community Edition enough?
Community Edition is not enough. Frequently tested features such as Jobs Orchestration, Multi-Task Workflow, Asset Bundles, Unity Catalog security features, and System Tables are only available in paid workspaces. We strongly recommend using the 14-day free trial or your company's workspace to learn hands-on.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...