Databricks Data Engineer Professional: Complete Guide (2026)

Databricks Certified Data Engineer Professional (DEP) is Databricks' advanced data engineering certification. While the Associate exam (DEA) asks "do you understand each feature?", DEP asks "can you design, implement, and operate pipelines that meet complex requirements?". Nearly every question is grounded in a real-world scenario, requiring deep understanding of Delta Lake, Structured Streaming, Unity Catalog, and Jobs Orchestration, plus production troubleshooting, security design, and CI/CD.

This article systematically covers the DEP exam specification, details of the 6 exam domains, differences from DEA, advanced topics per domain, a study roadmap from post-DEA, and strategies for tackling long-scenario questions.

Exam Overview

The DEP exam specification is as follows. Like DEA, there are 60 questions, but the time limit is expanded to 120 minutes, which works out to 2 minutes per question. The design accounts for the time needed to parse long scenarios.

Item	Details
Number of questions	60 questions
Time limit	120 minutes
Passing score	70% (42+ correct)
Exam fee	$200 (excl. tax)
Language	English and Japanese
Question format	Single choice / multiple choice (scenario-based)
Prerequisites	None (DEA cert not required, but the knowledge is)
Validity period	2 years
Delivery format	Online proctored / test center

Differences from DEA (Associate)

DEA and DEP belong to the same "Data Engineer" track, but the required skill level is fundamentally different. DEA is at the "you know what each feature does" level; DEP is at the "you can combine multiple features to make design decisions that satisfy requirements" level.

Comparison	DEA (Associate)	DEP (Professional)
Number of questions	45 questions	60 questions
Time limit	90 minutes	120 minutes
Passing score	70%	70%
Exam fee	$200	$200
Difficulty	Beginner to intermediate	Intermediate to advanced
Question style	Mostly short questions directly testing feature understanding	Mostly long-form questions asking for design decisions based on real-world scenarios
Expected experience	6+ months of Databricks usage	Databricks + 2+ years of data engineering experience
Code reading	Read basic PySpark/SQL syntax	Read implementation code (foreachBatch, DLT pipelines, Jobs API, etc.) and judge correctness

Exam Domains and Weightings

DEP draws from 6 domains. Data Processing is the largest at 25%, followed by Databricks Tooling and Data Modeling at 20% each. Security and Governance and Monitoring and Logging are each 10%, but because the question count is small, losing a single question hits your pass/fail margin directly — you can't afford to skip them.

Domain	Weight	Approx. Questions
1. Databricks Tooling	20%	~12 questions
2. Data Processing	25%	~15 questions
3. Data Modeling	20%	~12 questions
4. Security and Governance	10%	~6 questions
5. Monitoring and Logging	10%	~6 questions
6. Testing and Deployment	15%	~9 questions

Domain 1: Databricks Tooling (20%)

This domain tests deep understanding of Databricks' development and operations toolchain. DEA tests basic concepts of Workflows and Delta Live Tables; DEP requires implementation-level knowledge.

Asset Bundles (DABs)

Databricks Asset Bundles (DABs) let you define jobs, pipelines, notebooks, and config files as YAML templates and deploy them to a workspace via thedatabricks bundle deploy command. On the exam, this comes up in scenarios like "the most efficient way to deploy a pipeline across dev, staging, and prod environments". You need to understand how to override cluster settings and catalog names per target environment within a bundle definition.

Jobs API and Multi-Task Workflow

Creating, running, and monitoring jobs via the REST API is a staple DEP topic. In particular, expect questions on defining dependencies across tasks in a Multi-Task Workflow, failure retry policies (max_retries, retry_on_timeout), and passing parameters between tasks (task values). Be sure you know how to use dbutils.jobs.taskValues.set() and dbutils.jobs.taskValues.get().

Repos (Git integration) and CI/CD

Databricks Repos clones Git repositories into your workspace and enables branch management and pull-request workflows. On DEP, you'll see operational design questions like "deployment flow into production workspaces" or "how do you prevent developers from pushing directly to main?". Automating deployment from a CI/CD pipeline (such as GitHub Actions) via the Repos API is also in scope.

Domain 2: Data Processing (25%)

The highest-weighted domain, focused on advanced use of Structured Streaming, Delta Lake, and Auto Loader. It goes a step beyond DEA's "read with readStream, write with writeStream" — you need to design pipelines that include complex transformations and error handling.

foreachBatch + MERGE Pattern

For upserting streaming data into Delta tables, the standard pattern is to use foreachBatch to runMERGE INTO per micro-batch. The exam often shows a code snippet of this pattern and asks "what's wrong with this code?" or "what should be added to guarantee idempotency?". You'll need to understand deduplication using batch IDs and the precise semantics of the WHEN MATCHED / WHEN NOT MATCHED conditions in MERGE.

Auto Loader: Directory Listing and File Notification

Auto Loader has two file-detection modes. Directory Listing periodically lists the cloud storage directory to detect new files. File Notification uses cloud event services (AWS SQS / Azure Event Grid / GCS Notifications) to receive file-arrival events. DEP asks questions like "Auto Loader performance is degrading in a directory with millions of accumulated files — what's the fix?", expecting you to switch to File Notification mode. You need to understand the cloudFiles.useNotifications = true setting plus the required cloud permissions (such as the right to create SQS queues).

Change Data Feed (CDF)

Enabling Change Data Feed on a Delta Table lets downstream consumers read the INSERT/UPDATE/DELETE history. You retrieve change data with the table_changes() function or with readStream.option("readChangeFeed", "true") on a Spark DataFrame, then apply it downstream. Memorize precisely that CDF records include _change_type (insert / update_preimage / update_postimage / delete),_commit_version, and _commit_timestamp.

Domain 3: Data Modeling (20%)

This domain covers data modeling approaches on the Lakehouse. DEA focuses on conceptual understanding of Medallion Architecture (Bronze/Silver/Gold), but DEP goes further — implementing SCD Type 2, choosing between Star Schema designs, and justifying specific modeling choices.

SCD Type 2 (Slowly Changing Dimensions)

A pattern used in dimension tables like customer masters or product masters where you want to retain historical state. "Implementing SCD Type 2 in Delta Lake" is a classic DEP scenario, typically shown as a two-step MERGE that updates the end_date of an existing record and INSERTs a new one. Make sure you can accurately read conditional clauses such asWHEN MATCHED AND s.value <> t.value THEN UPDATE SET t.end_date = s.effective_date, t.is_current = false.

Star Schema vs Data Vault

Star Schema (fact + dimensions) is well-suited to accelerating analytical queries and is commonly adopted in the Gold layer. Data Vault uses a three-layer Hub-Link-Satellite structure that is resilient to source-system changes but increases query complexity. DEP asks you to decide which modeling approach fits a given requirement. Keep these decision criteria in mind: "Data Vault when source systems change frequently" and "Star Schema when BI-layer aggregation performance is the top priority".

Domain 4: Security and Governance (10%)

This domain centers on data governance features in Unity Catalog. The weighting is 10%, but it's a zero-tolerance area where misconfiguration is unacceptable in production, and the complexity of the questions is dramatically higher than on DEA.

Row-Level Security and Column-Level Security

Row Filters are applied via ALTER TABLE ... SET ROW FILTER with a specified function, controlling row visibility based on user attributes. Column Masks are applied via ALTER TABLE ... ALTER COLUMN ... SET MASK to mask column values. DEP presents combined scenarios such as "the sales team should only see data from their region, and PII columns must be masked". The evaluation order when Row Filter and Column Mask are both applied (Row Filter first) is also tested.

Dynamic Views

A technique that uses views with conditions like current_user() or is_account_group_member()to dynamically control data access. It predates Unity Catalog's Row Filter, but DEP can ask about scenarios for "migrating existing Dynamic View-based security to Row Filter / Column Mask".

Domain 5: Monitoring and Logging (10%)

This domain covers pipeline monitoring and log analysis. It centers on leveraging System Tables and audit logs.

System Tables

Databricks system tables such as system.billing.usage, system.access.audit, and system.compute.clusters are used for cost analysis, security audits, and performance analysis. DEP asks you to read SQL queries against System Tables, such as "a query that identifies the jobs with the highest DBU consumption over the past 30 days" or "a query that audits a specific user's data access history".

Audit Log Analysis

Unity Catalog audit logs are recorded in the system.access.audit table. You're expected to combine action type (action_name), target resource (request_params), and executing user (user_identity) to query "who accessed which table when". Detecting anomalies in streaming jobs (latency spikes, throughput drops) via System Tables is also in scope.

Domain 6: Testing and Deployment (15%)

This domain covers quality assurance and deployment strategy for production pipelines. DEA barely touches it, but DEP gives it 15% — and it's where differences in real-world experience show up most.

Notebook Testing

Implementing unit and integration tests inside Databricks notebooks is tested. A typical pattern: import helper functions with %run, run transformation logic against test-only temp views, and verify the results. Using Databricks Connect to attach a local IDE to a remote cluster and verifying with test frameworks like pytest is also in scope.

CI/CD Pipelines

Integrating Databricks with CI/CD tools such as GitHub Actions, Azure DevOps, or GitLab CI is tested. The canonical flow is: "PR merge → CI runs unit tests → Asset Bundle deploys to staging → integration tests → production deploy". DEP asks operations-oriented design questions like "how do you update a streaming pipeline in production without downtime?" or "what's the procedure if you need to roll back?".

Study Roadmap from Post-DEA to DEP

For engineers who already hold DEA and are targeting DEP, we recommend a structured 3-4 month plan. The approach builds on DEA knowledge while layering on the deep implementation knowledge unique to DEP.

Month 1: Build a Foundation in Tooling & Processing

Read the official Databricks Exam Guide (DEP version) carefully and lock in the scope of all 6 domains
Build a real Multi-Task Workflow and experience task dependencies, retries, and parameter passing first-hand
Actually run Auto Loader in both Directory Listing mode and File Notification mode
Implement streaming upserts with the foreachBatch + MERGE pattern

Month 2: Deep Dive into Modeling & Security

Implement SCD Type 2 in Delta Lake (to the point where you can write the MERGE branching by hand)
Compare Star Schema vs Data Vault and summarize which requirement patterns suit each
Set up Row Filter and Column Mask in Unity Catalog and verify dynamic control via current_user()
Create Dynamic Views and understand how they differ from Row Filter, along with the migration patterns

Month 3: Monitoring & Testing + Cross-Domain Practice

Write analytical queries against System Tables (billing.usage, access.audit, compute.clusters)
Practice deploying to dev/staging/prod environments with Asset Bundles (DABs)
Build a CI/CD pipeline flow (PR → test → deploy) with GitHub Actions
Start solving mock and practice questions to identify your weak domains

Month 4 (Finishing): Shore Up Weak Spots + Time-Management Training

Take full-length mock exams under the 120-minute limit to internalize your pacing
Classify wrong answers by domain and concentrate your study on the weak domains
Repeatedly practice reading patterns for long-scenario questions (see the strategy section below)
Check the official documentation release notes for newly added or changed features

How to Tackle Long-Scenario Questions

The defining feature of DEP is long question text. Each question describes a scenario in 200-300 words of English (the Japanese version has comparable length), and you have to extract "what's actually being asked" from within it. The following 4-step approach answers them efficiently.

Step 1: Read the Final Question First

Read the "Which of the following ..." or "What should the engineer do ..." line at the end of the scenario first. Knowing what's being asked before you read the scenario lets you skip unnecessary information and cuts reading time by 30-40%.

Step 2: Mark the Constraints

Keywords in the scenario like "must", "should not", "minimum cost", "without downtime", and "exactly-once" are decisive for choosing the right answer. Write them down on the scratch paper (a whiteboard/paper is provided during the exam). In many cases, the correct answer and the runner-up are separated by a single point: "does this satisfy the constraint or not?".

Step 3: Narrow Down by Elimination

Of the 4 options, typically 2 are clearly wrong (non-existent APIs, opposite behavior, etc.). For the remaining 2, judge by matching against the constraints. DEP often comes down to "both would work, but which most efficiently satisfies the requirements?", so train yourself to evaluate along three axes: cost, performance, and operability.

Step 4: When in Doubt, Flag and Move On

60 questions in 120 minutes means 2 minutes per question. If you've spent 2 and a half minutes and still aren't sure, flag the question and move on. The DEP exam system lets you review only the flagged questions after answering all of them. Concentrating your remaining time on flagged questions is an effective strategy for reaching the 70% pass line.

Try a Sample Question

Data Processing

問題 1

A data engineer operates a streaming pipeline that uses Auto Loader to ingest CSV files arriving in cloud storage in real time. The file volume is about 500,000 per day and growing, and recently a multi-minute delay has appeared at Auto Loader startup. The team wants to minimize ingestion delay while preserving the existing checkpoint. Which is the most appropriate action?

Switch Auto Loader's file detection mode from Directory Listing to File Notification (cloudFiles.useNotifications = true) and configure a cloud event service
Stop Auto Loader and switch to batch ingestion using the COPY INTO command
Set Auto Loader's maxFilesPerTrigger to 1 to limit the number of files processed per trigger
Delete the checkpoint, start a new stream, and reprocess all files

正解: A

When file volume is large (500,000/day and growing), Directory Listing mode is slow to enumerate files in storage, causing startup delays. Switching to File Notification mode delivers file-arrival events via a cloud event service (AWS SQS / Azure Event Grid, etc.), eliminating the need to list the entire directory and resolving the delay. The checkpoint is preserved across the mode switch. B violates the "real-time ingestion" requirement. C only caps the file processing count and doesn't address the root cause (file-detection latency). D violates the checkpoint-preservation requirement and incurs unnecessary cost and downtime by reprocessing every file.

Frequently Asked Questions

Can I take DEP (Professional) without first passing DEA (Associate)?

Yes. Databricks does not require Associate certification as a prerequisite for the Professional exam. However, DEP is a higher-level exam that builds on DEA knowledge. Without a solid grasp of Delta Lake, Structured Streaming, and Unity Catalog fundamentals, you'll burn time just parsing the question text. In practice, treat DEA-level knowledge as mandatory.

Are all DEP questions long-scenario format, or are there short knowledge questions too?

Most are long-scenario format, but not all. Roughly 70-80% start with something like "You are a data engineer building a pipeline that must satisfy the following requirements...", while the remainder are shorter questions that directly test API arguments or SQL syntax. Scenario questions are text-heavy, so reading speed and the ability to organize requirements often decide pass/fail.

Do I need a real Databricks environment to study for DEP? Is Community Edition enough?

Community Edition is not enough. Frequently tested features such as Jobs Orchestration, Multi-Task Workflow, Asset Bundles, Unity Catalog security features, and System Tables are only available in paid workspaces. We strongly recommend using the 14-day free trial or your company's workspace to learn hands-on.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Databricks Data Engineer Professional Complete Guide: Mastering the Advanced Exam