The Databricks Apache Spark Developer Associate exam tests practical Apache Spark coding skills, spanning Spark's core architecture, the DataFrame API, Spark SQL, and Structured Streaming. This article lays out the 7 exam domains and their weighting, explains the characteristics of code questions and how to attack them, and includes 3 practice questions from representative domains.
The Spark Developer Associate exam is 45 questions in 120 minutes, drawn from the 7 domains below. The fact that code-reading questions make up 30-35% of the exam is the biggest difference from other Databricks exams.
| Domain | Weight | Key Topics |
|---|---|---|
| DataFrame API & Spark SQL | ~25% | select, filter, groupBy, join, withColumn, window functions |
| Spark Architecture | ~15% | Driver/Executor, jobs/stages/tasks, Narrow/Wide Transformations |
| Catalyst & Tungsten | ~12% | Logical plan, physical plan, predicate pushdown, column pruning, AQE |
| Data Read/Write | ~13% | DataFrameReader/Writer, Parquet/CSV/JSON, schema inference, partitioning |
| Structured Streaming | ~13% | readStream/writeStream, output modes, watermarks, triggers |
| Performance Tuning | ~12% | Caching, broadcast joins, partition counts, data skew mitigation |
| Delta Lake Fundamentals | ~10% | ACID transactions, Time Travel, schema evolution, basic DML |
On the Spark Developer exam, 14-16 of the 45 questions are code questions. That's a higher share than on other Databricks exams (DEA, MLA, etc., where code questions are about 15-20%), so precise API knowledge and the ability to visualize execution results are what decide pass or fail.
There are 3 keys to reliably getting code questions right.
The most common pattern is reading a transformation chain like select → filter → groupBy → agg → orderBy and predicting the final DataFrame's schema and row count. Pay particular attention to the row count after groupBy (the number of unique keys) and the result column names produced by the aggregation functions (count, sum, avg, max, min) passed to agg.
Questions ask about the representative optimization rules Catalyst applies. The 4 big ones are predicate pushdown (apply filter conditions at data-source read time), column pruning (skip reading unneeded columns), constant folding (evaluate constant expressions at compile time), and join order optimization (broadcast small tables first).
The differences between the 3 output modes (Append / Complete / Update) and the purpose of watermarks (defining the late-data tolerance window and managing state memory) come up often. The constraint that Append mode on an aggregation query requires a watermark is also tested.
One question each from DataFrame API, Catalyst Optimizer, and Structured Streaming. Read the code, picture the execution result, and then pick your answer.
DataFrame API
問題 1
Which is the correct execution result of the following PySpark code? df = spark.createDataFrame( [("Alice", "Sales", 5000), ("Bob", "Sales", 6000), ("Carol", "Eng", 7000), ("Dave", "Eng", 7000)], ["name", "dept", "salary"]) from pyspark.sql.window import Window from pyspark.sql.functions import rank w = Window.partitionBy("dept").orderBy(col("salary").desc()) result = df.withColumn("rnk", rank().over(w)).filter(col("rnk") == 1) result.select("name", "dept").show()
正解: C
The rank() window function partitions by 'dept' and ranks within each partition by salary descending. In Sales, Bob (6000) gets rank=1 and Alice (5000) gets rank=2. In Eng, Carol and Dave tie at 7000, so both get rank=1. Because rank() assigns the same rank to ties, Eng produces 2 rows with rank=1. After filtering rnk==1, the result is 3 rows: Bob, Carol, and Dave. dense_rank() would give the same result, but row_number() would break ties with a unique ordering and only return 2 rows. This distinction is a frequent target of window-function questions on the Spark Developer exam.
Catalyst Optimizer
問題 2
Which optimization does the Catalyst Optimizer apply to the following Spark SQL query? SELECT name, salary FROM employees WHERE dept = 'Engineering' AND salary > 100000
正解: A
Catalyst Optimizer applies multiple optimization rules simultaneously. For this query: (1) Column pruning — only 3 columns are read from the Parquet file: name and salary (used in SELECT) plus dept (used in WHERE). Every other column is skipped. (2) Predicate pushdown — the dept='Engineering' and salary>100000 conditions are matched against the Parquet Row Group statistics, and any Row Group that cannot satisfy them is skipped. Together these two optimizations dramatically reduce wasted I/O. Column pruning is especially effective because Parquet is a columnar format. On Delta Lake tables, Data Skipping (file-level skipping via statistics) is layered on top of these as well.
Structured Streaming
問題 3
In the following Structured Streaming code, what is the correct behavior when using trigger(availableNow=True)? (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .load("/data/events/") .writeStream .format("delta") .option("checkpointLocation", "/checkpoints/events") .trigger(availableNow=True) .toTable("bronze.events"))
正解: B
trigger(availableNow=True) is a trigger mode that processes all unprocessed data since the last checkpoint and then stops the stream automatically. It is similar to trigger(once=True), but availableNow splits the data into multiple micro-batches, which avoids out-of-memory failures when a large backlog of unprocessed files exists. trigger(once=True) processes all data in a single micro-batch, which is what option C describes. availableNow is especially recommended in combination with cloudFiles (Auto Loader) and is the ideal fit for batch-style scheduled execution (e.g., a Workflow that runs once every morning) that still benefits from streaming-style incremental processing. This distinction comes up frequently on the Spark Developer exam.
The most effective way to lift your code-question accuracy is to write and run code yourself. Spin up small sample datasets on Community Edition and actually write select, filter, groupBy, join, withColumn, when/otherwise, and window-function code to verify the results. Window functions in particular (rank, dense_rank, row_number, lag, lead) produce different results depending on the partitionBy and orderBy combination, so you really need to feel it in your hands.
To internalize Catalyst Optimizer optimizations, get in the habit of running df.explain(True) to inspect the logical plans (Parsed / Analyzed / Optimized) and the Physical Plan. Once you can read an execution plan and tell whether predicate pushdown was applied or whether a broadcast join was chosen, your accuracy on optimization questions improves dramatically.
Understand how to read the Jobs, Stages, and SQL/DataFrame tabs in the Spark UI. The exam asks about diagnosing performance problems using signals like the task duration distribution (for detecting data skew), shuffle read/write sizes, cache hit rate, and the DAG between stages.
Are Spark Developer exam code questions in Python or Scala?
The Spark Developer Associate exam lets you choose between the Python (PySpark) version and the Scala version at registration. The question structure, difficulty, and passing score are identical — only the code language differs. PySpark has more test takers and richer learning resources, so unless your job requires Scala specifically, the Python version is recommended. Knowledge of both languages is never tested in the same sitting.
What format do Spark Developer code questions take?
Code questions come in three main patterns: (1) Output prediction — predict the result of a completed code snippet (e.g., the row count after groupBy → agg → show), (2) Fill-in-the-blank — part of the code is replaced with ___ and you pick the correct method name, argument, or option, and (3) Error spotting — choose which snippet raises a runtime error. All three require not just memorizing the API spec but actually being able to visualize the execution result. Hands-on practice on Community Edition is the most effective preparation.
Are Delta Lake questions included in the Spark Developer exam?
Delta Lake fundamentals (ACID transactions, Time Travel, schema evolution) and reading/writing Delta format via DataFrameReader are in scope. However, the depth is shallower than on the Data Engineer exams — complex MERGE branching, DLT, LakeFlow, and Unity Catalog permission management are out of scope for Spark Developer. Delta Lake topics account for roughly 5-8% of the Spark Developer exam overall.
Check your level with the Spark question bank
Sharpen your Spark Developer readiness with 550+ practice questions
Try free questions →Related Databricks Spark Articles
Spark Developer: Complete Guide
Exam scope, difficulty, and prep strategy in detail
Catalyst Optimizer Deep Dive
Comprehensive coverage of execution plan optimization rules
Databricks Exam Difficulty Ranking
All 7 exams compared head-to-head by difficulty
Free Databricks Question Bank
6,800+ practice questions
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...