The Databricks Certified Associate Developer for Apache Spark (commonly called the Spark Developer exam) measures your ability to implement DataFrame API and Spark SQL workloads in PySpark or Scala. Delta Lake and Unity Catalog are out of scope — what's tested is deep Apache Spark API knowledge. Since code questions account for 30-35% of the exam, reading the API reference and hands-on practice are the keys to passing.
| Item | Details |
|---|---|
| Exam name | Databricks Certified Associate Developer for Apache Spark |
| Questions | 45 |
| Duration | 90 minutes (about 2 minutes per question) |
| Passing score | 70% (32+ correct) |
| Exam fee | $200 (excl. tax) |
| Language | English only |
| Code language | Choose Python (PySpark) or Scala at registration |
| Validity | 2 years |
| Code question share | ~30-35% (14-16 questions) |
Because the exam is English-only, every question, answer choice, and code snippet appears in English. Learning the technical terms in English directly will boost your reading speed.
| Domain | Weight | Approx. questions |
|---|---|---|
| Spark Architecture | 17% | 7-8 |
| DataFrame API | 17% | 7-8 |
| Spark SQL | 13% | 5-6 |
| Data Sources | 13% | 5-6 |
| Higher-order Functions | 13% | 5-6 |
| Structured Streaming | 13% | 5-6 |
| Testing & Performance | 14% | 6-7 |
Tests whether you understand Spark internals. Conceptual multiple-choice questions dominate this domain — not code.
One of the highest-weighted domains. Tests whether you precisely understand the arguments, return values, and behavior of the main PySpark DataFrame API methods.
df.select("col1", "col2") selects columns; df.withColumn("new_col", expr) adds or transforms a column. The difference: select returns only the listed columns, while withColumn returns all original columns plus the new one.df.filter(col("age") > 30) and df.where("age > 30") are equivalent. To combine conditions, use & (AND) and | (OR), wrapping each condition in parentheses.df.groupBy("dept").agg(count("*"), avg("salary")) applies multiple aggregations at once. Know the difference vs. agg-less groupBy().count().df1.join(df2, "key", "inner")'s third argument specifies the join type. The 7 types are inner / left / right / outer / cross / semi / anti. Semi join returns no columns from the right table; anti join returns only non-matching rows.Window.partitionBy("dept").orderBy("salary") defines a window, then apply row_number(), rank(), dense_rank(), lag(), lead(). row_number assigns sequential numbers even on ties; rank assigns the same number to ties and leaves gaps; dense_rank assigns the same number to ties with no gaps.spark.sql("SELECT ...") returns a DataFrame. Use createOrReplaceTempView("view_name") to expose a DataFrame to SQL.COALESCE, NULLIF, CASE WHEN, CAST, and date functions (date_add, datediff, date_format).spark.read.format("csv").option("header", "true").option("inferSchema", "true") reads CSV. Know how to read and write JSON, Parquet, ORC, and Avro, and the characteristics of each format.StructType and StructField let you specify a schema explicitly to avoid the overhead of inferSchema (which scans the entire dataset).df.write.partitionBy("year", "month") partitions data into a directory structure. Watch out for the small-files problem caused by over-partitioning (e.g., partitioning by high-cardinality columns).append (add to existing data), overwrite (replace existing data), error/errorifexists (default — errors if the table exists), and ignore (does nothing if the table exists). Four modes total.Tests lambda operations on arrays (ArrayType) and maps (MapType). Added in Spark 3.0+, this is a common blind spot for learners compared to the other domains.
transform(array_col, x -> x * 2) doubles every element.filter(array_col, x -> x > 0) keeps only positive values.exists(array_col, x -> x == "target")spark.readStream.format("rate") or format("socket") to create test streams.append (emits only new rows; for non-aggregating streams), complete (emits the full result every batch; for aggregating streams), and update (emits only changed rows). Three modes total.withWatermark("timestamp", "10 minutes") sets the tolerance for late data. Data arriving past the watermark is dropped.trigger(processingTime="5 seconds") (sets the micro-batch interval) and trigger(availableNow=True) (processes all available data and then stops).broadcast_var = sc.broadcast(lookup_dict) distributes read-only data to every Executor. Used for large dictionaries and lookup tables.acc = sc.accumulator(0) creates a distributed counter. If you update an accumulator inside a Transformation, the value can be incremented multiple times when an Action re-runs the Transformation.df.cache() (memory only) vs. df.persist(StorageLevel.MEMORY_AND_DISK). Caching pays off when the same DataFrame is referenced multiple times.| Aspect | PySpark | Scala |
|---|---|---|
| Learning resources | Abundant (official docs, Qiita, YouTube) | Limited |
| Real-world usage | Used by 80%+ of Databricks users | Under 20% |
| Type safety | Dynamically typed (runtime errors) | Statically typed (compile-time errors) |
| Performance | DataFrame API matches Scala | Faster than PySpark for RDD operations |
| Exam-prep difficulty | Lower (many samples available) | Higher (few samples available) |
Databricks
問題 1
When inner-joining a 100GB DataFrame (df_large) with a 50MB DataFrame (df_small) in PySpark, which approach gives the best performance?
正解: B
The 50MB DataFrame is small enough that explicitly hinting a broadcast hash join with broadcast() yields the best performance. A broadcast hash join copies df_small to every Executor and joins without a shuffle, sharply reducing network traffic. Option A might let Catalyst/AQE choose a broadcast join automatically, but the default spark.sql.autoBroadcastJoinThreshold is 10MB, so a 50MB table is outside the auto-broadcast range — an explicit broadcast hint is the reliable choice. Option C's repartition triggers a full shuffle on both tables, moving 100GB of data, which is inefficient. Option D's cache doesn't change the join strategy, so the performance impact is limited.
Should I take the Spark Developer exam in PySpark or Scala?
You select Python or Scala when registering, and code questions are shown in your chosen language. PySpark is the mainstream choice in today's Databricks ecosystem, and official docs, Databricks Academy, and community resources are overwhelmingly Python-based — we recommend PySpark unless you have a specific reason to pick Scala. Scala offers type safety and performance benefits, but it has a steeper learning curve and fewer exam-prep samples available.
What share of the exam is code questions, and how should I prepare for them?
Code questions make up 30-35% of the exam (roughly 14-16 questions). Expect DataFrame API operations (filter/select/groupBy/join/withColumn), window functions, higher-order functions (transform/filter/exists), and Structured Streaming readStream/writeStream. Most questions ask about the execution result of a snippet, so you need to know the return types and behavior of PySpark APIs precisely. Studying by actually running code in Community Edition is the most effective approach.
How does the Spark Developer exam compare in difficulty to Data Engineer Associate?
DEA covers the Databricks platform broadly but shallowly, while Spark Developer focuses on deep Spark / PySpark API knowledge. Delta Lake, Unity Catalog, and Workflows are not in scope — only pure Spark API knowledge is tested. Code questions account for 30-35% (vs. 10-15% on DEA), so reading the API reference and hands-on practice are essential. The English-only format also raises the difficulty bar.
Related Databricks Certification Articles
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...