Databricks

Databricks Spark Developer Complete Guide: DataFrame API & Spark SQL Prep

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

The Databricks Certified Associate Developer for Apache Spark (commonly called the Spark Developer exam) measures your ability to implement DataFrame API and Spark SQL workloads in PySpark or Scala. Delta Lake and Unity Catalog are out of scope — what's tested is deep Apache Spark API knowledge. Since code questions account for 30-35% of the exam, reading the API reference and hands-on practice are the keys to passing.

Exam Overview

ItemDetails
Exam nameDatabricks Certified Associate Developer for Apache Spark
Questions45
Duration90 minutes (about 2 minutes per question)
Passing score70% (32+ correct)
Exam fee$200 (excl. tax)
LanguageEnglish only
Code languageChoose Python (PySpark) or Scala at registration
Validity2 years
Code question share~30-35% (14-16 questions)

Because the exam is English-only, every question, answer choice, and code snippet appears in English. Learning the technical terms in English directly will boost your reading speed.

The 7 Exam Domains and Their Weights

DomainWeightApprox. questions
Spark Architecture17%7-8
DataFrame API17%7-8
Spark SQL13%5-6
Data Sources13%5-6
Higher-order Functions13%5-6
Structured Streaming13%5-6
Testing & Performance14%6-7

Domain 1: Spark Architecture (17%)

Tests whether you understand Spark internals. Conceptual multiple-choice questions dominate this domain — not code.

  • Driver / Executor model: The Driver holds the SparkSession and plans jobs, while Executors run tasks in parallel. Know how to handle Executor OOM (increasing partition count, tuning memory settings).
  • Lazy Evaluation: Transformations (select, filter, groupBy, etc.) are not executed until an Action is called. Know how to inspect logical and physical plans with explain().
  • Narrow vs Wide Transformation: Distinguish Narrow (map, filter — completes within a partition, no shuffle) from Wide (groupBy, join, repartition — triggers a shuffle).
  • Catalyst Optimizer: The 4 stages: logical plan → optimized logical plan → physical plan → code generation. Frequently tested optimization rules include Predicate Pushdown (applying filters early), Column Pruning (dropping unused columns), and Constant Folding (pre-computing constant expressions).
  • Adaptive Query Execution (AQE): Dynamically switches join strategies, coalesces partitions, and splits skewed data using runtime statistics. Enabled by default on Databricks.

Domain 2: DataFrame API (17%)

One of the highest-weighted domains. Tests whether you precisely understand the arguments, return values, and behavior of the main PySpark DataFrame API methods.

Frequently Tested APIs and Question Patterns

  • select / withColumn: df.select("col1", "col2") selects columns; df.withColumn("new_col", expr) adds or transforms a column. The difference: select returns only the listed columns, while withColumn returns all original columns plus the new one.
  • filter / where: df.filter(col("age") > 30) and df.where("age > 30") are equivalent. To combine conditions, use & (AND) and | (OR), wrapping each condition in parentheses.
  • groupBy / agg: df.groupBy("dept").agg(count("*"), avg("salary")) applies multiple aggregations at once. Know the difference vs. agg-less groupBy().count().
  • join: df1.join(df2, "key", "inner")'s third argument specifies the join type. The 7 types are inner / left / right / outer / cross / semi / anti. Semi join returns no columns from the right table; anti join returns only non-matching rows.
  • Window functions: Window.partitionBy("dept").orderBy("salary") defines a window, then apply row_number(), rank(), dense_rank(), lag(), lead(). row_number assigns sequential numbers even on ties; rank assigns the same number to ties and leaves gaps; dense_rank assigns the same number to ties with no gaps.

Domain 3: Spark SQL (13%)

  • spark.sql("SELECT ...") returns a DataFrame. Use createOrReplaceTempView("view_name") to expose a DataFrame to SQL.
  • Built-in SQL functions: COALESCE, NULLIF, CASE WHEN, CAST, and date functions (date_add, datediff, date_format).
  • Subqueries: the difference between correlated subqueries (EXISTS) and non-correlated subqueries (IN), and the effect on the execution plan.

Domain 4: Data Sources (13%)

  • spark.read.format("csv").option("header", "true").option("inferSchema", "true") reads CSV. Know how to read and write JSON, Parquet, ORC, and Avro, and the characteristics of each format.
  • Schema definition: StructType and StructField let you specify a schema explicitly to avoid the overhead of inferSchema (which scans the entire dataset).
  • Partitioned writes: df.write.partitionBy("year", "month") partitions data into a directory structure. Watch out for the small-files problem caused by over-partitioning (e.g., partitioning by high-cardinality columns).
  • Write modes: append (add to existing data), overwrite (replace existing data), error/errorifexists (default — errors if the table exists), and ignore (does nothing if the table exists). Four modes total.

Domain 5: Higher-order Functions (13%)

Tests lambda operations on arrays (ArrayType) and maps (MapType). Added in Spark 3.0+, this is a common blind spot for learners compared to the other domains.

  • transform: applies a lambda to each array element. transform(array_col, x -> x * 2) doubles every element.
  • filter: keeps only the array elements matching a predicate. filter(array_col, x -> x > 0) keeps only positive values.
  • exists: returns a Boolean indicating whether any element in the array satisfies the predicate. exists(array_col, x -> x == "target")
  • aggregate: reduces array elements with a fold operation. You provide an initial value and a lambda to compute things like the sum of an array or a concatenated string.

Domain 6: Structured Streaming (13%)

  • spark.readStream.format("rate") or format("socket") to create test streams.
  • Output modes: append (emits only new rows; for non-aggregating streams), complete (emits the full result every batch; for aggregating streams), and update (emits only changed rows). Three modes total.
  • Watermark: withWatermark("timestamp", "10 minutes") sets the tolerance for late data. Data arriving past the watermark is dropped.
  • Trigger settings: trigger(processingTime="5 seconds") (sets the micro-batch interval) and trigger(availableNow=True) (processes all available data and then stops).

Domain 7: Testing & Performance (14%)

  • Broadcast variables: broadcast_var = sc.broadcast(lookup_dict) distributes read-only data to every Executor. Used for large dictionaries and lookup tables.
  • Accumulators: acc = sc.accumulator(0) creates a distributed counter. If you update an accumulator inside a Transformation, the value can be incremented multiple times when an Action re-runs the Transformation.
  • repartition vs coalesce: repartition does a full shuffle and can either increase or decrease the partition count. coalesce only decreases partitions and avoids a shuffle. Use coalesce to control the file count before writing.
  • Caching strategy: df.cache() (memory only) vs. df.persist(StorageLevel.MEMORY_AND_DISK). Caching pays off when the same DataFrame is referenced multiple times.

PySpark vs. Scala: How to Pick a Language

AspectPySparkScala
Learning resourcesAbundant (official docs, Qiita, YouTube)Limited
Real-world usageUsed by 80%+ of Databricks usersUnder 20%
Type safetyDynamically typed (runtime errors)Statically typed (compile-time errors)
PerformanceDataFrame API matches ScalaFaster than PySpark for RDD operations
Exam-prep difficultyLower (many samples available)Higher (few samples available)

Check Your Understanding with a Sample Question

Databricks

問題 1

When inner-joining a 100GB DataFrame (df_large) with a 50MB DataFrame (df_small) in PySpark, which approach gives the best performance?

  1. Run df_large.join(df_small, 'key', 'inner') as-is and let the Catalyst Optimizer handle optimization
  2. Use `from pyspark.sql.functions import broadcast` and run df_large.join(broadcast(df_small), 'key', 'inner') to explicitly request a broadcast hash join
  3. Run df_large.repartition('key').join(df_small.repartition('key'), 'key', 'inner') to partition both DataFrames on the key column before the join
  4. Call df_large.cache() and df_small.cache(), then run df_large.join(df_small, 'key', 'inner')

正解: B

The 50MB DataFrame is small enough that explicitly hinting a broadcast hash join with broadcast() yields the best performance. A broadcast hash join copies df_small to every Executor and joins without a shuffle, sharply reducing network traffic. Option A might let Catalyst/AQE choose a broadcast join automatically, but the default spark.sql.autoBroadcastJoinThreshold is 10MB, so a 50MB table is outside the auto-broadcast range — an explicit broadcast hint is the reliable choice. Option C's repartition triggers a full shuffle on both tables, moving 100GB of data, which is inefficient. Option D's cache doesn't change the join strategy, so the performance impact is limited.

Frequently Asked Questions

Should I take the Spark Developer exam in PySpark or Scala?

You select Python or Scala when registering, and code questions are shown in your chosen language. PySpark is the mainstream choice in today's Databricks ecosystem, and official docs, Databricks Academy, and community resources are overwhelmingly Python-based — we recommend PySpark unless you have a specific reason to pick Scala. Scala offers type safety and performance benefits, but it has a steeper learning curve and fewer exam-prep samples available.

What share of the exam is code questions, and how should I prepare for them?

Code questions make up 30-35% of the exam (roughly 14-16 questions). Expect DataFrame API operations (filter/select/groupBy/join/withColumn), window functions, higher-order functions (transform/filter/exists), and Structured Streaming readStream/writeStream. Most questions ask about the execution result of a snippet, so you need to know the return types and behavior of PySpark APIs precisely. Studying by actually running code in Community Edition is the most effective approach.

How does the Spark Developer exam compare in difficulty to Data Engineer Associate?

DEA covers the Databricks platform broadly but shallowly, while Spark Developer focuses on deep Spark / PySpark API knowledge. Delta Lake, Unity Catalog, and Workflows are not in scope — only pure Spark API knowledge is tested. Code questions account for 30-35% (vs. 10-15% on DEA), so reading the API reference and hands-on practice are essential. The English-only format also raises the difficulty bar.

Related Databricks Certification Articles

Data Engineer Associate: Complete Guide

Associate where Spark knowledge translates directly

PySpark Complete Guide

DataFrame API and execution internals

Databricks Exam Difficulty Ranking

All 7 exams ranked with study-time estimates

Databricks Certifications Overview

Scope and passing scores at a glance

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.