Databricks

Databricks Spark Practice Questions: DataFrame API & Architecture Drills

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

The Databricks Apache Spark Developer Associate exam tests practical Apache Spark coding skills, spanning Spark's core architecture, the DataFrame API, Spark SQL, and Structured Streaming. This article lays out the 7 exam domains and their weighting, explains the characteristics of code questions and how to attack them, and includes 3 practice questions from representative domains.

The 7 Spark Developer Exam Domains and Their Weights

The Spark Developer Associate exam is 45 questions in 120 minutes, drawn from the 7 domains below. The fact that code-reading questions make up 30-35% of the exam is the biggest difference from other Databricks exams.

DomainWeightKey Topics
DataFrame API & Spark SQL~25%select, filter, groupBy, join, withColumn, window functions
Spark Architecture~15%Driver/Executor, jobs/stages/tasks, Narrow/Wide Transformations
Catalyst & Tungsten~12%Logical plan, physical plan, predicate pushdown, column pruning, AQE
Data Read/Write~13%DataFrameReader/Writer, Parquet/CSV/JSON, schema inference, partitioning
Structured Streaming~13%readStream/writeStream, output modes, watermarks, triggers
Performance Tuning~12%Caching, broadcast joins, partition counts, data skew mitigation
Delta Lake Fundamentals~10%ACID transactions, Time Travel, schema evolution, basic DML

Code Question Characteristics and Attack Strategy

On the Spark Developer exam, 14-16 of the 45 questions are code questions. That's a higher share than on other Databricks exams (DEA, MLA, etc., where code questions are about 15-20%), so precise API knowledge and the ability to visualize execution results are what decide pass or fail.

The 3 Patterns of Code Questions

  • Output Prediction: Pick the execution result (row count, column names, values) of a completed snippet. You're tested on whether you can accurately trace the final result of a transformation chain like groupBy → agg → show.
  • Fill-in-the-Blank: Part of the code is blanked out (___) and you pick the correct method name, argument, or option. Telling withColumn / when / otherwise / col / lit apart comes up frequently.
  • Error Spotting: Choose which of 4 code snippets raises a runtime error. Common themes are type mismatches, references to nonexistent columns, and misuse of the immutable DataFrame.

Keys to Cracking Code Questions

There are 3 keys to reliably getting code questions right.

  • Understand lazy evaluation: Transformations don't execute until an action is called. select/filter/groupBy on their own run nothing; computation only fires when show/collect/write is invoked.
  • Distinguish Narrow vs. Wide Transformations: The ability to instantly judge which operations trigger a shuffle. Narrow: select, filter, map, withColumn. Wide: groupBy, join, distinct, repartition, orderBy.
  • DataFrame immutability: df.withColumn() returns a new DataFrame and does NOT mutate the original df. Forgetting to assign the result of df.withColumn(...) to a variable so the transformation never sticks is a frequent error-spotting trap.

Representative Question Patterns by Domain

DataFrame API — Predicting Transformation Chain Results

The most common pattern is reading a transformation chain like select → filter → groupBy → agg → orderBy and predicting the final DataFrame's schema and row count. Pay particular attention to the row count after groupBy (the number of unique keys) and the result column names produced by the aggregation functions (count, sum, avg, max, min) passed to agg.

Catalyst Optimizer — Execution Plan Optimization Rules

Questions ask about the representative optimization rules Catalyst applies. The 4 big ones are predicate pushdown (apply filter conditions at data-source read time), column pruning (skip reading unneeded columns), constant folding (evaluate constant expressions at compile time), and join order optimization (broadcast small tables first).

Structured Streaming — Output Modes and Watermarks

The differences between the 3 output modes (Append / Complete / Update) and the purpose of watermarks (defining the late-data tolerance window and managing state memory) come up often. The constraint that Append mode on an aggregation query requires a watermark is also tested.

Practice Questions by Domain (3 Questions)

One question each from DataFrame API, Catalyst Optimizer, and Structured Streaming. Read the code, picture the execution result, and then pick your answer.

DataFrame API

問題 1

Which is the correct execution result of the following PySpark code? df = spark.createDataFrame( [("Alice", "Sales", 5000), ("Bob", "Sales", 6000), ("Carol", "Eng", 7000), ("Dave", "Eng", 7000)], ["name", "dept", "salary"]) from pyspark.sql.window import Window from pyspark.sql.functions import rank w = Window.partitionBy("dept").orderBy(col("salary").desc()) result = df.withColumn("rnk", rank().over(w)).filter(col("rnk") == 1) result.select("name", "dept").show()

  1. Only Bob is output (highest salary in Sales)
  2. 2 rows are output: Bob and Carol (one top-salary person per department)
  3. 3 rows are output: Bob, Carol, and Dave (Eng has 2 rows with rank=1)
  4. An error is raised (rank() cannot be used without partitionBy)

正解: C

The rank() window function partitions by 'dept' and ranks within each partition by salary descending. In Sales, Bob (6000) gets rank=1 and Alice (5000) gets rank=2. In Eng, Carol and Dave tie at 7000, so both get rank=1. Because rank() assigns the same rank to ties, Eng produces 2 rows with rank=1. After filtering rnk==1, the result is 3 rows: Bob, Carol, and Dave. dense_rank() would give the same result, but row_number() would break ties with a unique ordering and only return 2 rows. This distinction is a frequent target of window-function questions on the Spark Developer exam.

Catalyst Optimizer

問題 2

Which optimization does the Catalyst Optimizer apply to the following Spark SQL query? SELECT name, salary FROM employees WHERE dept = 'Engineering' AND salary > 100000

  1. Both predicate pushdown and column pruning are applied — only dept, name, and salary are read from the Parquet file, and the dept and salary filters are applied at file scan time
  2. Only predicate pushdown is applied — all columns are read first, then the filter is applied
  3. Only column pruning is applied — only the name and salary columns are read, but the filter is applied after a full scan
  4. No optimization is applied — all columns and all rows are scanned, then the filter and column selection happen

正解: A

Catalyst Optimizer applies multiple optimization rules simultaneously. For this query: (1) Column pruning — only 3 columns are read from the Parquet file: name and salary (used in SELECT) plus dept (used in WHERE). Every other column is skipped. (2) Predicate pushdown — the dept='Engineering' and salary>100000 conditions are matched against the Parquet Row Group statistics, and any Row Group that cannot satisfy them is skipped. Together these two optimizations dramatically reduce wasted I/O. Column pruning is especially effective because Parquet is a columnar format. On Delta Lake tables, Data Skipping (file-level skipping via statistics) is layered on top of these as well.

Structured Streaming

問題 3

In the following Structured Streaming code, what is the correct behavior when using trigger(availableNow=True)? (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .load("/data/events/") .writeStream .format("delta") .option("checkpointLocation", "/checkpoints/events") .trigger(availableNow=True) .toTable("bronze.events"))

  1. The stream stays running continuously and a micro-batch fires every time a new file arrives
  2. All currently available unprocessed files are processed across multiple micro-batches, and the stream stops automatically when they are all done
  3. All available unprocessed files are processed in a single micro-batch and the stream stops automatically
  4. An error is raised (availableNow cannot be used with the cloudFiles format)

正解: B

trigger(availableNow=True) is a trigger mode that processes all unprocessed data since the last checkpoint and then stops the stream automatically. It is similar to trigger(once=True), but availableNow splits the data into multiple micro-batches, which avoids out-of-memory failures when a large backlog of unprocessed files exists. trigger(once=True) processes all data in a single micro-batch, which is what option C describes. availableNow is especially recommended in combination with cloudFiles (Auto Loader) and is the ideal fit for batch-style scheduled execution (e.g., a Workflow that runs once every morning) that still benefits from streaming-style incremental processing. This distinction comes up frequently on the Spark Developer exam.

Study Strategy for the Spark Developer Exam

Learn the DataFrame API by writing it by hand

The most effective way to lift your code-question accuracy is to write and run code yourself. Spin up small sample datasets on Community Edition and actually write select, filter, groupBy, join, withColumn, when/otherwise, and window-function code to verify the results. Window functions in particular (rank, dense_rank, row_number, lag, lead) produce different results depending on the partitionBy and orderBy combination, so you really need to feel it in your hands.

Build a habit of checking execution plans with explain()

To internalize Catalyst Optimizer optimizations, get in the habit of running df.explain(True) to inspect the logical plans (Parsed / Analyzed / Optimized) and the Physical Plan. Once you can read an execution plan and tell whether predicate pushdown was applied or whether a broadcast join was chosen, your accuracy on optimization questions improves dramatically.

Practice reading execution metrics in the Spark UI

Understand how to read the Jobs, Stages, and SQL/DataFrame tabs in the Spark UI. The exam asks about diagnosing performance problems using signals like the task duration distribution (for detecting data skew), shuffle read/write sizes, cache hit rate, and the DAG between stages.

Frequently Asked Questions

Are Spark Developer exam code questions in Python or Scala?

The Spark Developer Associate exam lets you choose between the Python (PySpark) version and the Scala version at registration. The question structure, difficulty, and passing score are identical — only the code language differs. PySpark has more test takers and richer learning resources, so unless your job requires Scala specifically, the Python version is recommended. Knowledge of both languages is never tested in the same sitting.

What format do Spark Developer code questions take?

Code questions come in three main patterns: (1) Output prediction — predict the result of a completed code snippet (e.g., the row count after groupBy → agg → show), (2) Fill-in-the-blank — part of the code is replaced with ___ and you pick the correct method name, argument, or option, and (3) Error spotting — choose which snippet raises a runtime error. All three require not just memorizing the API spec but actually being able to visualize the execution result. Hands-on practice on Community Edition is the most effective preparation.

Are Delta Lake questions included in the Spark Developer exam?

Delta Lake fundamentals (ACID transactions, Time Travel, schema evolution) and reading/writing Delta format via DataFrameReader are in scope. However, the depth is shallower than on the Data Engineer exams — complex MERGE branching, DLT, LakeFlow, and Unity Catalog permission management are out of scope for Spark Developer. Delta Lake topics account for roughly 5-8% of the Spark Developer exam overall.

Check your level with the Spark question bank

Sharpen your Spark Developer readiness with 550+ practice questions

Try free questions

Related Databricks Spark Articles

Spark Developer: Complete Guide

Exam scope, difficulty, and prep strategy in detail

Catalyst Optimizer Deep Dive

Comprehensive coverage of execution plan optimization rules

Databricks Exam Difficulty Ranking

All 7 exams compared head-to-head by difficulty

Free Databricks Question Bank

6,800+ practice questions

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.