pandas API on Spark (pyspark.pandas) is an API that lets you run familiar Pandas syntax directly on Apache Spark's distributed processing engine. You can process large datasets that single-node Pandas cannot handle, with minimal code rewriting.
This article comprehensively covers the Koalas integration history, comparison with Pandas, DataFrame interconversion code examples, distributed-environment constraints and unsupported APIs, differences from Pandas UDF, and exam focus points for the Databricks exam.
pandas API on Spark is an API officially merged into Spark in Apache Spark 3.2, shipped as the pyspark.pandas module. Its predecessor was the open-source project "Koalas" developed by Databricks, designed so that Pandas users could transparently leverage Spark's distributed processing.
The key feature is that it provides the same API as Pandas DataFrame and Series while internally running as a Spark DataFrame in distributed mode. This lets you process hundreds of GBs to TBs of data with intuitive Pandas code.
Databricks released Koalas as open source in 2019. It implemented a Pandas API compatibility layer on top of Spark to answer the demand "I want to run Pandas code on Spark as-is." It was merged into Apache Spark in Spark 3.2 (released January 2022) and became a standard module as pyspark.pandas. Today, pip install koalas is no longer needed — PySpark 3.2 or later includes it with no extra installation.
Importing pandas API on Spark takes just one line. The conventional alias is ps.
import pyspark.pandas as ps
# Read a CSV file (same syntax as Pandas)
df = ps.read_csv("/mnt/data/sales.csv")
# Read a Parquet file
df = ps.read_parquet("/mnt/data/sales.parquet")
# Read a Delta table (Databricks-specific)
df = ps.read_delta("/mnt/data/delta_table")
# Basic operations
df.head(10)
df.describe()
df.shape
df.dtypesThe table below summarizes the main differences between native Pandas and pandas API on Spark. The exam tests whether you understand these distinctions accurately.
| Comparison Item | Pandas | pandas API on Spark |
|---|---|---|
| Execution engine | Single node (CPython) | Apache Spark distributed processing |
| Data size limit | Whatever fits in memory | Cluster storage capacity |
| API compatibility | Baseline (100%) | Covers about 80-85% |
| Lazy evaluation | Eager evaluation | Lazy + partially eager |
| Row order guarantee | Guaranteed | Not guaranteed (distributed) |
| Index | Fully supported | Partial limits (iloc inefficient) |
| Import | import pandas as pd | import pyspark.pandas as ps |
| Execution environment | Local Python | On a Spark cluster |
Converting between the three DataFrame types — Pandas DataFrame, Spark DataFrame, and pyspark.pandas DataFrame — is a frequent exam topic. Memorize the differences between each conversion method accurately.
import pandas as pd
import pyspark.pandas as ps
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# ---- 1. Pandas -> pyspark.pandas ----
pandas_df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [30, 25]})
ps_df = ps.from_pandas(pandas_df)
# ---- 2. pyspark.pandas -> Pandas ----
# Note: collects all data to the driver
pandas_df2 = ps_df.to_pandas()
# ---- 3. Spark DataFrame -> pyspark.pandas ----
spark_df = spark.createDataFrame(
[("Alice", 30), ("Bob", 25)], ["name", "age"]
)
ps_df2 = spark_df.to_pandas_on_spark()
# Alternative
ps_df2 = ps.DataFrame(spark_df)
# ---- 4. pyspark.pandas -> Spark DataFrame ----
spark_df2 = ps_df.to_spark()
# ---- 5. Spark DataFrame -> Pandas (reference) ----
# Note: collects all data to the driver
pandas_df3 = spark_df.toPandas()
# ---- 6. Pandas -> Spark DataFrame (reference) ----
spark_df3 = spark.createDataFrame(pandas_df)Important: toPandas() and to_pandas() collect the entire dataset into the driver node's memory. For large datasets there is a risk of OutOfMemoryError, so it is safer to use to_spark() to convert to a Spark DataFrame and process from there.
pandas API on Spark covers about 80-85% of the Pandas API, but due to the nature of the distributed environment, the following APIs and behaviors are restricted.
| Category | Restricted API / Behavior | Reason |
|---|---|---|
| Index operations | iloc (positional access) | Triggers shuffle, inefficient |
| Window operations | ewm (exponentially weighted moving average) | Depends on row order, unfit for distribution |
| String operations | str.extractall (all regex matches) | Requires complex state management |
| Join | DataFrame.append (deprecated) | Deprecated in Pandas itself too |
| Sort | Implicit row-order guarantee | Not guaranteed across distributed partitions |
| Plot | plot()-family methods | Requires data collection, inefficient |
| MultiIndex | Some MultiIndex operations | Complex to implement in distributed environments |
to_pandas() at the final visualization or reporting stage.apply().to_spark() and process there.sort_values() may have its order broken by downstream processing, so sort right before the final output."Pandas API on Spark" and "Pandas UDF" have similar names and are often confused, but their purposes and usage are different.
| Comparison Item | pandas API on Spark | Pandas UDF (pandas_udf) |
|---|---|---|
| Purpose | Write distributed processing in Pandas syntax | Apply Pandas functions to Spark DataFrame |
| Input | pyspark.pandas DataFrame | Spark DataFrame columns/groups |
| Output | pyspark.pandas DataFrame | Spark DataFrame columns |
| Internal processing | Converted to a Spark plan | Batch transfer via Apache Arrow |
| Usage | All operations from ps.read_csv() etc. | Define function with @pandas_udf decorator |
| Speed | Comparable to Spark API | Tens to hundreds of times faster than regular Python UDF |
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
import pandas as pd
# Scalar Pandas UDF: apply a Pandas function per column
@pandas_udf(DoubleType())
def multiply_by_two(s: pd.Series) -> pd.Series:
return s * 2
df = spark.table("sales")
result = df.withColumn("doubled", multiply_by_two(df["amount"]))
# Grouped Map: receive a Pandas DataFrame per group and process it
def normalize(pdf: pd.DataFrame) -> pd.DataFrame:
pdf["normalized"] = (
(pdf["value"] - pdf["value"].mean()) / pdf["value"].std()
)
return pdf
df.groupby("department").applyInPandas(normalize, schema=output_schema)pandas API on Spark appears on both the Spark Developer Associate and Data Engineer Associate (DEA) exams. The following three points are especially important.
import pyspark.pandas as ps is the correct import statement.import koalas as ks (old Koalas) andfrom pyspark import pandas appear as distractor answers on the exam.
The difference between to_pandas_on_spark() (Spark → pyspark.pandas) and toPandas() (Spark → native Pandas) is a frequent topic. The former preserves distributed state, while the latter collects data to the driver.
True/false questions test constraints like "row order is not guaranteed," "iloc is inefficient," and "some Pandas functions are unsupported." You need to accurately understand the behavioral differences from native Pandas.
Spark Developer / DEA
問題 1
Which of the following correctly describes the behavior of the code below? import pyspark.pandas as ps ps_df = ps.read_csv("/data/large_dataset.csv") result = ps_df.groupby("region").agg({"sales": "sum", "quantity": "mean"}) final = result.to_pandas()
正解: B
pyspark.pandas's read_csv uses Spark's data source API internally to load data in a distributed manner (A is wrong). groupby and agg are also converted into Spark DataFrame operations and executed in distributed mode (C is wrong). to_pandas() converts a pyspark.pandas DataFrame into a native Pandas DataFrame, collecting the data to the driver node in the process. In this example only the post-aggregation result (one row per region) is collected, so it is usually safe to run. D describes to_pandas_on_spark(); to_pandas() converts to native Pandas, so distributed state is not preserved.
Try pandas API on Spark questions
Check your understanding of DataFrame conversions and Pandas UDF
Try free questions →What is the difference between Pandas API on Spark and native Pandas?
The syntax is nearly identical, but the execution engines are fundamentally different. Native Pandas processes data in memory on a single node, while Pandas API on Spark runs on Apache Spark's distributed processing engine. This lets you process hundreds of GBs to TBs of data using Pandas syntax. However, there are distributed-environment constraints: row order is not guaranteed, random access via iloc/loc is inefficient, and some Pandas functions (cummax, ewm, etc.) are not implemented.
What is the relationship between Koalas and Pandas API on Spark?
Koalas is the predecessor project of Pandas API on Spark. Databricks developed it as open source, and it was merged into Apache Spark 3.2. It is now shipped as the standard pyspark.pandas module, so there is no need to pip install Koalas. The Koalas API has been almost entirely migrated to pyspark.pandas. A migration guide is available in the official Spark documentation.
What is the difference between Pandas UDF and Pandas API on Spark?
The names are similar, but the purposes and usage differ. Pandas API on Spark (pyspark.pandas) is an overall API for writing Spark distributed processing in Pandas syntax — you can create, transform, and aggregate DataFrames in a Pandas-like style. Pandas UDF (pandas_udf), on the other hand, is a mechanism for applying Pandas functions to columns or groups of a Spark DataFrame, using vectorized data transfer via Apache Arrow to run tens of times faster than regular Python UDFs. Exams sometimes ask about this distinction.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...