Pandas API on Spark: pandas-on-Spark for Big Data (2026)

pandas API on Spark (pyspark.pandas) is an API that lets you run familiar Pandas syntax directly on Apache Spark's distributed processing engine. You can process large datasets that single-node Pandas cannot handle, with minimal code rewriting.

This article comprehensively covers the Koalas integration history, comparison with Pandas, DataFrame interconversion code examples, distributed-environment constraints and unsupported APIs, differences from Pandas UDF, and exam focus points for the Databricks exam.

What is pandas API on Spark

pandas API on Spark is an API officially merged into Spark in Apache Spark 3.2, shipped as the pyspark.pandas module. Its predecessor was the open-source project "Koalas" developed by Databricks, designed so that Pandas users could transparently leverage Spark's distributed processing.

The key feature is that it provides the same API as Pandas DataFrame and Series while internally running as a Spark DataFrame in distributed mode. This lets you process hundreds of GBs to TBs of data with intuitive Pandas code.

The Koalas Integration History

Databricks released Koalas as open source in 2019. It implemented a Pandas API compatibility layer on top of Spark to answer the demand "I want to run Pandas code on Spark as-is." It was merged into Apache Spark in Spark 3.2 (released January 2022) and became a standard module as pyspark.pandas. Today, pip install koalas is no longer needed — PySpark 3.2 or later includes it with no extra installation.

How to Import

Importing pandas API on Spark takes just one line. The conventional alias is ps.

import pyspark.pandas as ps

# Read a CSV file (same syntax as Pandas)
df = ps.read_csv("/mnt/data/sales.csv")

# Read a Parquet file
df = ps.read_parquet("/mnt/data/sales.parquet")

# Read a Delta table (Databricks-specific)
df = ps.read_delta("/mnt/data/delta_table")

# Basic operations
df.head(10)
df.describe()
df.shape
df.dtypes

Pandas vs pandas API on Spark Comparison

The table below summarizes the main differences between native Pandas and pandas API on Spark. The exam tests whether you understand these distinctions accurately.

Comparison Item	Pandas	pandas API on Spark
Execution engine	Single node (CPython)	Apache Spark distributed processing
Data size limit	Whatever fits in memory	Cluster storage capacity
API compatibility	Baseline (100%)	Covers about 80-85%
Lazy evaluation	Eager evaluation	Lazy + partially eager
Row order guarantee	Guaranteed	Not guaranteed (distributed)
Index	Fully supported	Partial limits (iloc inefficient)
Import	import pandas as pd	import pyspark.pandas as ps
Execution environment	Local Python	On a Spark cluster

DataFrame Interconversion Code Examples

Converting between the three DataFrame types — Pandas DataFrame, Spark DataFrame, and pyspark.pandas DataFrame — is a frequent exam topic. Memorize the differences between each conversion method accurately.

import pandas as pd
import pyspark.pandas as ps
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# ---- 1. Pandas -> pyspark.pandas ----
pandas_df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [30, 25]})
ps_df = ps.from_pandas(pandas_df)

# ---- 2. pyspark.pandas -> Pandas ----
# Note: collects all data to the driver
pandas_df2 = ps_df.to_pandas()

# ---- 3. Spark DataFrame -> pyspark.pandas ----
spark_df = spark.createDataFrame(
    [("Alice", 30), ("Bob", 25)], ["name", "age"]
)
ps_df2 = spark_df.to_pandas_on_spark()
# Alternative
ps_df2 = ps.DataFrame(spark_df)

# ---- 4. pyspark.pandas -> Spark DataFrame ----
spark_df2 = ps_df.to_spark()

# ---- 5. Spark DataFrame -> Pandas (reference) ----
# Note: collects all data to the driver
pandas_df3 = spark_df.toPandas()

# ---- 6. Pandas -> Spark DataFrame (reference) ----
spark_df3 = spark.createDataFrame(pandas_df)

Important: toPandas() and to_pandas() collect the entire dataset into the driver node's memory. For large datasets there is a risk of OutOfMemoryError, so it is safer to use to_spark() to convert to a Spark DataFrame and process from there.

Constraints and Unsupported APIs

pandas API on Spark covers about 80-85% of the Pandas API, but due to the nature of the distributed environment, the following APIs and behaviors are restricted.

Major Unsupported or Restricted APIs

Category	Restricted API / Behavior	Reason
Index operations	iloc (positional access)	Triggers shuffle, inefficient
Window operations	ewm (exponentially weighted moving average)	Depends on row order, unfit for distribution
String operations	str.extractall (all regex matches)	Requires complex state management
Join	DataFrame.append (deprecated)	Deprecated in Pandas itself too
Sort	Implicit row-order guarantee	Not guaranteed across distributed partitions
Plot	plot()-family methods	Requires data collection, inefficient
MultiIndex	Some MultiIndex operations	Complex to implement in distributed environments

Best Practices

Use pyspark.pandas for preprocessing large data, and only collect locally with to_pandas() at the final visualization or reporting stage.
Prefer built-in aggregation functions (sum, mean, count) over apply().
For JOINs, window functions, and similar operations, the Spark DataFrame API is faster — convert via to_spark() and process there.
The result of sort_values() may have its order broken by downstream processing, so sort right before the final output.

Differences from Pandas UDF

"Pandas API on Spark" and "Pandas UDF" have similar names and are often confused, but their purposes and usage are different.

Comparison Item	pandas API on Spark	Pandas UDF (pandas_udf)
Purpose	Write distributed processing in Pandas syntax	Apply Pandas functions to Spark DataFrame
Input	pyspark.pandas DataFrame	Spark DataFrame columns/groups
Output	pyspark.pandas DataFrame	Spark DataFrame columns
Internal processing	Converted to a Spark plan	Batch transfer via Apache Arrow
Usage	All operations from ps.read_csv() etc.	Define function with @pandas_udf decorator
Speed	Comparable to Spark API	Tens to hundreds of times faster than regular Python UDF

from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
import pandas as pd

# Scalar Pandas UDF: apply a Pandas function per column
@pandas_udf(DoubleType())
def multiply_by_two(s: pd.Series) -> pd.Series:
    return s * 2

df = spark.table("sales")
result = df.withColumn("doubled", multiply_by_two(df["amount"]))

# Grouped Map: receive a Pandas DataFrame per group and process it
def normalize(pdf: pd.DataFrame) -> pd.DataFrame:
    pdf["normalized"] = (
        (pdf["value"] - pdf["value"].mean()) / pdf["value"].std()
    )
    return pdf

df.groupby("department").applyInPandas(normalize, schema=output_schema)

Exam Focus Points

pandas API on Spark appears on both the Spark Developer Associate and Data Engineer Associate (DEA) exams. The following three points are especially important.

1. Exact import statement

import pyspark.pandas as ps is the correct import statement.import koalas as ks (old Koalas) andfrom pyspark import pandas appear as distractor answers on the exam.

2. Choosing the right DataFrame conversion method

The difference between to_pandas_on_spark() (Spark → pyspark.pandas) and toPandas() (Spark → native Pandas) is a frequent topic. The former preserves distributed state, while the latter collects data to the driver.

3. Understanding distributed-environment constraints

True/false questions test constraints like "row order is not guaranteed," "iloc is inefficient," and "some Pandas functions are unsupported." You need to accurately understand the behavioral differences from native Pandas.

Check with a Sample Question

Spark Developer / DEA

問題 1

Which of the following correctly describes the behavior of the code below? import pyspark.pandas as ps ps_df = ps.read_csv("/data/large_dataset.csv") result = ps_df.groupby("region").agg({"sales": "sum", "quantity": "mean"}) final = result.to_pandas()

read_csv loads the entire dataset into the driver node's memory at call time
groupby and agg run distributed on Spark, and only the aggregated result is collected to the driver at to_pandas()
pyspark.pandas's groupby runs as single-node processing just like native Pandas
to_pandas() provides a Pandas-compatible API while preserving distributed state

正解: B

pyspark.pandas's read_csv uses Spark's data source API internally to load data in a distributed manner (A is wrong). groupby and agg are also converted into Spark DataFrame operations and executed in distributed mode (C is wrong). to_pandas() converts a pyspark.pandas DataFrame into a native Pandas DataFrame, collecting the data to the driver node in the process. In this example only the post-aggregation result (one row per region) is collected, so it is usually safe to run. D describes to_pandas_on_spark(); to_pandas() converts to native Pandas, so distributed state is not preserved.

Try pandas API on Spark questions

Check your understanding of DataFrame conversions and Pandas UDF

Try free questions →

Frequently Asked Questions

What is the difference between Pandas API on Spark and native Pandas?

The syntax is nearly identical, but the execution engines are fundamentally different. Native Pandas processes data in memory on a single node, while Pandas API on Spark runs on Apache Spark's distributed processing engine. This lets you process hundreds of GBs to TBs of data using Pandas syntax. However, there are distributed-environment constraints: row order is not guaranteed, random access via iloc/loc is inefficient, and some Pandas functions (cummax, ewm, etc.) are not implemented.

What is the relationship between Koalas and Pandas API on Spark?

Koalas is the predecessor project of Pandas API on Spark. Databricks developed it as open source, and it was merged into Apache Spark 3.2. It is now shipped as the standard pyspark.pandas module, so there is no need to pip install Koalas. The Koalas API has been almost entirely migrated to pyspark.pandas. A migration guide is available in the official Spark documentation.

What is the difference between Pandas UDF and Pandas API on Spark?

The names are similar, but the purposes and usage differ. Pandas API on Spark (pyspark.pandas) is an overall API for writing Spark distributed processing in Pandas syntax — you can create, transform, and aggregate DataFrames in a Pandas-like style. Pandas UDF (pandas_udf), on the other hand, is a mechanism for applying Pandas functions to columns or groups of a Spark DataFrame, using vectorized data transfer via Apache Arrow to run tens of times faster than regular Python UDFs. Exams sometimes ask about this distinction.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

pandas API on Spark Complete Guide