Databricks

pandas API on Spark Complete Guide

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

pandas API on Spark (pyspark.pandas) is an API that lets you run familiar Pandas syntax directly on Apache Spark's distributed processing engine. You can process large datasets that single-node Pandas cannot handle, with minimal code rewriting.

This article comprehensively covers the Koalas integration history, comparison with Pandas, DataFrame interconversion code examples, distributed-environment constraints and unsupported APIs, differences from Pandas UDF, and exam focus points for the Databricks exam.

What is pandas API on Spark

pandas API on Spark is an API officially merged into Spark in Apache Spark 3.2, shipped as the pyspark.pandas module. Its predecessor was the open-source project "Koalas" developed by Databricks, designed so that Pandas users could transparently leverage Spark's distributed processing.

The key feature is that it provides the same API as Pandas DataFrame and Series while internally running as a Spark DataFrame in distributed mode. This lets you process hundreds of GBs to TBs of data with intuitive Pandas code.

The Koalas Integration History

Databricks released Koalas as open source in 2019. It implemented a Pandas API compatibility layer on top of Spark to answer the demand "I want to run Pandas code on Spark as-is." It was merged into Apache Spark in Spark 3.2 (released January 2022) and became a standard module as pyspark.pandas. Today, pip install koalas is no longer needed — PySpark 3.2 or later includes it with no extra installation.

How to Import

Importing pandas API on Spark takes just one line. The conventional alias is ps.

import pyspark.pandas as ps

# Read a CSV file (same syntax as Pandas)
df = ps.read_csv("/mnt/data/sales.csv")

# Read a Parquet file
df = ps.read_parquet("/mnt/data/sales.parquet")

# Read a Delta table (Databricks-specific)
df = ps.read_delta("/mnt/data/delta_table")

# Basic operations
df.head(10)
df.describe()
df.shape
df.dtypes

Pandas vs pandas API on Spark Comparison

The table below summarizes the main differences between native Pandas and pandas API on Spark. The exam tests whether you understand these distinctions accurately.

Comparison ItemPandaspandas API on Spark
Execution engineSingle node (CPython)Apache Spark distributed processing
Data size limitWhatever fits in memoryCluster storage capacity
API compatibilityBaseline (100%)Covers about 80-85%
Lazy evaluationEager evaluationLazy + partially eager
Row order guaranteeGuaranteedNot guaranteed (distributed)
IndexFully supportedPartial limits (iloc inefficient)
Importimport pandas as pdimport pyspark.pandas as ps
Execution environmentLocal PythonOn a Spark cluster

DataFrame Interconversion Code Examples

Converting between the three DataFrame types — Pandas DataFrame, Spark DataFrame, and pyspark.pandas DataFrame — is a frequent exam topic. Memorize the differences between each conversion method accurately.

import pandas as pd
import pyspark.pandas as ps
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# ---- 1. Pandas -> pyspark.pandas ----
pandas_df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [30, 25]})
ps_df = ps.from_pandas(pandas_df)

# ---- 2. pyspark.pandas -> Pandas ----
# Note: collects all data to the driver
pandas_df2 = ps_df.to_pandas()

# ---- 3. Spark DataFrame -> pyspark.pandas ----
spark_df = spark.createDataFrame(
    [("Alice", 30), ("Bob", 25)], ["name", "age"]
)
ps_df2 = spark_df.to_pandas_on_spark()
# Alternative
ps_df2 = ps.DataFrame(spark_df)

# ---- 4. pyspark.pandas -> Spark DataFrame ----
spark_df2 = ps_df.to_spark()

# ---- 5. Spark DataFrame -> Pandas (reference) ----
# Note: collects all data to the driver
pandas_df3 = spark_df.toPandas()

# ---- 6. Pandas -> Spark DataFrame (reference) ----
spark_df3 = spark.createDataFrame(pandas_df)

Important: toPandas() and to_pandas() collect the entire dataset into the driver node's memory. For large datasets there is a risk of OutOfMemoryError, so it is safer to use to_spark() to convert to a Spark DataFrame and process from there.

Constraints and Unsupported APIs

pandas API on Spark covers about 80-85% of the Pandas API, but due to the nature of the distributed environment, the following APIs and behaviors are restricted.

Major Unsupported or Restricted APIs

CategoryRestricted API / BehaviorReason
Index operationsiloc (positional access)Triggers shuffle, inefficient
Window operationsewm (exponentially weighted moving average)Depends on row order, unfit for distribution
String operationsstr.extractall (all regex matches)Requires complex state management
JoinDataFrame.append (deprecated)Deprecated in Pandas itself too
SortImplicit row-order guaranteeNot guaranteed across distributed partitions
Plotplot()-family methodsRequires data collection, inefficient
MultiIndexSome MultiIndex operationsComplex to implement in distributed environments

Best Practices

  • Use pyspark.pandas for preprocessing large data, and only collect locally with to_pandas() at the final visualization or reporting stage.
  • Prefer built-in aggregation functions (sum, mean, count) over apply().
  • For JOINs, window functions, and similar operations, the Spark DataFrame API is faster — convert via to_spark() and process there.
  • The result of sort_values() may have its order broken by downstream processing, so sort right before the final output.

Differences from Pandas UDF

"Pandas API on Spark" and "Pandas UDF" have similar names and are often confused, but their purposes and usage are different.

Comparison Itempandas API on SparkPandas UDF (pandas_udf)
PurposeWrite distributed processing in Pandas syntaxApply Pandas functions to Spark DataFrame
Inputpyspark.pandas DataFrameSpark DataFrame columns/groups
Outputpyspark.pandas DataFrameSpark DataFrame columns
Internal processingConverted to a Spark planBatch transfer via Apache Arrow
UsageAll operations from ps.read_csv() etc.Define function with @pandas_udf decorator
SpeedComparable to Spark APITens to hundreds of times faster than regular Python UDF
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
import pandas as pd

# Scalar Pandas UDF: apply a Pandas function per column
@pandas_udf(DoubleType())
def multiply_by_two(s: pd.Series) -> pd.Series:
    return s * 2

df = spark.table("sales")
result = df.withColumn("doubled", multiply_by_two(df["amount"]))

# Grouped Map: receive a Pandas DataFrame per group and process it
def normalize(pdf: pd.DataFrame) -> pd.DataFrame:
    pdf["normalized"] = (
        (pdf["value"] - pdf["value"].mean()) / pdf["value"].std()
    )
    return pdf

df.groupby("department").applyInPandas(normalize, schema=output_schema)

Exam Focus Points

pandas API on Spark appears on both the Spark Developer Associate and Data Engineer Associate (DEA) exams. The following three points are especially important.

1. Exact import statement

import pyspark.pandas as ps is the correct import statement.import koalas as ks (old Koalas) andfrom pyspark import pandas appear as distractor answers on the exam.

2. Choosing the right DataFrame conversion method

The difference between to_pandas_on_spark() (Spark → pyspark.pandas) and toPandas() (Spark → native Pandas) is a frequent topic. The former preserves distributed state, while the latter collects data to the driver.

3. Understanding distributed-environment constraints

True/false questions test constraints like "row order is not guaranteed," "iloc is inefficient," and "some Pandas functions are unsupported." You need to accurately understand the behavioral differences from native Pandas.

Check with a Sample Question

Spark Developer / DEA

問題 1

Which of the following correctly describes the behavior of the code below? import pyspark.pandas as ps ps_df = ps.read_csv("/data/large_dataset.csv") result = ps_df.groupby("region").agg({"sales": "sum", "quantity": "mean"}) final = result.to_pandas()

  1. read_csv loads the entire dataset into the driver node's memory at call time
  2. groupby and agg run distributed on Spark, and only the aggregated result is collected to the driver at to_pandas()
  3. pyspark.pandas's groupby runs as single-node processing just like native Pandas
  4. to_pandas() provides a Pandas-compatible API while preserving distributed state

正解: B

pyspark.pandas's read_csv uses Spark's data source API internally to load data in a distributed manner (A is wrong). groupby and agg are also converted into Spark DataFrame operations and executed in distributed mode (C is wrong). to_pandas() converts a pyspark.pandas DataFrame into a native Pandas DataFrame, collecting the data to the driver node in the process. In this example only the post-aggregation result (one row per region) is collected, so it is usually safe to run. D describes to_pandas_on_spark(); to_pandas() converts to native Pandas, so distributed state is not preserved.

Try pandas API on Spark questions

Check your understanding of DataFrame conversions and Pandas UDF

Try free questions

Frequently Asked Questions

What is the difference between Pandas API on Spark and native Pandas?

The syntax is nearly identical, but the execution engines are fundamentally different. Native Pandas processes data in memory on a single node, while Pandas API on Spark runs on Apache Spark's distributed processing engine. This lets you process hundreds of GBs to TBs of data using Pandas syntax. However, there are distributed-environment constraints: row order is not guaranteed, random access via iloc/loc is inefficient, and some Pandas functions (cummax, ewm, etc.) are not implemented.

What is the relationship between Koalas and Pandas API on Spark?

Koalas is the predecessor project of Pandas API on Spark. Databricks developed it as open source, and it was merged into Apache Spark 3.2. It is now shipped as the standard pyspark.pandas module, so there is no need to pip install Koalas. The Koalas API has been almost entirely migrated to pyspark.pandas. A migration guide is available in the official Spark documentation.

What is the difference between Pandas UDF and Pandas API on Spark?

The names are similar, but the purposes and usage differ. Pandas API on Spark (pyspark.pandas) is an overall API for writing Spark distributed processing in Pandas syntax — you can create, transform, and aggregate DataFrames in a Pandas-like style. Pandas UDF (pandas_udf), on the other hand, is a mechanism for applying Pandas functions to columns or groups of a Spark DataFrame, using vectorized data transfer via Apache Arrow to run tens of times faster than regular Python UDFs. Exams sometimes ask about this distinction.

Related Articles

PySpark Beginner Guide

DataFrame API and Structured Streaming prep

Spark SQL Complete Guide

SQL syntax and query optimization basics

Catalyst Optimizer Explained

How Spark query optimization works

How to Study for Databricks Certifications

Fastest path to passing and study-time estimates

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.