Spark Connect: Decoupled Spark Client/Server (2026)

Spark Connect is the client/server decoupled architecture introduced in Apache Spark 3.4. In classic Spark applications, the client code and the Spark driver ran in the same JVM. With Spark Connect, you can attach to a remote SparkSession over the gRPC protocol. Databricks certification exams test the mechanics, benefits, and limitations of this architecture.

Spark Connect Architecture

Spark Connect builds a DataFrame API logical plan on the client side and ships it to the server over gRPC. The server then optimizes the plan, converts it into a physical plan, runs it on the Spark cluster, and streams the results back over gRPC.

┌─────────────────────────┐     gRPC      ┌──────────────────────────────┐
│   クライアント (IDE等)    │ ──────────► │   Spark Connect サーバー       │
│                         │              │                              │
│  PySpark / Scala Client │ ◄────────── │  SparkSession                │
│  DataFrame API          │   結果返却   │    ├─ Catalyst Optimizer     │
│  Spark SQL API          │              │    ├─ Physical Planner       │
│                         │              │    └─ Spark Executors        │
└─────────────────────────┘              └──────────────────────────────┘
        ローカルPC                              リモートクラスタ

In the classic architecture, the client JVM acted as the driver and talked directly to executors over RPC. With Spark Connect, the client JVM is fully decoupled from the driver, so a crash on the client side no longer affects the Spark job.

Benefits of Spark Connect

Improved stability: OOMs or crashes on the client no longer propagate to the driver, which improves stability for long-running jobs.
Upgrade flexibility: Client and server Spark versions can be upgraded independently, with backward compatibility guaranteed at the gRPC protocol level.
Remote development experience: Attach a local IDE to a remote Spark cluster with full code completion, debugging, and step execution.
Multi-language support: Because the protocol is gRPC-based, clients can be implemented in languages outside the officially supported list.

Spark Connect vs Databricks Connect

Item	Spark Connect	Databricks Connect v2
Provider	Apache Spark OSS	Databricks
Protocol	gRPC	gRPC (built on Spark Connect)
Connection target	Any Spark Connect server	Databricks clusters / Serverless
Authentication	Custom implementation required	PAT / OAuth / Azure AD integration
Unity Catalog	Not supported	Supported (including table ACLs)
Serverless connection	Not supported	Supported (DBR 15.x+)
Install	`pyspark[connect]`	`databricks-connect`
Minimum runtime	Spark 3.4	DBR 13.0

Setup

Spark Connect (OSS)

# Spark Connectサーバーの起動
./sbin/start-connect-server.sh \
  --packages org.apache.spark:spark-connect_2.12:3.5.0

# Pythonクライアントからの接続
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .remote("sc://spark-server-host:15002") \
    .getOrCreate()

df = spark.sql("SELECT 1 AS test")
df.show()

Databricks Connect v2

# インストール
pip install databricks-connect==15.4.*

# Pythonクライアントからの接続
from databricks.connect import DatabricksSession

spark = DatabricksSession.builder \
    .host("https://adb-xxxx.azuredatabricks.net") \
    .token("dapi...") \
    .clusterId("0123-456789-abcdefgh") \
    .getOrCreate()

# Unity Catalogのテーブルにアクセス
df = spark.table("my_catalog.my_schema.customers")
df.filter(df.region == "APAC").show()

# DataFrameをローカルのPandasに変換
pdf = df.toPandas()

Limitations (Frequently Tested)

Because Spark Connect only ships gRPC logical plans between client and server, the following operations are not supported.

Limitation	Reason	Workaround
Direct use of the RDD API	RDDs cannot be expressed as Catalyst logical plans	Rewrite using the DataFrame API
Custom Accumulators	Client-side JVM objects cannot be serialized over the wire	Use DataFrame aggregation functions instead
Direct access to SparkContext	SparkContext only exists on the server side	Use the SparkSession API instead
Custom Partitioner	Depends on RDD-level operations	Use repartition / coalesce instead
Broadcasting arbitrary JVM functions	The client JVM is decoupled from the server	Register a UDF instead

How This Topic Shows Up on the Exam

Architecture understanding: "Which protocol do the Spark Connect client and server use to communicate?" → gRPC
Identifying limitations: "Which of the following operations cannot run on Spark Connect?" → Direct RDD operations
Differences from Databricks Connect: "What is the internal protocol of Databricks Connect v2?" → Spark Connect (gRPC)
Selecting the right benefit: "What is the benefit of the Spark Connect client decoupling?" → A client crash no longer affects the running job

Sample Question

Spark Connect - Limitations

問題 1

A developer connects to a remote cluster with Spark Connect and tries to run the following operations. Which operation is NOT supported by Spark Connect?

spark.sql('SELECT * FROM catalog.schema.table').filter(col('status') == 'active').show()
df.groupBy('region').agg(sum('revenue')).orderBy(desc('revenue'))
spark.sparkContext.parallelize([1, 2, 3]).map(lambda x: x * 2).collect()
spark.table('catalog.schema.events').write.mode('append').saveAsTable('catalog.schema.events_copy')

正解: C

Spark Connect cannot use sparkContext.parallelize() or RDD operations such as map, flatMap, and reduce. The Spark Connect client is designed to ship logical plans over gRPC, and RDDs cannot be expressed in the Catalyst logical plan representation. Options A, B, and D are all DataFrame API / Spark SQL API operations and are fully supported by Spark Connect.

Frequently Asked Questions

Can I use the RDD API with Spark Connect?

No. The Spark Connect client cannot use the RDD API directly. It only supports the DataFrame API and the Spark SQL API, because RDD operations cannot be serialized over the gRPC protocol by design. If you need RDD operations, fall back to the classic driver-attached connection model or wrap the logic inside a UDF. This constraint is a frequently tested topic on the exam.

What is the difference between Spark Connect and Databricks Connect?

Spark Connect is the open-source Apache Spark client/server decoupled architecture. Databricks Connect builds on top of the Spark Connect protocol and layers on Databricks-specific features such as workspace authentication, Unity Catalog integration, and Serverless Compute connectivity. Databricks Connect v2 (DBR 13.0+) uses the Spark Connect protocol under the hood.

Which languages support the Spark Connect client?

The Spark Connect client officially supports Python (PySpark), Scala, and Java. Because the protocol is gRPC-based, in principle any language with a gRPC client implementation can connect, but PySpark is by far the most widely used in production. Databricks Connect additionally has third-party clients for R and Go.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

What Is Spark Connect? Remote Spark Architecture and Exam Prep