Spark Connect is the client/server decoupled architecture introduced in Apache Spark 3.4. In classic Spark applications, the client code and the Spark driver ran in the same JVM. With Spark Connect, you can attach to a remote SparkSession over the gRPC protocol. Databricks certification exams test the mechanics, benefits, and limitations of this architecture.
Spark Connect builds a DataFrame API logical plan on the client side and ships it to the server over gRPC. The server then optimizes the plan, converts it into a physical plan, runs it on the Spark cluster, and streams the results back over gRPC.
┌─────────────────────────┐ gRPC ┌──────────────────────────────┐
│ クライアント (IDE等) │ ──────────► │ Spark Connect サーバー │
│ │ │ │
│ PySpark / Scala Client │ ◄────────── │ SparkSession │
│ DataFrame API │ 結果返却 │ ├─ Catalyst Optimizer │
│ Spark SQL API │ │ ├─ Physical Planner │
│ │ │ └─ Spark Executors │
└─────────────────────────┘ └──────────────────────────────┘
ローカルPC リモートクラスタIn the classic architecture, the client JVM acted as the driver and talked directly to executors over RPC. With Spark Connect, the client JVM is fully decoupled from the driver, so a crash on the client side no longer affects the Spark job.
| Item | Spark Connect | Databricks Connect v2 |
|---|---|---|
| Provider | Apache Spark OSS | Databricks |
| Protocol | gRPC | gRPC (built on Spark Connect) |
| Connection target | Any Spark Connect server | Databricks clusters / Serverless |
| Authentication | Custom implementation required | PAT / OAuth / Azure AD integration |
| Unity Catalog | Not supported | Supported (including table ACLs) |
| Serverless connection | Not supported | Supported (DBR 15.x+) |
| Install | pyspark[connect] | databricks-connect |
| Minimum runtime | Spark 3.4 | DBR 13.0 |
# Spark Connectサーバーの起動
./sbin/start-connect-server.sh \
--packages org.apache.spark:spark-connect_2.12:3.5.0
# Pythonクライアントからの接続
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.remote("sc://spark-server-host:15002") \
.getOrCreate()
df = spark.sql("SELECT 1 AS test")
df.show()# インストール
pip install databricks-connect==15.4.*
# Pythonクライアントからの接続
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder \
.host("https://adb-xxxx.azuredatabricks.net") \
.token("dapi...") \
.clusterId("0123-456789-abcdefgh") \
.getOrCreate()
# Unity Catalogのテーブルにアクセス
df = spark.table("my_catalog.my_schema.customers")
df.filter(df.region == "APAC").show()
# DataFrameをローカルのPandasに変換
pdf = df.toPandas()Because Spark Connect only ships gRPC logical plans between client and server, the following operations are not supported.
| Limitation | Reason | Workaround |
|---|---|---|
| Direct use of the RDD API | RDDs cannot be expressed as Catalyst logical plans | Rewrite using the DataFrame API |
| Custom Accumulators | Client-side JVM objects cannot be serialized over the wire | Use DataFrame aggregation functions instead |
| Direct access to SparkContext | SparkContext only exists on the server side | Use the SparkSession API instead |
| Custom Partitioner | Depends on RDD-level operations | Use repartition / coalesce instead |
| Broadcasting arbitrary JVM functions | The client JVM is decoupled from the server | Register a UDF instead |
Spark Connect - Limitations
問題 1
A developer connects to a remote cluster with Spark Connect and tries to run the following operations. Which operation is NOT supported by Spark Connect?
正解: C
Spark Connect cannot use sparkContext.parallelize() or RDD operations such as map, flatMap, and reduce. The Spark Connect client is designed to ship logical plans over gRPC, and RDDs cannot be expressed in the Catalyst logical plan representation. Options A, B, and D are all DataFrame API / Spark SQL API operations and are fully supported by Spark Connect.
Can I use the RDD API with Spark Connect?
No. The Spark Connect client cannot use the RDD API directly. It only supports the DataFrame API and the Spark SQL API, because RDD operations cannot be serialized over the gRPC protocol by design. If you need RDD operations, fall back to the classic driver-attached connection model or wrap the logic inside a UDF. This constraint is a frequently tested topic on the exam.
What is the difference between Spark Connect and Databricks Connect?
Spark Connect is the open-source Apache Spark client/server decoupled architecture. Databricks Connect builds on top of the Spark Connect protocol and layers on Databricks-specific features such as workspace authentication, Unity Catalog integration, and Serverless Compute connectivity. Databricks Connect v2 (DBR 13.0+) uses the Spark Connect protocol under the hood.
Which languages support the Spark Connect client?
The Spark Connect client officially supports Python (PySpark), Scala, and Java. Because the protocol is gRPC-based, in principle any language with a gRPC client implementation can connect, but PySpark is by far the most widely used in production. Databricks Connect additionally has third-party clients for R and Go.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...