Databricks

What Is Spark Connect? Remote Spark Architecture and Exam Prep

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

Spark Connect is the client/server decoupled architecture introduced in Apache Spark 3.4. In classic Spark applications, the client code and the Spark driver ran in the same JVM. With Spark Connect, you can attach to a remote SparkSession over the gRPC protocol. Databricks certification exams test the mechanics, benefits, and limitations of this architecture.

Spark Connect Architecture

Spark Connect builds a DataFrame API logical plan on the client side and ships it to the server over gRPC. The server then optimizes the plan, converts it into a physical plan, runs it on the Spark cluster, and streams the results back over gRPC.

┌─────────────────────────┐     gRPC      ┌──────────────────────────────┐
│   クライアント (IDE等)    │ ──────────► │   Spark Connect サーバー       │
│                         │              │                              │
│  PySpark / Scala Client │ ◄────────── │  SparkSession                │
│  DataFrame API          │   結果返却   │    ├─ Catalyst Optimizer     │
│  Spark SQL API          │              │    ├─ Physical Planner       │
│                         │              │    └─ Spark Executors        │
└─────────────────────────┘              └──────────────────────────────┘
        ローカルPC                              リモートクラスタ

In the classic architecture, the client JVM acted as the driver and talked directly to executors over RPC. With Spark Connect, the client JVM is fully decoupled from the driver, so a crash on the client side no longer affects the Spark job.

Benefits of Spark Connect

  • Improved stability: OOMs or crashes on the client no longer propagate to the driver, which improves stability for long-running jobs.
  • Upgrade flexibility: Client and server Spark versions can be upgraded independently, with backward compatibility guaranteed at the gRPC protocol level.
  • Remote development experience: Attach a local IDE to a remote Spark cluster with full code completion, debugging, and step execution.
  • Multi-language support: Because the protocol is gRPC-based, clients can be implemented in languages outside the officially supported list.

Spark Connect vs Databricks Connect

ItemSpark ConnectDatabricks Connect v2
ProviderApache Spark OSSDatabricks
ProtocolgRPCgRPC (built on Spark Connect)
Connection targetAny Spark Connect serverDatabricks clusters / Serverless
AuthenticationCustom implementation requiredPAT / OAuth / Azure AD integration
Unity CatalogNot supportedSupported (including table ACLs)
Serverless connectionNot supportedSupported (DBR 15.x+)
Installpyspark[connect]databricks-connect
Minimum runtimeSpark 3.4DBR 13.0

Setup

Spark Connect (OSS)

# Spark Connectサーバーの起動
./sbin/start-connect-server.sh \
  --packages org.apache.spark:spark-connect_2.12:3.5.0

# Pythonクライアントからの接続
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .remote("sc://spark-server-host:15002") \
    .getOrCreate()

df = spark.sql("SELECT 1 AS test")
df.show()

Databricks Connect v2

# インストール
pip install databricks-connect==15.4.*

# Pythonクライアントからの接続
from databricks.connect import DatabricksSession

spark = DatabricksSession.builder \
    .host("https://adb-xxxx.azuredatabricks.net") \
    .token("dapi...") \
    .clusterId("0123-456789-abcdefgh") \
    .getOrCreate()

# Unity Catalogのテーブルにアクセス
df = spark.table("my_catalog.my_schema.customers")
df.filter(df.region == "APAC").show()

# DataFrameをローカルのPandasに変換
pdf = df.toPandas()

Limitations (Frequently Tested)

Because Spark Connect only ships gRPC logical plans between client and server, the following operations are not supported.

LimitationReasonWorkaround
Direct use of the RDD APIRDDs cannot be expressed as Catalyst logical plansRewrite using the DataFrame API
Custom AccumulatorsClient-side JVM objects cannot be serialized over the wireUse DataFrame aggregation functions instead
Direct access to SparkContextSparkContext only exists on the server sideUse the SparkSession API instead
Custom PartitionerDepends on RDD-level operationsUse repartition / coalesce instead
Broadcasting arbitrary JVM functionsThe client JVM is decoupled from the serverRegister a UDF instead

How This Topic Shows Up on the Exam

  • Architecture understanding: "Which protocol do the Spark Connect client and server use to communicate?" → gRPC
  • Identifying limitations: "Which of the following operations cannot run on Spark Connect?" → Direct RDD operations
  • Differences from Databricks Connect: "What is the internal protocol of Databricks Connect v2?" → Spark Connect (gRPC)
  • Selecting the right benefit: "What is the benefit of the Spark Connect client decoupling?" → A client crash no longer affects the running job

Sample Question

Spark Connect - Limitations

問題 1

A developer connects to a remote cluster with Spark Connect and tries to run the following operations. Which operation is NOT supported by Spark Connect?

  1. spark.sql('SELECT * FROM catalog.schema.table').filter(col('status') == 'active').show()
  2. df.groupBy('region').agg(sum('revenue')).orderBy(desc('revenue'))
  3. spark.sparkContext.parallelize([1, 2, 3]).map(lambda x: x * 2).collect()
  4. spark.table('catalog.schema.events').write.mode('append').saveAsTable('catalog.schema.events_copy')

正解: C

Spark Connect cannot use sparkContext.parallelize() or RDD operations such as map, flatMap, and reduce. The Spark Connect client is designed to ship logical plans over gRPC, and RDDs cannot be expressed in the Catalyst logical plan representation. Options A, B, and D are all DataFrame API / Spark SQL API operations and are fully supported by Spark Connect.

Frequently Asked Questions

Can I use the RDD API with Spark Connect?

No. The Spark Connect client cannot use the RDD API directly. It only supports the DataFrame API and the Spark SQL API, because RDD operations cannot be serialized over the gRPC protocol by design. If you need RDD operations, fall back to the classic driver-attached connection model or wrap the logic inside a UDF. This constraint is a frequently tested topic on the exam.

What is the difference between Spark Connect and Databricks Connect?

Spark Connect is the open-source Apache Spark client/server decoupled architecture. Databricks Connect builds on top of the Spark Connect protocol and layers on Databricks-specific features such as workspace authentication, Unity Catalog integration, and Serverless Compute connectivity. Databricks Connect v2 (DBR 13.0+) uses the Spark Connect protocol under the hood.

Which languages support the Spark Connect client?

The Spark Connect client officially supports Python (PySpark), Scala, and Java. Because the protocol is gRPC-based, in principle any language with a gRPC client implementation can connect, but PySpark is by far the most widely used in production. Databricks Connect additionally has third-party clients for R and Go.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.