Confluent Cloud vs Self-Managed Kafka (2026)

When adopting Kafka, the choice between Confluent Cloud and designing and operating your own cluster directly drives TCO and risk profile. This article organizes the perspectives you need to make that decision, based on the general behavior described in the official documentation.

For exam preparation, fundamentals of cluster design, security, availability, and scaling come up often, so be ready to explain the conceptual differences between Cloud and self-managed deployments.

Architecture Overview: Managed vs Self-Managed

Confluent Cloud abstracts much of the Kafka broker, ZooKeeper/KRaft, scaling, patching, and monitoring as a SaaS. Users can focus on application-side responsibilities: API keys, network connectivity, topic design, schema design, and ACL/RBAC.

Self-managed operation puts the entire cluster lifecycle on your organization: broker count and instance type selection, storage and network design, rolling upgrades, monitoring and alerting, incident response, and backup/disaster recovery.

Management boundary: Cloud operates within the SaaS SLO/SLA. Self-managed owns everything down to OS, middleware, and network.
Network: Cloud offers private connectivity options such as private link and VPC peering. Self-managed builds VPC design and certificate distribution from scratch.
Scale: Cloud scales via plans and APIs. Self-managed requires adding brokers and partition reassignment.

Connectivity and operational boundary (conceptual diagram)

Minimal setup example (Cloud and OSS)

# Confluent Cloud（CLIの一例：プラン名やオプションはアカウント/リージョンで異なる）
confluent kafka cluster create my-cluster \
  --cloud aws \
  --region ap-northeast-1 \
  --type basic

# トピック作成（保持期間やパーティション数を指定）
confluent kafka topic create orders \
  --partitions 6 \
  --config retention.ms=604800000

# 自前（OSS Kafka）でのトピック作成例
kafka-topics.sh --create \
  --topic orders \
  --partitions 6 \
  --replication-factor 3 \
  --bootstrap-server broker1:9092

Cost Model Comparison

Confluent Cloud is broadly a combination of usage-based pricing (data in/out, storage, throughput capacity units, etc.) and a plan fee. This fits PoCs with hard-to-predict usage and incremental scale-out well, while limiting the risk of excess capacity.

Self-managed stacks infrastructure costs (VMs/bare metal, storage, network) on top of staffing, monitoring/backup platforms, and redundancy needed for availability. With high utilization and stable traffic the unit cost is easier to drive down, but you tend to over-provision against peaks.

For the exam, it is important to articulate the cause and effect: retention period and replication factor dominate storage cost, while partition design and batching/compression drive network and CPU efficiency.

Retention directly drives cost: keep unnecessary data short, and apply log compaction selectively based on the use case.
Network billing is mainly egress: reduce it with aggregation, filtering, and compression.
Throughput scales with partition count and parallelism, but over-partitioning increases metadata overhead.

Aspect	Confluent Cloud	Self-Managed	Exam Note
Initial cost	Low (usage/subscription, instant)	High (design, build, procurement required)	Be able to break down TCO into its components
Variable cost	Tied to in/out, storage, and capacity units	VMs, storage, network, plus staffing	Retention and replication drive storage cost
Scaling	Fast via API/CLI; capacity is abstracted	Requires adding brokers and reassignment	Understand the partition-throughput relationship
Upgrades	Managed, aimed at zero or low downtime	Requires planning, validation, and rolling restarts	Know compatibility and protocol evolution
Network cost	Account for egress and private link pricing	Cross-AZ/cross-DC transfer and LB costs	Localizing traffic is the key

Back-of-the-envelope estimate notes (storage/network)

# 1日の入力データ量(GB) = (平均イベントサイズ(Byte) * 1日のイベント数) / 1024^3
# 保持ストレージ(GB) ≈ 入力データ量(GB) * 保持日数 * 複製係数
# 出力量(GB) ≈ 入力量(GB) * 平均購読者数 * フィルタ係数
# 圧縮を使う場合は、データ圧縮率(例: 0.3〜0.7)を掛け合わせて見積ると現実的
# 例: 1KBのイベントを1日20億件、保持7日、RF=3
# 入力 ≈ (1024 * 2e9) / 1024^3 ≈ 1907GB/日 → 保持 ≈ 1907 * 7 * 3 ≈ 40TB強

Operational Responsibility and SLA

On Confluent Cloud, node replacement on failure, disk resync on failure, cluster scaling, patching, and the monitoring stack are built into the service. Users can focus on operating topics, schemas, ACL/RBAC, and connectivity.

With self-managed, incident response runbooks, monitoring threshold tuning, rolling upgrades, capacity planning, partition reassignment, certificate rotation, and backup/DR drills are all on you. You define, measure, and improve SLA/SLO yourself.

Cloud: a high level of automation for failure detection, failover, and recovery.
Self-managed: securing maintenance windows and building observability are essential.
Exam: you are expected to understand the procedures for rolling restarts, rebalancing, and failover.

Representative ops commands (compared)

# Cloud: パーティション数の増加
confluent kafka topic update orders --partitions 12

# 自前: パーティション再配置（例）
# 1) 割当案の生成
echo '{"version":1,"partitions":[{"topic":"orders","partition":0,"replicas":[1,2,3]}]}' > reassignment.json
kafka-reassign-partitions.sh --bootstrap-server broker1:9092 --reassignment-json-file reassignment.json --execute

# 自前: ブローカーのローリング再起動（例）
# 順に停止→起動。実環境はオーケストレータ/Automationを使用。
kafka-server-stop.sh && sleep 10 && kafka-server-start.sh -daemon /etc/kafka/server.properties

Security and Network

Cloud provides in-transit encryption (TLS), API key/secret authentication, RBAC and ACLs, and audit logs. Using a private connectivity option minimizes network exposure.

Self-managed means you design and operate TLS certificate issuance/deployment/rotation, choice of SASL mechanism (e.g. SCRAM or OAUTHBEARER), ACL design, and network isolation (subnets, SGs, NACLs).

Exam topics include the unit of application for ACL/RBAC, SASL/TLS configuration, correctness of client configuration, and optimization of the network path (NAT, proxy, bandwidth control).

Principle of least privilege: separate producer and consumer permissions.
Distribute and rotate certificates and secrets securely (automation matters especially for self-managed).
When using private connectivity, plan DNS resolution, CIDR ranges, and bandwidth in advance.

Client configuration example (SASL_SSL)

# Confluent Cloud 例（PLAIN over TLS）
bootstrap.servers=xxxx.gcp.confluent.cloud:9092
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \
  username="<API_KEY>" password="<API_SECRET>";
# 必要に応じてCA証明書バンドルを設定

# 自前（SCRAM-SHA-512 例）
bootstrap.servers=broker1:9093,broker2:9093
security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required \
  username="app-producer" password="********";

Feature Differences and Ecosystem Integration

Confluent Cloud offers managed connectors, Schema Registry, ksqlDB, governance features, and Cluster Linking. You get a fully managed event-driven experience while keeping infrastructure ops to a minimum.

Self-managed can assemble equivalent capabilities from OSS or commercial components, but compatibility validation and operational complexity go up. Replication using MirrorMaker 2 and similar tools also needs to be designed in-house.

For the exam, key topics are schema compatibility, log compaction policy, stream processing fundamentals, and connector reliability (exactly-once and retry strategies).

Cloud: managed connectors, Schema Registry, ksqlDB, and governance are integrated.
Self-managed: high freedom in component selection and operation, but standardization is the challenge.
Either way: manage schema evolution and compatibility rules systematically.

A simple ksqlDB sample (cleansing)

CREATE STREAM pageviews_raw (user_id VARCHAR, url VARCHAR, ts BIGINT)
  WITH (KAFKA_TOPIC='pageviews', VALUE_FORMAT='JSON');

CREATE STREAM pageviews_clean AS
  SELECT user_id, url, ts
  FROM pageviews_raw
  WHERE url IS NOT NULL EMIT CHANGES;

Exam Prep Essentials and Practical Tips

CCAAK frequently covers cluster availability design (RF, ISR, acks, min.insync.replicas), throughput planning (partitions, batching, compression), retention strategy (retention.ms/bytes, log compaction), and security (SASL/TLS, ACL/RBAC).

When contrasting Cloud and self-managed, it helps to articulate who manages what, which metrics are observable where, and how scaling and incident response procedures differ.

Retention strategy: choose between retention.ms / retention.bytes / log.cleanup.policy (delete | compact) per use case.
Availability: the relationship between acks=all and min.insync.replicas, and behavior when a replica falls out of ISR.
Throughput: understand the effect of compression (producer/topic/broker), linger.ms, and batch.size.
Security: granularity of ACL principal/resource/operation/pattern, and the scope of RBAC roles.
Operations: basic procedures for rebalancing and rolling upgrades.

ACL grant examples (Cloud and self-managed)

# Confluent Cloud（CLI）
confluent kafka acl create --allow \
  --operation WRITE --topic orders --principal User:svc-producer
confluent kafka acl create --allow \
  --operation READ --topic orders --principal User:svc-consumer --consumer-group cg-orders

# 自前（kafka-acls.sh）
kafka-acls.sh --authorizer-properties zookeeper.connect=zk:2181 \
  --add --allow-principal User:svc-producer --operation Write --topic orders
kafka-acls.sh --authorizer-properties zookeeper.connect=zk:2181 \
  --add --allow-principal User:svc-consumer --operation Read --topic orders --group cg-orders

Check Your Understanding

CCAAK

問題 1

On a topic that handles large volumes of ephemeral events, data older than 7 days is unnecessary. You want to reduce storage and network costs without affecting consumer behavior. Which configuration is most appropriate?

Set retention.ms=604800000 on the topic and enable compression.type=producer
Increase the partition count to the maximum to improve throughput
Set the producer's acks to 0 to remove synchronization
Lower min.insync.replicas to 1 to relax replica requirements

正解: A

Limiting retention to 7 days and using compression to improve network and storage efficiency matches the goal. B carries the risk of over-partitioning and does not fit the goal. C and D both reduce fault tolerance and create data loss risk, so they are inappropriate.

Frequently Asked Questions

Which is ultimately cheaper?

It depends on workload and organizational capability. Confluent Cloud has the edge when usage is unpredictable, you are starting small, or you lack a 24x7 operations team. If you have steady long-term traffic plus high operational maturity and automation assets, self-managed unit costs can be driven lower. Optimizing retention, replication factor, and egress reduces TCO on either path.

Is ZooKeeper required for self-managed Kafka?

Recent Kafka versions ship KRaft mode, which removes the ZooKeeper dependency. Since design and operational procedures change, pick the option that fits your standards and toolchain. For the exam, be ready to articulate the conceptual differences from ZooKeeper-based setups (metadata management, rolling upgrade procedures, and so on).

How should I think about private connectivity?

On Cloud, you have options such as VPC peering and private link; plan for CIDR overlap, DNS, bandwidth, and cost up front. For self-managed, you design VPC routing, firewalls, and certificate distribution yourself. In both cases, the core principle is to localize the data egress point and avoid unnecessary region or network traversal.

Check what you learned with practice questions

Practice with certification-focused question sets

Try free practice questions

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Confluent Cloud vs Self-Managed Kafka: Cost, Operations, and Feature Differences for CCAAK