Stretched Kafka Cluster: Multi-AZ Synchronous Replication (2026)

A stretched cluster distributes brokers and quorum nodes across multiple AZs within a single region to tolerate zone-level failures. Because of how Kafka's replication and quorum behave, you must understand RTT, ISR, and ACK settings.

This article covers the essential multi-AZ settings, common failure modes, and comparisons with alternative architectures. It also highlights keywords likely to appear on the CCAAK exam.

Stretched Cluster: Premises and Overview

A stretched cluster places brokers and quorum nodes (ZooKeeper or KRaft controllers) across multiple AZs within a single region, aiming to keep both writes and reads available even if a single AZ is lost. The key levers are replication factor, min.insync.replicas, acks, unclean leader election, and preserving a quorum majority.

In practice, you should restrict this to a single region where inter-AZ RTT is on the order of a few milliseconds (roughly 1-2 ms). Stretching across regions adds latency, reduces throughput, lengthens re-election times, and increases timeouts. Multi-region scenarios should generally use asynchronous replication (Cluster Linking or MirrorMaker 2).

Quorum nodes must be deployed as an odd number to guarantee a majority. With ZooKeeper, use 3 or more nodes; with KRaft, use 3 or more controllers. Distribute them evenly across AZs. The bar is that even losing a single AZ must leave a majority intact.

Use a replication factor (RF) of 3 as the baseline, with each replica on a different AZ
Set min.insync.replicas (minISR) to 2, acks=all, and unclean.leader.election.enable=false
Set broker.rack (or its equivalent) to the AZ name and enforce rack-aware placement
Run the quorum (ZooKeeper or KRaft controllers) as an odd number across AZs to preserve a majority
Limit to multi-AZ within a single region; use asynchronous replication for multi-region

Architecture	Availability (zone failure)	RPO/RTO	Latency / Throughput
Single-AZ Kafka	Low (stops when AZ is lost)	RPO: unknown (halted), RTO: long	Best (low latency, high throughput)
Stretched cluster (intra-region, across AZs)	Medium to high (survives 1 AZ loss if majority holds)	RPO: 0 (assuming acks=all and minISR are satisfied), RTO: seconds to minutes	Medium (cross-AZ adds latency and reduces throughput)
Multi-cluster with async (Cluster Linking / MM2)	High (the other side keeps running when one is down)	RPO: >0 (asynchronous), RTO: short	Each cluster enjoys local low latency

Kafka stretched cluster spanning multiple AZs (conceptual diagram)

Minimal configuration example: rack-aware settings and topic creation

# broker（AZごと）
# AZ-A のブローカ
broker.rack=az-a
num.network.threads=3
num.io.threads=8

# AZ-B のブローカ
broker.rack=az-b

# AZ-C のブローカ
broker.rack=az-c

# クラスタ共通の安全側設定
unclean.leader.election.enable=false
min.insync.replicas=2

# KRaftの場合：コントローラは3台（C1,C2,C3）を各AZに1台ずつ
# ZooKeeperの場合：ZKも3台を各AZに分散

# トピック作成：RF=3、minISR=2
kafka-topics.sh --create \
  --topic orders \
  --partitions 12 \
  --replication-factor 3 \
  --config min.insync.replicas=2

# プロデューサ（安全側）
acks=all
retries=2147483647
enable.idempotence=true
delivery.timeout.ms=120000
linger.ms=20
batch.size=131072

# コンシューマ（任意：Closest Replicaを使う場合）
# ブローカ側で replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
# クライアント側で client.rack=az-a（所属AZを設定）

Replication and ACK/ISR Design Essentials

For a stretched cluster, RF=3, min.insync.replicas=2, and acks=all are the baseline. With these, losing one AZ still leaves 2 replicas in the ISR, so writes can continue with consistency after leader failover.

Always set unclean.leader.election.enable to false. Electing a stale, out-of-ISR replica as leader can cause data loss. Combine this with enable.idempotence=true on the producer to avoid duplicates and out-of-order writes.

RF=3, min.insync.replicas=2, acks=all (a configuration that targets RPO=0)
unclean.leader.election.enable=false (the safe choice)
Producer: enable idempotence and tune delivery.timeout.ms with cross-AZ latency in mind
Tune replica.lag.time.max.ms based on real-world latency (overly strict values cause excess ISR drops)
During recovery, throttle replica re-sync with replication.quota to protect production traffic

Rack-Aware Placement and Partition Leader Balance

Set broker.rack to each AZ name so that topic creation spreads replicas across AZs. If this breaks down, an AZ failure can take out several replicas at once, drop you below minISR, and stop writes.

Leader skew increases cross-AZ traffic and latency. Run Preferred Leader Election periodically to rebalance, and avoid producer hot-partitioning.

Explicitly set broker.rack=az-a/az-b/az-c and enable rack awareness
Verify the replica assignment at topic creation time (assign manually if necessary)
Monitor leader skew (leader count per AZ and outbound network throughput)
Rebalance periodically with Preferred Leader Election
(Optional) For Closest Replica reads, set both replica.selector.class on the broker and client.rack on the client

Failure Scenarios and Recovery Operations

When an AZ is lost, the cluster keeps running as long as the quorum majority survives. If the lost AZ hosted many leaders, failover causes a temporary drop in throughput and a spike in latency.

After the AZ recovers, large amounts of re-sync traffic kick off. Apply a replication throttle as part of your runbook to soften the impact during business hours. Once fully recovered, rebalance with Preferred Leader Election as needed.

During zone failure: keep UNCLEAN disabled and rely on automatic election from within the ISR to preserve consistency
Quorum (ZK/KRaft) majority must hold. Even numbers or skewed placement are unacceptable
During recovery: throttle replication bandwidth and monitor fetch/replica threads
After re-sync completes, run Preferred Leader Election to even out leadership
Document RTO/RPO requirements, thresholds, and rollback conditions in your operational runbook

Network, Storage, and Tuning Considerations

Cross-AZ traffic is a cost factor for both latency and cloud billing. Use compression (lz4 or zstd) and proper batching to cut round trips, and balance leader placement to avoid skewed inter-AZ traffic.

Storage is generally fine with JBOD, leaning on replication for durability. Balance segment size, page cache, and network bandwidth, and monitor so that re-sync windows during recovery do not blow out.

Producer: batch with linger.ms and batch.size, and compress with lz4 or zstd
Broker: tune socket send/receive buffers and num.network.threads / num.io.threads
Adjust replica.fetch.max.bytes and fetch.max.bytes based on throughput and latency
When TLS is on, review cipher suites and monitor NIC/CPU bottlenecks
Surface cross-AZ billing and bandwidth, and continuously evaluate SLO vs. cost

CCAAK Prep: High-Yield Points and Pitfalls

The exam frequently tests safe-side configuration values, rack awareness, quorum majority, and the boundary between multi-AZ and multi-region. Even when a question seems to have a single obvious answer, premises (RTT, RPO requirements, cost) are often implicit. Do not miss the assumptions in the question.

RF=3, minISR=2, acks=all, unclean.leader.election.enable=false
Declare the AZ via broker.rack and spread replicas across different AZs
Quorum: odd number of nodes spread across AZs to preserve a majority (the principle is the same for both ZK and KRaft)
Stretched clusters are for a single region; use Cluster Linking or MM2 for multi-region
Closest Replica requires both broker-side configuration and client.rack (the default is to read from the leader)
If you claim RPO=0, explicitly state the assumptions that acks=all and minISR are satisfied

Check Your Understanding

CCAAK

問題 1

In a 3-AZ stretched cluster within a single region, you want writes to continue and data loss to be avoided even when Zone A is completely lost. Which combination is appropriate?

Topic RF=3, min.insync.replicas=2, acks=all, unclean.leader.election.enable=false
Topic RF=2, min.insync.replicas=1, acks=1, unclean.leader.election.enable=true
Topic RF=3, min.insync.replicas=1, acks=all, unclean.leader.election.enable=true
Topic RF=2, min.insync.replicas=2, acks=all, unclean.leader.election.enable=false

正解: A

To target RPO=0 during a zone loss, the standard pattern is RF=3 with each replica on a different AZ, minISR=2 and acks=all to require replication to a majority before commit, and unclean leader election disabled. B and C are unsafe because they allow UNCLEAN or use an insufficient minISR. D uses RF=2, so losing one AZ leaves minISR unsatisfied and blocks writes.

Frequently Asked Questions

Can a stretched cluster span multiple regions?

Not recommended. Inter-region RTT is too high, making leader election and replication unstable. Multi-region scenarios belong to asynchronous solutions like Cluster Linking or MirrorMaker 2.

How can consumers read from a local-AZ replica using Closest Replica?

Set replica.selector.class to RackAwareReplicaSelector on the broker, and set client.rack to the consumer's AZ on the client side. By default, consumers read from the leader only.

How many quorum nodes (ZooKeeper/KRaft) do I need?

Use an odd number (typically 3) spread across AZs. Place them so that losing a single AZ still leaves a majority.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Kafka Stretched Cluster: Multi-AZ Design and Pitfalls