Kafka

Kafka Stretched Cluster: Multi-AZ Design and Pitfalls

2026-04-19
NicheeLab Editorial Team

A stretched cluster distributes brokers and quorum nodes across multiple AZs within a single region to tolerate zone-level failures. Because of how Kafka's replication and quorum behave, you must understand RTT, ISR, and ACK settings.

This article covers the essential multi-AZ settings, common failure modes, and comparisons with alternative architectures. It also highlights keywords likely to appear on the CCAAK exam.

Stretched Cluster: Premises and Overview

A stretched cluster places brokers and quorum nodes (ZooKeeper or KRaft controllers) across multiple AZs within a single region, aiming to keep both writes and reads available even if a single AZ is lost. The key levers are replication factor, min.insync.replicas, acks, unclean leader election, and preserving a quorum majority.

In practice, you should restrict this to a single region where inter-AZ RTT is on the order of a few milliseconds (roughly 1-2 ms). Stretching across regions adds latency, reduces throughput, lengthens re-election times, and increases timeouts. Multi-region scenarios should generally use asynchronous replication (Cluster Linking or MirrorMaker 2).

Quorum nodes must be deployed as an odd number to guarantee a majority. With ZooKeeper, use 3 or more nodes; with KRaft, use 3 or more controllers. Distribute them evenly across AZs. The bar is that even losing a single AZ must leave a majority intact.

  • Use a replication factor (RF) of 3 as the baseline, with each replica on a different AZ
  • Set min.insync.replicas (minISR) to 2, acks=all, and unclean.leader.election.enable=false
  • Set broker.rack (or its equivalent) to the AZ name and enforce rack-aware placement
  • Run the quorum (ZooKeeper or KRaft controllers) as an odd number across AZs to preserve a majority
  • Limit to multi-AZ within a single region; use asynchronous replication for multi-region
ArchitectureAvailability (zone failure)RPO/RTOLatency / Throughput
Single-AZ KafkaLow (stops when AZ is lost)RPO: unknown (halted), RTO: longBest (low latency, high throughput)
Stretched cluster (intra-region, across AZs)Medium to high (survives 1 AZ loss if majority holds)RPO: 0 (assuming acks=all and minISR are satisfied), RTO: seconds to minutesMedium (cross-AZ adds latency and reduces throughput)
Multi-cluster with async (Cluster Linking / MM2)High (the other side keeps running when one is down)RPO: >0 (asynchronous), RTO: shortEach cluster enjoys local low latency

Kafka stretched cluster spanning multiple AZs (conceptual diagram)

replicationreplicationreplicationreplicationAZ-ABroker B1, B2 / Controller C1AZ-BBroker B3, B4 / Controller C2AZ-CBroker B5, B6 / Controller C3P0 [Leader]P0 [Follower]P0 [Follower]P1 [Follower]P1 [Leader]P1 [Follower]L: Leader, F: Follower — each partition's replicas are placed on different AZs

Minimal configuration example: rack-aware settings and topic creation

# broker(AZごと)
# AZ-A のブローカ
broker.rack=az-a
num.network.threads=3
num.io.threads=8

# AZ-B のブローカ
broker.rack=az-b

# AZ-C のブローカ
broker.rack=az-c

# クラスタ共通の安全側設定
unclean.leader.election.enable=false
min.insync.replicas=2

# KRaftの場合:コントローラは3台(C1,C2,C3)を各AZに1台ずつ
# ZooKeeperの場合:ZKも3台を各AZに分散

# トピック作成:RF=3、minISR=2
kafka-topics.sh --create \
  --topic orders \
  --partitions 12 \
  --replication-factor 3 \
  --config min.insync.replicas=2

# プロデューサ(安全側)
acks=all
retries=2147483647
enable.idempotence=true
delivery.timeout.ms=120000
linger.ms=20
batch.size=131072

# コンシューマ(任意:Closest Replicaを使う場合)
# ブローカ側で replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
# クライアント側で client.rack=az-a(所属AZを設定)

Replication and ACK/ISR Design Essentials

For a stretched cluster, RF=3, min.insync.replicas=2, and acks=all are the baseline. With these, losing one AZ still leaves 2 replicas in the ISR, so writes can continue with consistency after leader failover.

Always set unclean.leader.election.enable to false. Electing a stale, out-of-ISR replica as leader can cause data loss. Combine this with enable.idempotence=true on the producer to avoid duplicates and out-of-order writes.

  • RF=3, min.insync.replicas=2, acks=all (a configuration that targets RPO=0)
  • unclean.leader.election.enable=false (the safe choice)
  • Producer: enable idempotence and tune delivery.timeout.ms with cross-AZ latency in mind
  • Tune replica.lag.time.max.ms based on real-world latency (overly strict values cause excess ISR drops)
  • During recovery, throttle replica re-sync with replication.quota to protect production traffic

Rack-Aware Placement and Partition Leader Balance

Set broker.rack to each AZ name so that topic creation spreads replicas across AZs. If this breaks down, an AZ failure can take out several replicas at once, drop you below minISR, and stop writes.

Leader skew increases cross-AZ traffic and latency. Run Preferred Leader Election periodically to rebalance, and avoid producer hot-partitioning.

  • Explicitly set broker.rack=az-a/az-b/az-c and enable rack awareness
  • Verify the replica assignment at topic creation time (assign manually if necessary)
  • Monitor leader skew (leader count per AZ and outbound network throughput)
  • Rebalance periodically with Preferred Leader Election
  • (Optional) For Closest Replica reads, set both replica.selector.class on the broker and client.rack on the client

Failure Scenarios and Recovery Operations

When an AZ is lost, the cluster keeps running as long as the quorum majority survives. If the lost AZ hosted many leaders, failover causes a temporary drop in throughput and a spike in latency.

After the AZ recovers, large amounts of re-sync traffic kick off. Apply a replication throttle as part of your runbook to soften the impact during business hours. Once fully recovered, rebalance with Preferred Leader Election as needed.

  • During zone failure: keep UNCLEAN disabled and rely on automatic election from within the ISR to preserve consistency
  • Quorum (ZK/KRaft) majority must hold. Even numbers or skewed placement are unacceptable
  • During recovery: throttle replication bandwidth and monitor fetch/replica threads
  • After re-sync completes, run Preferred Leader Election to even out leadership
  • Document RTO/RPO requirements, thresholds, and rollback conditions in your operational runbook

Network, Storage, and Tuning Considerations

Cross-AZ traffic is a cost factor for both latency and cloud billing. Use compression (lz4 or zstd) and proper batching to cut round trips, and balance leader placement to avoid skewed inter-AZ traffic.

Storage is generally fine with JBOD, leaning on replication for durability. Balance segment size, page cache, and network bandwidth, and monitor so that re-sync windows during recovery do not blow out.

  • Producer: batch with linger.ms and batch.size, and compress with lz4 or zstd
  • Broker: tune socket send/receive buffers and num.network.threads / num.io.threads
  • Adjust replica.fetch.max.bytes and fetch.max.bytes based on throughput and latency
  • When TLS is on, review cipher suites and monitor NIC/CPU bottlenecks
  • Surface cross-AZ billing and bandwidth, and continuously evaluate SLO vs. cost

CCAAK Prep: High-Yield Points and Pitfalls

The exam frequently tests safe-side configuration values, rack awareness, quorum majority, and the boundary between multi-AZ and multi-region. Even when a question seems to have a single obvious answer, premises (RTT, RPO requirements, cost) are often implicit. Do not miss the assumptions in the question.

  • RF=3, minISR=2, acks=all, unclean.leader.election.enable=false
  • Declare the AZ via broker.rack and spread replicas across different AZs
  • Quorum: odd number of nodes spread across AZs to preserve a majority (the principle is the same for both ZK and KRaft)
  • Stretched clusters are for a single region; use Cluster Linking or MM2 for multi-region
  • Closest Replica requires both broker-side configuration and client.rack (the default is to read from the leader)
  • If you claim RPO=0, explicitly state the assumptions that acks=all and minISR are satisfied

Check Your Understanding

CCAAK

問題 1

In a 3-AZ stretched cluster within a single region, you want writes to continue and data loss to be avoided even when Zone A is completely lost. Which combination is appropriate?

  1. Topic RF=3, min.insync.replicas=2, acks=all, unclean.leader.election.enable=false
  2. Topic RF=2, min.insync.replicas=1, acks=1, unclean.leader.election.enable=true
  3. Topic RF=3, min.insync.replicas=1, acks=all, unclean.leader.election.enable=true
  4. Topic RF=2, min.insync.replicas=2, acks=all, unclean.leader.election.enable=false

正解: A

To target RPO=0 during a zone loss, the standard pattern is RF=3 with each replica on a different AZ, minISR=2 and acks=all to require replication to a majority before commit, and unclean leader election disabled. B and C are unsafe because they allow UNCLEAN or use an insufficient minISR. D uses RF=2, so losing one AZ leaves minISR unsatisfied and blocks writes.

Frequently Asked Questions

Can a stretched cluster span multiple regions?

Not recommended. Inter-region RTT is too high, making leader election and replication unstable. Multi-region scenarios belong to asynchronous solutions like Cluster Linking or MirrorMaker 2.

How can consumers read from a local-AZ replica using Closest Replica?

Set replica.selector.class to RackAwareReplicaSelector on the broker, and set client.rack to the consumer's AZ on the client side. By default, consumers read from the leader only.

How many quorum nodes (ZooKeeper/KRaft) do I need?

Use an odd number (typically 3) spread across AZs. Place them so that losing a single AZ still leaves a majority.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.