Kafka's in-cluster replication handles single-broker and single-AZ failures well, but a full region outage demands deliberate design. This article lays out a DR strategy starting from RPO/RTO targets and the practical patterns for cross-region replication with MirrorMaker 2 or Cluster Linking.
Confluent CCAAK heavily tests durability parameters, replication mechanisms, client behavior during failover, and offset synchronization. This guide is written for both real operations and exam day — designs and procedures you can defend either way.
Every DR design starts by fixing the failure scope and SLOs (RPO/RTO). A region outage covers network partitions and power loss; cross-cluster replication is asynchronous, so RPO will never be zero. The smaller you push RPO, the sharper the trade-off becomes between link bandwidth, latency, and cost.
Treat in-cluster durability and cross-cluster durability as separate problems. The first is controlled by min.insync.replicas, acks, and unclean.leader.election.enable. The second is controlled by MirrorMaker 2 or Cluster Linking lag, checkpoint cadence, and the promotion procedure during failure.
Core durability settings (broker and producer)
# server.properties(クラスタ内の耐久性)
min.insync.replicas=2
unclean.leader.election.enable=false
replica.lag.time.max.ms=30000
# リージョン内のゾーン配置に合わせる
broker.rack=az-a
# プロデューサ設定(書き込みの耐久性)
acks=all
enable.idempotence=true
max.in.flight.requests.per.connection=1
retries=1000000
request.timeout.ms=30000
delivery.timeout.ms=120000Kafka does not provide native synchronous replication across regions. The standard options are MirrorMaker 2 (built on Apache Kafka Connect) and Confluent's Cluster Linking. Both are asynchronous by design, and RPO is the sum of network latency and processing lag.
MirrorMaker 2 is open source, supports flexible topologies, and uses checkpoints to translate consumer group offsets. Cluster Linking is a broker-level link with low mirror-topic lag and simpler operations, but it requires Confluent Platform or Confluent Cloud.
| Approach | Implementation Scope | Offset Sync | Typical Lag |
|---|---|---|---|
| MirrorMaker 2 | Connectors running on Kafka Connect | Translatable via checkpoints | Seconds to tens of seconds (load-dependent) |
| Cluster Linking | Broker-native (Confluent) | Translated within the link (mirror topics) | Low latency (per topic) |
| Storage/snapshot copy | External mechanism (not recommended) | Not possible | Large (minutes to hours) |
Big picture of cross-region replication
Minimal MirrorMaker 2 and Cluster Linking configuration
# MirrorMaker 2(connect-mirror-maker 用プロパティ)
clusters = A, B
A.bootstrap.servers=a1:9092,a2:9092
B.bootstrap.servers=b1:9092,b2:9092
A->B.enabled=true
A->B.topics=orders.*,inventory.*
A->B.emit.checkpoints.enabled=true
replication.policy.class=org.apache.kafka.connect.mirror.IdentityReplicationPolicy
sync.topic.configs.enabled=true
sync.topic.acls.enabled=false
# Cluster Linking(Confluent CLI の一例。実コマンドは環境に依存)
confluent kafka link create dr-link \
--cluster B \
--source-cluster A \
--source-bootstrap-server a1:9092 \
--link-mode READ_ONLY
# ミラートピック作成
confluent kafka mirror create --link dr-link --topic ordersIn-region high availability comes from replication.factor and rack awareness. The baseline is RF=3 with min.insync.replicas=2, placing each partition's replicas in distinct AZs. Set broker.rack correctly so partition assignment naturally spreads across zones.
When data preservation is the priority, lock down unclean.leader.election.enable=false. Because you may switch over, keep topic names, schema compatibility, and cleanup policy (delete/compact) consistent across regions. Remember that Cluster Linking mirror topics are read-only until you promote them.
Typical topic creation command
kafka-topics \
--bootstrap-server a1:9092 \
--create \
--topic orders \
--partitions 12 \
--replication-factor 3 \
--config min.insync.replicas=2 \
--config cleanup.policy=deleteFor a planned failover, first stop writes on the primary, then confirm that cross-region replication lag and checkpoints have caught up before promoting the DR side. For unplanned events, translate consumer offsets using the most recent checkpoints and rely on a duplicate-tolerant design — idempotent producers, transactions, and downstream idempotency — to absorb the overlap.
Make bootstrap endpoints region-redundant on the client side, then build switchover hooks around DNS, TLS SNI, and load balancers. Producers should use enable.idempotence with a sensible delivery.timeout.ms; consumers should use static membership (group.instance.id) to suppress mass rebalances.
Sample failover runbook (planned, MM2 / Cluster Linking)
# 1) プライマリ側の書き込みを停止
# 2) リージョン間遅延を確認(例: 60 秒以下)
# - MM2: replication-latency-ms、checkpoints の遅延
# - CL: ミラートピックの lag メトリクス
# 3) DR 側の昇格
# Cluster Linking(例。環境によりコマンドは異なる)
confluent kafka mirror failover --link dr-link --topics 'orders.*'
# 4) クライアントの接続先を切替(DNS/設定配布)
# 5) 書き込み再開
# 非計画時(MM2 オフセット翻訳の概念例)
# 事前に MirrorCheckpointConnector を有効化している前提
# 翻訳されたオフセットに基づきコンシューマグループを調整
kafka-consumer-groups \
--bootstrap-server b1:9092 \
--group app-g \
--reset-offsets --topic orders --to-offset <translated-offset> --executeAfter running on DR, returning to the primary means treating the DR cluster as the source of truth, re-establishing replication in the reverse direction, and waiting until the delta has drained before flipping traffic back. Active-passive only needs a one-way switchover, which is easy to automate.
Active-active opens the door to concurrent updates on the same key and ordering skew, so you need a dedup-key design and downstream upsert/reconciliation logic. Remember that transactions do not cross cluster boundaries — fall back to unique keys and idempotent processing where needed.
Recreating links for failback
# Cluster Linking(DR -> Primary に逆リンクを作成)
confluent kafka link create backfill \
--cluster A \
--source-cluster B \
--source-bootstrap-server b1:9092 \
--link-mode READ_ONLY
# MirrorMaker 2(双方向を有効化)
clusters = A, B
A->B.enabled=true
B->A.enabled=true
# フェイルバック時は B->A の topics を対象に限定The exam tests the boundary between in-cluster durability and cross-region DR, the meaning and side effects of each setting, and offset synchronization plus client behavior during switchover. Lock in three facts: replication is asynchronous so RPO is never zero, unclean leader election should stay off, and you must know how min.insync.replicas interacts with acks.
In production, continuously monitor UnderReplicatedPartitions, ActiveControllerCount, mirror latency, and link health, then wire alerts directly into the automation runbook.
Prometheus-style alert examples (excerpt)
alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 5m
labels:
severity: critical
annotations:
summary: Under-replicated partitions detected
alert: ActiveControllerCountNotOne
expr: kafka_controller_kafkacontroller_activecontrollercount != 1
for: 1m
labels:
severity: warning
annotations:
summary: Active controller count is not 1
alert: MirrorLagHigh
expr: mm2_replication_latency_ms > 60000
for: 2m
labels:
severity: warning
annotations:
summary: MirrorMaker 2 replication latency exceeds 60sCCAAK
問題 1
You run a two-region setup with MirrorMaker 2 asynchronously replicating A→B. Your RPO is 60 seconds and RTO is 15 minutes. During a planned failover, you want to minimize consumer duplicates while resuming from the correct position. Which procedure is most appropriate?
正解: A
The canonical planned-switchover sequence is: stop writes → confirm replication lag and checkpoint catch-up → translate offsets on B → redirect connections. B and D carry serious risks of data loss or ordering corruption, and C actively worsens your RPO.
Why not use synchronous replication across regions?
Kafka is designed for asynchronous replication across regions. WAN latency and partitions make synchronous replication a non-starter — throughput and availability would both collapse. To shrink RPO, invest in bandwidth, low-latency links, and tight monitoring; for planned switchovers, drain lag close to zero before promoting the DR cluster.
Are exactly-once semantics guaranteed across regions?
No. Kafka transactions and idempotent producers are scoped to a single cluster. Cross-region replication is asynchronous, so DR designs must absorb duplicates via unique keys and idempotent downstream processing.
Does migrating from ZooKeeper to KRaft affect DR design?
Intra-cluster metadata management changes, but the cross-region DR fundamentals — asynchronous replication via MM2 or Cluster Linking, RPO/RTO planning, and failover procedures — remain the same. Watch for renamed metrics and slightly different operational commands during the migration.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...