KRaft is Apache Kafka's native metadata management mechanism, eliminating the need for an external ZooKeeper. Since ZooKeeper is being deprecated and removed in upcoming major releases, a planned migration is required.
Prioritizing safety and availability, this article focuses on a blue/green approach that gradually switches from an existing ZK cluster to a new KRaft cluster, and explains concrete techniques for minimizing downtime.
In KRaft, Kafka itself handles metadata replication and consensus (a Raft derivative). The operational footprint is simplified, eliminating the need to operate, monitor, and tune ZooKeeper.
Key items to confirm before migrating are version compatibility, client compatibility, replication requirements for internal topics, reapplying security settings (SASL/TLS), and switching over monitoring and observability. KRaft has seen growing production adoption from Kafka 3.5 onward, but you should plan with a clear understanding of feature gaps and operational model differences.
| Aspect | ZooKeeper Mode | KRaft Mode |
|---|---|---|
| Metadata management | External ZooKeeper | Built-in Raft (KRaft) |
| Availability essentials | Maintaining ZK quorum + Broker | Maintaining Controller quorum + Broker |
| Rolling upgrade complexity | Must consider both ZK and Broker | Only Broker/Controller (clear role separation) |
| Operational targets | ZK nodes + Broker | Controller nodes + Broker (no ZK needed) |
| Recommended migration approach | — | Mirror to a new KRaft cluster, then cut over (phased migration) |
Technically the migration is a swap of old and new clusters, but in practice it requires end-to-end optimization including clients, operations, and monitoring. Use the checklist below to prevent gaps.
In particular, the target KRaft cluster must be correctly formatted and bootstrapped from the start. Mistakes in formatting (generating and distributing the cluster.id) are hard to recover from, so prepare standardized runbooks and automation.
In KRaft, the Controller quorum agrees on metadata and brokers follow it. Availability is determined by Controller majority and Broker health. Production should use dedicated Controller nodes — 3 (medium scale) or 5 (large scale) — separate from Brokers.
For networking and listeners, separate the client-facing listener (e.g., PLAINTEXT/TLS) and the Controller listener (CONTROLLER), and explicitly set inter.broker.listener.name.
The safe and common approach is a blue/green strategy: run a new KRaft cluster in parallel, sync data and offsets with MirrorMaker 2 (OSS) or Cluster Linking (Confluent Platform), and after the final sync switch the client bootstrap target.
An in-place migration on the same cluster is operationally complex and risky to recover from, with significant long-term risk. From a downtime-minimization standpoint as well, most teams choose to build new and cut over in phases.
Blue/green migration (synced with MirrorMaker 2)
The following is a representative procedure. In production, integrate it into automation (IaC / configuration management) and your change management process, and always rehearse in a staging environment first.
On the KRaft side, the format step must use the same cluster ID across all nodes. Start the Controller first, then start the Brokers. Enable checkpoints and offset sync in MM2, and verify zero lag before the final cut-over.
Configuration and example commands (KRaft format/start, MirrorMaker 2)
# server.properties for Controller node (e.g., c1) (excerpt)
process.roles=controller
node.id=1
controller.listener.names=CONTROLLER
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
listeners=CONTROLLER://0.0.0.0:9093
controller.quorum.voters=1@c1:9093,2@c2:9093,3@c3:9093
log.dirs=/var/lib/kafka/data
# server.properties for Broker node (e.g., b1) (excerpt)
process.roles=broker
node.id=11
listeners=PLAINTEXT://0.0.0.0:9092
advertised.listeners=PLAINTEXT://b1:9092
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
inter.broker.listener.name=PLAINTEXT
controller.listener.names=CONTROLLER
controller.quorum.voters=1@c1:9093,2@c2:9093,3@c3:9093
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2
unclean.leader.election.enable=false
log.dirs=/var/lib/kafka/data
# 1) Generate cluster ID (run once on a controller, then distribute to all nodes)
CLUSTER_ID=$(kafka-storage.sh random-uuid)
echo $CLUSTER_ID
# 2) Format each node (using the same CLUSTER_ID)
sudo kafka-storage.sh format -t $CLUSTER_ID -c /etc/kafka/server.properties
# 3) Start Controller → start Broker (ensure ordering via service management)
# systemctl start kafka-controller; systemctl start kafka-broker ... etc.
# 4) MirrorMaker 2 configuration (MM2 plugin setup for connect-distributed)
# mm2.properties (excerpt)
name=mm2
connector.class=org.apache.kafka.connect.mirror.MirrorSourceConnector
tasks.max=4
# Source (ZK cluster) and destination (KRaft)
source.cluster.alias=src
dest.cluster.alias=dst
src.bootstrap.servers=zk-b1:9092,zk-b2:9092
src.security.protocol=PLAINTEXT
dst.bootstrap.servers=kr-b1:9092,kr-b2:9092
dst.security.protocol=PLAINTEXT
# Mirror targets
topics=.*
groups=.*
replication.factor=3
sync.topic.acls.enabled=true
emit.checkpoints.enabled=true
sync.group.offsets.enabled=true
# 5) Start the Connect worker and submit the MM2 connector
# kafka-connect-distributed /etc/kafka/connect-distributed.properties &
# curl -X POST localhost:8083/connectors -H 'Content-Type: application/json' -d @mm2.json
# 6) Check lag (e.g., ConsumerGroups, MM2 metrics/JMX)
# 7) When lag is zero, switch Producer/Consumer bootstrap to the KRaft side
Cut-over is not merely a DNS change — it involves consistency validation. Observe end-to-end latency, error rates, partition leader distribution, and offset consistency. If problems arise, roll back within the predetermined time window.
After the migration is complete, stop unnecessary mirroring and gradually retire the old cluster. Review monitoring and alert thresholds, update backup/DR, and refresh the operational runbook.
CCAAK
問題 1
You want to migrate from a ZK-based Kafka cluster to a KRaft cluster while minimizing downtime. Which combination of steps is appropriate?
正解: A
The safest and most common approach is to run a new KRaft cluster in parallel, sync it in advance with MirrorMaker 2 (or Cluster Linking), and switch at zero lag. Reusing existing brokers in place or sharing broker.id is dangerous, and switching DNS first risks data loss or duplication.
Do clients need to change when migrating to KRaft?
Usually only the bootstrap address needs to change. Because the API and wire protocol stay compatible, producer/consumer code does not need to be modified. That said, double-check that security settings (SASL/TLS), DNS names, and connection parameters like retries and timeouts do not differ between the old and new clusters.
Can downtime be reduced to zero?
In theory it can be reduced almost indefinitely, but truly zero downtime requires controlling all switch-over effects on the network and application side. In practice, taking a short maintenance window and switching Producer then Consumer after confirming zero lag keeps the actual interruption to tens of seconds to a few minutes.
What should you watch out for after retiring ZooKeeper?
Preserve snapshots of the old cluster's ACLs, quotas, and configuration, and retain them for the period required by your audit requirements. Update your monitoring, backup, and DR procedures to assume KRaft, and add Controller quorum health (leader change time, maintaining voting majority) to your top-priority monitoring targets.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...