Kafka

Migrating from ZooKeeper to KRaft: Procedures, Prerequisites & Minimizing Downtime

2026-04-19
NicheeLab Editorial Team

KRaft is Apache Kafka's native metadata management mechanism, eliminating the need for an external ZooKeeper. Since ZooKeeper is being deprecated and removed in upcoming major releases, a planned migration is required.

Prioritizing safety and availability, this article focuses on a blue/green approach that gradually switches from an existing ZK cluster to a new KRaft cluster, and explains concrete techniques for minimizing downtime.

Why KRaft Now: Prerequisites and Constraints

In KRaft, Kafka itself handles metadata replication and consensus (a Raft derivative). The operational footprint is simplified, eliminating the need to operate, monitor, and tune ZooKeeper.

Key items to confirm before migrating are version compatibility, client compatibility, replication requirements for internal topics, reapplying security settings (SASL/TLS), and switching over monitoring and observability. KRaft has seen growing production adoption from Kafka 3.5 onward, but you should plan with a clear understanding of feature gaps and operational model differences.

  • Client compatibility: Standard Kafka clients are wire-protocol compatible, so they work by just switching the bootstrap target.
  • Controller design: Production setups should use dedicated controllers (3 or 5 nodes). Combining controller and broker roles is for small-scale or validation use.
  • Internal topics: Keep RF=3 for offsets and transaction_state, and tune ISR appropriately (e.g., transaction.state.log.min.isr=2) to secure availability.
  • Replacing ZooKeeper-dependent features: Review topic/ACL/quota management and monitoring tooling for KRaft.
AspectZooKeeper ModeKRaft Mode
Metadata managementExternal ZooKeeperBuilt-in Raft (KRaft)
Availability essentialsMaintaining ZK quorum + BrokerMaintaining Controller quorum + Broker
Rolling upgrade complexityMust consider both ZK and BrokerOnly Broker/Controller (clear role separation)
Operational targetsZK nodes + BrokerController nodes + Broker (no ZK needed)
Recommended migration approachMirror to a new KRaft cluster, then cut over (phased migration)

Pre-Migration Checklist (Compatibility, Capacity, Security)

Technically the migration is a swap of old and new clusters, but in practice it requires end-to-end optimization including clients, operations, and monitoring. Use the checklist below to prevent gaps.

In particular, the target KRaft cluster must be correctly formatted and bootstrapped from the start. Mistakes in formatting (generating and distributing the cluster.id) are hard to recover from, so prepare standardized runbooks and automation.

  • Version: Use a stable KRaft version (generally 3.5+) as the target. Keep the source ZK side on a MirrorMaker 2-compatible version.
  • Capacity/Performance: Estimate broker count, disk, and network based on peak throughput and retention period. Align monitoring stacks whether self-hosted or managed.
  • Security: Inventory SASL/TLS certificates, JAAS, ACLs, and quotas, and provision them on the KRaft side. Prepare a switch-over plan for the bootstrap DNS names.
  • Internal topics: Set offsets.topic.replication.factor, transaction.state.log.replication.factor, etc., to 3. Disable auto.create.topics.enable as needed.
  • Monitoring/Operations: Rebuild metrics (JMX/Exporter), logs, alerts, and observability dashboards for KRaft. Update backup and DR procedures.

KRaft Cluster Design Essentials (Controller / Quorum / Listeners)

In KRaft, the Controller quorum agrees on metadata and brokers follow it. Availability is determined by Controller majority and Broker health. Production should use dedicated Controller nodes — 3 (medium scale) or 5 (large scale) — separate from Brokers.

For networking and listeners, separate the client-facing listener (e.g., PLAINTEXT/TLS) and the Controller listener (CONTROLLER), and explicitly set inter.broker.listener.name.

  • Process roles: process.roles=controller (dedicated) or broker,controller (combined; for validation/small scale).
  • Quorum: controller.quorum.voters is comma-separated id@host:port (e.g., 1@c1:9093,2@c2:9093,3@c3:9093).
  • Node ID: node.id must be unique. The legacy broker.id is consolidated into node.id in KRaft.
  • Listeners: controller.listener.names=CONTROLLER; define the CONTROLLER protocol via listener.security.protocol.map.

Migration Patterns to Minimize Downtime

The safe and common approach is a blue/green strategy: run a new KRaft cluster in parallel, sync data and offsets with MirrorMaker 2 (OSS) or Cluster Linking (Confluent Platform), and after the final sync switch the client bootstrap target.

An in-place migration on the same cluster is operationally complex and risky to recover from, with significant long-term risk. From a downtime-minimization standpoint as well, most teams choose to build new and cut over in phases.

  • Blue/green (recommended): Sync in parallel, verify zero lag, then cut over in one go. If it fails, you can roll back by reverting DNS and configuration.
  • Phased cut-over: Migrate sequentially by topic or consumer group. Move the most critical ones last.
  • Client switch: Design the order — Producer-first or Consumer-first — based on workload characteristics.

Blue/green migration (synced with MirrorMaker 2)

MirrorMaker 2 / Cluster LinkingProducersExisting ZK clusterbrokers: B1..Bn / ZK ensembleConsumersPhased cut-overNew KRaft clustercontrollers: C1..C3/5 / brokers: B1'..Bn'Cut-over: 1) Start KRaft 2) Start mirror 3) Check lag 4) Switch Producer 5) Switch Consumer 6) Stop old cluster

Example Procedure: ZK to a New KRaft Cluster (using MirrorMaker 2)

The following is a representative procedure. In production, integrate it into automation (IaC / configuration management) and your change management process, and always rehearse in a staging environment first.

On the KRaft side, the format step must use the same cluster ID across all nodes. Start the Controller first, then start the Brokers. Enable checkpoints and offset sync in MM2, and verify zero lag before the final cut-over.

  • 1. Prepare KRaft configuration files (separate Controller and Broker recommended)
  • 2. Generate the cluster ID and format each node
  • 3. Start the Controller, then start the Brokers
  • 4. Sync topics and groups with MirrorMaker 2 (or Cluster Linking)
  • 5. Verify zero lag, then switch the bootstrap target in the order Producer → Consumer
  • 6. Observe monitoring, error rate, and latency to confirm stabilization, then gradually shut down the old cluster

Configuration and example commands (KRaft format/start, MirrorMaker 2)

# server.properties for Controller node (e.g., c1) (excerpt)
process.roles=controller
node.id=1
controller.listener.names=CONTROLLER
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
listeners=CONTROLLER://0.0.0.0:9093
controller.quorum.voters=1@c1:9093,2@c2:9093,3@c3:9093
log.dirs=/var/lib/kafka/data

# server.properties for Broker node (e.g., b1) (excerpt)
process.roles=broker
node.id=11
listeners=PLAINTEXT://0.0.0.0:9092
advertised.listeners=PLAINTEXT://b1:9092
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
inter.broker.listener.name=PLAINTEXT
controller.listener.names=CONTROLLER
controller.quorum.voters=1@c1:9093,2@c2:9093,3@c3:9093
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2
unclean.leader.election.enable=false
log.dirs=/var/lib/kafka/data

# 1) Generate cluster ID (run once on a controller, then distribute to all nodes)
CLUSTER_ID=$(kafka-storage.sh random-uuid)
echo $CLUSTER_ID

# 2) Format each node (using the same CLUSTER_ID)
sudo kafka-storage.sh format -t $CLUSTER_ID -c /etc/kafka/server.properties

# 3) Start Controller → start Broker (ensure ordering via service management)
# systemctl start kafka-controller; systemctl start kafka-broker ... etc.

# 4) MirrorMaker 2 configuration (MM2 plugin setup for connect-distributed)
# mm2.properties (excerpt)
name=mm2
connector.class=org.apache.kafka.connect.mirror.MirrorSourceConnector
tasks.max=4

# Source (ZK cluster) and destination (KRaft)
source.cluster.alias=src
dest.cluster.alias=dst

src.bootstrap.servers=zk-b1:9092,zk-b2:9092
src.security.protocol=PLAINTEXT

dst.bootstrap.servers=kr-b1:9092,kr-b2:9092
dst.security.protocol=PLAINTEXT

# Mirror targets
topics=.*
groups=.*
replication.factor=3
sync.topic.acls.enabled=true
emit.checkpoints.enabled=true
sync.group.offsets.enabled=true

# 5) Start the Connect worker and submit the MM2 connector
# kafka-connect-distributed /etc/kafka/connect-distributed.properties &
# curl -X POST localhost:8083/connectors -H 'Content-Type: application/json' -d @mm2.json

# 6) Check lag (e.g., ConsumerGroups, MM2 metrics/JMX)
# 7) When lag is zero, switch Producer/Consumer bootstrap to the KRaft side

Validation, Cut-over, Rollback & Post-Migration Tasks

Cut-over is not merely a DNS change — it involves consistency validation. Observe end-to-end latency, error rates, partition leader distribution, and offset consistency. If problems arise, roll back within the predetermined time window.

After the migration is complete, stop unnecessary mirroring and gradually retire the old cluster. Review monitoring and alert thresholds, update backup/DR, and refresh the operational runbook.

  • Validation: Message-count parity for representative topics, latency, throughput, consumer lag, and presence of duplicates or losses.
  • Cut-over order: Producer → Consumer. Reserve a short maintenance window if needed.
  • Rollback: Revert the bootstrap target to the old cluster and restore from pre-cut-over snapshots/configuration. MM2 continues syncing to assist recovery.
  • Decommissioning: Preserve ACLs/quotas/snapshots, retain audit logs, and follow an approval process for releasing resources.

Test Your Knowledge

CCAAK

問題 1

You want to migrate from a ZK-based Kafka cluster to a KRaft cluster while minimizing downtime. Which combination of steps is appropriate?

  1. Build a new KRaft cluster, sync topics and consumer groups via MirrorMaker 2. Confirm zero lag, then switch the bootstrap target in the order Producer → Consumer.
  2. Restart some brokers of the existing cluster as KRaft, stop ZooKeeper, then switch the clients.
  3. Configure the same broker.id on both ZK and KRaft, run them simultaneously, and gradually remove ZooKeeper.
  4. Start the KRaft cluster, switch DNS immediately, and use MirrorMaker 2 afterward to fill the gap.

正解: A

The safest and most common approach is to run a new KRaft cluster in parallel, sync it in advance with MirrorMaker 2 (or Cluster Linking), and switch at zero lag. Reusing existing brokers in place or sharing broker.id is dangerous, and switching DNS first risks data loss or duplication.

Frequently Asked Questions

Do clients need to change when migrating to KRaft?

Usually only the bootstrap address needs to change. Because the API and wire protocol stay compatible, producer/consumer code does not need to be modified. That said, double-check that security settings (SASL/TLS), DNS names, and connection parameters like retries and timeouts do not differ between the old and new clusters.

Can downtime be reduced to zero?

In theory it can be reduced almost indefinitely, but truly zero downtime requires controlling all switch-over effects on the network and application side. In practice, taking a short maintenance window and switching Producer then Consumer after confirming zero lag keeps the actual interruption to tens of seconds to a few minutes.

What should you watch out for after retiring ZooKeeper?

Preserve snapshots of the old cluster's ACLs, quotas, and configuration, and retain them for the period required by your audit requirements. Update your monitoring, backup, and DR procedures to assume KRaft, and add Controller quorum health (leader change time, maintaining voting majority) to your top-priority monitoring targets.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.