Kafka

Practical Kafka Cluster Upgrade Guide: Rolling Updates and Compatibility Essentials

2026-04-19
NicheeLab Editorial Team

With the right plan, Kafka upgrades can be performed with zero downtime. The keys are understanding compatibility and following a phased rolling update procedure. This article distills the official documentation into a workflow that is useful both for CCAAK exam prep and for production operations.

We turn the following into concrete steps and checklists: pinning inter.broker.protocol.version (IBPV), pinning the message format when needed, the correct order for controllers and clients, the differences between KRaft and ZooKeeper, and the essentials of validation and rollback.

Compatibility Fundamentals at a Glance

Kafka compatibility breaks down into three layers: the wire protocol (broker-to-broker and client-to-broker), the record message format, and the metadata feature level (for KRaft). In a rolling update, you first pin IBPV and the message format to the current versions so that existing communication and record formats are not broken, then upgrade the binaries.

Clients can generally talk to newer brokers using older API versions, so upgrading brokers first is the rule. In ZooKeeper mode the controller role is embedded in the brokers, but in KRaft mode you must roll the controller quorum and the broker fleet separately, in that order.

  • Wire compatibility: pin the inter-broker protocol to the current version via IBPV before upgrading.
  • Message format: upgrade with the old format in place, then switch only after every broker is upgraded. Existing segments are not rewritten.
  • Client compatibility: upgrade brokers first, clients later. Transactions and the idempotent producer require specific broker versions (0.11+) when older brokers are still in the mix.
  • KRaft metadata features: KRaft has a separate finalization step for the feature level (e.g., via the kafka-features tool).

Standard Rolling Update Flow (Zero-Downtime)

The flow below is for ZooKeeper mode. KRaft is fundamentally the same, but adds an ordering requirement (controllers first, then brokers) and a final feature-level finalization step.

What matters is the order: pin compatibility first, upgrade one broker at a time while checking health, and only unpin (move to the latest) once every broker is upgraded.

  • Before starting: confirm cluster health (URP=0, no offline partitions, stable controller). Pin IBPV and the message format to the current values as needed.
  • One at a time: reduce load and leadership on the target broker as much as possible → stop → upgrade binaries → start → verify health.
  • After all brokers are upgraded: update IBPV and the message format to the latest version. Apply the config change via a rolling restart if needed.
  • KRaft: roll the controller quorum first, then the brokers, and finalize the metadata feature level at the end.

Conceptual diagram of a rolling update (B1 → B2 → B3, one broker at a time)

ClientsLB / BootstrapStep 0 (initial):B${i + 1}(old)B${i + 1}(old)B${i + 1}(old)Step 1: stop/update/start B1 -> (new)B${i + 1}(new)B${i + 1}(old)B${i + 1}(old)Step 2: B2 -> (new) / Step 3: B3 -> (new)B${i + 1}(new)B${i + 1}(new)B${i + 1}(new)Finalize: unpin protocol/message -> optional rolling restart

Example runbook (assumes ZooKeeper mode; see the KRaft notes below)

# 0) Pre-check (cluster health)
$ kafka-topics.sh --bootstrap-server broker:9092 --describe --under-replicated-partitions
$ kafka-topics.sh --bootstrap-server broker:9092 --describe --unavailable-partitions
$ kafka-broker-api-versions.sh --bootstrap-server broker:9092 | head

# 1) Pin compatibility (apply the same server.properties on every broker)
# Example: if the current production version is 2.8
inter.broker.protocol.version=2.8
log.message.format.version=2.8   # only if the property exists; skip otherwise

# 2) Upgrade brokers one at a time
# (Optional) reduce leadership before stopping
$ kafka-preferred-replica-election.sh --zookeeper zk:2181  # if available

# Stop -> upgrade package -> start broker B1
$ sudo systemctl stop kafka
# Run the package upgrade in the way your distro/artifact requires
$ sudo tar -xf kafka_2.13-3.6.0.tgz -C /opt/kafka --strip-components=1
$ sudo systemctl start kafka

# Health check (URP=0, no offline partitions, no client errors)
$ kafka-topics.sh --bootstrap-server broker:9092 --describe --under-replicated-partitions

# Repeat for B2, B3 ...

# 3) After all brokers are upgraded, unpin (move to the latest)
inter.broker.protocol.version=3.6  # example
log.message.format.version=3.6     # only if the property exists

# 4) Final rolling restart to apply the new config
$ sudo systemctl restart kafka  # one broker at a time

# 5) For KRaft, finalize on a controller (actual options depend on your version)
$ kafka-features.sh --bootstrap-server controller:9093 --describe
$ kafka-features.sh --bootstrap-server controller:9093 --finalize-upgrade

Key Settings and Pitfalls

inter.broker.protocol.version pins the inter-broker wire protocol to the existing version. This lets old and new brokers communicate using the same protocol during a rolling update, preserving compatibility. Apply the pin uniformly on every broker, and only bump it after every broker has been upgraded.

When log.message.format.version is available, it pins the record encoding (magic byte, header layout, etc.) to the old format. Unpinning it does not rewrite existing log segments; only newly written records use the new format. As a result, the switch typically does not trigger a large I/O burst.

  • Inconsistent pins: if IBPV differs on some brokers, the cluster destabilizes. Always keep it identical across all brokers.
  • Switch order: stick to pin -> binary upgrade -> all brokers done -> unpin (to the latest).
  • Client order: brokers go first. Roll out clients that use new features only after the server side is finalized (IBPV / message format / feature level).
  • Transactions / idempotence: feature gaps are common while old brokers are still mixed in. Consider enabling them only after unpinning.
  • KRaft feature finalization: the cluster stays at the old feature level until you finalize. Run finalization only after all brokers are upgraded to preserve compatibility.

Quick commands for checking your settings

# Are IBPV / message format what you expect? (static config file example)
$ grep -E "^(inter.broker.protocol.version|log.message.format.version)" /etc/kafka/server.properties

# API version mapping (client vs broker)
$ kafka-broker-api-versions.sh --bootstrap-server broker:9092 | sed -n '1,20p'

# Cluster health (representative metrics)
$ kafka-metrics.sh  # match this to whatever metrics pipeline you use, and watch indicators like:
# kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions == 0
# kafka.controller:type=KafkaController,name=ActiveControllerCount == 1 (ZK)
# kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec staying nominal

Validation and Rollback Strategy

After restarting each broker, confirm that partition replication has caught up, leadership is stable, and clients are not seeing rising timeouts. In practice, running smoke tests that mirror real production traffic is more useful than synthetic benchmarks.

If something goes wrong, the safe move is to roll back just that broker's binary and leave the pinned settings (IBPV / message format) untouched. As long as the pins stay in place, rolling back remains easy until every broker has been upgraded.

  • Health signals: URP=0, OfflinePartitions=0, ISR converged, client error rate and latency nominal.
  • Log inspection: check server.log for runs of protocol errors or timeouts.
  • Rollback: roll the most recently upgraded broker back to the old version and re-validate. Do not touch the pins.

Strategy Comparison and Case-by-Case Guidance

Rolling updates are the default, but depending on requirements you may opt for a stop-the-world upgrade or a parallel cluster (Blue/Green). Weigh audit obligations, the risk of large version skips, and infrastructure cost when choosing.

From a CCAAK perspective, you need to be able to clearly articulate the prerequisites of a rolling update: pin IBPV, unpin only after every broker is upgraded, and update clients last.

  • Traffic profile: workloads that peak even at night are better served by Blue/Green.
  • Scale and cost: the more brokers you have, the more realistic rolling becomes.
  • Risk tolerance: large version skips or sweeping config changes benefit the most from parallel validation.
StrategyDowntimeRisk / ComplexityExtra Cost
Rolling update (recommended)Near zero (only momentary leader re-election)Medium (requires ordering and pin management)Low
Stop-the-world upgrade (full shutdown -> upgrade all)High (full outage)Low (work is simple)Low
Parallel cluster (Blue/Green)Zero (minimal blip at cutover)High (requires dual-ingest and consistency checks)High

Exam Prep Summary and Checklist

The exam frequently tests the pin -> upgrade -> unpin order, the client upgrade order, how the message format is handled, and the existence of KRaft feature finalization. Distractors such as "bump IBPV before every broker is upgraded" or "upgrade clients first" are wrong.

In production, the stable approach is to spell out observability checkpoints before and after each change, and never advance to the next broker until the current one meets the pass criteria.

  • Keep pins identical across every broker (IBPV, plus message format if applicable).
  • Upgrade one broker at a time. Wait for URP to return to 0 on each.
  • Unpin only after all brokers are done (move to the latest). Do a final rolling restart if needed.
  • Clients are upgraded last. Enable new features only after the server side has been finalized.
  • For KRaft, the order is: controller quorum -> brokers -> feature finalization.

Check Your Understanding

CCAAK

問題 1

You are performing a zero-downtime upgrade from Kafka 2.8 to 3.x in ZooKeeper mode. Which procedure is the safest?

  1. A. Pin IBPV and message format to 2.8 -> upgrade brokers to 3.x one at a time -> after every broker is done, bump IBPV / message format to 3.x -> optional final rolling restart -> upgrade clients
  2. B. First upgrade the clients for 3.x -> bump IBPV to 3.x -> shut down every broker at once and upgrade them all to 3.x -> start them up
  3. C. Upgrade brokers to 3.x one at a time and bump IBPV to 3.x with each broker -> upgrade clients last
  4. D. Upgrade every broker to 3.x in parallel and, after they start, set IBPV back to 2.8

正解: A

In a rolling update, you preserve compatibility by first pinning IBPV (and message format if applicable) to the current value, then unpinning only after every broker has been upgraded. Clients are upgraded last by convention.

Frequently Asked Questions

Is log.message.format.version always required?

Depending on the environment and version, the property may not exist. Where it does, pin it to the current version before upgrading to preserve compatibility during the rolling update, then bump it to the latest version after all brokers are upgraded. Existing log segments are not rewritten when you switch.

What is additionally required in KRaft mode?

Roll the controller quorum first, then the brokers, and finally finalize the metadata feature level (using tools like kafka-features). The old feature level stays in effect until finalization, so wait until after finalization to use new features.

When should clients be upgraded?

As a rule, only after the broker upgrade is complete and IBPV / message format (and the KRaft feature level) have been finalized. Most clients are backward compatible, so upgrading the server side first is the safe path.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.