Kafka

Kafka Cooperative Rebalance: The Modern Protocol That Avoids Stop-the-World

2026-04-19
NicheeLab Editorial Team

The traditional Eager Rebalance causes all consumers to release their partitions at once, leading to brief but full stops that often spike latency and degrade throughput.

Cooperative Rebalance is the modern protocol that avoids stop-the-world by reassigning partitions incrementally and gradually. This article explains the mechanics, rollout steps, callback design, and operational tips from a perspective that also helps with the CCDAK exam.

Why Stop-the-World Happens (The Limits of Eager Rebalance)

With Eager Rebalance, any membership change or config update causes every consumer in the group to release all of its partitions and wait for reassignment. Even a short consumption pause ripples across the whole group, causing latency spikes and throughput drops.

In production, rolling deploys, scale-outs, and network jitter occur frequently, driving up the number of rebalances. Suppressing stop-the-world directly translates into SLO stability and peak-time resilience.

  • All members revoking and then being assigned at once is the fundamental bottleneck of Eager
  • When long-running processing drifts from commits, duplicate processing and delays become noticeable
  • Cache warm-ups and connection re-establishments tend to occur on every rebalance

Typical configuration example for Eager Rebalance (traditional)

props.put("partition.assignment.strategy", java.util.List.of(
  "org.apache.kafka.clients.consumer.RangeAssignor"
));
// デフォルトは RangeAssignor(Eager)で、全メンバーが一旦 revoke される

How Cooperative Rebalance Works (Incremental, Step-by-Step Rebalancing)

Incremental Cooperative Rebalancing moves only the minimum necessary partitions, step by step. Most existing members keep their assignments, and only the partitions that need to move are revoked and reassigned, which avoids stop-the-world.

On the consumer side, the ConsumerRebalanceListener callbacks let you safely commit and release only what was revoked, and start consuming what was assigned. You also need to account for cases where onPartitionsLost is called when partitions could not be revoked gracefully.

  • Because the migration is gradual, full stops are unlikely
  • onPartitionsRevoked only receives the partitions that are actually leaving
  • onPartitionsLost is the last resort for error scenarios (be careful with uncommitted work)
  • Sticky + Cooperative together minimize the volume of partition movement
StrategyBehaviorStop-the-worldMain callbacks
Eager (Range/RoundRobin/Sticky)Every member releases all partitions at onceLikely to occuronPartitionsRevoked/Assigned (Lost is not expected)
CooperativeStickyOnly the minimum necessary moves incrementally (step by step)Largely avoidableonPartitionsRevoked/Assigned/Lost
Sticky (Eager)Movement is suppressed, but a full revoke still occursMitigated but not avoidableonPartitionsRevoked/Assigned

Visualizing incremental movement (Cooperative Rebalance)

revoke P1move P1 onlyConsumer A (before)P0, P1Consumer B (before)P2, P3Consumer A (after)P0Consumer B (after)P2, P3, P1In Cooperative, only P1 moves incrementally from A → B. A keeps P0 and B keeps P2/P3, so a full stop is avoided.

Enabling the Cooperative Sticky Assignor

props.put("partition.assignment.strategy", java.util.List.of(
  "org.apache.kafka.clients.consumer.CooperativeStickyAssignor"
));
// 可能なら全メンバーで同一戦略に統一すること

A Safe Rollout Plan (Tips for Rolling Migrations)

Within the same consumer group, the effective assignor is whatever sits at the head of the intersection of all members' strategies. Mixed periods can cause unexpected fallback to Eager behavior, so plan the migration carefully.

The safest approach is to first upgrade every member to a version that supports CooperativeStickyAssignor, and then either unify the strategy or migrate to a new group.id.

  • Watch for the possibility of temporarily reverting to Eager behavior during a rolling deploy
  • Combining group.instance.id (static membership) further reduces rebalance frequency
  • Avoid mixing components of unknown compatibility; migrate gradually while observing behavior

A safe migration example (gradual move using a new group.id)

# 1) 先にアプリを CooperativeStickyAssignor 対応版へデプロイ
# 2) 新しい group.id で起動し、旧グループから段階的にトラフィック移行
props.put("group.id", "orders-consumer-v2");
props.put("partition.assignment.strategy", java.util.List.of(
  "org.apache.kafka.clients.consumer.CooperativeStickyAssignor"
));

Callback and Commit Design (Implementation Pitfalls)

With Cooperative, you can cleanly close just the partitions being revoked. Inside onPartitionsRevoked, flush/commit the pending work for the target partitions and release external resources. In onPartitionsAssigned, keep the initialization for newly assigned partitions lightweight.

onPartitionsLost is called in error scenarios where partitions could not be returned gracefully. The safest move is to discard and reinitialize state on the assumption that unprocessed work will definitely be reprocessed.

  • Use synchronous commits and commit only the target partitions (avoid unnecessary full-group commits)
  • For long-running processing, allow plenty of headroom in max.poll.interval.ms
  • Managing handles to external stores (DB/cache) on a per-partition basis makes life easier

Skeleton of a ConsumerRebalanceListener (Java)

consumer.subscribe(List.of("orders"), new ConsumerRebalanceListener() {
  @Override
  public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
    // revoke 対象だけを同期コミット・クローズ
    commitOffsetsFor(partitions);
    closeResourcesFor(partitions);
  }
  @Override
  public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
    // 必要最小の初期化。重い処理は避ける
    initStateFor(partitions);
  }
  @Override
  public void onPartitionsLost(Collection<TopicPartition> partitions) {
    // 異常系。未処理は再処理前提で状態を破棄
    dropStateFor(partitions);
  }
});

Operational Tuning and Monitoring Points

Even when stop-the-world is avoided, rebalances still happen. Align timeouts and heartbeats with your application's processing time. In particular, max.poll.interval.ms must match the maximum time required to process a single poll.

Focus monitoring on consumer lag, the count and duration of rebalances, and whether onPartitionsLost fires. Use static membership (group.instance.id) and the cooperative strategy to minimize group churn.

  • Keep session.timeout.ms and heartbeat.interval.ms aligned (to avoid session expiration)
  • Set max.poll.interval.ms to the longest workload processing time plus headroom
  • Note that long GC pauses or external API waits can get a member kicked out of the group
  • Use kafka-consumer-groups.sh to periodically inspect lag and stability

Example monitoring and inspection commands

$ kafka-consumer-groups.sh \
  --bootstrap-server broker:9092 \
  --group orders-consumer \
  --describe

# 再均衡の兆候はアプリログ(JoinGroup/SyncGroup)やメトリクスで確認

CCDAK Exam Angles and Question Patterns

CCDAK frequently tests the difference between Eager and Cooperative, the advantages of CooperativeStickyAssignor, the related callbacks (especially onPartitionsLost), and stabilization through static membership.

Locking in precise concepts — config precedence, behavior in mixed-strategy groups, safety measures during rolling deploys, and commit design principles (duplicate tolerance and ordering) — translates directly into points on the exam.

  • Keywords: CooperativeStickyAssignor, Incremental Rebalancing, onPartitionsLost, group.instance.id
  • Be able to instantly contrast Eager (revoke-all) with Cooperative (minimal-revoke)
  • Safe rollout steps (upgrade all members → unify strategy or migrate to a new group.id)

Frequently tested configuration keys (excerpt)

# コンシューマー設定(代表例)
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
max.poll.interval.ms=900000
session.timeout.ms=45000
heartbeat.interval.ms=15000
group.instance.id=orders-c1   # 静的メンバーの例(ユニークに)

Check Your Understanding

CCDAK

問題 1

You want to minimize stop-the-world when scaling out a consumer group. Which approach is most appropriate?

  1. Unify partition.assignment.strategy on CooperativeStickyAssignor and, if needed, also set group.instance.id
  2. Lower max.poll.interval.ms to trigger frequent rebalances and optimize the assignment
  3. Use StickyAssignor and prepare a script that pauses every member before each rebalance
  4. Use RoundRobinAssignor to even out partitions and smooth processing

正解: A

CooperativeStickyAssignor avoids stop-the-world with incremental, gradual rebalancing. Combining it with static membership further reduces the number of rebalances. B is counterproductive, C adds operational burden without solving Eager's limits, and D is Eager and cannot avoid the full revoke.

Frequently Asked Questions

Does Cooperative Rebalance completely eliminate duplicate processing?

No. It is a mechanism for avoiding stop-the-world, and duplicate processing is still possible. You need to combine it with duplicate-tolerant processing design (idempotency, accurate offset commits) to maintain consistency.

Does onPartitionsLost always fire?

Normally it does not fire. It is called when partitions could not be returned gracefully due to network disconnects or long pauses. The safe approach is to discard local state and reinitialize, assuming any unprocessed data will be reprocessed.

Can I gradually migrate to CooperativeSticky in a mixed environment?

If strategies are mixed within the same group, the assignor falls back to the head of the common intersection and can revert to Eager behavior. The safe approach is to upgrade all members to a supporting version first and then unify the strategy, or migrate to a new group.id.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.