Kafka Rebalance: Eager vs Cooperative (2026)

Rebalances directly affect consumer group stability and availability. On the exam, what matters is concepts and the cause-and-effect of settings; in operations, minimizing downtime is what counts.

Based on official behavior, this article walks through trigger conditions, protocol differences, configuration trade-offs, and monitoring/operations end-to-end.

Kafka Rebalance Basics and Prerequisites

A rebalance is the process of recomputing partition assignments within a consumer group. The group coordinator notifies members of assignments via JoinGroup and SyncGroup round trips, and each member runs revoke/assign handling before resuming.

There are two main protocols: Eager and Cooperative (incremental). Eager has every member release all partitions at once, which tends to lengthen downtime. Cooperative hands off partitions incrementally, shortening downtime.

The group's steady state is Stable; during a rebalance it transitions through PreparingRebalance/CompletingRebalance
On rebalance, commit in onPartitionsRevoked and initialize in onPartitionsAssigned as the basic pattern
Downtime is usually observed as the sum of "time each member can't read its partitions," not "time all members can't read"

Phase	Key RPCs / Events	Behavior on Failure
Detection	Heartbeat / subscription change detection	Timeout marks member as missing and triggers rejoin
Join	JoinGroup request and leader election	Retries on expiry; excessively long deadlines cause extended gaps
Sync	SyncGroup finalizes the assignment	Another round if there is conflict or change (Cooperative applies it incrementally)

Basic sequence (simplified)

Java consumer rebalance listener (key points only)

consumer.subscribe(Collections.singletonList("orders"), new ConsumerRebalanceListener() {
  @Override
  public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
    // Safely finish in-flight messages and sync-commit the last read position
    consumer.commitSync();
  }
  @Override
  public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
    // Adjust position on resume. Call seek if needed
    // consumer.seekToCommitted(partitions);
  }
});

Rebalance Trigger Conditions

Rebalances do not fire on every arbitrary change. The main triggers are changes to membership, subscriptions, or topic configuration. Heartbeat or polling timeouts are the most common cause in practice.

A coordinator failover can cause members to rejoin, but assignments do not necessarily change significantly. Even so, expect short gaps and resync periods.

New member joins/leaves (process exit, crash, scaling operations)
session.timeout.ms exceeded, or heartbeat.interval.ms anomalies leading to member-loss detection
max.poll.interval.ms exceeded (app fails to call poll for too long and is deemed unhealthy)
Subscription changes (updating the targets of subscribe)
Increase in a topic's partition count
Rejoin due to group coordinator move or restart

Event	Typical Cause	Can It Be Avoided/Reduced?
Member leave	Pod/VM restart, deploy, crash	Reducible with static membership and rolling restarts
Heartbeat timeout	GC/CPU saturation, network latency, session.timeout.ms misconfiguration	Reducible with tuning and monitoring
max.poll exceeded	Heavy processing stretches poll intervals; oversized batches	Reducible by splitting work, backpressure, and max.poll settings
Partition count increase	Scaling requirements	Unavoidable, but Cooperative reduces impact

Conceptual map of triggers

Key consumer settings (foundation for avoiding rebalances)

# Heartbeat and member health
session.timeout.ms=...        # Too short or too long is destabilizing
heartbeat.interval.ms=...     # Keep it a fraction of session.timeout

# Application processing health
max.poll.interval.ms=...      # Extend if processing is long-running
max.poll.records=...          # Cap per-poll work to prevent overruns

Eager vs Cooperative: Differences and Selection Guide

Eager protocols (Range/Sticky, etc.) have every member revoke all partitions at the start of a rebalance and stop reading until assignment completes. Simple and broadly compatible, but downtime tends to grow.

Cooperative Sticky hands off only the impacted partitions incrementally. It stabilizes over multiple rounds, but total downtime is shorter in most cases.

If short downtime and rolling deploys are priorities, consider Cooperative Sticky
Eager may still be the right choice with older clients in the mix or specific ecosystem requirements
Avoid mixing multiple assignors in the same group (behavior becomes unstable)

Aspect	Eager (Range/Sticky)	Cooperative Sticky
Downtime tendency	Long (revokes all)	Short (revokes only impacted partitions, incrementally)
Convergence rounds	Usually 1	Multiple (incremental application)
Duplicate/drop risk	Increases if listener implementation is weak	Easier to control with incremental handoff
Compatibility / requirements	Broadly supported	Explicit configuration on supported clients is safest

Time behavior of Eager vs Cooperative (conceptual)

Eager:
[Revoke all]----[Assign all]----[Resume]
         ^ Full stop

Cooperative:
[Revoke impacted]--[Assign subset]--[Resume some]--[Next round]--[Stable]
         ^ Only impacted partitions stop

Assignment strategy configuration examples

# Cooperative (recommended, supported clients)
partition.assignment.strategy=[org.apache.kafka.clients.consumer.CooperativeStickyAssignor]

# Eager (compatibility-first)
# partition.assignment.strategy=[org.apache.kafka.clients.consumer.StickyAssignor]
# or explicitly RangeAssignor

Estimating Downtime and Impact Points

Effective consumer downtime is the span from revoke to resume plus the application's interruption and re-initialization costs. With Cooperative, the scope is limited to a subset, so total gaps shrink.

If your commit design is weak on the app side, duplicate processing and missed messages around a rebalance increase. Clearly define sync commits in onPartitionsRevoked and initialization on reassignment.

Split batches so heavy processing doesn't exceed max.poll.interval.ms
Long commit delays mean more duplicates on resume (too small adds overhead)
When partitions grow, warm-up time (caches, connections) also impacts perceived downtime

Parameter	Main Effect	Risk if Misconfigured
max.poll.interval.ms	Maximum permitted processing time per poll	Exceeded → forced leave → frequent rebalances
session.timeout.ms	Grace period for member liveness	Too short causes false positives; too long delays failure detection
max.poll.records	Records processed per poll	Too large extends processing time; too small adds overhead

Timeline (from a single member's perspective)

time --->
[poll]--process--process--RebalanceStart--revoke--commit--assign--init--resume--process
                       ^ Downtime window

Basic pause-and-resume pattern

// Safe stop on revoke
consumer.pause(currentPartitions);
consumer.commitSync();
// Close handling as needed...

// Re-initialize on assign
// for (tp : assigned) { seekToCommitted(tp); }
consumer.resume(assignedPartitions);

Practical Techniques to Reduce Downtime

The shortcut to lower downtime is reducing frequency and narrowing impact when rebalances do happen. The pillars are Cooperative Sticky combined with static membership, tuned heartbeat and polling settings, and planned scaling.

On rolling deploys, hold slots via static membership and swap one instance at a time to limit the ripple of reassignment.

Explicitly configure Cooperative Sticky to avoid full revokes
Set group.instance.id for static membership and suppress reassignment during rolling restarts
Keep the heartbeat.interval.ms to session.timeout.ms ratio healthy
Prevent max.poll overruns with batch splitting and backpressure
Run partition increases during off-peak hours and verify in stages

Technique	Expected Effect	Notes / Caveats
Cooperative Sticky	Shorter downtime via incremental handoff of impacted partitions only	Standardize within a group and verify compatibility
Static membership	Suppress unnecessary rebalances during rolling restarts	group.instance.id must be unique within the group
Heartbeat tuning	Fewer false positives and right-sized detection time	Overly short adds load; overly long delays detection
Gradual scaling	Avoid large-scale reassignment	Observe latency and lag at each step

How static membership preserves a slot

Basic configuration and rolling restart example

# Static membership + Cooperative
partition.assignment.strategy=[org.apache.kafka.clients.consumer.CooperativeStickyAssignor]
group.instance.id=order-svc-1   # unique within the group

# Rolling procedure (conceptual)
# 1) Stop one instance
# 2) Restart immediately with the same group.instance.id
# 3) Confirm stability, then move to the next instance

Monitoring, Troubleshooting, and Operational Commands

Excessive rebalances directly cause app downtime and latency degradation. Combine client and broker metrics, logs, and CLI to detect and triage them early.

Typical signs are spikes in rebalances-total, drops in heartbeat-rate, and frequent RebalanceInProgressException. Separate the cause into application processing, infrastructure latency, or configuration.

Client JMX: consumer-coordinator-metrics.rebalances-total, heartbeat-rate, commit-latency-avg
Logs: Member x in group y has failed, RebalanceInProgressException
CLI: kafka-consumer-groups to immediately inspect assignments and lag

Metric / Log	Meaning	Action
Increase in rebalances-total	Frequent rebalances	Correlate with time-of-day and deploys to find the source
Drop in heartbeat-rate	Heartbeat delay/stop	Investigate GC/CPU/network and re-tune timeouts
RebalanceInProgressException	Rebalance collision	Verify retry logic and listener handler health

Group state transitions (conceptual)

Commonly used operational commands

# Current assignments and lag
kafka-consumer-groups --bootstrap-server <bkr:9092> --describe --group <group>

# List group members
kafka-consumer-groups --bootstrap-server <bkr:9092> --describe --members --group <group>

# Safely adjust offsets (planned maintenance)
# Always verify with --dry-run before executing
kafka-consumer-groups --bootstrap-server <bkr:9092> --group <group> --topic <t> --reset-offsets --to-latest --dry-run

Check with a Sample Question

CCDAK / CCAAK

問題 1

You want to run a rolling deploy while minimizing consumer group downtime. Which mitigation is most effective and aligned with official behavior?

Set partition.assignment.strategy to Cooperative Sticky, assign a unique group.instance.id to each member, and restart one instance at a time
Make max.poll.interval.ms extremely short so polling happens very frequently
Make session.timeout.ms very long to delay failure detection
Use RangeAssignor and restart all members simultaneously

正解: A

Cooperative Sticky shrinks downtime via incremental handoff, and static membership suppresses unnecessary reassignment during rolling restarts. B oversensitizes max.poll exceedance detection (counterproductive); C delays failure detection and extends gaps; D tends to cause temporary full stops.

Frequently Asked Questions

Does a coordinator failover always trigger a rebalance?

Member rejoins can happen, but a coordinator failover does not necessarily cause large-scale reassignment. Brief gaps or resyncs may occur, so make client retries and listener implementations robust.

What happens when you increase the partition count?

The group recomputes assignments including the new partitions, triggering a rebalance. With Cooperative, only the impacted partitions are handed off incrementally, reducing perceived downtime. Doing it during off-peak hours with monitoring is recommended.

What is the difference between session.timeout.ms and max.poll.interval.ms?

session.timeout.ms is the heartbeat-based member liveness window, monitoring network and process health. max.poll.interval.ms is the application processing health check; if the gap between poll() calls is too long, the member is considered unhealthy and removed. They are independent, but exceeding either one triggers a rebalance.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Kafka Rebalance Flow and Impact: Triggers, Downtime, and How to Reduce It