Rebalances directly affect consumer group stability and availability. On the exam, what matters is concepts and the cause-and-effect of settings; in operations, minimizing downtime is what counts.
Based on official behavior, this article walks through trigger conditions, protocol differences, configuration trade-offs, and monitoring/operations end-to-end.
A rebalance is the process of recomputing partition assignments within a consumer group. The group coordinator notifies members of assignments via JoinGroup and SyncGroup round trips, and each member runs revoke/assign handling before resuming.
There are two main protocols: Eager and Cooperative (incremental). Eager has every member release all partitions at once, which tends to lengthen downtime. Cooperative hands off partitions incrementally, shortening downtime.
| Phase | Key RPCs / Events | Behavior on Failure |
|---|---|---|
| Detection | Heartbeat / subscription change detection | Timeout marks member as missing and triggers rejoin |
| Join | JoinGroup request and leader election | Retries on expiry; excessively long deadlines cause extended gaps |
| Sync | SyncGroup finalizes the assignment | Another round if there is conflict or change (Cooperative applies it incrementally) |
Basic sequence (simplified)
Java consumer rebalance listener (key points only)
consumer.subscribe(Collections.singletonList("orders"), new ConsumerRebalanceListener() {
@Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// Safely finish in-flight messages and sync-commit the last read position
consumer.commitSync();
}
@Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// Adjust position on resume. Call seek if needed
// consumer.seekToCommitted(partitions);
}
});Rebalances do not fire on every arbitrary change. The main triggers are changes to membership, subscriptions, or topic configuration. Heartbeat or polling timeouts are the most common cause in practice.
A coordinator failover can cause members to rejoin, but assignments do not necessarily change significantly. Even so, expect short gaps and resync periods.
| Event | Typical Cause | Can It Be Avoided/Reduced? |
|---|---|---|
| Member leave | Pod/VM restart, deploy, crash | Reducible with static membership and rolling restarts |
| Heartbeat timeout | GC/CPU saturation, network latency, session.timeout.ms misconfiguration | Reducible with tuning and monitoring |
| max.poll exceeded | Heavy processing stretches poll intervals; oversized batches | Reducible by splitting work, backpressure, and max.poll settings |
| Partition count increase | Scaling requirements | Unavoidable, but Cooperative reduces impact |
Conceptual map of triggers
Key consumer settings (foundation for avoiding rebalances)
# Heartbeat and member health
session.timeout.ms=... # Too short or too long is destabilizing
heartbeat.interval.ms=... # Keep it a fraction of session.timeout
# Application processing health
max.poll.interval.ms=... # Extend if processing is long-running
max.poll.records=... # Cap per-poll work to prevent overrunsEager protocols (Range/Sticky, etc.) have every member revoke all partitions at the start of a rebalance and stop reading until assignment completes. Simple and broadly compatible, but downtime tends to grow.
Cooperative Sticky hands off only the impacted partitions incrementally. It stabilizes over multiple rounds, but total downtime is shorter in most cases.
| Aspect | Eager (Range/Sticky) | Cooperative Sticky |
|---|---|---|
| Downtime tendency | Long (revokes all) | Short (revokes only impacted partitions, incrementally) |
| Convergence rounds | Usually 1 | Multiple (incremental application) |
| Duplicate/drop risk | Increases if listener implementation is weak | Easier to control with incremental handoff |
| Compatibility / requirements | Broadly supported | Explicit configuration on supported clients is safest |
Time behavior of Eager vs Cooperative (conceptual)
Eager:
[Revoke all]----[Assign all]----[Resume]
^ Full stop
Cooperative:
[Revoke impacted]--[Assign subset]--[Resume some]--[Next round]--[Stable]
^ Only impacted partitions stopAssignment strategy configuration examples
# Cooperative (recommended, supported clients)
partition.assignment.strategy=[org.apache.kafka.clients.consumer.CooperativeStickyAssignor]
# Eager (compatibility-first)
# partition.assignment.strategy=[org.apache.kafka.clients.consumer.StickyAssignor]
# or explicitly RangeAssignorEffective consumer downtime is the span from revoke to resume plus the application's interruption and re-initialization costs. With Cooperative, the scope is limited to a subset, so total gaps shrink.
If your commit design is weak on the app side, duplicate processing and missed messages around a rebalance increase. Clearly define sync commits in onPartitionsRevoked and initialization on reassignment.
| Parameter | Main Effect | Risk if Misconfigured |
|---|---|---|
| max.poll.interval.ms | Maximum permitted processing time per poll | Exceeded → forced leave → frequent rebalances |
| session.timeout.ms | Grace period for member liveness | Too short causes false positives; too long delays failure detection |
| max.poll.records | Records processed per poll | Too large extends processing time; too small adds overhead |
Timeline (from a single member's perspective)
time --->
[poll]--process--process--RebalanceStart--revoke--commit--assign--init--resume--process
^ Downtime windowBasic pause-and-resume pattern
// Safe stop on revoke
consumer.pause(currentPartitions);
consumer.commitSync();
// Close handling as needed...
// Re-initialize on assign
// for (tp : assigned) { seekToCommitted(tp); }
consumer.resume(assignedPartitions);The shortcut to lower downtime is reducing frequency and narrowing impact when rebalances do happen. The pillars are Cooperative Sticky combined with static membership, tuned heartbeat and polling settings, and planned scaling.
On rolling deploys, hold slots via static membership and swap one instance at a time to limit the ripple of reassignment.
| Technique | Expected Effect | Notes / Caveats |
|---|---|---|
| Cooperative Sticky | Shorter downtime via incremental handoff of impacted partitions only | Standardize within a group and verify compatibility |
| Static membership | Suppress unnecessary rebalances during rolling restarts | group.instance.id must be unique within the group |
| Heartbeat tuning | Fewer false positives and right-sized detection time | Overly short adds load; overly long delays detection |
| Gradual scaling | Avoid large-scale reassignment | Observe latency and lag at each step |
How static membership preserves a slot
Basic configuration and rolling restart example
# Static membership + Cooperative
partition.assignment.strategy=[org.apache.kafka.clients.consumer.CooperativeStickyAssignor]
group.instance.id=order-svc-1 # unique within the group
# Rolling procedure (conceptual)
# 1) Stop one instance
# 2) Restart immediately with the same group.instance.id
# 3) Confirm stability, then move to the next instanceExcessive rebalances directly cause app downtime and latency degradation. Combine client and broker metrics, logs, and CLI to detect and triage them early.
Typical signs are spikes in rebalances-total, drops in heartbeat-rate, and frequent RebalanceInProgressException. Separate the cause into application processing, infrastructure latency, or configuration.
| Metric / Log | Meaning | Action |
|---|---|---|
| Increase in rebalances-total | Frequent rebalances | Correlate with time-of-day and deploys to find the source |
| Drop in heartbeat-rate | Heartbeat delay/stop | Investigate GC/CPU/network and re-tune timeouts |
| RebalanceInProgressException | Rebalance collision | Verify retry logic and listener handler health |
Group state transitions (conceptual)
Commonly used operational commands
# Current assignments and lag
kafka-consumer-groups --bootstrap-server <bkr:9092> --describe --group <group>
# List group members
kafka-consumer-groups --bootstrap-server <bkr:9092> --describe --members --group <group>
# Safely adjust offsets (planned maintenance)
# Always verify with --dry-run before executing
kafka-consumer-groups --bootstrap-server <bkr:9092> --group <group> --topic <t> --reset-offsets --to-latest --dry-runCCDAK / CCAAK
問題 1
You want to run a rolling deploy while minimizing consumer group downtime. Which mitigation is most effective and aligned with official behavior?
正解: A
Cooperative Sticky shrinks downtime via incremental handoff, and static membership suppresses unnecessary reassignment during rolling restarts. B oversensitizes max.poll exceedance detection (counterproductive); C delays failure detection and extends gaps; D tends to cause temporary full stops.
Does a coordinator failover always trigger a rebalance?
Member rejoins can happen, but a coordinator failover does not necessarily cause large-scale reassignment. Brief gaps or resyncs may occur, so make client retries and listener implementations robust.
What happens when you increase the partition count?
The group recomputes assignments including the new partitions, triggering a rebalance. With Cooperative, only the impacted partitions are handed off incrementally, reducing perceived downtime. Doing it during off-peak hours with monitoring is recommended.
What is the difference between session.timeout.ms and max.poll.interval.ms?
session.timeout.ms is the heartbeat-based member liveness window, monitoring network and process health. max.poll.interval.ms is the application processing health check; if the gap between poll() calls is too long, the member is considered unhealthy and removed. They are independent, but exceeding either one triggers a rebalance.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...