Kafka Consumer Groups: Parallelism & Rebalance (2026)

Consumer Groups are at the heart of Kafka's scale-out and fault-tolerance story. Partitions cap the upper bound on parallelism, while group coordination (rebalancing) drives availability and latency.

This article distills the CCDAK exam perspective alongside the configuration and boundary behaviors that matter in production, sticking to stable concepts aligned with the official documentation.

1. Parallelism Fundamentals: Partitions and Consumer Groups

A Consumer Group is made up of multiple consumers sharing the same group.id, and each partition of a given topic is assigned to exactly one consumer in the group. As a result, the maximum parallelism is bounded by the total number of assignable partitions across the group.

Ordering guarantees are per-partition. Even as you scale the group horizontally, ordering is preserved within a single partition but not across partitions. Secure the parallelism you need up front through partition design. Consumers are not thread-safe, so the basic rule is one thread per instance; if you need internal parallelism, decouple message processing with a worker-queue pattern.

Max parallelism ≈ min(number of consumers, total partitions)
Each partition is exclusively assigned to one consumer within the group
Ordering is guaranteed per partition; there is no global ordering across the group
Consumers are generally not thread-safe; one thread per instance is the rule

Consumer Group and partition assignment (conceptual diagram)

2. Group Coordination: How Rebalancing Works and What Triggers It

Consumer Groups are managed by the Group Coordinator on a broker, and members send heartbeats to signal liveness. When sessions expire or membership changes, a rebalance fires and partition assignments are recomputed. Fetching and processing can pause briefly during a rebalance, so optimizing its frequency and duration is the key to throughput and latency.

The main triggers are members joining or leaving, changes to the subscribed topic set, partition count changes, and exceeding the session timeout or max poll interval. Configuration centers on keeping session.timeout.ms, heartbeat.interval.ms, and max.poll.interval.ms consistent, plus selecting the right assignment strategy.

Rebalance triggers: join/leave, subscribe changes, partition count changes, session expiry, max.poll.interval.ms exceeded
Monitoring priorities: rebalance count, average/max duration, heartbeat latency
Stabilization essentials: adopt the Cooperative strategy, use static membership, and tune timeouts appropriately

Aspect	Eager (legacy) rebalance	Cooperative (incremental) rebalance	Exam and operational takeaway
Disruption level	All members release every partition first, causing major processing interruptions	Only the affected partitions move incrementally, minimizing interruption	Cooperative is generally preferred for low-latency, stable operations
Movement volume	Broad reassignment tends to occur after recomputation	Stays at the minimum necessary reassignment	The difference is most pronounced during rolling restarts
Listener handling	Tends to require a full commit/cleanup in onPartitionsRevoked	Allows minimal commits in step with fine-grained revoke/assign events	The rebalance-listener implementation makes a real difference
Applicability and benefits	Simpler to implement; acceptable for small, short-lived groups	Effective for always-on workloads where stability and continuous processing matter	Setting it explicitly is the safe choice (defaults depend on implementation/version)

3. Assignment Strategies and Stabilization Tuning

Assignment strategies are specified as a list under partition.assignment.strategy. Common options include RangeAssignor, RoundRobinAssignor, StickyAssignor, and CooperativeStickyAssignor. In practice, incremental rebalancing via CooperativeStickyAssignor is widely recommended, but defaults vary across clients and distributions, so explicit configuration is the safer choice (to avoid version-driven surprises).

Static membership (setting group.instance.id) treats a restarted member as the same identity, dramatically reducing rebalancing during rolling restarts. Tune timeout settings with their interdependencies in mind so that you don't trigger unnecessary leaves by exceeding max.poll.interval.ms.

partition.assignment.strategy: list CooperativeStickyAssignor first explicitly (recommended)
Assign group.instance.id to enable static membership (stabilizes rolling restarts)
Keep the ratio between session.timeout.ms and heartbeat.interval.ms (e.g., heartbeat at about 1/3 of session)
Set max.poll.interval.ms long enough to cover your worst-case processing time

4. Offset Management and Delivery Semantics

Consumer offsets are committed by default to the internal __consumer_offsets topic. enable.auto.commit will commit periodically, but it advances regardless of processing success or failure, so consider manual commits (commitSync/commitAsync) at least for critical processing paths.

Delivery semantics depend on commit ordering. Commit after processing for at-least-once. Commit before processing for at-most-once — but with the risk of loss. True exactly-once requires producer/consumer coordination via Kafka transactions, which broadens the unit of design.

auto-commit is convenient but makes reprocessing and duplicate control difficult
at-least-once: commit offsets after processing completes (the common, safe default)
at-most-once: commit immediately after fetch (only for loss-tolerant use cases)
Transactions enable read-process-write EOS at the cost of additional design complexity

5. Patterns for Scaling and Throughput Optimization

To raise parallelism, you generally either add partitions or add consumers to the group. The latter is bounded by the former, so estimating the required partition count at planning time is critical. Too many partitions inflate metadata and replication overhead, so scale design needs to be grounded in evidence.

For fetch/processing optimization, useful tools include batch processing via max.poll.records, bandwidth efficiency via fetch.min.bytes and fetch.max.wait.ms, and temporary backpressure control via pause/resume. Distribute processing threads horizontally through an internal queue, and manage offset commits against processing completion.

Partition count is the main axis of scale design; adding partitions later can disrupt key-based locality
Use max.poll.records for batch processing to balance throughput vs. latency
pause/resume temporarily reduces intake — useful as a throttle during overload
client.rack enables fetching from nearby replicas first (network efficiency)

6. Operational Essentials: Monitoring and Safe Rolling Restarts

In production, surface rebalance metrics (count and duration), consumer lag, fetch latency, and heartbeat delay. During incidents and deployments, combine static membership with the Cooperative strategy to minimize per-release impact.

During rolling restarts, stop and restart one instance at a time, waiting for each member to rejoin before moving on. With static membership intact, reassignment is localized and interruption stays minimal. The kafka-consumer-groups CLI with --describe or --reset-offsets is useful for inspecting and correcting state.

Monitoring: group lag, rebalance count/duration, heartbeat, fetch metrics
Procedure: roll one instance at a time with static membership plus Cooperative for a graceful transition
Tooling: proper use of kafka-consumer-groups --describe and --reset-offsets

Java Consumer with Cooperative + static membership (key excerpts)

Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092,broker2:9092");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.GROUP_ID_CONFIG, "orders-consumer-g");
// Explicitly set Cooperative Sticky (list order is priority)
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
         "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");
// Static membership (set a unique ID per instance)
props.put(ConsumerConfig.GROUP_INSTANCE_ID_CONFIG, System.getenv("POD_NAME"));
// Align timeouts with processing time and network characteristics
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "15000");
props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, "3000");
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, "600000");
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);

CountDownLatch latch = new CountDownLatch(1);

ConsumerRebalanceListener listener = new ConsumerRebalanceListener() {
  @Override
  public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
    // Reliably commit processed offsets (at-least-once)
    try { consumer.commitSync(); } catch (Exception e) { /* log */ }
  }
  @Override
  public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
    // Seek or initialize as needed
  }
};

consumer.subscribe(Arrays.asList("orders"), listener);

try {
  while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
    for (ConsumerRecord<String, String> r : records) {
      process(r); // App-specific processing (don't swallow exceptions; route to DLQ etc.)
    }
    consumer.commitAsync(); // Usually async; sync before shutdown/revoke
  }
} finally {
  try { consumer.commitSync(); } catch (Exception ignore) {}
  consumer.close();
}

Check Your Understanding

CCDAK

問題 1

Topic A has 6 partitions. You launch 8 instances in Consumer Group G with CooperativeStickyAssignor and static membership configured. Which combination correctly describes the maximum concurrent parallelism and the rolling-restart behavior?

Parallelism is 6. Rebalancing during rolling restarts is localized in scope, minimizing interruption
Parallelism is 8. Cooperative shares some partitions with the surplus consumers
Parallelism is 6. Static membership eliminates rebalancing entirely
Parallelism is 8. Static membership replicates the unassigned partitions

正解: A

The group's maximum parallelism is capped by the number of assignable partitions (6); the 2 surplus instances stay unassigned. Cooperative minimizes interruption via incremental reassignment, and static membership suppresses the full reassignment that would otherwise come from leave/rejoin during rolling restarts — but it does not eliminate rebalancing entirely.

Frequently Asked Questions

Why doesn't throughput improve when I add more consumers?

Because the total number of partitions caps parallelism — any extra consumers stay unassigned. First revisit your partition-count design, or look at processing bottlenecks (downstream databases or APIs).

How can I increase parallelism while preserving ordering?

Improve key locality and increase the partition count. Ordering guarantees apply only within a single partition, so revisit your key/hashing design so that one key maps to one partition.

How can I reduce downtime caused by rebalances?

Explicitly set CooperativeStickyAssignor, enable static membership via group.instance.id, tune max.poll.interval.ms to match your processing time, set session/heartbeat appropriately, and implement onPartitionsRevoked/Assigned to keep commits and re-initialization minimal.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Kafka Consumer Groups in Practice: Parallelism, Coordination, and Tuning