Consumer Lag is the difference between the latest end offset of each partition and the consumer's position (or committed position). Mixing up these definitions leads to false positives and missed incidents in monitoring.
This article covers the lag formula and key metrics, alert design for production, a tool comparison, common root causes and remediations, and the points most frequently tested on CCDAK/CCAAK.
Consumer Lag is the gap between a partition's End Offset (latest write position) and the consumer's position. In practice, some tools use the committed offset as the baseline (kafka-consumer-groups.sh and most exporters), while client metrics like records-lag use the current fetched position. Both are valid, but they mean different things for alerts, so pick one baseline and stick to it across your monitoring design.
Kafka consumer groups commit the last processed position per partition to an internal topic called __consumer_offsets. Lag is computed per partition, and the group-level lag is typically evaluated as the simple sum across partitions.
Consumer Lag conceptual diagram (1 topic, 2 partitions)
The LAG column in kafka-consumer-groups.sh is normally EndOffset - CommittedOffset. Because it reflects how far processing has been finalized, it makes a natural baseline for operational alerts. The client-side JMX records-lag metrics, by contrast, use the current fetched/processing position, which is effective for catching short-term stalls.
The main metrics are listed below. JMX names are stable, but exporter names on the monitoring backend vary by tool, so confirm the exact name in the documentation of whatever you scrape from.
Commands you will actually use in production, plus a Prometheus alert example
# 1) グループごとのラグ確認(Committed 基準)
$ kafka-consumer-groups.sh \
--bootstrap-server broker1:9092 \
--group orders-app \
--describe
# 2) 特定グループのラグが5分間1万件超で通知(Prometheus 例)
# ※ メトリクス名はエクスポーター実装により異なります(例: Kafka Lag Exporter)。
ALERT KafkaConsumerLagHigh
expr: max_over_time(kafka_consumergroup_group_lag{group="orders-app"}[5m]) > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Kafka consumer lag high (orders-app)"
description: "Lag > 10k for 5m. Investigate consumer throughput, rebalancing, or downstream slowness."Fixed thresholds alone make it hard to distinguish a temporary peak from a real stall, so combine quantity with duration. Layering in Backlog Seconds on top of that yields alerts that map directly to business impact (how far behind you actually are).
When the consumer ingestion rate sits near zero while records-lag-max keeps rising, that combination is a classic signature of a stuck processing loop or frequent rebalances. Design alerts around the combination of rate and slope, not just the absolute value of a single metric.
Tools differ in which lag baseline they use (committed vs current position) and how they evaluate it. Pick based on your operational requirements (SLA compliance, trend monitoring, dashboard focus).
Confluent Control Center is platform-integrated and makes visualization and alerting easy. Burrow evaluates the slope of commits to classify state as OK/WARN/ERROR, which suppresses false positives. For a lightweight start, the CLI plus an exporter feeding Prometheus/Alertmanager is the most approachable combo.
| Tool | Lag baseline | Alerting | Visualization / characteristics |
|---|---|---|---|
| kafka-consumer-groups.sh | Committed-based (EndOffset - CommittedOffset) | Aimed at manual checks; scriptable for simple monitoring | Lightweight, ships with Kafka, one-shot |
| Confluent Control Center | Both committed and client-side metrics | Thresholds / anomaly detection with notification integrations | GUI dashboards, trends, cross-topic/group views |
| Burrow | Primarily committed-based plus slope evaluation | Evaluator classifies state as OK/WARN/ERROR | No built-in UI (designed for external integration); low false positive rate |
| Kafka Lag Exporter + Prometheus | Committed-based (aggregated per partition/group) | Flexible PromQL definitions with Alertmanager integration | Easy Grafana dashboards, low cost |
Lag accumulates whenever production exceeds consumption for a sustained period. The main culprits are insufficient processing throughput, frequent rebalances, retry storms, and waits on external dependencies (DB/HTTP). Remediation has to cover not just config tuning but also processing design — batching, async patterns, DLQ.
Scaling out is not a silver bullet. Adding consumers beyond the partition count does not increase parallelism. If the throughput bottleneck is downstream I/O, simple parallelization barely helps — queueing, caching, and bulk APIs are far more effective.
Exams frequently test the definition of lag, the difference between committed and current position, the scaling ceiling (partition-count constraint), offset management, and rebalance-related settings. Expect questions about interpreting CLI output and prioritizing settings for a given implementation.
End-to-end delay (time from production to processing completion) is not synonymous with lag. Even when lag is small, internal processing waits or external I/O can produce significant real-time delay.
CCDAK / CCAAK
問題 1
You are designing Consumer Lag alerts for a Kafka environment. Using kafka-consumer-groups.sh LAG as the baseline while minimizing false positives, which combination is most appropriate?
正解: A
The LAG from kafka-consumer-groups.sh is committed-based and reflects finalized processing delay, so combining quantity x time (with a for clause) for persistence with a drop in records-consumed-rate suppresses false positives. Immediate alerting on records-lag-max alone (B) is fragile against spikes. Producer rate only (C) or end-to-end delay only (D) cannot directly address consumer-side stalls.
Does Consumer Lag have to be zero?
You do not need to aim for a constant zero. Lag typically accumulates briefly during traffic peaks and clears once the consumption rate catches up. What matters is that it resolves within your tolerated delay window — define an SLO that accounts for Backlog Seconds and duration.
I added more consumers but the lag will not drop. Why?
Parallelism is capped by the partition count. Once consumers >= partitions, adding more consumers does not increase parallel processing. Also, if downstream I/O is the bottleneck, batching or scaling the destination usually helps more than adding consumers.
After disabling auto-commit, the LAG looks huge. Is that a problem?
With enable.auto.commit=false, the lag depends on how often your app commits. If you intentionally delay commits, the Committed-based LAG can look large even though processing is keeping up. Combine it with records-lag-max (current position) for alerts and pick a baseline that matches your commit design.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...