Kafka is loved for its high throughput and fault tolerance, but without proper monitoring you cannot catch performance regressions or data loss risks. This article covers the core broker and client metrics, how to read them, and the key points of threshold design.
The content focuses on stable concepts from the Apache Kafka and Confluent documentation, picking metrics that do not depend heavily on version differences. Points that the CCAAK exam likes to test are also called out.
Kafka brokers expose metrics through JMX. The common pattern is to convert them into an HTTP endpoint with the JMX Exporter (a Java Agent), scrape with Prometheus, visualize in Grafana, and alert via Alertmanager. In Confluent environments, Control Center and the Metrics API (Confluent Cloud) are also available.
Separate collection paths and responsibilities, and design freshness (scrape interval) and granularity (average vs. percentile) per use case. The rule of thumb in production is high frequency for latency and lower frequency for capacity metrics.
Typical flow for collecting Kafka metrics
JMX Exporter + Prometheus configuration example (excerpt)
# Kafka ブローカの起動オプション例(systemd など)
# JMX Exporter を Java Agent として有効化
Environment="KAFKA_OPTS=-javaagent:/opt/jmx/jmx_prometheus_javaagent.jar=7071:/opt/jmx/kafka.yml"
# Prometheus の scrape 設定例
scrape_configs:
- job_name: 'kafka-brokers'
scrape_interval: 15s
static_configs:
- targets: ['broker-1:7071','broker-2:7071','broker-3:7071']
relabel_configs:
- source_labels: [__address__]
regex: '([^:]+):.*'
target_label: instance
replacement: '$1'Start with the metrics that directly touch availability and consistency. UnderReplicatedPartitions (URP), OfflinePartitionsCount, and ActiveControllerCount are the canonical primary alerting signals. Throughput metrics (BytesIn/Out, MessagesIn) are used for capacity planning and bottleneck analysis.
Request latency is not a single metric — it is the sum of queue wait, network, disk I/O, and more. The right pattern for both the exam and the field is to break down RequestMetrics by request type (Produce, Fetch, FetchConsumer).
| Metric | Meaning | Healthy range | Example alert condition |
|---|---|---|---|
| UnderReplicatedPartitions | Number of partitions out of ISR | Always 0 | > 0 sustained (e.g. 1 minute or more) |
| ActiveControllerCount | Number of active controllers | Always 1 | != 1 |
| OfflinePartitionsCount | Number of offline partitions | Always 0 | >= 1 |
| Request TotalTimeMs p95 (Produce) | Total latency of produce requests | Workload-dependent (e.g. < 50–100ms) | Sustained at 2–3x the normal level |
| BytesInPerSec/BytesOutPerSec | I/O throughput | Baseline for capacity planning | Sharp rate of change (e.g. ±50% in 5 minutes) |
Kafka's RequestMetrics lets you split TotalTimeMs into QueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, and so on. Growing Queue suggests thread starvation or GC, growing Local points to disk I/O, and growing Remote indicates network.
Build alert candidates on percentiles (p95/p99) and treat the average as just a dashboard reference. Splitting Produce and Fetch (especially the consumer-facing FetchConsumer) is the iron rule in both the exam and production.
Consumer lag is the difference between each partition's Log End Offset and the Consumer Group's Committed Offset. Momentary spikes are normal, but sustained lag with no downward trend needs action. Settings like auto.commit.interval.ms, max.poll.interval.ms, and max.poll.records directly shape the behavior.
On the CCAAK exam, questions about misreading lag (temporary swelling from batch processing, inflated values from delayed commits, etc.) are standard. Use kafka-consumer-groups to inspect at the Group/Member/Partition level.
Checking lag with kafka-consumer-groups (example)
# 代表的なコマンド
$ kafka-consumer-groups --bootstrap-server broker-1:9092 \
--group analytics-app --describe
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
analytics-app events 0 128345 128400 55 consumer-1-... /10.0.0.21 analytics-1
analytics-app events 1 127998 128001 3 consumer-2-... /10.0.0.22 analytics-2
# 解釈のコツ
# - LAG が波打ちながらも平均的に収束 → 正常範囲
# - LAG が累積的に増える/固定化 → スループット不足 or 停滞Disk I/O and replication behavior tie directly to latency and availability. LogFlushRateAndTimeMs, Produce/FetchRequestPurgatorySize, and IsrShrinksPerSec/IsrExpandsPerSec are the keys to stable operations. Even when URP stays clean, catch a rising IsrShrinks trend early.
Too many partitions also drives up open file counts and metadata processing cost. Watch PartitionCount and LeaderCount regularly to prevent unchecked topic/partition growth.
Design alerts in three layers: primary availability signals, leading performance-degradation signals, and capacity-pressure trends. Notify immediately on primary signals, use sustained conditions and hysteresis for leading signals, and rely on early detection (trend slope) for capacity.
For exam prep, the meanings and normal values of URP/ActiveController/OfflinePartitions, the RequestMetrics breakdown, correct lag interpretation, and visualization with Confluent tools all come up frequently. In production, taking a baseline and combining it with relative-change alerts reduces false positives.
CCAAK
問題 1
Which metric is the primary signal that most directly indicates an availability risk in a Kafka cluster and should be monitored continuously?
正解: A
URP directly indicates that replicas have dropped out of the ISR and fault tolerance is degraded. A drop in BytesIn or a high CPU plateau are indirect factors, and LeaderElection latency matters but is not a steady-state signal. URP is the top priority as the primary indicator.
Can I monitor Kafka without Prometheus?
Yes. Kafka exposes metrics via JMX, so you can read them with tools like JConsole or Jolokia. With Confluent you can use Control Center, and with Confluent Cloud the Metrics API visualizes the key indicators. That said, Prometheus-based stacks are the common choice because of long-term retention and easier alert design.
Is there a universal threshold value?
No. Workloads vary widely, so the practical approach is to capture a baseline and percentiles during normal operation and design alerts around relative change (for example, 2–3x the normal p95 sustained for some duration). The exceptions where a flat value applies are URP=0, ActiveController=1, and OfflinePartitions=0.
Why is processing slow even when lag is 0?
Batch commits and uneven processing time can leave offsets advancing while the application internals are stuck. Common causes include max.poll.records being too small, slow serializers or external API waits, and async commits that make the surface metric look better than reality. Measure in-app processing time and throughput in parallel.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...