Kafka brokers expose a large set of metrics through JMX. For both the exam and real operations, getting the namespaces and attribute granularity right is critical.
This article covers the MBeans you can rely on for both exams and operations, along with threshold design guidance that minimizes false alarms. Details can shift between ZooKeeper mode and KRaft, but the core JMX names and Yammer metrics attributes stay consistent.
Kafka's JMX is built on Yammer Metrics. Gauges expose Value; Meters expose Count/MeanRate/OneMinuteRate; Timers expose 50thPercentile/95thPercentile/99thPercentile and more. The exam may test the exact spelling of attribute names like 99thPercentile.
The categories that matter most for operations and CCAAK are: availability (ActiveControllerCount, OfflinePartitionsCount), replication health (UnderReplicatedPartitions, IsrShrinksPerSec), throughput (BytesIn/OutPerSec, MessagesInPerSec), latency (RequestMetrics TotalTimeMs), thread headroom (RequestHandlerAvgIdlePercent), and JVM (java.lang:type=Memory, GarbageCollector).
Dump core metrics with JmxTool (specify stable attribute names)
kafka-run-class kafka.tools.JmxTool \
--jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi \
--object-name 'kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec' \
--attributes Count,OneMinuteRate,FiveMinuteRate,MeanRate \
--reporting-interval 3000 --duration 15000UnderReplicatedPartitions is the single most important health signal. If Value > 0 persists, suspect network latency, broker outages, or degraded disk I/O. OfflinePartitionsCount is even more severe and demands immediate recovery.
IsrShrinksPerSec / IsrExpandsPerSec show ISR churn frequency, and spikes can indicate latency, GC pauses, or disk contention. Replica lag is captured via MaxLag-style MBeans on ReplicaFetcherManager (the exact object name varies by version, so build the habit of enumerating JMX on the live broker).
Check URP immediately with jmxterm (Gauge: Value)
# jmxterm の例
open localhost:9999
get -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions Value
# 応答例: mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
# Value = 0ActiveControllerCount must be exactly 1 across the entire cluster. Whether you run ZooKeeper mode or KRaft, there is always exactly one active controller (the exposed namespace can vary by implementation, so confirm by enumerating on the live broker).
PreferredReplicaImbalanceCount counts partitions whose leader has drifted off the preferred replica; a persistent increase signals imbalance. The ratio and sudden shifts between PartitionCount and LeaderCount are also useful for health checks.
Fetch ActiveControllerCount via Jolokia
curl -s http://broker1:8778/jolokia/read/kafka.controller:type=KafkaController,name=ActiveControllerCount | jq .value.ValueBytesInPerSec/BytesOutPerSec and MessagesInPerSec are the foundation for capacity planning and sudden-change detection. Use OneMinuteRate as the primary signal rather than MeanRate to track changes more responsively.
RequestHandlerAvgIdlePercent measures headroom on the request-handler thread pool. Sustained drops (e.g., < 0.2) lead to queue buildup and worse latency. Use the 95thPercentile/99thPercentile of RequestMetrics TotalTimeMs to monitor latency SLOs directly.
Path from client request to log append and replication, with the key MBeans
Clients --> [NetworkProcessor] --> [RequestQueue] --> [RequestHandler]
| |
v v
RequestMetrics BrokerTopicMetrics
|
v
[Log Append]
|
v
ReplicaFetcherManager --> Followers
監視ポイント:
- RequestHandlerAvgIdlePercent (kafka.server:KafkaRequestHandlerPool)
- TotalTimeMs (kafka.network:RequestMetrics)
- BytesIn/OutPerSec・MessagesInPerSec (kafka.server:BrokerTopicMetrics)
- MaxLag (kafka.server:ReplicaFetcherManager)Check latency and headroom with jmxterm
open localhost:9999
# 99 パーセンタイルのプロデュースレイテンシ
get -b kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce 99thPercentile
# ハンドラ余力(0.0〜1.0)
get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent ValueTrack JVM heap health via HeapMemoryUsage on java.lang:type=Memory. If used/committed stays above 0.8, suspect GC pressure or OOM risk. Use CollectionTime/CollectionCount on GarbageCollector as an auxiliary signal for GC load.
Latency for log writes and flushes is visible via the percentiles on kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs. Correlating with OS-level disk utilization makes root-cause isolation much easier.
Retrieve JVM heap state via JMX
open localhost:9999
get -b java.lang:type=Memory HeapMemoryUsage
# 返値の used, committed を使用率計算に利用Availability-critical metrics (OfflinePartitionsCount, ActiveControllerCount) warrant static, immediate thresholds. Traffic and latency have business-cycle rhythms, so combining them with baselines (moving averages and quantiles) reduces false positives.
With JMX Exporter (Prometheus), map Yammer attribute names directly to time series. Tie Timer 99thPercentile to your SLA and use duration conditions to keep paging noise down.
| Approach | Setup Difficulty | Strengths | Typical Metrics |
|---|---|---|---|
| Static thresholds | Low | Simple and immediate | ActiveControllerCount, OfflinePartitionsCount, URP |
| Baseline deviation | Medium | Lower false positives and seasonality-aware | BytesIn/OutPerSec, MessagesInPerSec, p99 latency |
| SLO / error budget tied | Medium-high | Directly tied to business impact | RequestMetrics TotalTimeMs p99, error rates (timeouts, etc.) |
Minimal example of Prometheus rules and JMX Exporter config
# jmx_exporter (jmx_prometheus_javaagent) の rules 抜粋
rules:
- pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
name: kafka_under_replicated_partitions
type: GAUGE
- pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
name: kafka_active_controller_count
type: GAUGE
- pattern: 'kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(.+)><>(99thPercentile)'
name: kafka_request_latency_p99
labels:
request: "$1"
type: GAUGE
# alerting rules (Prometheus)
- alert: KafkaURPCritical
expr: kafka_under_replicated_partitions > 0
for: 5m
labels: {severity: critical}
annotations:
summary: "URP が 5 分以上継続"
- alert: KafkaActiveControllerAnomaly
expr: kafka_active_controller_count != 1
for: 1m
labels: {severity: critical}
annotations:
summary: "ActiveControllerCount が 1 ではない"
- alert: KafkaProduceLatencyHigh
expr: kafka_request_latency_p99{request="Produce"} > 200
for: 10m
labels: {severity: warning}
annotations:
summary: "プロデュース p99 レイテンシが 200ms を 10 分超過"CCAAK
問題 1
You want to detect Kafka cluster availability degradation as quickly as possible. Which MBean/attribute combination should be monitored most directly and with the highest priority?
正解: A
ActiveControllerCount and OfflinePartitionsCount tie directly to availability (leader election and partition health). Throughput, latency, and JVM metrics matter, but they are secondary signals for availability.
Do MBean names change between ZooKeeper mode and KRaft?
Most server-side MBeans (BrokerTopicMetrics, ReplicaManager, etc.) are shared between modes. Controller-related object names can differ by implementation, so enumerate JMX on the live broker to confirm. For the exam, the key point is the meaning of ActiveControllerCount (it must always be 1).
Are Timer percentile attribute names 99thPercentile rather than p99?
Yes. Kafka's Yammer Metrics use attribute names like 99thPercentile, 95thPercentile, and 50thPercentile. As with MeanRate/OneMinuteRate, memorizing the exact spelling helps you avoid traps on the exam.
URP briefly spikes to 1. Should I page on it?
Brief leader transfers and transient network throttling can cause momentary fluctuations. In production, add a duration condition (e.g., warn if >1-2 min, critical if >5 min) to suppress false positives.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...