Kafka

Kafka JMX Metrics: Essential MBeans and Threshold Design Guide

2026-04-19
NicheeLab Editorial Team

Kafka brokers expose a large set of metrics through JMX. For both the exam and real operations, getting the namespaces and attribute granularity right is critical.

This article covers the MBeans you can rely on for both exams and operations, along with threshold design guidance that minimizes false alarms. Details can shift between ZooKeeper mode and KRaft, but the core JMX names and Yammer metrics attributes stay consistent.

Key MBean Overview and Metric Categories

Kafka's JMX is built on Yammer Metrics. Gauges expose Value; Meters expose Count/MeanRate/OneMinuteRate; Timers expose 50thPercentile/95thPercentile/99thPercentile and more. The exam may test the exact spelling of attribute names like 99thPercentile.

The categories that matter most for operations and CCAAK are: availability (ActiveControllerCount, OfflinePartitionsCount), replication health (UnderReplicatedPartitions, IsrShrinksPerSec), throughput (BytesIn/OutPerSec, MessagesInPerSec), latency (RequestMetrics TotalTimeMs), thread headroom (RequestHandlerAvgIdlePercent), and JVM (java.lang:type=Memory, GarbageCollector).

  • kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions (Gauge: Value)
  • kafka.controller:type=KafkaController,name=ActiveControllerCount (Gauge: Value)
  • kafka.controller:type=KafkaController,name=OfflinePartitionsCount (Gauge: Value)
  • kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec|BytesOutPerSec|MessagesInPerSec (Meter: Count, OneMinuteRate, etc.)
  • kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent (Gauge: Value)
  • kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce|FetchConsumer (Timer: 95thPercentile, 99thPercentile, etc.)

Dump core metrics with JmxTool (specify stable attribute names)

kafka-run-class kafka.tools.JmxTool \
  --jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi \
  --object-name 'kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec' \
  --attributes Count,OneMinuteRate,FiveMinuteRate,MeanRate \
  --reporting-interval 3000 --duration 15000

Replication Health: Thresholds and Operational Actions

UnderReplicatedPartitions is the single most important health signal. If Value > 0 persists, suspect network latency, broker outages, or degraded disk I/O. OfflinePartitionsCount is even more severe and demands immediate recovery.

IsrShrinksPerSec / IsrExpandsPerSec show ISR churn frequency, and spikes can indicate latency, GC pauses, or disk contention. Replica lag is captured via MaxLag-style MBeans on ReplicaFetcherManager (the exact object name varies by version, so build the habit of enumerating JMX on the live broker).

  • UnderReplicatedPartitions: warn if > 0 for 1-2 min, crit if > 0 for 5+ min
  • OfflinePartitionsCount: page immediately on crit > 0
  • IsrShrinksPerSec: warn on spikes 3x or more above baseline
  • ReplicaFetcherManager MaxLag: thresholds aligned with topic SLAs (e.g., warn at >10s, crit at >60s)
  • Use duration windows to filter noise (exclude brief leader transfers)

Check URP immediately with jmxterm (Gauge: Value)

# jmxterm の例
open localhost:9999
get -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions Value
# 応答例: mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
# Value = 0

Monitoring Controller and Partition State

ActiveControllerCount must be exactly 1 across the entire cluster. Whether you run ZooKeeper mode or KRaft, there is always exactly one active controller (the exposed namespace can vary by implementation, so confirm by enumerating on the live broker).

PreferredReplicaImbalanceCount counts partitions whose leader has drifted off the preferred replica; a persistent increase signals imbalance. The ratio and sudden shifts between PartitionCount and LeaderCount are also useful for health checks.

  • ActiveControllerCount: anything other than 1 is critical (multiple or 0 requires immediate recovery)
  • OfflinePartitionsCount: continuously monitor that it stays at 0
  • PreferredReplicaImbalanceCount: warn on continuous increase; consider reassignment or rebalancing

Fetch ActiveControllerCount via Jolokia

curl -s http://broker1:8778/jolokia/read/kafka.controller:type=KafkaController,name=ActiveControllerCount | jq .value.Value

Throughput and Latency: BrokerTopicMetrics and RequestMetrics

BytesInPerSec/BytesOutPerSec and MessagesInPerSec are the foundation for capacity planning and sudden-change detection. Use OneMinuteRate as the primary signal rather than MeanRate to track changes more responsively.

RequestHandlerAvgIdlePercent measures headroom on the request-handler thread pool. Sustained drops (e.g., < 0.2) lead to queue buildup and worse latency. Use the 95thPercentile/99thPercentile of RequestMetrics TotalTimeMs to monitor latency SLOs directly.

  • kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec|BytesOutPerSec (OneMinuteRate)
  • kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec (OneMinuteRate)
  • kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent (warn if Value < 0.2)
  • kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce|FetchConsumer (tie 99thPercentile to your SLO)
  • kafka.network:type=Processor,name=IdlePercent,networkProcessor=0..N (network processing headroom)

Path from client request to log append and replication, with the key MBeans

Clients --> [NetworkProcessor] --> [RequestQueue] --> [RequestHandler]
                                       |                        |
                                       v                        v
                                   RequestMetrics         BrokerTopicMetrics
                                                                |
                                                                v
                                                          [Log Append]
                                                                |
                                                                v
                                                   ReplicaFetcherManager --> Followers

監視ポイント:
- RequestHandlerAvgIdlePercent (kafka.server:KafkaRequestHandlerPool)
- TotalTimeMs (kafka.network:RequestMetrics)
- BytesIn/OutPerSec・MessagesInPerSec (kafka.server:BrokerTopicMetrics)
- MaxLag (kafka.server:ReplicaFetcherManager)

Check latency and headroom with jmxterm

open localhost:9999
# 99 パーセンタイルのプロデュースレイテンシ
get -b kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce 99thPercentile
# ハンドラ余力(0.0〜1.0)
get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value

Detecting JVM and Disk I/O Bottlenecks

Track JVM heap health via HeapMemoryUsage on java.lang:type=Memory. If used/committed stays above 0.8, suspect GC pressure or OOM risk. Use CollectionTime/CollectionCount on GarbageCollector as an auxiliary signal for GC load.

Latency for log writes and flushes is visible via the percentiles on kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs. Correlating with OS-level disk utilization makes root-cause isolation much easier.

  • java.lang:type=Memory HeapMemoryUsage.used/committed > 0.8 sustained for 5 min: warn
  • java.lang:type=GarbageCollector,name=G1 Old Generation CollectionTime: warn on sudden surges
  • kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs 99thPercentile: warn at 2x baseline or higher

Retrieve JVM heap state via JMX

open localhost:9999
get -b java.lang:type=Memory HeapMemoryUsage
# 返値の used, committed を使用率計算に利用

Alert Design: Combining Static Thresholds and Baselines

Availability-critical metrics (OfflinePartitionsCount, ActiveControllerCount) warrant static, immediate thresholds. Traffic and latency have business-cycle rhythms, so combining them with baselines (moving averages and quantiles) reduces false positives.

With JMX Exporter (Prometheus), map Yammer attribute names directly to time series. Tie Timer 99thPercentile to your SLA and use duration conditions to keep paging noise down.

  • Fatal signals: static thresholds with immediate paging (OfflinePartitionsCount > 0, ActiveControllerCount != 1)
  • Performance signals: baseline deviation plus duration (p99 latency, IdlePercent drops)
  • Use correlation rules to infer cause (rising URP plus dropping network Idle suggests a bandwidth bottleneck)
ApproachSetup DifficultyStrengthsTypical Metrics
Static thresholdsLowSimple and immediateActiveControllerCount, OfflinePartitionsCount, URP
Baseline deviationMediumLower false positives and seasonality-awareBytesIn/OutPerSec, MessagesInPerSec, p99 latency
SLO / error budget tiedMedium-highDirectly tied to business impactRequestMetrics TotalTimeMs p99, error rates (timeouts, etc.)

Minimal example of Prometheus rules and JMX Exporter config

# jmx_exporter (jmx_prometheus_javaagent) の rules 抜粋
rules:
- pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
  name: kafka_under_replicated_partitions
  type: GAUGE
- pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
  name: kafka_active_controller_count
  type: GAUGE
- pattern: 'kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(.+)><>(99thPercentile)'
  name: kafka_request_latency_p99
  labels:
    request: "$1"
  type: GAUGE

# alerting rules (Prometheus)
- alert: KafkaURPCritical
  expr: kafka_under_replicated_partitions > 0
  for: 5m
  labels: {severity: critical}
  annotations:
    summary: "URP が 5 分以上継続"
- alert: KafkaActiveControllerAnomaly
  expr: kafka_active_controller_count != 1
  for: 1m
  labels: {severity: critical}
  annotations:
    summary: "ActiveControllerCount が 1 ではない"
- alert: KafkaProduceLatencyHigh
  expr: kafka_request_latency_p99{request="Produce"} > 200
  for: 10m
  labels: {severity: warning}
  annotations:
    summary: "プロデュース p99 レイテンシが 200ms を 10 分超過"

Check Your Understanding

CCAAK

問題 1

You want to detect Kafka cluster availability degradation as quickly as possible. Which MBean/attribute combination should be monitored most directly and with the highest priority?

  1. kafka.controller:type=KafkaController,name=ActiveControllerCount の Value と kafka.controller:type=KafkaController,name=OfflinePartitionsCount の Value
  2. kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec の OneMinuteRate と MessagesInPerSec の MeanRate
  3. kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce の 50thPercentile と 95thPercentile
  4. java.lang:type=Memory の HeapMemoryUsage.used と GarbageCollector の CollectionTime

正解: A

ActiveControllerCount and OfflinePartitionsCount tie directly to availability (leader election and partition health). Throughput, latency, and JVM metrics matter, but they are secondary signals for availability.

Frequently Asked Questions

Do MBean names change between ZooKeeper mode and KRaft?

Most server-side MBeans (BrokerTopicMetrics, ReplicaManager, etc.) are shared between modes. Controller-related object names can differ by implementation, so enumerate JMX on the live broker to confirm. For the exam, the key point is the meaning of ActiveControllerCount (it must always be 1).

Are Timer percentile attribute names 99thPercentile rather than p99?

Yes. Kafka's Yammer Metrics use attribute names like 99thPercentile, 95thPercentile, and 50thPercentile. As with MeanRate/OneMinuteRate, memorizing the exact spelling helps you avoid traps on the exam.

URP briefly spikes to 1. Should I page on it?

Brief leader transfers and transient network throttling can cause momentary fluctuations. In production, add a duration condition (e.g., warn if >1-2 min, critical if >5 min) to suppress false positives.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.