Kafka

Kafka Monitoring: Key Metrics and How to Read Them (CCAAK + Production)

2026-04-19
NicheeLab Editorial Team

Kafka is loved for its high throughput and fault tolerance, but without proper monitoring you cannot catch performance regressions or data loss risks. This article covers the core broker and client metrics, how to read them, and the key points of threshold design.

The content focuses on stable concepts from the Apache Kafka and Confluent documentation, picking metrics that do not depend heavily on version differences. Points that the CCAAK exam likes to test are also called out.

The Monitoring Architecture at a Glance

Kafka brokers expose metrics through JMX. The common pattern is to convert them into an HTTP endpoint with the JMX Exporter (a Java Agent), scrape with Prometheus, visualize in Grafana, and alert via Alertmanager. In Confluent environments, Control Center and the Metrics API (Confluent Cloud) are also available.

Separate collection paths and responsibilities, and design freshness (scrape interval) and granularity (average vs. percentile) per use case. The rule of thumb in production is high frequency for latency and lower frequency for capacity metrics.

  • Source: Kafka broker JMX and client (producer/consumer) client metrics
  • Conversion: JMX Exporter (jmx_prometheus_javaagent)
  • Scrape/storage: Prometheus (watch the scrape_interval)
  • Visualization/alerts: Grafana and Alertmanager
  • Managed: Confluent Control Center / Confluent Cloud Metrics API

Typical flow for collecting Kafka metrics

HTTP /metricsProducers/Kafka ClientsConsumers (Client Mx)Kafka Brokers (JMX + Exporter)PrometheusGrafanaAlertmanager

JMX Exporter + Prometheus configuration example (excerpt)

# Kafka ブローカの起動オプション例(systemd など)
# JMX Exporter を Java Agent として有効化
Environment="KAFKA_OPTS=-javaagent:/opt/jmx/jmx_prometheus_javaagent.jar=7071:/opt/jmx/kafka.yml"

# Prometheus の scrape 設定例
scrape_configs:
  - job_name: 'kafka-brokers'
    scrape_interval: 15s
    static_configs:
      - targets: ['broker-1:7071','broker-2:7071','broker-3:7071']
    relabel_configs:
      - source_labels: [__address__]
        regex: '([^:]+):.*'
        target_label: instance
        replacement: '$1'

Core Broker Metrics: Meaning and Healthy Ranges

Start with the metrics that directly touch availability and consistency. UnderReplicatedPartitions (URP), OfflinePartitionsCount, and ActiveControllerCount are the canonical primary alerting signals. Throughput metrics (BytesIn/Out, MessagesIn) are used for capacity planning and bottleneck analysis.

Request latency is not a single metric — it is the sum of queue wait, network, disk I/O, and more. The right pattern for both the exam and the field is to break down RequestMetrics by request type (Produce, Fetch, FetchConsumer).

  • Availability: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
  • Availability: kafka.controller:type=KafkaController,name=ActiveControllerCount (always 1 when healthy)
  • Availability: kafka.controller:type=KafkaController,name=OfflinePartitionsCount (0 is healthy)
  • Throughput: kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec/BytesOutPerSec
  • Latency breakdown: kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce/Fetch
  • Load indicator: kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent
MetricMeaningHealthy rangeExample alert condition
UnderReplicatedPartitionsNumber of partitions out of ISRAlways 0> 0 sustained (e.g. 1 minute or more)
ActiveControllerCountNumber of active controllersAlways 1!= 1
OfflinePartitionsCountNumber of offline partitionsAlways 0>= 1
Request TotalTimeMs p95 (Produce)Total latency of produce requestsWorkload-dependent (e.g. < 50–100ms)Sustained at 2–3x the normal level
BytesInPerSec/BytesOutPerSecI/O throughputBaseline for capacity planningSharp rate of change (e.g. ±50% in 5 minutes)

Decomposing Latency and Throughput

Kafka's RequestMetrics lets you split TotalTimeMs into QueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, and so on. Growing Queue suggests thread starvation or GC, growing Local points to disk I/O, and growing Remote indicates network.

Build alert candidates on percentiles (p95/p99) and treat the average as just a dashboard reference. Splitting Produce and Fetch (especially the consumer-facing FetchConsumer) is the iron rule in both the exam and production.

  • Visualize Produce and Fetch on separate axes (do not mix them)
  • Show the TotalTimeMs breakdown side by side to localize the cause
  • Use p95/p99 as the primary signal and the average as a supplement
  • Cross-check the correlation between throughput (BytesIn/Out, MessagesIn) and latency

Reading Consumer Lag Correctly

Consumer lag is the difference between each partition's Log End Offset and the Consumer Group's Committed Offset. Momentary spikes are normal, but sustained lag with no downward trend needs action. Settings like auto.commit.interval.ms, max.poll.interval.ms, and max.poll.records directly shape the behavior.

On the CCAAK exam, questions about misreading lag (temporary swelling from batch processing, inflated values from delayed commits, etc.) are standard. Use kafka-consumer-groups to inspect at the Group/Member/Partition level.

  • Align the observation unit to Group x Topic x Partition
  • Check whether an increase-then-decrease cycle exists (from processing cycles)
  • Exceeding max.poll.interval.ms triggers rebalances and tends to worsen lag
  • Judge scaling first by consumer parallelism (capped at the partition count)

Checking lag with kafka-consumer-groups (example)

# 代表的なコマンド
$ kafka-consumer-groups --bootstrap-server broker-1:9092 \
  --group analytics-app --describe

GROUP           TOPIC      PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG   CONSUMER-ID        HOST            CLIENT-ID
analytics-app   events     0          128345          128400          55    consumer-1-...     /10.0.0.21      analytics-1
analytics-app   events     1          127998          128001          3     consumer-2-...     /10.0.0.22      analytics-2

# 解釈のコツ
# - LAG が波打ちながらも平均的に収束 → 正常範囲
# - LAG が累積的に増える/固定化 → スループット不足 or 停滞

Storage and Resilience Metrics

Disk I/O and replication behavior tie directly to latency and availability. LogFlushRateAndTimeMs, Produce/FetchRequestPurgatorySize, and IsrShrinksPerSec/IsrExpandsPerSec are the keys to stable operations. Even when URP stays clean, catch a rising IsrShrinks trend early.

Too many partitions also drives up open file counts and metadata processing cost. Watch PartitionCount and LeaderCount regularly to prevent unchecked topic/partition growth.

  • kafka.server:type=ReplicaManager,name=IsrShrinksPerSec / IsrExpandsPerSec
  • kafka.server:type=BrokerTopicMetrics,name=LogFlushRateAndTimeMs
  • kafka.server:type=DelayedOperationPurgatory,name=ProduceRequestPurgatorySize
  • kafka.server:type=ReplicaManager,name=PartitionCount / LeaderCount
  • Also watch OS metrics (disk wait, IOPS, file descriptor exhaustion)

Alert Design and Key CCAAK Points

Design alerts in three layers: primary availability signals, leading performance-degradation signals, and capacity-pressure trends. Notify immediately on primary signals, use sustained conditions and hysteresis for leading signals, and rely on early detection (trend slope) for capacity.

For exam prep, the meanings and normal values of URP/ActiveController/OfflinePartitions, the RequestMetrics breakdown, correct lag interpretation, and visualization with Confluent tools all come up frequently. In production, taking a baseline and combining it with relative-change alerts reduces false positives.

  • Primary: URP>0, ActiveController!=1, OfflinePartitions>0 (immediate)
  • Leading: p95 Produce/Fetch latency sustained at 2–3x normal
  • Capacity: weekly growth rate of BytesIn/Out, slope of disk usage
  • For lag, first check whether it resolves — do not judge from a snapshot value alone

Check Your Understanding

CCAAK

問題 1

Which metric is the primary signal that most directly indicates an availability risk in a Kafka cluster and should be monitored continuously?

  1. A. UnderReplicatedPartitions (URP)
  2. B. A 10% drop in BytesInPerSec
  3. C. Broker CPU usage at 80%
  4. D. A rise in p95 of LeaderElectionRateAndTimeMs

正解: A

URP directly indicates that replicas have dropped out of the ISR and fault tolerance is degraded. A drop in BytesIn or a high CPU plateau are indirect factors, and LeaderElection latency matters but is not a steady-state signal. URP is the top priority as the primary indicator.

Frequently Asked Questions

Can I monitor Kafka without Prometheus?

Yes. Kafka exposes metrics via JMX, so you can read them with tools like JConsole or Jolokia. With Confluent you can use Control Center, and with Confluent Cloud the Metrics API visualizes the key indicators. That said, Prometheus-based stacks are the common choice because of long-term retention and easier alert design.

Is there a universal threshold value?

No. Workloads vary widely, so the practical approach is to capture a baseline and percentiles during normal operation and design alerts around relative change (for example, 2–3x the normal p95 sustained for some duration). The exceptions where a flat value applies are URP=0, ActiveController=1, and OfflinePartitions=0.

Why is processing slow even when lag is 0?

Batch commits and uneven processing time can leave offsets advancing while the application internals are stuck. Common causes include max.poll.records being too small, slow serializers or external API waits, and async commits that make the surface metric look better than reality. Measure in-app processing time and throughput in parallel.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.