Kafka Monitoring: Key Metrics & Alerts (2026)

Kafka is loved for its high throughput and fault tolerance, but without proper monitoring you cannot catch performance regressions or data loss risks. This article covers the core broker and client metrics, how to read them, and the key points of threshold design.

The content focuses on stable concepts from the Apache Kafka and Confluent documentation, picking metrics that do not depend heavily on version differences. Points that the CCAAK exam likes to test are also called out.

The Monitoring Architecture at a Glance

Kafka brokers expose metrics through JMX. The common pattern is to convert them into an HTTP endpoint with the JMX Exporter (a Java Agent), scrape with Prometheus, visualize in Grafana, and alert via Alertmanager. In Confluent environments, Control Center and the Metrics API (Confluent Cloud) are also available.

Separate collection paths and responsibilities, and design freshness (scrape interval) and granularity (average vs. percentile) per use case. The rule of thumb in production is high frequency for latency and lower frequency for capacity metrics.

Source: Kafka broker JMX and client (producer/consumer) client metrics
Conversion: JMX Exporter (jmx_prometheus_javaagent)
Scrape/storage: Prometheus (watch the scrape_interval)
Visualization/alerts: Grafana and Alertmanager
Managed: Confluent Control Center / Confluent Cloud Metrics API

Typical flow for collecting Kafka metrics

JMX Exporter + Prometheus configuration example (excerpt)

# Kafka ブローカの起動オプション例（systemd など）
# JMX Exporter を Java Agent として有効化
Environment="KAFKA_OPTS=-javaagent:/opt/jmx/jmx_prometheus_javaagent.jar=7071:/opt/jmx/kafka.yml"

# Prometheus の scrape 設定例
scrape_configs:
  - job_name: 'kafka-brokers'
    scrape_interval: 15s
    static_configs:
      - targets: ['broker-1:7071','broker-2:7071','broker-3:7071']
    relabel_configs:
      - source_labels: [__address__]
        regex: '([^:]+):.*'
        target_label: instance
        replacement: '$1'

Core Broker Metrics: Meaning and Healthy Ranges

Start with the metrics that directly touch availability and consistency. UnderReplicatedPartitions (URP), OfflinePartitionsCount, and ActiveControllerCount are the canonical primary alerting signals. Throughput metrics (BytesIn/Out, MessagesIn) are used for capacity planning and bottleneck analysis.

Request latency is not a single metric — it is the sum of queue wait, network, disk I/O, and more. The right pattern for both the exam and the field is to break down RequestMetrics by request type (Produce, Fetch, FetchConsumer).

Availability: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
Availability: kafka.controller:type=KafkaController,name=ActiveControllerCount (always 1 when healthy)
Availability: kafka.controller:type=KafkaController,name=OfflinePartitionsCount (0 is healthy)
Throughput: kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec/BytesOutPerSec
Latency breakdown: kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce/Fetch
Load indicator: kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent

Metric	Meaning	Healthy range	Example alert condition
UnderReplicatedPartitions	Number of partitions out of ISR	Always 0	> 0 sustained (e.g. 1 minute or more)
ActiveControllerCount	Number of active controllers	Always 1	!= 1
OfflinePartitionsCount	Number of offline partitions	Always 0	>= 1
Request TotalTimeMs p95 (Produce)	Total latency of produce requests	Workload-dependent (e.g. < 50–100ms)	Sustained at 2–3x the normal level
BytesInPerSec/BytesOutPerSec	I/O throughput	Baseline for capacity planning	Sharp rate of change (e.g. ±50% in 5 minutes)

Decomposing Latency and Throughput

Kafka's RequestMetrics lets you split TotalTimeMs into QueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, and so on. Growing Queue suggests thread starvation or GC, growing Local points to disk I/O, and growing Remote indicates network.

Build alert candidates on percentiles (p95/p99) and treat the average as just a dashboard reference. Splitting Produce and Fetch (especially the consumer-facing FetchConsumer) is the iron rule in both the exam and production.

Visualize Produce and Fetch on separate axes (do not mix them)
Show the TotalTimeMs breakdown side by side to localize the cause
Use p95/p99 as the primary signal and the average as a supplement
Cross-check the correlation between throughput (BytesIn/Out, MessagesIn) and latency

Reading Consumer Lag Correctly

Consumer lag is the difference between each partition's Log End Offset and the Consumer Group's Committed Offset. Momentary spikes are normal, but sustained lag with no downward trend needs action. Settings like auto.commit.interval.ms, max.poll.interval.ms, and max.poll.records directly shape the behavior.

On the CCAAK exam, questions about misreading lag (temporary swelling from batch processing, inflated values from delayed commits, etc.) are standard. Use kafka-consumer-groups to inspect at the Group/Member/Partition level.

Align the observation unit to Group x Topic x Partition
Check whether an increase-then-decrease cycle exists (from processing cycles)
Exceeding max.poll.interval.ms triggers rebalances and tends to worsen lag
Judge scaling first by consumer parallelism (capped at the partition count)

Checking lag with kafka-consumer-groups (example)

# 代表的なコマンド
$ kafka-consumer-groups --bootstrap-server broker-1:9092 \
  --group analytics-app --describe

GROUP           TOPIC      PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG   CONSUMER-ID        HOST            CLIENT-ID
analytics-app   events     0          128345          128400          55    consumer-1-...     /10.0.0.21      analytics-1
analytics-app   events     1          127998          128001          3     consumer-2-...     /10.0.0.22      analytics-2

# 解釈のコツ
# - LAG が波打ちながらも平均的に収束 → 正常範囲
# - LAG が累積的に増える/固定化 → スループット不足 or 停滞

Storage and Resilience Metrics

Disk I/O and replication behavior tie directly to latency and availability. LogFlushRateAndTimeMs, Produce/FetchRequestPurgatorySize, and IsrShrinksPerSec/IsrExpandsPerSec are the keys to stable operations. Even when URP stays clean, catch a rising IsrShrinks trend early.

Too many partitions also drives up open file counts and metadata processing cost. Watch PartitionCount and LeaderCount regularly to prevent unchecked topic/partition growth.

kafka.server:type=ReplicaManager,name=IsrShrinksPerSec / IsrExpandsPerSec
kafka.server:type=BrokerTopicMetrics,name=LogFlushRateAndTimeMs
kafka.server:type=DelayedOperationPurgatory,name=ProduceRequestPurgatorySize
kafka.server:type=ReplicaManager,name=PartitionCount / LeaderCount
Also watch OS metrics (disk wait, IOPS, file descriptor exhaustion)

Alert Design and Key CCAAK Points

Design alerts in three layers: primary availability signals, leading performance-degradation signals, and capacity-pressure trends. Notify immediately on primary signals, use sustained conditions and hysteresis for leading signals, and rely on early detection (trend slope) for capacity.

For exam prep, the meanings and normal values of URP/ActiveController/OfflinePartitions, the RequestMetrics breakdown, correct lag interpretation, and visualization with Confluent tools all come up frequently. In production, taking a baseline and combining it with relative-change alerts reduces false positives.

Primary: URP>0, ActiveController!=1, OfflinePartitions>0 (immediate)
Leading: p95 Produce/Fetch latency sustained at 2–3x normal
Capacity: weekly growth rate of BytesIn/Out, slope of disk usage
For lag, first check whether it resolves — do not judge from a snapshot value alone

Check Your Understanding

CCAAK

問題 1

Which metric is the primary signal that most directly indicates an availability risk in a Kafka cluster and should be monitored continuously?

A. UnderReplicatedPartitions (URP)
B. A 10% drop in BytesInPerSec
C. Broker CPU usage at 80%
D. A rise in p95 of LeaderElectionRateAndTimeMs

正解: A

URP directly indicates that replicas have dropped out of the ISR and fault tolerance is degraded. A drop in BytesIn or a high CPU plateau are indirect factors, and LeaderElection latency matters but is not a steady-state signal. URP is the top priority as the primary indicator.

Frequently Asked Questions

Can I monitor Kafka without Prometheus?

Yes. Kafka exposes metrics via JMX, so you can read them with tools like JConsole or Jolokia. With Confluent you can use Control Center, and with Confluent Cloud the Metrics API visualizes the key indicators. That said, Prometheus-based stacks are the common choice because of long-term retention and easier alert design.

Is there a universal threshold value?

No. Workloads vary widely, so the practical approach is to capture a baseline and percentiles during normal operation and design alerts around relative change (for example, 2–3x the normal p95 sustained for some duration). The exceptions where a flat value applies are URP=0, ActiveController=1, and OfflinePartitions=0.

Why is processing slow even when lag is 0?

Batch commits and uneven processing time can leave offsets advancing while the application internals are stuck. Common causes include max.poll.records being too small, slow serializers or external API waits, and async commits that make the surface metric look better than reality. Measure in-app processing time and throughput in parallel.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Kafka Monitoring: Key Metrics and How to Read Them (CCAAK + Production)