The first things you cannot skip when operating Kafka are visibility into broker health and replica status. This article assumes Prometheus + Grafana and walks through JMX Exporter / Node Exporter setup, dashboards, and alerting designed around the representative metrics that show up frequently on the exam.
The content follows the official Apache Kafka JMX metrics specification and uses metric names and monitoring logic that are largely version-agnostic. The goal is a setup you can apply as-is for both the CCAAK exam and real-world operations.
Kafka exposes a rich set of internal metrics via JMX. The most common and stable way to feed those into Prometheus is the JMX Exporter (javaagent). Attach the javaagent to each Java process — broker, Kafka Connect, Schema Registry, REST Proxy — and expose metrics over HTTP. Complement that with Node Exporter at the OS level.
Prometheus scrapes each endpoint using its pull model, and Grafana visualizes via PromQL. Standardizing label design across cluster, environment, and role keeps dashboard variables and alert aggregation stable.
| Exporter | Where to deploy | Key metrics (JMX name / common name) |
|---|---|---|
| JMX Exporter (javaagent) | Kafka Broker / Connect / Schema Registry / REST Proxy | kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions / UnderReplicatedPartitions, kafka.controller:type=KafkaController,name=ActiveControllerCount / ActiveControllerCount, kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent / RequestHandlerAvgIdlePercent |
| Node Exporter | Linux nodes (broker hosts, etc.) | node_cpu_seconds_total, node_filesystem_avail_bytes, node_network_receive_bytes_total |
| JMX Exporter (javaagent, Connect) | Kafka Connect Worker | kafka.connect:type=connect-worker-metrics,name=connector-count, task-count, status-metrics, etc. |
Overall view of Kafka monitoring (Prometheus pull model)
Example inventory of monitored endpoints (logical names and URLs)
# 環境: production, クラスタ: cluster-a
- job: kafka-broker
targets:
- broker-1.prod.example.com:7071
- broker-2.prod.example.com:7071
- broker-3.prod.example.com:7071
- job: kafka-connect
targets:
- connect-1.prod.example.com:7072
- job: node
targets:
- broker-1.prod.example.com:9100
- broker-2.prod.example.com:9100
- broker-3.prod.example.com:9100Attach jmx_prometheus_javaagent.jar to each Kafka-family process and expose metrics over HTTP. For Kafka Broker, add the javaagent to the startup arguments via KAFKA_OPTS or environment variables. Explicitly listing the official JMX names in your rules keeps PromQL stable.
At a minimum, start by collecting UnderReplicatedPartitions, ActiveControllerCount, RequestHandlerAvgIdlePercent, and network I/O metrics. Drop unnecessary attributes in the rules to avoid high-cardinality explosions at the topic/partition level.
Example of attaching the javaagent to Kafka Broker and configuring jmx_exporter
# 1) Kafka起動にjavaagentを付与(systemdや環境変数で)
export KAFKA_OPTS="-javaagent:/opt/jmx/jmx_prometheus_javaagent.jar=7071:/opt/jmx/kafka-broker-jmx.yml $KAFKA_OPTS"
# 2) /opt/jmx/kafka-broker-jmx.yml(最小安定セット)
---
lowercaseOutputName: true
rules:
# UnderReplicatedPartitions
- pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions>(Count)'
name: kafka_server_replicamanager_underreplicatedpartitions
type: GAUGE
help: Number of under-replicated partitions
# ActiveControllerCount(単一クラスタでは常に1が正)
- pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount>(Value)'
name: kafka_controller_kafkacontroller_activecontrollercount
type: GAUGE
# RequestHandlerAvgIdlePercent(0〜1)
- pattern: 'kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent>(Value)'
name: kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent
type: GAUGE
# Network I/O
- pattern: 'kafka.server<type=BrokerTopicMetrics, name=BytesInPerSec, topic=.*>(OneMinuteRate|Count)'
name: kafka_server_brokertopicmetrics_bytesinpersec_$1
type: GAUGE
labels:
topic: "$topic"
- pattern: 'kafka.server<type=BrokerTopicMetrics, name=BytesOutPerSec, topic=.*>(OneMinuteRate|Count)'
name: kafka_server_brokertopicmetrics_bytesoutpersec_$1
type: GAUGE
labels:
topic: "$topic"In Prometheus, separate targets by job and carry environment and cluster names in external_labels. Suppress high cardinality on the metrics side via metrics_relabel_configs (e.g., allow only the top-N topics; drop the partition label by default).
For latency and throughput evaluation, use rate functions and combine 1m/5m/15m windows to improve noise tolerance.
Example prometheus.yml (job separation and high-cardinality suppression)
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
env: production
cluster: cluster-a
scrape_configs:
- job_name: 'kafka-broker'
static_configs:
- targets: ['broker-1.prod.example.com:7071','broker-2.prod.example.com:7071','broker-3.prod.example.com:7071']
metrics_relabel_configs:
# partitionラベルを持つメトリクスは原則ドロップ
- action: drop
regex: .+
source_labels: [partition]
# topicは重要上位のみ残す例(^app- で始まるものだけ)
- action: keep
source_labels: [topic]
regex: app-.*
- action: labeldrop
regex: (client_id|fetcher_id)
- job_name: 'kafka-connect'
static_configs:
- targets: ['connect-1.prod.example.com:7072']
- job_name: 'node'
static_configs:
- targets: ['broker-1.prod.example.com:9100','broker-2.prod.example.com:9100','broker-3.prod.example.com:9100']For both CCAAK prep and real operations, the first four panel groups you want are replica health, controller state, broker request-handler headroom, and I/O. Below are minimal query examples. Variables: cluster, broker, topic, etc.
Rather than cramming everything into a single chart, provide both aggregations directly tied to alert conditions (sum/max) and drill-down views (by broker/topic). This speeds up incident triage.
PromQL examples for representative panels (paste into the dashboard)
# 1) レプリカ健全性
sum(kafka_server_replicamanager_underreplicatedpartitions{cluster="$cluster"})
# 2) コントローラ状態(1で正常)
max(kafka_controller_kafkacontroller_activecontrollercount{cluster="$cluster"})
# 3) ブローカ処理余力(平均)
avg by (cluster) (
kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent{cluster="$cluster"}
)
# 4) Bytes In/Out(トピック合計・5分移動平均)
rate(sum by (cluster) (kafka_server_brokertopicmetrics_bytesinpersec_oneminuterate{cluster="$cluster"}))[5m]
rate(sum by (cluster) (kafka_server_brokertopicmetrics_bytesoutpersec_oneminuterate{cluster="$cluster"}))[5m]
# 5) ブローカ別のUnderReplicatedPartitions(ドリルダウン)
max by (instance) (kafka_server_replicamanager_underreplicatedpartitions{cluster="$cluster"})To avoid false positives, suppress short spikes and attach a 1-5 minute for clause. UnderReplicatedPartitions needs immediate action, so 1m is recommended. For RequestHandlerAvgIdlePercent, a sustained observation of 5m is recommended.
Route notifications using cluster and broker labels, and include investigation starting points in the message (target broker, recent reassignments, incident history).
Example alerting rules (Prometheus)
groups:
- name: kafka.rules
rules:
- alert: KafkaUnderReplicatedPartitions
expr: sum(kafka_server_replicamanager_underreplicatedpartitions) > 0
for: 1m
labels:
severity: critical
team: kafka
annotations:
summary: "Under-replicated partitions detected"
description: "There are {{ $value }} under-replicated partitions in {{ $labels.cluster }}. Check broker health and network."
- alert: KafkaControllerNotOne
expr: max(kafka_controller_kafkacontroller_activecontrollercount) != 1
for: 3m
labels:
severity: warning
team: kafka
annotations:
summary: "ActiveControllerCount is not 1"
description: "Active controller count is {{ $value }} in {{ $labels.cluster }} (expected 1). Investigate controller elections."
- alert: KafkaRequestHandlersSaturated
expr: avg(kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent) < 0.2
for: 5m
labels:
severity: warning
team: kafka
annotations:
summary: "Low RequestHandler idle percent"
description: "Handler idle percent below 0.2 for 5m in {{ $labels.cluster }}. Consider scaling or tuning."Most exporters only support plain HTTP. Either keep them inside the network perimeter or terminate TLS/auth at a reverse proxy when needed. On the Prometheus side you can configure basic_auth and tls_config. Restrict endpoint exposure with network ACLs.
For multi-cluster operations, attach cluster consistently via external_labels so series do not collide when aggregated in the same Prometheus. Plan high-cardinality reduction via metrics_relabel_configs and make dashboards switchable through variables.
Example of scraping exporters behind a proxy from Prometheus (Basic auth / TLS)
scrape_configs:
- job_name: 'kafka-broker-proxied'
scheme: https
basic_auth:
username: prometheus
password: ${EXPORTER_BASIC_AUTH_PASSWORD}
tls_config:
insecure_skip_verify: false
static_configs:
- targets: ['broker-1.proxy.example.com:443','broker-2.proxy.example.com:443']CCAAK
問題 1
Which combination of metrics should Prometheus monitor to detect Kafka cluster replica health most directly and earliest?
正解: A
UnderReplicatedPartitions directly indicates unhealthy replicas, and ActiveControllerCount signals an abnormal controller state. Together they enable early detection of replica health issues. I/O and node-resource metrics are important but only provide indirect signals.
If I open the JMX port, can Prometheus scrape it directly? Is the JMX Exporter required?
Prometheus does not connect to JMX directly. A JMX Exporter (typically the javaagent) that exposes metrics over HTTP is effectively required. With the javaagent in place, you do not need to enable remote JMX at all.
Can I get accurate Consumer Lag from JMX alone?
Standard JMX metrics alone make it hard to comprehensively visualize per-group lag. In practice you pair Prometheus with a dedicated tool or product that reads Kafka group offsets (__consumer_offsets). For exam prep, prioritize JMX monitoring of broker health (replicas, controller, request-handler headroom).
Where should I tackle high-cardinality metrics?
Use a two-layer approach. First, suppress unwanted attributes in the JMX Exporter rules. Second, drop labels like partition via Prometheus metrics_relabel_configs. On dashboards, default to aggregations (sum by, max by) and only expose detail through variables when needed.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...