Kafka Monitoring with Prometheus & Grafana (2026)

The first things you cannot skip when operating Kafka are visibility into broker health and replica status. This article assumes Prometheus + Grafana and walks through JMX Exporter / Node Exporter setup, dashboards, and alerting designed around the representative metrics that show up frequently on the exam.

The content follows the official Apache Kafka JMX metrics specification and uses metric names and monitoring logic that are largely version-agnostic. The goal is a setup you can apply as-is for both the CCAAK exam and real-world operations.

Monitoring Architecture and Responsibilities

Kafka exposes a rich set of internal metrics via JMX. The most common and stable way to feed those into Prometheus is the JMX Exporter (javaagent). Attach the javaagent to each Java process — broker, Kafka Connect, Schema Registry, REST Proxy — and expose metrics over HTTP. Complement that with Node Exporter at the OS level.

Prometheus scrapes each endpoint using its pull model, and Grafana visualizes via PromQL. Standardizing label design across cluster, environment, and role keeps dashboard variables and alert aggregation stable.

The JMX Exporter (javaagent) co-locates with each Kafka-family process. It runs over plain HTTP by default; terminate TLS at a reverse proxy if you need it.
Node Exporter collects common OS metrics for CPU, memory, disk, and network. Essential for spotting I/O bottlenecks in Kafka.
Splitting dashboards into broker health, replica health, network I/O, and thread headroom makes operational decisions much faster.

Exporter	Where to deploy	Key metrics (JMX name / common name)
JMX Exporter (javaagent)	Kafka Broker / Connect / Schema Registry / REST Proxy	kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions / UnderReplicatedPartitions, kafka.controller:type=KafkaController,name=ActiveControllerCount / ActiveControllerCount, kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent / RequestHandlerAvgIdlePercent
Node Exporter	Linux nodes (broker hosts, etc.)	node_cpu_seconds_total, node_filesystem_avail_bytes, node_network_receive_bytes_total
JMX Exporter (javaagent, Connect)	Kafka Connect Worker	kafka.connect:type=connect-worker-metrics,name=connector-count, task-count, status-metrics, etc.

Overall view of Kafka monitoring (Prometheus pull model)

Example inventory of monitored endpoints (logical names and URLs)

# 環境: production, クラスタ: cluster-a
- job: kafka-broker
  targets:
    - broker-1.prod.example.com:7071
    - broker-2.prod.example.com:7071
    - broker-3.prod.example.com:7071
- job: kafka-connect
  targets:
    - connect-1.prod.example.com:7072
- job: node
  targets:
    - broker-1.prod.example.com:9100
    - broker-2.prod.example.com:9100
    - broker-3.prod.example.com:9100

Setting Up the JMX Exporter (javaagent)

Attach jmx_prometheus_javaagent.jar to each Kafka-family process and expose metrics over HTTP. For Kafka Broker, add the javaagent to the startup arguments via KAFKA_OPTS or environment variables. Explicitly listing the official JMX names in your rules keeps PromQL stable.

At a minimum, start by collecting UnderReplicatedPartitions, ActiveControllerCount, RequestHandlerAvgIdlePercent, and network I/O metrics. Drop unnecessary attributes in the rules to avoid high-cardinality explosions at the topic/partition level.

Example ports: Broker 7071, Connect 7072. Node Exporter conventionally uses 9100.
Enable lowercaseOutputName and standardize metric names to snake_case.
Detailed metrics including topic/partition can be kept only for aggregation on the Prometheus side (covered below); raw entries can be dropped.

Example of attaching the javaagent to Kafka Broker and configuring jmx_exporter

# 1) Kafka起動にjavaagentを付与（systemdや環境変数で）
export KAFKA_OPTS="-javaagent:/opt/jmx/jmx_prometheus_javaagent.jar=7071:/opt/jmx/kafka-broker-jmx.yml $KAFKA_OPTS"

# 2) /opt/jmx/kafka-broker-jmx.yml（最小安定セット）
---
lowercaseOutputName: true
rules:
  # UnderReplicatedPartitions
  - pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions>(Count)'
    name: kafka_server_replicamanager_underreplicatedpartitions
    type: GAUGE
    help: Number of under-replicated partitions
  # ActiveControllerCount（単一クラスタでは常に1が正）
  - pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount>(Value)'
    name: kafka_controller_kafkacontroller_activecontrollercount
    type: GAUGE
  # RequestHandlerAvgIdlePercent（0〜1）
  - pattern: 'kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent>(Value)'
    name: kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent
    type: GAUGE
  # Network I/O
  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=BytesInPerSec, topic=.*>(OneMinuteRate|Count)'
    name: kafka_server_brokertopicmetrics_bytesinpersec_$1
    type: GAUGE
    labels:
      topic: "$topic"
  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=BytesOutPerSec, topic=.*>(OneMinuteRate|Count)'
    name: kafka_server_brokertopicmetrics_bytesoutpersec_$1
    type: GAUGE
    labels:
      topic: "$topic"

Prometheus Scrape Configuration and Relabel Design

In Prometheus, separate targets by job and carry environment and cluster names in external_labels. Suppress high cardinality on the metrics side via metrics_relabel_configs (e.g., allow only the top-N topics; drop the partition label by default).

For latency and throughput evaluation, use rate functions and combine 1m/5m/15m windows to improve noise tolerance.

Use external_labels to consistently attach cluster and env. This helps with dashboard variables and alert suppression.
Drop partition and arbitrary topics via metrics_relabel. Essential at large cluster scale.
Use tls_config / basic_auth only when behind a proxy. Exporters themselves are typically run on plain HTTP.

Example prometheus.yml (job separation and high-cardinality suppression)

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    env: production
    cluster: cluster-a

scrape_configs:
  - job_name: 'kafka-broker'
    static_configs:
      - targets: ['broker-1.prod.example.com:7071','broker-2.prod.example.com:7071','broker-3.prod.example.com:7071']
    metrics_relabel_configs:
      # partitionラベルを持つメトリクスは原則ドロップ
      - action: drop
        regex: .+
        source_labels: [partition]
      # topicは重要上位のみ残す例（^app- で始まるものだけ）
      - action: keep
        source_labels: [topic]
        regex: app-.*
      - action: labeldrop
        regex: (client_id|fetcher_id)

  - job_name: 'kafka-connect'
    static_configs:
      - targets: ['connect-1.prod.example.com:7072']

  - job_name: 'node'
    static_configs:
      - targets: ['broker-1.prod.example.com:9100','broker-2.prod.example.com:9100','broker-3.prod.example.com:9100']

Grafana Dashboard Design (Essential Panels and PromQL)

For both CCAAK prep and real operations, the first four panel groups you want are replica health, controller state, broker request-handler headroom, and I/O. Below are minimal query examples. Variables: cluster, broker, topic, etc.

Rather than cramming everything into a single chart, provide both aggregations directly tied to alert conditions (sum/max) and drill-down views (by broker/topic). This speeds up incident triage.

Detect UnderReplicatedPartitions > 0 immediately. First response is to check broker health and network.
ActiveControllerCount should always be 1 across the cluster. Anything other than 1 is a sign of trouble.
If RequestHandlerAvgIdlePercent stays below 0.2, suspect saturation.

PromQL examples for representative panels (paste into the dashboard)

# 1) レプリカ健全性
sum(kafka_server_replicamanager_underreplicatedpartitions{cluster="$cluster"})

# 2) コントローラ状態（1で正常）
max(kafka_controller_kafkacontroller_activecontrollercount{cluster="$cluster"})

# 3) ブローカ処理余力（平均）
avg by (cluster) (
  kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent{cluster="$cluster"}
)

# 4) Bytes In/Out（トピック合計・5分移動平均）
rate(sum by (cluster) (kafka_server_brokertopicmetrics_bytesinpersec_oneminuterate{cluster="$cluster"}))[5m]
rate(sum by (cluster) (kafka_server_brokertopicmetrics_bytesoutpersec_oneminuterate{cluster="$cluster"}))[5m]

# 5) ブローカ別のUnderReplicatedPartitions（ドリルダウン）
max by (instance) (kafka_server_replicamanager_underreplicatedpartitions{cluster="$cluster"})

Alert Design (Prometheus Rules)

To avoid false positives, suppress short spikes and attach a 1-5 minute for clause. UnderReplicatedPartitions needs immediate action, so 1m is recommended. For RequestHandlerAvgIdlePercent, a sustained observation of 5m is recommended.

Route notifications using cluster and broker labels, and include investigation starting points in the message (target broker, recent reassignments, incident history).

Detect UnderReplicatedPartitions > 0 with the highest priority.
ActiveControllerCount != 1 indicates controller failover is in progress. Notify only if it persists.
If RequestHandlerAvgIdlePercent < 0.2 persists, suspect thread/CPU saturation.

Example alerting rules (Prometheus)

groups:
- name: kafka.rules
  rules:
  - alert: KafkaUnderReplicatedPartitions
    expr: sum(kafka_server_replicamanager_underreplicatedpartitions) > 0
    for: 1m
    labels:
      severity: critical
      team: kafka
    annotations:
      summary: "Under-replicated partitions detected"
      description: "There are {{ $value }} under-replicated partitions in {{ $labels.cluster }}. Check broker health and network."

  - alert: KafkaControllerNotOne
    expr: max(kafka_controller_kafkacontroller_activecontrollercount) != 1
    for: 3m
    labels:
      severity: warning
      team: kafka
    annotations:
      summary: "ActiveControllerCount is not 1"
      description: "Active controller count is {{ $value }} in {{ $labels.cluster }} (expected 1). Investigate controller elections."

  - alert: KafkaRequestHandlersSaturated
    expr: avg(kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent) < 0.2
    for: 5m
    labels:
      severity: warning
      team: kafka
    annotations:
      summary: "Low RequestHandler idle percent"
      description: "Handler idle percent below 0.2 for 5m in {{ $labels.cluster }}. Consider scaling or tuning."

Security and Operational Pitfalls (TLS, Multi-Cluster, Tuning)

Most exporters only support plain HTTP. Either keep them inside the network perimeter or terminate TLS/auth at a reverse proxy when needed. On the Prometheus side you can configure basic_auth and tls_config. Restrict endpoint exposure with network ACLs.

For multi-cluster operations, attach cluster consistently via external_labels so series do not collide when aggregated in the same Prometheus. Plan high-cardinality reduction via metrics_relabel_configs and make dashboards switchable through variables.

Exporters are HTTP by default. Use a proxy to terminate TLS plus Basic auth if you need it.
Use external_labels and job separation to safely aggregate multiple clusters.
Manage collection load via scrape_interval, timeout, and target count. At 15s with 1000 targets you should benchmark performance.

Example of scraping exporters behind a proxy from Prometheus (Basic auth / TLS)

scrape_configs:
  - job_name: 'kafka-broker-proxied'
    scheme: https
    basic_auth:
      username: prometheus
      password: ${EXPORTER_BASIC_AUTH_PASSWORD}
    tls_config:
      insecure_skip_verify: false
    static_configs:
      - targets: ['broker-1.proxy.example.com:443','broker-2.proxy.example.com:443']

Check Your Understanding

CCAAK

問題 1

Which combination of metrics should Prometheus monitor to detect Kafka cluster replica health most directly and earliest?

UnderReplicatedPartitions and ActiveControllerCount
BytesInPerSec and BytesOutPerSec
CPU utilization and memory usage (Node Exporter)
Request latency (p95) and free disk space

正解: A

UnderReplicatedPartitions directly indicates unhealthy replicas, and ActiveControllerCount signals an abnormal controller state. Together they enable early detection of replica health issues. I/O and node-resource metrics are important but only provide indirect signals.

Frequently Asked Questions

If I open the JMX port, can Prometheus scrape it directly? Is the JMX Exporter required?

Prometheus does not connect to JMX directly. A JMX Exporter (typically the javaagent) that exposes metrics over HTTP is effectively required. With the javaagent in place, you do not need to enable remote JMX at all.

Can I get accurate Consumer Lag from JMX alone?

Standard JMX metrics alone make it hard to comprehensively visualize per-group lag. In practice you pair Prometheus with a dedicated tool or product that reads Kafka group offsets (__consumer_offsets). For exam prep, prioritize JMX monitoring of broker health (replicas, controller, request-handler headroom).

Where should I tackle high-cardinality metrics?

Use a two-layer approach. First, suppress unwanted attributes in the JMX Exporter rules. Second, drop labels like partition via Prometheus metrics_relabel_configs. On dashboards, default to aggregations (sum by, max by) and only expose detail through variables when needed.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

NicheeLab Kafka Monitoring: Prometheus / Grafana Integration — Exporter Setup and Dashboards