Vault Metrics: What to Monitor (2026)

Vault is a core platform for distributing secrets at high availability and with strong security guarantees. In operations, the point is not just whether the service is up: the real value comes from watching metrics that surface early warning signs — leader elections, seal state, request latency, storage health, and lease renewal failures.

Building on the behavior documented in the official docs, this article walks through how to collect metrics with Prometheus and how to read and alert on the key signals. Rather than leaning on individual metric names (which shift between versions), we focus on categories and the operational lens you actually use day-to-day.

The Big Picture and How to Collect Metrics (Prometheus/StatsD)

Once the telemetry stanza is enabled, Vault exposes metrics in Prometheus format from the /v1/sys/metrics endpoint (with format=prometheus). You can also push to StatsD/DogStatsD. Pick the collection method that fits your platform, but in recent years direct Prometheus scraping has become the mainstream choice thanks to its ease of dashboard and alert design.

There are two important prerequisites: (1) configure telemetry and then restart the server (or roll restart following a safe maintenance procedure), and (2) in production, gate access to /v1/sys/metrics with a policy granting read. The most reliable approach is to first inspect the metric list in a development environment, get familiar with the available labels (path, method, status, mount type, etc.), and only then build dashboards on top.

Endpoint: /v1/sys/metrics?format=prometheus
Setting telemetry.prometheus_retention_time enables in-Vault metric retention
StatsD/DogStatsD integrates well with existing APM/monitoring stacks, but label expressiveness tends to be limited

Collection method	Main destination / retrieval method	Strengths	Caveats
Direct Prometheus scrape	Prometheus scrapes /v1/sys/metrics	Rich labels, lots of visualization templates	Requires endpoint protection plus certificate/token management
StatsD	UDP push to statsd_address	Lightweight; easy to drop onto existing infrastructure	Weak label expression makes detailed slicing difficult
DogStatsD	Push to dogstatsd_address	Tags allow some level of label expression	Depends on a collection agent; watch network reachability
File / external bridge	Collect with an agent and forward to another system	Flexible integration with existing SOC/monitoring	Risk of operational complexity and added latency

Typical configuration for collecting Vault metrics

Example Vault telemetry and Prometheus scrape configuration (HCL/YAML)

# server.hcl（抜粋）
telemetry {
  prometheus_retention_time = "24h"
  disable_hostname = true
  # 必要に応じて（いずれかを使用）
  # statsd_address   = "127.0.0.1:8125"
  # dogstatsd_address= "127.0.0.1:8125"
}

# sys/metrics 読取用の最小ポリシー例（Vault policy）
path "sys/metrics" {
  capabilities = ["read"]
}

# Prometheus（prometheus.yml 抜粋）
scrape_configs:
  - job_name: "vault"
    scheme: https
    metrics_path: /v1/sys/metrics
    params:
      format: ["prometheus"]
    bearer_token: "s.xxxxxxxx"  # sys/metrics 読取権限のあるトークン
    tls_config:
      insecure_skip_verify: false
    static_configs:
      - targets: ["vault.example.com:8200"]

Availability / HA Signals (Leader, Seal, Health)

When running HA, keep an eye on the frequency of leader changes and their leading indicators (rising latency, storage slowdowns), seal/unseal events, and standby promotion success or failure. Stable metric signals include leadership-related counters and gauges, the number of seal-state change events, and standby RPC failure counts. Pair this with /v1/sys/health to track state transitions at fixed intervals.

Whether for the exam or for real operations, the key is to focus on rising trends in leader changes or chained sequences (change → latency → change) rather than one-off down detection. If those correlate with 5xx errors or storage warnings, you can intervene early.

Angles to watch: leader change count, leader tenure, seal state changes, standby promotion failures
The health API (/v1/sys/health) gives immediate state; metrics are stronger for trend analysis

Symptom	Underlying risk	First-response action
Spike in leader changes	Storage latency / unstable network	Check storage I/O and network latency. Test reachability of the standbys
Frequent seal/unseal	Auto-unseal KMS failures or key management issues	Check KMS reachability and permissions; scrutinize audit log errors
Standby promotion failure	ACL/network restrictions, Raft inconsistency	Check ports and certificates; wait for or correct Raft status convergence

HA state transitions (conceptual)

Example health/metrics retrieval (curl)

# リーダー・シール状態確認（ヘルス API）
curl -sS https://vault.example.com:8200/v1/sys/health \
  -H "X-Vault-Token: s.xxxxx" | jq '{sealed, standby, initialized, version}'

# メトリクス（Prometheus 形式）
curl -sS "https://vault.example.com:8200/v1/sys/metrics?format=prometheus" \
  -H "X-Vault-Token: s.xxxxx" | head -n 50

Reading Request Latency, Throughput, and Error Rate

Vault API requests are exported with labels such as path (e.g. auth/, sys/, kv/), HTTP method, and status code. That lets you visualize throughput, latency distribution, and 5xx error rate broken down by mount, method, and status. For SLO design, p95/p99 latency and 5xx rate (over the most recent 5-15 minutes) are the practical levers.

Specific metric names and labels can be added or removed across versions, so first inspect the actual output of /v1/sys/metrics, and define dashboards on the assumption that you can aggregate by labels (path, method, status, etc.).

Monitor latency by quantile (p95/p99). Visualize sudden spikes and sustained increases separately
Aggregate 5xx by route. Rising auth/* errors should also raise suspicion of external IdP or network factors

Metric category	Aggregation axis to watch	Operational note
Throughput	path × method	Identify hot paths and confirm rate-limit impact
Latency	p95/p99 × path	Surfaces backend latency and the cost of key generation
Error rate	status(5xx) × path	Selectively break out 429/403 in separate charts as well

Understanding the latency distribution (conceptual)

Dashboard and alert design templates (PromQL approach)

# 具体的なメトリクス名は /v1/sys/metrics の出力に合わせて置換してください。
# 例: レイテンシ p95（Histogram/summary を利用可能なら）
# histogram_quantile(0.95, sum by (le, path) (<request_duration_bucket>))

# 例: 5xx 率（直近 5 分）
# sum by (path)(rate(<request_total>{status=~"5.."}[5m])) / sum by (path)(rate(<request_total>[5m]))

# 例: スループット（RPS）
# sum(rate(<request_total>[1m]))

Storage / Replication Signals (Raft / Integrated Storage)

When using Integrated Storage (built-in Raft), watch commit latency, peer health, snapshot/log compaction progress, and disk I/O bottlenecks. Metrics are exported as storage operation latency and error counts. Weight sustained latency increases and growing retransmissions more heavily than short-lived spikes.

Alongside the metrics, checking Autopilot state and peer health via the API speeds up triage considerably.

Number of Raft peers and leader identification; commit/apply latency; snapshot frequency
Disk fsync delays and storage-layer errors propagate up into request latency and leader changes

Angle	Verification method	Threshold / rule of thumb
Peer health	Autopilot state / API	All peers should be healthy; investigate degraded peers immediately
Commit latency	Storage-related metrics	Watch for a doubling vs. baseline sustained for 5+ minutes
Snapshot	Log size / generations	Bloat or high frequency signals an I/O bottleneck

Raft conceptual diagram (Active / Peers)

Retrieving Autopilot / peer state (reference APIs)

# Autopilot の状態（参考：補助的な健全性確認）
curl -sS https://vault.example.com:8200/v1/sys/storage/raft/autopilot/state \
  -H "X-Vault-Token: s.xxxxx" | jq '{healthy: .healthy, failure_tolerance: .failure_tolerance, servers: [.servers[] | {id, node, voter, healthy}]}'

# ピア一覧
curl -sS https://vault.example.com:8200/v1/sys/storage/raft/configuration \
  -H "X-Vault-Token: s.xxxxx" | jq '.configuration.servers[] | {id, address, voter}'

Token / Lease / Secret Lifetime Signals

Token issuance and lease renewal failures map directly to authorization defects or trouble in external dependencies (KMS/PKI/DB). On the metrics side, track counts and failure rates for issue/renew/revoke operations and the trend in held leases. A steady upward trend signals a leak; failure spikes point to external-dependency outages or the impact of policy changes.

Adding spot checks against the /sys/leases API (total leases under a specific path and the distribution of remaining TTL) to your operational Runbook speeds up triage during incidents.

Numbers to watch: total leases, renewal success rate, source of failures (broken out by path / mount)
A buildup of long-lived leases is a signal to run tidy and revisit the policy

Metric	Sign of trouble	Direction of response
Lease renewal failure rate	Sharp short-window spike	Verify reachability and permissions of external dependencies (DB/PKI)
Issuance rate	Double the baseline level	Check spike causes (batches/deployments) and rate limits
Total lease count	Climbs and stays high	Suspect a leak. Revisit TTL design and schedule a tidy run

Lease lifecycle (conceptual)

Helper APIs for lease inspection

# 特定パス配下のリース一覧（例）
curl -sS https://vault.example.com:8200/v1/sys/leases/lookup/db/creds/ \
  -H "X-Vault-Token: s.xxxxx" | jq '.'

# トークン情報（例）
curl -sS https://vault.example.com:8200/v1/auth/token/lookup-self \
  -H "X-Vault-Token: s.xxxxx" | jq '{id, policies, ttl, orphan}'

Alert Design and Key Points for the Exam

Design alerts around sustained deviation rather than instantaneous peaks. Cover at minimum these five families — HA stability, API latency, 5xx, storage latency, lease renewal failures — and pair them with a dashboard that surfaces correlations between them.

On the Operations exam, recurring themes include enabling telemetry, the prerequisites for fetching /v1/sys/metrics (format=prometheus, permissions), the relationship between HA and storage, and signs of rate limits and audit log errors. The safe bet is to memorize the terminology exactly as the official docs use it.

Set thresholds relative to the baseline (e.g. baseline x 2 sustained for 5 minutes)
Pair the immediacy of sys/health with the trend visibility of Prometheus

Category	Suggested SLI/SLO example	Primary alert condition (example)
HA / Leader	Changes per 24h, median leader tenure	3+ changes within the last 1h
Latency	p95 ≤ 200ms (hot paths)	p95 exceeds 2x baseline sustained for 5 minutes
Error rate	5xx ≤ 0.5%	Above 1% in the most recent 5 minutes

Correlation dashboard (conceptual layout)

Operational Runbook template (pseudo-recipe)

# 1) アラート発火（例: 5xx 増加）
# 2) /v1/sys/metrics を取得し、path/method/status を確認
# 3) 該当 path の下位依存（DB/PKI/ネットワーク）の健全性を確認
# 4) 併せて /v1/sys/health と Raft Autopilot 状態を確認
# 5) 影響範囲（トークン/リース）を確認し、レート制限/回避策を適用

Check Yourself with a Question

Ops

問題 1

You want to collect Vault metrics with Prometheus and build dashboards. Which combination of steps is the most appropriate?

Configure prometheus_retention_time in the telemetry stanza of server.hcl, have Prometheus scrape /v1/sys/metrics?format=prometheus, and use a token with read on sys/metrics
If you collect Vault audit logs with Fluentd, there is no need to enable /v1/sys/metrics
Have Prometheus scrape /v1/sys/health every 1 second and derive the latency distribution from that
Enabling only StatsD lets Prometheus pick up tag information directly

正解: A

Prometheus collection works by enabling telemetry and scraping /v1/sys/metrics with format=prometheus. For access control, use a token with read permission on sys/metrics. Audit logs are not a substitute for metrics; /v1/sys/health is a state check and cannot produce a latency distribution; and StatsD tags are not a substitute for direct Prometheus scraping.

Frequently Asked Questions

Are telemetry settings applied dynamically, or is a restart required?

In most cases you need a safe server restart (or a planned rolling restart). Whether settings can be changed live varies by version, so consult the official docs for your version, validate the behavior in a staging environment, and then roll it out to production.

Does /v1/sys/metrics require authentication? How does Prometheus connect?

For production, gate access with a token. Use a Vault policy that grants read on path "sys/metrics", and send that token from Prometheus as a Bearer Token. Make sure TLS certificate verification is configured correctly as well.

Concrete metric names differ across environments. How do I reconcile them?

First dump the actual output of /v1/sys/metrics and design dashboards and alerts around the available labels (path, method, status, mount, and so on). Rather than depending on individual metric names, organize around categories (latency, 5xx, storage delays, leader changes, lease renewal failures); that approach holds up across version differences.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Vault Metrics: How to Read the Key Signals