Vault

Vault Metrics: How to Read the Key Signals

2026-04-19
NicheeLab Editorial Team

Vault is a core platform for distributing secrets at high availability and with strong security guarantees. In operations, the point is not just whether the service is up: the real value comes from watching metrics that surface early warning signs — leader elections, seal state, request latency, storage health, and lease renewal failures.

Building on the behavior documented in the official docs, this article walks through how to collect metrics with Prometheus and how to read and alert on the key signals. Rather than leaning on individual metric names (which shift between versions), we focus on categories and the operational lens you actually use day-to-day.

The Big Picture and How to Collect Metrics (Prometheus/StatsD)

Once the telemetry stanza is enabled, Vault exposes metrics in Prometheus format from the /v1/sys/metrics endpoint (with format=prometheus). You can also push to StatsD/DogStatsD. Pick the collection method that fits your platform, but in recent years direct Prometheus scraping has become the mainstream choice thanks to its ease of dashboard and alert design.

There are two important prerequisites: (1) configure telemetry and then restart the server (or roll restart following a safe maintenance procedure), and (2) in production, gate access to /v1/sys/metrics with a policy granting read. The most reliable approach is to first inspect the metric list in a development environment, get familiar with the available labels (path, method, status, mount type, etc.), and only then build dashboards on top.

  • Endpoint: /v1/sys/metrics?format=prometheus
  • Setting telemetry.prometheus_retention_time enables in-Vault metric retention
  • StatsD/DogStatsD integrates well with existing APM/monitoring stacks, but label expressiveness tends to be limited
Collection methodMain destination / retrieval methodStrengthsCaveats
Direct Prometheus scrapePrometheus scrapes /v1/sys/metricsRich labels, lots of visualization templatesRequires endpoint protection plus certificate/token management
StatsDUDP push to statsd_addressLightweight; easy to drop onto existing infrastructureWeak label expression makes detailed slicing difficult
DogStatsDPush to dogstatsd_addressTags allow some level of label expressionDepends on a collection agent; watch network reachability
File / external bridgeCollect with an agent and forward to another systemFlexible integration with existing SOC/monitoringRisk of operational complexity and added latency

Typical configuration for collecting Vault metrics

Vault ServersHA: Active/StdbyPrometheus/v1/sys/metrics?format=prometheus (TLS, Token)Grafana

Example Vault telemetry and Prometheus scrape configuration (HCL/YAML)

# server.hcl(抜粋)
telemetry {
  prometheus_retention_time = "24h"
  disable_hostname = true
  # 必要に応じて(いずれかを使用)
  # statsd_address   = "127.0.0.1:8125"
  # dogstatsd_address= "127.0.0.1:8125"
}

# sys/metrics 読取用の最小ポリシー例(Vault policy)
path "sys/metrics" {
  capabilities = ["read"]
}

# Prometheus(prometheus.yml 抜粋)
scrape_configs:
  - job_name: "vault"
    scheme: https
    metrics_path: /v1/sys/metrics
    params:
      format: ["prometheus"]
    bearer_token: "s.xxxxxxxx"  # sys/metrics 読取権限のあるトークン
    tls_config:
      insecure_skip_verify: false
    static_configs:
      - targets: ["vault.example.com:8200"]

Availability / HA Signals (Leader, Seal, Health)

When running HA, keep an eye on the frequency of leader changes and their leading indicators (rising latency, storage slowdowns), seal/unseal events, and standby promotion success or failure. Stable metric signals include leadership-related counters and gauges, the number of seal-state change events, and standby RPC failure counts. Pair this with /v1/sys/health to track state transitions at fixed intervals.

Whether for the exam or for real operations, the key is to focus on rising trends in leader changes or chained sequences (change → latency → change) rather than one-off down detection. If those correlate with 5xx errors or storage warnings, you can intervene early.

  • Angles to watch: leader change count, leader tenure, seal state changes, standby promotion failures
  • The health API (/v1/sys/health) gives immediate state; metrics are stronger for trend analysis
SymptomUnderlying riskFirst-response action
Spike in leader changesStorage latency / unstable networkCheck storage I/O and network latency. Test reachability of the standbys
Frequent seal/unsealAuto-unseal KMS failures or key management issuesCheck KMS reachability and permissions; scrutinize audit log errors
Standby promotion failureACL/network restrictions, Raft inconsistencyCheck ports and certificates; wait for or correct Raft status convergence

HA state transitions (conceptual)

failover / preemptionActive (Leader)Standby (New)

Example health/metrics retrieval (curl)

# リーダー・シール状態確認(ヘルス API)
curl -sS https://vault.example.com:8200/v1/sys/health \
  -H "X-Vault-Token: s.xxxxx" | jq '{sealed, standby, initialized, version}'

# メトリクス(Prometheus 形式)
curl -sS "https://vault.example.com:8200/v1/sys/metrics?format=prometheus" \
  -H "X-Vault-Token: s.xxxxx" | head -n 50

Reading Request Latency, Throughput, and Error Rate

Vault API requests are exported with labels such as path (e.g. auth/, sys/, kv/), HTTP method, and status code. That lets you visualize throughput, latency distribution, and 5xx error rate broken down by mount, method, and status. For SLO design, p95/p99 latency and 5xx rate (over the most recent 5-15 minutes) are the practical levers.

Specific metric names and labels can be added or removed across versions, so first inspect the actual output of /v1/sys/metrics, and define dashboards on the assumption that you can aggregate by labels (path, method, status, etc.).

  • Monitor latency by quantile (p95/p99). Visualize sudden spikes and sustained increases separately
  • Aggregate 5xx by route. Rising auth/* errors should also raise suspicion of external IdP or network factors
Metric categoryAggregation axis to watchOperational note
Throughputpath × methodIdentify hot paths and confirm rate-limit impact
Latencyp95/p99 × pathSurfaces backend latency and the cost of key generation
Error ratestatus(5xx) × pathSelectively break out 429/403 in separate charts as well

Understanding the latency distribution (conceptual)

countlatencyp50p90p95p99

Dashboard and alert design templates (PromQL approach)

# 具体的なメトリクス名は /v1/sys/metrics の出力に合わせて置換してください。
# 例: レイテンシ p95(Histogram/summary を利用可能なら)
# histogram_quantile(0.95, sum by (le, path) (<request_duration_bucket>))

# 例: 5xx 率(直近 5 分)
# sum by (path)(rate(<request_total>{status=~"5.."}[5m])) / sum by (path)(rate(<request_total>[5m]))

# 例: スループット(RPS)
# sum(rate(<request_total>[1m]))

Storage / Replication Signals (Raft / Integrated Storage)

When using Integrated Storage (built-in Raft), watch commit latency, peer health, snapshot/log compaction progress, and disk I/O bottlenecks. Metrics are exported as storage operation latency and error counts. Weight sustained latency increases and growing retransmissions more heavily than short-lived spikes.

Alongside the metrics, checking Autopilot state and peer health via the API speeds up triage considerably.

  • Number of Raft peers and leader identification; commit/apply latency; snapshot frequency
  • Disk fsync delays and storage-layer errors propagate up into request latency and leader changes
AngleVerification methodThreshold / rule of thumb
Peer healthAutopilot state / APIAll peers should be healthy; investigate degraded peers immediately
Commit latencyStorage-related metricsWatch for a doubling vs. baseline sustained for 5+ minutes
SnapshotLog size / generationsBloat or high frequency signals an I/O bottleneck

Raft conceptual diagram (Active / Peers)

Append/Commit (Replication)LeaderFollower

Retrieving Autopilot / peer state (reference APIs)

# Autopilot の状態(参考:補助的な健全性確認)
curl -sS https://vault.example.com:8200/v1/sys/storage/raft/autopilot/state \
  -H "X-Vault-Token: s.xxxxx" | jq '{healthy: .healthy, failure_tolerance: .failure_tolerance, servers: [.servers[] | {id, node, voter, healthy}]}'

# ピア一覧
curl -sS https://vault.example.com:8200/v1/sys/storage/raft/configuration \
  -H "X-Vault-Token: s.xxxxx" | jq '.configuration.servers[] | {id, address, voter}'

Token / Lease / Secret Lifetime Signals

Token issuance and lease renewal failures map directly to authorization defects or trouble in external dependencies (KMS/PKI/DB). On the metrics side, track counts and failure rates for issue/renew/revoke operations and the trend in held leases. A steady upward trend signals a leak; failure spikes point to external-dependency outages or the impact of policy changes.

Adding spot checks against the /sys/leases API (total leases under a specific path and the distribution of remaining TTL) to your operational Runbook speeds up triage during incidents.

  • Numbers to watch: total leases, renewal success rate, source of failures (broken out by path / mount)
  • A buildup of long-lived leases is a signal to run tidy and revisit the policy
MetricSign of troubleDirection of response
Lease renewal failure rateSharp short-window spikeVerify reachability and permissions of external dependencies (DB/PKI)
Issuance rateDouble the baseline levelCheck spike causes (batches/deployments) and rate limits
Total lease countClimbs and stays highSuspect a leak. Revisit TTL design and schedule a tidy run

Lease lifecycle (conceptual)

issueactive leaserenewactiveexpire/revokeend

Helper APIs for lease inspection

# 特定パス配下のリース一覧(例)
curl -sS https://vault.example.com:8200/v1/sys/leases/lookup/db/creds/ \
  -H "X-Vault-Token: s.xxxxx" | jq '.'

# トークン情報(例)
curl -sS https://vault.example.com:8200/v1/auth/token/lookup-self \
  -H "X-Vault-Token: s.xxxxx" | jq '{id, policies, ttl, orphan}'

Alert Design and Key Points for the Exam

Design alerts around sustained deviation rather than instantaneous peaks. Cover at minimum these five families — HA stability, API latency, 5xx, storage latency, lease renewal failures — and pair them with a dashboard that surfaces correlations between them.

On the Operations exam, recurring themes include enabling telemetry, the prerequisites for fetching /v1/sys/metrics (format=prometheus, permissions), the relationship between HA and storage, and signs of rate limits and audit log errors. The safe bet is to memorize the terminology exactly as the official docs use it.

  • Set thresholds relative to the baseline (e.g. baseline x 2 sustained for 5 minutes)
  • Pair the immediacy of sys/health with the trend visibility of Prometheus
CategorySuggested SLI/SLO examplePrimary alert condition (example)
HA / LeaderChanges per 24h, median leader tenure3+ changes within the last 1h
Latencyp95 ≤ 200ms (hot paths)p95 exceeds 2x baseline sustained for 5 minutes
Error rate5xx ≤ 0.5%Above 1% in the most recent 5 minutes

Correlation dashboard (conceptual layout)

Latency p955xx Error %Leader changes

Operational Runbook template (pseudo-recipe)

# 1) アラート発火(例: 5xx 増加)
# 2) /v1/sys/metrics を取得し、path/method/status を確認
# 3) 該当 path の下位依存(DB/PKI/ネットワーク)の健全性を確認
# 4) 併せて /v1/sys/health と Raft Autopilot 状態を確認
# 5) 影響範囲(トークン/リース)を確認し、レート制限/回避策を適用

Check Yourself with a Question

Ops

問題 1

You want to collect Vault metrics with Prometheus and build dashboards. Which combination of steps is the most appropriate?

  1. Configure prometheus_retention_time in the telemetry stanza of server.hcl, have Prometheus scrape /v1/sys/metrics?format=prometheus, and use a token with read on sys/metrics
  2. If you collect Vault audit logs with Fluentd, there is no need to enable /v1/sys/metrics
  3. Have Prometheus scrape /v1/sys/health every 1 second and derive the latency distribution from that
  4. Enabling only StatsD lets Prometheus pick up tag information directly

正解: A

Prometheus collection works by enabling telemetry and scraping /v1/sys/metrics with format=prometheus. For access control, use a token with read permission on sys/metrics. Audit logs are not a substitute for metrics; /v1/sys/health is a state check and cannot produce a latency distribution; and StatsD tags are not a substitute for direct Prometheus scraping.

Frequently Asked Questions

Are telemetry settings applied dynamically, or is a restart required?

In most cases you need a safe server restart (or a planned rolling restart). Whether settings can be changed live varies by version, so consult the official docs for your version, validate the behavior in a staging environment, and then roll it out to production.

Does /v1/sys/metrics require authentication? How does Prometheus connect?

For production, gate access with a token. Use a Vault policy that grants read on path "sys/metrics", and send that token from Prometheus as a Bearer Token. Make sure TLS certificate verification is configured correctly as well.

Concrete metric names differ across environments. How do I reconcile them?

First dump the actual output of /v1/sys/metrics and design dashboards and alerts around the available labels (path, method, status, mount, and so on). Rather than depending on individual metric names, organize around categories (latency, 5xx, storage delays, leader changes, lease renewal failures); that approach holds up across version differences.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Vault

Vault Core Concepts: Sealed/Unsealed, Auth, Secrets (2026)

Vault fundamentals — sealed/unsealed state, auth methods, se...

Vault

Vault Operations Professional (VOP-003): Complete Guide (2026)

Pass the Vault Operations Professional exam — enterprise pat...

Vault

Vault Path-Based Routing: API URL Structure (2026)

How Vault's path-based routing works — mount points, sub-pat...

Vault

Vault Tokens: Auth Token Mechanics (2026)

Token fundamentals — service vs. batch tokens, accessor, ren...

Vault

Vault Token Types: Service, Batch, Periodic (2026)

Service vs. batch tokens compared — performance, ACL behavio...

Browse all Vault articles (101)
© 2026 NicheeLab All rights reserved.