Vault is a core platform for distributing secrets at high availability and with strong security guarantees. In operations, the point is not just whether the service is up: the real value comes from watching metrics that surface early warning signs — leader elections, seal state, request latency, storage health, and lease renewal failures.
Building on the behavior documented in the official docs, this article walks through how to collect metrics with Prometheus and how to read and alert on the key signals. Rather than leaning on individual metric names (which shift between versions), we focus on categories and the operational lens you actually use day-to-day.
Once the telemetry stanza is enabled, Vault exposes metrics in Prometheus format from the /v1/sys/metrics endpoint (with format=prometheus). You can also push to StatsD/DogStatsD. Pick the collection method that fits your platform, but in recent years direct Prometheus scraping has become the mainstream choice thanks to its ease of dashboard and alert design.
There are two important prerequisites: (1) configure telemetry and then restart the server (or roll restart following a safe maintenance procedure), and (2) in production, gate access to /v1/sys/metrics with a policy granting read. The most reliable approach is to first inspect the metric list in a development environment, get familiar with the available labels (path, method, status, mount type, etc.), and only then build dashboards on top.
| Collection method | Main destination / retrieval method | Strengths | Caveats |
|---|---|---|---|
| Direct Prometheus scrape | Prometheus scrapes /v1/sys/metrics | Rich labels, lots of visualization templates | Requires endpoint protection plus certificate/token management |
| StatsD | UDP push to statsd_address | Lightweight; easy to drop onto existing infrastructure | Weak label expression makes detailed slicing difficult |
| DogStatsD | Push to dogstatsd_address | Tags allow some level of label expression | Depends on a collection agent; watch network reachability |
| File / external bridge | Collect with an agent and forward to another system | Flexible integration with existing SOC/monitoring | Risk of operational complexity and added latency |
Typical configuration for collecting Vault metrics
Example Vault telemetry and Prometheus scrape configuration (HCL/YAML)
# server.hcl(抜粋)
telemetry {
prometheus_retention_time = "24h"
disable_hostname = true
# 必要に応じて(いずれかを使用)
# statsd_address = "127.0.0.1:8125"
# dogstatsd_address= "127.0.0.1:8125"
}
# sys/metrics 読取用の最小ポリシー例(Vault policy)
path "sys/metrics" {
capabilities = ["read"]
}
# Prometheus(prometheus.yml 抜粋)
scrape_configs:
- job_name: "vault"
scheme: https
metrics_path: /v1/sys/metrics
params:
format: ["prometheus"]
bearer_token: "s.xxxxxxxx" # sys/metrics 読取権限のあるトークン
tls_config:
insecure_skip_verify: false
static_configs:
- targets: ["vault.example.com:8200"]When running HA, keep an eye on the frequency of leader changes and their leading indicators (rising latency, storage slowdowns), seal/unseal events, and standby promotion success or failure. Stable metric signals include leadership-related counters and gauges, the number of seal-state change events, and standby RPC failure counts. Pair this with /v1/sys/health to track state transitions at fixed intervals.
Whether for the exam or for real operations, the key is to focus on rising trends in leader changes or chained sequences (change → latency → change) rather than one-off down detection. If those correlate with 5xx errors or storage warnings, you can intervene early.
| Symptom | Underlying risk | First-response action |
|---|---|---|
| Spike in leader changes | Storage latency / unstable network | Check storage I/O and network latency. Test reachability of the standbys |
| Frequent seal/unseal | Auto-unseal KMS failures or key management issues | Check KMS reachability and permissions; scrutinize audit log errors |
| Standby promotion failure | ACL/network restrictions, Raft inconsistency | Check ports and certificates; wait for or correct Raft status convergence |
HA state transitions (conceptual)
Example health/metrics retrieval (curl)
# リーダー・シール状態確認(ヘルス API)
curl -sS https://vault.example.com:8200/v1/sys/health \
-H "X-Vault-Token: s.xxxxx" | jq '{sealed, standby, initialized, version}'
# メトリクス(Prometheus 形式)
curl -sS "https://vault.example.com:8200/v1/sys/metrics?format=prometheus" \
-H "X-Vault-Token: s.xxxxx" | head -n 50Vault API requests are exported with labels such as path (e.g. auth/, sys/, kv/), HTTP method, and status code. That lets you visualize throughput, latency distribution, and 5xx error rate broken down by mount, method, and status. For SLO design, p95/p99 latency and 5xx rate (over the most recent 5-15 minutes) are the practical levers.
Specific metric names and labels can be added or removed across versions, so first inspect the actual output of /v1/sys/metrics, and define dashboards on the assumption that you can aggregate by labels (path, method, status, etc.).
| Metric category | Aggregation axis to watch | Operational note |
|---|---|---|
| Throughput | path × method | Identify hot paths and confirm rate-limit impact |
| Latency | p95/p99 × path | Surfaces backend latency and the cost of key generation |
| Error rate | status(5xx) × path | Selectively break out 429/403 in separate charts as well |
Understanding the latency distribution (conceptual)
Dashboard and alert design templates (PromQL approach)
# 具体的なメトリクス名は /v1/sys/metrics の出力に合わせて置換してください。
# 例: レイテンシ p95(Histogram/summary を利用可能なら)
# histogram_quantile(0.95, sum by (le, path) (<request_duration_bucket>))
# 例: 5xx 率(直近 5 分)
# sum by (path)(rate(<request_total>{status=~"5.."}[5m])) / sum by (path)(rate(<request_total>[5m]))
# 例: スループット(RPS)
# sum(rate(<request_total>[1m]))When using Integrated Storage (built-in Raft), watch commit latency, peer health, snapshot/log compaction progress, and disk I/O bottlenecks. Metrics are exported as storage operation latency and error counts. Weight sustained latency increases and growing retransmissions more heavily than short-lived spikes.
Alongside the metrics, checking Autopilot state and peer health via the API speeds up triage considerably.
| Angle | Verification method | Threshold / rule of thumb |
|---|---|---|
| Peer health | Autopilot state / API | All peers should be healthy; investigate degraded peers immediately |
| Commit latency | Storage-related metrics | Watch for a doubling vs. baseline sustained for 5+ minutes |
| Snapshot | Log size / generations | Bloat or high frequency signals an I/O bottleneck |
Raft conceptual diagram (Active / Peers)
Retrieving Autopilot / peer state (reference APIs)
# Autopilot の状態(参考:補助的な健全性確認)
curl -sS https://vault.example.com:8200/v1/sys/storage/raft/autopilot/state \
-H "X-Vault-Token: s.xxxxx" | jq '{healthy: .healthy, failure_tolerance: .failure_tolerance, servers: [.servers[] | {id, node, voter, healthy}]}'
# ピア一覧
curl -sS https://vault.example.com:8200/v1/sys/storage/raft/configuration \
-H "X-Vault-Token: s.xxxxx" | jq '.configuration.servers[] | {id, address, voter}'Token issuance and lease renewal failures map directly to authorization defects or trouble in external dependencies (KMS/PKI/DB). On the metrics side, track counts and failure rates for issue/renew/revoke operations and the trend in held leases. A steady upward trend signals a leak; failure spikes point to external-dependency outages or the impact of policy changes.
Adding spot checks against the /sys/leases API (total leases under a specific path and the distribution of remaining TTL) to your operational Runbook speeds up triage during incidents.
| Metric | Sign of trouble | Direction of response |
|---|---|---|
| Lease renewal failure rate | Sharp short-window spike | Verify reachability and permissions of external dependencies (DB/PKI) |
| Issuance rate | Double the baseline level | Check spike causes (batches/deployments) and rate limits |
| Total lease count | Climbs and stays high | Suspect a leak. Revisit TTL design and schedule a tidy run |
Lease lifecycle (conceptual)
Helper APIs for lease inspection
# 特定パス配下のリース一覧(例)
curl -sS https://vault.example.com:8200/v1/sys/leases/lookup/db/creds/ \
-H "X-Vault-Token: s.xxxxx" | jq '.'
# トークン情報(例)
curl -sS https://vault.example.com:8200/v1/auth/token/lookup-self \
-H "X-Vault-Token: s.xxxxx" | jq '{id, policies, ttl, orphan}'Design alerts around sustained deviation rather than instantaneous peaks. Cover at minimum these five families — HA stability, API latency, 5xx, storage latency, lease renewal failures — and pair them with a dashboard that surfaces correlations between them.
On the Operations exam, recurring themes include enabling telemetry, the prerequisites for fetching /v1/sys/metrics (format=prometheus, permissions), the relationship between HA and storage, and signs of rate limits and audit log errors. The safe bet is to memorize the terminology exactly as the official docs use it.
| Category | Suggested SLI/SLO example | Primary alert condition (example) |
|---|---|---|
| HA / Leader | Changes per 24h, median leader tenure | 3+ changes within the last 1h |
| Latency | p95 ≤ 200ms (hot paths) | p95 exceeds 2x baseline sustained for 5 minutes |
| Error rate | 5xx ≤ 0.5% | Above 1% in the most recent 5 minutes |
Correlation dashboard (conceptual layout)
Operational Runbook template (pseudo-recipe)
# 1) アラート発火(例: 5xx 増加)
# 2) /v1/sys/metrics を取得し、path/method/status を確認
# 3) 該当 path の下位依存(DB/PKI/ネットワーク)の健全性を確認
# 4) 併せて /v1/sys/health と Raft Autopilot 状態を確認
# 5) 影響範囲(トークン/リース)を確認し、レート制限/回避策を適用Ops
問題 1
You want to collect Vault metrics with Prometheus and build dashboards. Which combination of steps is the most appropriate?
正解: A
Prometheus collection works by enabling telemetry and scraping /v1/sys/metrics with format=prometheus. For access control, use a token with read permission on sys/metrics. Audit logs are not a substitute for metrics; /v1/sys/health is a state check and cannot produce a latency distribution; and StatsD tags are not a substitute for direct Prometheus scraping.
Are telemetry settings applied dynamically, or is a restart required?
In most cases you need a safe server restart (or a planned rolling restart). Whether settings can be changed live varies by version, so consult the official docs for your version, validate the behavior in a staging environment, and then roll it out to production.
Does /v1/sys/metrics require authentication? How does Prometheus connect?
For production, gate access with a token. Use a Vault policy that grants read on path "sys/metrics", and send that token from Prometheus as a Bearer Token. Make sure TLS certificate verification is configured correctly as well.
Concrete metric names differ across environments. How do I reconcile them?
First dump the actual output of /v1/sys/metrics and design dashboards and alerts around the available labels (path, method, status, mount, and so on). Rather than depending on individual metric names, organize around categories (latency, 5xx, storage delays, leader changes, lease renewal failures); that approach holds up across version differences.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Vault Core Concepts: Sealed/Unsealed, Auth, Secrets (2026)
Vault fundamentals — sealed/unsealed state, auth methods, se...
Vault Operations Professional (VOP-003): Complete Guide (2026)
Pass the Vault Operations Professional exam — enterprise pat...
Vault Path-Based Routing: API URL Structure (2026)
How Vault's path-based routing works — mount points, sub-pat...
Vault Tokens: Auth Token Mechanics (2026)
Token fundamentals — service vs. batch tokens, accessor, ren...
Vault Token Types: Service, Batch, Periodic (2026)
Service vs. batch tokens compared — performance, ACL behavio...