This article catalogs the failures you actually run into when operating Vault, paired with patterns that lead you to the root cause as quickly as possible. It follows the official documentation's baseline behavior and focuses on concepts and commands that hold up across environment and version differences.
It also covers themes that appear frequently on HashiCorp certifications (Security Automation / Vault) — health endpoint return codes, redirects under HA, token/lease constraints, and Raft quorum and snapshots — phrased the way the exam tends to ask about them.
Vault failures are easiest to debug when you split them into five layers: client (CLI/SDK), network/TLS, Vault server (init/seal/state), storage/HA (Raft/Consul), and auth/policy/secrets. Pin a working hypothesis on which layer is failing first, then drill down.
At minimum, check server state (initialized/sealed/standby/active), API reachability, load balancer health, and major log errors (panic, permission denied, context deadline exceeded), all on a single timeline. Mis-set environment variables (VAULT_ADDR, VAULT_TOKEN, VAULT_CACERT, etc.) are an extremely common cause.
Initial diagnostic commands (connectivity, state, environment)
export VAULT_ADDR="https://vault.example.com:8200"
export VAULT_CACERT="/etc/ssl/certs/vault-ca.pem"
# サーバ状態(CLI)
vault status
# ヘルスエンドポイント(LB経由とノード直打ちを比較)
curl -sS --cacert "$VAULT_CACERT" "$VAULT_ADDR/v1/sys/health"
# 環境確認
env | grep -E "^VAULT_|^HTTPS?_" | sort
# サーバログ(systemd)
sudo journalctl -u vault -n 200 --no-pager
# ポート疎通(FW/LB確認)
nc -vz vault.example.com 8200 || trueMisjudging whether the cluster is initialized leads you down the wrong path. Use /v1/sys/init to confirm the initialized boolean, and never re-init an already-initialized cluster. Shamir unseal requires threshold key shares from the same cluster — mixing keys from a different cluster is a classic incident.
Auto-unseal (KMS, etc.) failures are usually caused by insufficient KMS permissions, network unreachability, or misconfigured seal block type/parameters in vault config. KMS errors show up verbatim in audit and server logs, so check there first. Startup loops can also be triggered when an audit device output destination is not writable.
Checking init state and performing unseal
# 初期化状態の確認(真に未初期化かを確認)
curl -sS --cacert "$VAULT_CACERT" "$VAULT_ADDR/v1/sys/init" | jq .
# 初期化(未初期化時のみ。shares/threshold は運用基準で設計)
# 出力される Unseal Keys と Initial Root Token は安全に分割保管
vault operator init \
-key-shares=5 \
-key-threshold=3 \
-format=json > /secure/offline/vault-init.json
# Shamir での Unseal(threshold 回実行)
vault operator unseal
# Auto-unseal 利用時はログで KMS 側の失敗を確認(例)
sudo journalctl -u vault -n 200 --no-pager | grep -i -E "kms|seal|unseal|permission|denied"
Raft requires majority quorum. If node failures or network partitions prevent leader election, the whole cluster stalls. Start with list-peers and autopilot to see voting status, the leader, and replication lag.
Disk pressure is the most common failure. Take and retain snapshots on a schedule, with operations ready to restore on demand. TLS/hostname mismatches and misconfigured api_addr/cluster_addr are also major causes of forwarding and redirect failures.
Investigating Raft and operating snapshots
# ピア状況とオートパイロット
vault operator raft list-peers -format=json | jq .
vault operator raft autopilot get -format=json | jq .
# スナップショットの取得(定期バックアップに組み込む)
vault operator raft snapshot save /backup/vault-$(date +%F).snap
# リストア(検証環境で手順を必ず確認)
# vault operator raft snapshot restore /backup/vault-YYYY-MM-DD.snap
# ディスク確認
df -h /var/lib/vault
inode=$(df -i /var/lib/vault | awk 'NR==2{print $4}') && echo "free inodes: $inode"
When you front an HA cluster with a load balancer, api_addr must be resolvable and reachable from clients, and standby nodes must be able to forward/redirect to the active node correctly. cluster_addr is used for inter-node communication.
The health endpoint returns different status codes per state. By default: 200 for active, 429 for standby, 503 for sealed, 501 for uninitialized. If you want LB health checks to also pass standby nodes, use standbyok=true and adjust the return codes via query parameters as needed.
Vault request flow (HA + LB)
Practical health endpoint examples
# アクティブ/スタンバイをともに正常とみなす(LB 側で 200 を期待)
curl -sS --cacert "$VAULT_CACERT" \
"$VAULT_ADDR/v1/sys/health?standbyok=true&perfstandbyok=true" | jq .
# 状態別コードを明示指定(必要に応じて)
# 例: アクティブ200、スタンバイ200、シールド503、未初期化501
curl -sS --cacert "$VAULT_CACERT" \
"$VAULT_ADDR/v1/sys/health?standbyok=true&activecode=200&standbycode=200&sealedcode=503&uninitcode=501" -o /dev/null -w "%{http_code}\n"
For token TTLs, the issuance TTL must not exceed the mount's max_ttl, and the token itself must be renewable. Periodic tokens are exempt from max_ttl but require explicit renewal. When renewal fails, the lease/token is most likely either not renewable or attempting to exceed an upper TTL limit.
OIDC/JWT auth failures classically come from audience, iss, or bound_claims mismatches, or clock drift. For permission denied caused by missing policies, the fastest check is to inspect the target path with token capabilities. Lease renew/revoke for dynamic secrets (databases, etc.) often fails on network or backend permissions, so correlate with engine-side logs.
Token / lease investigation examples
# トークンの属性確認
vault token lookup $VAULT_TOKEN
# 対象パスの実効権限を確認(permission denied の切り分け)
vault token capabilities $VAULT_TOKEN secret/data/app/config
# リースの一覧と更新(更新不可ならエラーに)
vault list sys/leases/lookup/database/creds/app-role || true
vault lease renew <lease_id>
# マウントやロールの TTL 設定確認
vault read sys/mounts | jq '.data'
# 例: KV の場合はバージョンとパスに注意(kv-v2 は data/ パス)
Common production log messages mapped to their causes, what to check, and commands you can use immediately. The golden rule: pin down a reproducer first, then compare LB-routed vs. direct node calls and with/without TLS.
| Symptom / Log excerpt | Main cause | Where to check | Representative command |
|---|---|---|---|
| Vault is sealed | Not unsealed / Auto-unseal failure | /v1/sys/health, server logs, seal config | vault status; journalctl -u vault |
| permission denied | Insufficient policy / wrong path | token capabilities, policies | vault token capabilities <token> <path> |
| context deadline exceeded | Network outage / redirect target unreachable | LB vs. direct node comparison, api_addr reachability | curl /v1/sys/health; ping/trace |
| request forwarding failed | api_addr unreachable / certificate SAN mismatch | Each node's api_addr and certificate | vault status (against each node directly) |
| cluster leader not found | Raft election unfinished / quorum not met | raft list-peers, autopilot, node liveness | vault operator raft list-peers |
| x509: certificate signed by unknown authority | CA not configured / not trusted | VAULT_CACERT / system CA, root certificate | curl --cacert <ca.pem> ... |
Pulling diagnostic keywords from logs (example)
sudo journalctl -u vault -S "-5min" | grep -i -E \
"sealed|unseal|forward|leader|raft|denied|deadline|certificate|x509" -n
Ops
問題 1
You operate Vault as an HA cluster with a load balancer in front. Which setting is most appropriate for letting standby nodes also pass the health check?
正解: A
The health endpoint returns 429 by default for standby nodes, so to make standby pass the LB health check you add standbyok=true (and perfstandbyok=true if needed). api_addr should be a per-node reachable value — setting the same VIP on every node is wrong. Forcibly promoting standby to active is neither necessary nor appropriate. /v1/sys/leader has a different purpose and is insufficient as a standalone LB health check.
Is a 429 from sys/health a failure? Can standby nodes return 200?
429 is the default behavior and indicates standby state; it is not a failure. If you want load balancers to treat standby as healthy, use /v1/sys/health?standbyok=true (and perfstandbyok=true for performance standbys), and explicitly override status codes via query parameters if needed.
How often should you take Integrated Storage (Raft) snapshots?
It depends on change volume and your RPO/RTO, but at least daily (every few hours in critical environments) is recommended, with generational retention and cross-region storage. Rehearse both backup and restore procedures regularly in a staging environment, and prepare runbooks for disk pressure scenarios (rotating old snapshots).
What are the keys to safely migrating from Shamir to Auto-unseal?
Perform the migration during a maintenance window after pre-validating the seal block configuration, KMS permissions, and connectivity. Before migrating, take a Raft snapshot, enable audit logging, and prepare a rollback procedure (reverting to the old configuration). After migration, verify auto-unseal via vault status and a restart test.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Vault Core Concepts: Sealed/Unsealed, Auth, Secrets (2026)
Vault fundamentals — sealed/unsealed state, auth methods, se...
Vault Operations Professional (VOP-003): Complete Guide (2026)
Pass the Vault Operations Professional exam — enterprise pat...
Vault Path-Based Routing: API URL Structure (2026)
How Vault's path-based routing works — mount points, sub-pat...
Vault Tokens: Auth Token Mechanics (2026)
Token fundamentals — service vs. batch tokens, accessor, ren...
Vault Token Types: Service, Batch, Periodic (2026)
Service vs. batch tokens compared — performance, ACL behavio...