This article walks through a rolling upgrade procedure for HA Vault clusters that either avoids downtime entirely or keeps it to an absolute minimum.
In particular, it focuses on the differences between Integrated Storage (Raft) and Consul storage, health checks and traffic control, snapshot capture, and how to handle the leader node.
The basic policy comes down to two points: patch versions can be rolled out directly, and minor versions must follow the release notes and compatibility notices. Avoid skipping across multiple minor versions in a single jump and step through them incrementally. In HA clusters, update followers first and the leader last.
By storage type: Raft is designed to tolerate mixed versions within the cluster for short periods, but the supported window is spelled out in each release. With Consul storage, rolling updates of the Vault nodes are still the baseline, but you also need to watch Consul's own compatibility and health. If you use plugins (Secrets, Auth, Database, etc.), verify their compatibility with the target version, along with signatures and ABI, in advance.
| Topology / Storage | Recommended Approach | Downtime Characteristics |
|---|---|---|
| Raft (odd number of nodes, 3 or more) | Roll in the order follower → follower → leader. Step the leader down as needed. | Effectively zero-downtime (assuming quorum is preserved) |
| Consul storage (Vault in HA) | Roll the Vault nodes; separately snapshot and monitor Consul's health | Effectively zero-downtime (assuming LB draining) |
| Single node (test only) | Stop, update, start | Downtime is unavoidable |
Rolling upgrade (3-node Raft cluster behind a load balancer)
Clients
|
[ Load Balancer ] (Health: /v1/sys/health)
| | |
[n1]----[n2]----[n3]
| | |
follower follower leader
^ Step1 ^ Step2 ^ Step3(last)
Step1: LB で n1 をドレイン → n1 を更新/再起動 → ヘルスOKで LB 戻し
Step2: 同様に n2 を更新
Step3: leader を step-down → n3 を更新(再選出後に最後更新)Pre-flight checks (compatibility, health, peers)
# バージョン確認
vault version
# クラスタ状態(リーダー/スタンバイ)
vault status
# ヘルスエンドポイント(LB のチェックに合わせる)
curl -s -o /dev/null -w "%{http_code}\n" http://vault.example.com:8200/v1/sys/health
# 代表的なコード: 200=active、429=standby、503=sealed/uninit
# Raft ピア確認(Integrated Storage の場合)
vault operator raft list-peersUpgrade safety is decided before you start. Review the release notes for the target version, storage compatibility, plugin signatures and ABI, any replication topology, and how health checks behave (return codes and timeouts). If the upgrade includes RBAC or TLS configuration changes, validate them in a separate environment first and clearly document the configuration diff.
For backups, capture a Vault snapshot if you use Raft and a Consul snapshot if you use Consul storage. Document the restore procedure and agree in advance on the decision points for rolling back (what triggers a revert and to which state).
Example backup commands
# Raft(Integrated Storage)のスナップショット
env VAULT_TOKEN=... vault operator raft snapshot save /backups/vault-`date +%F-%H%M`.snap
# Consul(Vault が Consul をストレージに使用)
consul snapshot save /backups/consul-`date +%F-%H%M`.snap
# 復元(参考: 事前に単体検証必須)
# vault operator raft snapshot restore /backups/vault-xxxx.snapThe key to minimal downtime is reliably detaching the node being upgraded from traffic and refusing to accept traffic on its way back until health has been confirmed. Combine LB draining with Vault's health API, and verify quorum and responsiveness at every step of the rollout.
Vault's /v1/sys/health typically returns 200 (active), 429 (standby), or 503 (sealed/uninitialized). Decide ahead of time whether your LB should only forward to 200 nodes or also accept standby nodes, and lock in that rule.
Example health checks and LB draining (illustrative)
# ヘルスチェック(LB から)
curl -s -o /dev/null -w "%{http_code}\n" http://n1:8200/v1/sys/health
# 例: HA アクティブのみ通す Nginx 的判定(擬似。実装は環境に合わせる)
# if (status == 200) upstream enable; else disable;
# ドレイン(擬似コマンド。実際は LB ベンダ固有のAPI/CLIを使用)
# lbcli target detach --pool vault --node n1 --drain --timeout 120With at least 3 nodes (and an odd count) to keep quorum, update followers first. Finish by stepping the leader down and updating it, then confirm stability after re-election. If you are not using Auto Unseal, prepare the unseal key submission procedure for use after each restart.
On each node, binary updates follow the order stop, replace, start. The examples assume systemd-equivalent service management; adapt them to whatever startup manager you actually run.
Example Raft node update commands (Linux/systemd)
# 1) 対象ノードの切り離し(LB 側)
# lbcli target detach --pool vault --node <node> --drain
# 2) クラスタ状態確認
vault status
vault operator raft list-peers
# 3) サービス停止
sudo systemctl stop vault
# 4) バイナリ置換(検証済みバージョンを配置)
sudo install -m 0755 /tmp/vault-new /usr/local/bin/vault
vault version
# 5) 起動・ヘルス確認
sudo systemctl start vault
sleep 3
vault status
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8200/v1/sys/health
# 6) LB 復帰
# lbcli target attach --pool vault --node <node>
# (リーダー更新時)
# リーダーを明示的に降格してから更新
vault operator step-down
# リーダー再選出後に同様の停止→置換→起動を実施When Vault uses Consul for storage, the Vault nodes themselves are nearly stateless, which makes rolling updates straightforward. You still have to watch Consul's health, snapshots, and network/TLS configuration. Start by capturing a Consul snapshot and checking cluster state, then update the Vault nodes followers-first.
Handle the leader the same way as with Raft: update it last. Use LB draining and health checks to contain traffic impact, and verify reachability and token operations on each node after it comes back up.
Example update commands with Consul storage
# 事前に Consul のバックアップ
consul snapshot save /backups/consul-`date +%F-%H%M`.snap
# Vault ノードのローリング(フォロワーから)
# LB ドレイン → 停止 → 置換 → 起動 → ヘルスOK → LB 復帰
sudo systemctl stop vault
sudo install -m 0755 /tmp/vault-new /usr/local/bin/vault
sudo systemctl start vault
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8200/v1/sys/health
# リーダーは最後に
vault operator step-downRun functional verification after each node update and finish with a cluster-wide test. The classic checks are auth (e.g., approle/login), secret read/write (KV v2 put/get), critical Transit encrypt/decrypt, replication state, and audit log output. Also confirm there is no spike in 5xx errors from your major consumer applications.
Rollback assumes you have a recent snapshot in hand and the old binary preserved. Detach the failing node and revert it to the old binary. Absent destructive storage changes, swapping the binary back is often enough to recover, but always check the release notes for storage schema changes before you rely on that.
Representative verification commands
# レプリケーション状態
vault read -format=json sys/replication/status | jq .
# KV v2 動作確認
env VAULT_TOKEN=... vault kv put secret/app/foo bar=baz
env VAULT_TOKEN=... vault kv get secret/app/foo
# 監査ログの直近イベント確認(出力先に応じて)
sudo tail -n 100 /var/log/vault/audit.logOps
問題 1
A 3-node Vault cluster (Integrated Storage: Raft, behind an LB). Which is the appropriate procedure for performing a patch upgrade with minimal downtime?
正解: A
The crux of an HA rolling upgrade is preserving quorum and controlling traffic. The standard pattern is to update followers in sequence and finish by stepping the leader down and updating it last.
Can I run a rolling upgrade without Auto Unseal?
Yes. Every restart requires submitting enough unseal key shares to satisfy the threshold on each node. Document the unseal procedure and assigned operators in your runbook, and budget time for the health state transitions while keys are being entered.
Is it safe to skip versions (jump across multiple minor releases) in one shot?
Not recommended. As a rule, step through minor versions one at a time, performing a rolling upgrade and verification at each stage. Follow the compatibility and migration notes in the release notes for the safest path.
How should I decide when to roll back if something goes wrong?
As soon as post-update health verification on a node fails, detach the node from the load balancer and revert it to the old binary. If you have data-level concerns, restore the most recent snapshot into a standalone environment for verification before applying it to production. When storage schema changes are involved, document the rollback procedure in advance.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Vault Core Concepts: Sealed/Unsealed, Auth, Secrets (2026)
Vault fundamentals — sealed/unsealed state, auth methods, se...
Vault Operations Professional (VOP-003): Complete Guide (2026)
Pass the Vault Operations Professional exam — enterprise pat...
Vault Path-Based Routing: API URL Structure (2026)
How Vault's path-based routing works — mount points, sub-pat...
Vault Tokens: Auth Token Mechanics (2026)
Token fundamentals — service vs. batch tokens, accessor, ren...
Vault Token Types: Service, Batch, Periodic (2026)
Service vs. batch tokens compared — performance, ACL behavio...