Vault Upgrade Procedure: Rolling Restart (2026)

This article walks through a rolling upgrade procedure for HA Vault clusters that either avoids downtime entirely or keeps it to an absolute minimum.

In particular, it focuses on the differences between Integrated Storage (Raft) and Consul storage, health checks and traffic control, snapshot capture, and how to handle the leader node.

Upgrade Strategy and Compatibility Principles

The basic policy comes down to two points: patch versions can be rolled out directly, and minor versions must follow the release notes and compatibility notices. Avoid skipping across multiple minor versions in a single jump and step through them incrementally. In HA clusters, update followers first and the leader last.

By storage type: Raft is designed to tolerate mixed versions within the cluster for short periods, but the supported window is spelled out in each release. With Consul storage, rolling updates of the Vault nodes are still the baseline, but you also need to watch Consul's own compatibility and health. If you use plugins (Secrets, Auth, Database, etc.), verify their compatibility with the target version, along with signatures and ABI, in advance.

Patches are rollable; minors follow the release notes (no skipping — step through them)
Update followers first and the leader last; use step-down as needed
Always capture a snapshot (either Raft or Consul)
Use load-balancer health checks to reliably exclude nodes that are being updated
Verify plugin and replication (DR/Performance) compatibility ahead of time

Topology / Storage	Recommended Approach	Downtime Characteristics
Raft (odd number of nodes, 3 or more)	Roll in the order follower → follower → leader. Step the leader down as needed.	Effectively zero-downtime (assuming quorum is preserved)
Consul storage (Vault in HA)	Roll the Vault nodes; separately snapshot and monitor Consul's health	Effectively zero-downtime (assuming LB draining)
Single node (test only)	Stop, update, start	Downtime is unavoidable

Rolling upgrade (3-node Raft cluster behind a load balancer)

Clients
   |
[ Load Balancer ]  (Health: /v1/sys/health)
   |        |        |
 [n1]----[n2]----[n3]
  |        |        |
 follower  follower  leader
   ^  Step1   ^ Step2   ^ Step3(last)

Step1: LB で n1 をドレイン → n1 を更新/再起動 → ヘルスOKで LB 戻し
Step2: 同様に n2 を更新
Step3: leader を step-down → n3 を更新（再選出後に最後更新）

Pre-flight checks (compatibility, health, peers)

# バージョン確認
vault version

# クラスタ状態（リーダー/スタンバイ）
vault status

# ヘルスエンドポイント（LB のチェックに合わせる）
curl -s -o /dev/null -w "%{http_code}\n" http://vault.example.com:8200/v1/sys/health
# 代表的なコード: 200=active、429=standby、503=sealed/uninit

# Raft ピア確認（Integrated Storage の場合）
vault operator raft list-peers

Pre-Upgrade Checklist and Backups

Upgrade safety is decided before you start. Review the release notes for the target version, storage compatibility, plugin signatures and ABI, any replication topology, and how health checks behave (return codes and timeouts). If the upgrade includes RBAC or TLS configuration changes, validate them in a separate environment first and clearly document the configuration diff.

For backups, capture a Vault snapshot if you use Raft and a Consul snapshot if you use Consul storage. Document the restore procedure and agree in advance on the decision points for rolling back (what triggers a revert and to which state).

Review the target release's compatibility notes (especially storage, API compatibility, and plugins)
Agree on the maintenance window, communication plan, and rollback triggers
Capture snapshots and store them in redundant locations
Rehearse the LB health thresholds and drain procedure
Temporarily suppress monitoring/alert noise (maintenance mode)

Example backup commands

# Raft（Integrated Storage）のスナップショット
env VAULT_TOKEN=... vault operator raft snapshot save /backups/vault-`date +%F-%H%M`.snap

# Consul（Vault が Consul をストレージに使用）
consul snapshot save /backups/consul-`date +%F-%H%M`.snap

# 復元（参考: 事前に単体検証必須）
# vault operator raft snapshot restore /backups/vault-xxxx.snap

Traffic Control and Health Check Design

The key to minimal downtime is reliably detaching the node being upgraded from traffic and refusing to accept traffic on its way back until health has been confirmed. Combine LB draining with Vault's health API, and verify quorum and responsiveness at every step of the rollout.

Vault's /v1/sys/health typically returns 200 (active), 429 (standby), or 503 (sealed/uninitialized). Decide ahead of time whether your LB should only forward to 200 nodes or also accept standby nodes, and lock in that rule.

Drain nodes manually or automatically at the LB (configure connection-close timeouts)
Clearly define health API thresholds (e.g., only accept 200)
During updates, do not return the node to the LB until its health has stabilized
Avoid parallel updates; always confirm quorum and leader presence

Example health checks and LB draining (illustrative)

# ヘルスチェック（LB から）
curl -s -o /dev/null -w "%{http_code}\n" http://n1:8200/v1/sys/health

# 例: HA アクティブのみ通す Nginx 的判定（擬似。実装は環境に合わせる）
# if (status == 200) upstream enable; else disable;

# ドレイン（擬似コマンド。実際は LB ベンダ固有のAPI/CLIを使用）
# lbcli target detach --pool vault --node n1 --drain --timeout 120

Rolling Procedure for Raft (Integrated Storage)

With at least 3 nodes (and an odd count) to keep quorum, update followers first. Finish by stepping the leader down and updating it, then confirm stability after re-election. If you are not using Auto Unseal, prepare the unseal key submission procedure for use after each restart.

On each node, binary updates follow the order stop, replace, start. The examples assume systemd-equivalent service management; adapt them to whatever startup manager you actually run.

Drain the target node at the LB
Confirm cluster health with vault status and raft list-peers
Follower flow: stop → replace binary → start → wait for health to stabilize → return to LB
Leader last: step-down → update → confirm stability
Re-confirm that the snapshot is up to date before each update

Example Raft node update commands (Linux/systemd)

# 1) 対象ノードの切り離し（LB 側）
# lbcli target detach --pool vault --node <node> --drain

# 2) クラスタ状態確認
vault status
vault operator raft list-peers

# 3) サービス停止
sudo systemctl stop vault

# 4) バイナリ置換（検証済みバージョンを配置）
sudo install -m 0755 /tmp/vault-new /usr/local/bin/vault
vault version

# 5) 起動・ヘルス確認
sudo systemctl start vault
sleep 3
vault status
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8200/v1/sys/health

# 6) LB 復帰
# lbcli target attach --pool vault --node <node>

# （リーダー更新時）
# リーダーを明示的に降格してから更新
vault operator step-down
# リーダー再選出後に同様の停止→置換→起動を実施

Rolling Procedure with Consul Storage

When Vault uses Consul for storage, the Vault nodes themselves are nearly stateless, which makes rolling updates straightforward. You still have to watch Consul's health, snapshots, and network/TLS configuration. Start by capturing a Consul snapshot and checking cluster state, then update the Vault nodes followers-first.

Handle the leader the same way as with Raft: update it last. Use LB draining and health checks to contain traffic impact, and verify reachability and token operations on each node after it comes back up.

Consul: take a snapshot backup with consul snapshot save
Vault: roll followers first, leader last (step-down as needed)
Control acceptance via LB draining and the health API
Also monitor Consul's own health and leader election

Example update commands with Consul storage

# 事前に Consul のバックアップ
consul snapshot save /backups/consul-`date +%F-%H%M`.snap

# Vault ノードのローリング（フォロワーから）
# LB ドレイン → 停止 → 置換 → 起動 → ヘルスOK → LB 復帰
sudo systemctl stop vault
sudo install -m 0755 /tmp/vault-new /usr/local/bin/vault
sudo systemctl start vault
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8200/v1/sys/health

# リーダーは最後に
vault operator step-down

Verification, Rollback, and Exam-Ready Key Points

Run functional verification after each node update and finish with a cluster-wide test. The classic checks are auth (e.g., approle/login), secret read/write (KV v2 put/get), critical Transit encrypt/decrypt, replication state, and audit log output. Also confirm there is no spike in 5xx errors from your major consumer applications.

Rollback assumes you have a recent snapshot in hand and the old binary preserved. Detach the failing node and revert it to the old binary. Absent destructive storage changes, swapping the binary back is often enough to recover, but always check the release notes for storage schema changes before you rely on that.

Verification: cover auth, secrets, API, audit logs, and replication
Metrics: monitor 5xx, latency, and the leader re-election count
Rollback: detach at the LB → revert to old binary → restore from snapshot if needed
Exam keywords: followers first, leader last, health 200/429/503, snapshot capture, LB draining

Representative verification commands

# レプリケーション状態
vault read -format=json sys/replication/status | jq .

# KV v2 動作確認
env VAULT_TOKEN=... vault kv put secret/app/foo bar=baz
env VAULT_TOKEN=... vault kv get secret/app/foo

# 監査ログの直近イベント確認（出力先に応じて）
sudo tail -n 100 /var/log/vault/audit.log

Check Your Understanding

Ops

問題 1

A 3-node Vault cluster (Integrated Storage: Raft, behind an LB). Which is the appropriate procedure for performing a patch upgrade with minimal downtime?

Detach followers from the LB one at a time, update and reattach them, then step the leader down and update it last
Stop and update the leader first, then update the remaining nodes in parallel
Stop all nodes simultaneously and update/start them in a single batch
Update only the Performance secondary; defer the primary to a later date

正解: A

The crux of an HA rolling upgrade is preserving quorum and controlling traffic. The standard pattern is to update followers in sequence and finish by stepping the leader down and updating it last.

Frequently Asked Questions

Can I run a rolling upgrade without Auto Unseal?

Yes. Every restart requires submitting enough unseal key shares to satisfy the threshold on each node. Document the unseal procedure and assigned operators in your runbook, and budget time for the health state transitions while keys are being entered.

Is it safe to skip versions (jump across multiple minor releases) in one shot?

Not recommended. As a rule, step through minor versions one at a time, performing a rolling upgrade and verification at each stage. Follow the compatibility and migration notes in the release notes for the safest path.

How should I decide when to roll back if something goes wrong?

As soon as post-update health verification on a node fails, detach the node from the load balancer and revert it to the old binary. If you have data-level concerns, restore the most recent snapshot into a standalone environment for verification before applying it to production. When storage schema changes are involved, document the rollback procedure in advance.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Vault Upgrade Procedure (Minimizing Downtime)