Vault

Backup and Restore Operations with Vault Raft Snapshots

2026-04-19
NicheeLab Editorial Team

Vault's integrated storage (Raft) lets you capture the cluster's consistent state as a snapshot. Snapshots can be taken with minimal downtime and are effective for disaster recovery and migrations.

This article focuses on the commands and ordering you actually need in operations, with Ops certification exam tips included along the way.

Raft Snapshot Basics and Safety

Vault's integrated storage is replicated across multiple nodes via Raft. Snapshots are created from the leader's applied state, so the resulting backup is consistent. You can issue the command against any node — it is forwarded internally to the leader.

A snapshot contains Vault's entire storage state (secrets, policies, auth backend configuration, namespaces, replication metadata, and so on). Server configuration files (vault.hcl), TLS certificates, and keys are not included. Treat snapshots themselves as sensitive material — store them with access control, encryption, and integrity verification in place.

Restoration requires the original Unseal keys (shards). Even if someone obtains the snapshot, they cannot unseal the data without the keys. Version compatibility is not strict — restoring to the same version as the source or to a slightly newer compatible version is safest.

  • Snapshots can be taken online (no downtime is typically required)
  • Any node accepts the command; the leader handles it (when forwarding is enabled)
  • Snapshots only cover storage state; configuration files and certificates are managed separately
  • Restoration requires the original Unseal keys
  • Version compatibility is limited; stick to the same major version as a baseline

Basic verification commands (leader check and peer status)

export VAULT_ADDR=https://vault.example.com:8200
export VAULT_TOKEN=s.xxxxx  # sudo権限のあるトークン

# リーダー確認
vault status | egrep 'HA Mode|HA Cluster|Active Node Address'

# Raftピアの確認
vault operator raft list-peers

Backup Procedure (Taking a Snapshot)

Routine backups follow this order: 1) check peers and health, 2) save the snapshot to a secure temporary location, 3) compute a hash, 4) rotate according to the retention policy, and 5) record the operation in audit logs.

Snapshot creation carves out a consistency point on the leader. In large environments, the I/O can take a while, so verify free space and throughput on the destination volume in advance.

  • Pre-check: ensure vault status and raft list-peers are stable
  • Reserve free space at the destination (pre-compression size roughly matches your data usage)
  • Store the hash and metadata (creation time, source node, Vault version) alongside the snapshot
  • Watch latency during business hours (you may need to temporarily relax monitoring thresholds)

Example: taking and verifying a snapshot

# 取得
SNAP_DIR=/var/backups/vault
SNAP_FILE=${SNAP_DIR}/vault-raft-$(date +%Y%m%d-%H%M%S).snap
mkdir -p "$SNAP_DIR"

vault operator raft snapshot save "$SNAP_FILE"

# 整合性検証(ハッシュ)
sha256sum "$SNAP_FILE" | tee -a ${SNAP_DIR}/SHA256SUMS

# メタ情報を記録
{
  echo "created_at=$(date -Is)"
  vault version | xargs echo "vault_version="
  vault operator raft list-peers -format=json | jq -r '.data.leader_address' | xargs echo "leader="
} | tee -a ${SNAP_FILE}.meta

# ローテーション(例:30世代保持)
ls -1t ${SNAP_DIR}/vault-raft-*.snap | tail -n +31 | xargs -r rm -f

Snapshot Storage, Encryption, and Retention

Replicate snapshots to at least an offsite location or a different AZ, and send the hash and metadata along with them to detect tampering. On cloud storage, versioning and WORM (object lock) further improve recovery resilience.

Double up encryption in transit and at rest where possible. Combine SSE-KMS and GPG encryption and keep the set of people holding decryption keys small. Logging backup success/failure and hashes to your audit trail also makes compliance reviews easier.

  • Upload immediately to redundant storage like S3 or GCS
  • Server-side encryption (SSE-KMS) or client-side encryption (GPG)
  • Use WORM, versioning, and lifecycle rules for generation management
  • Run restore drills quarterly and record measured recovery times

Example: transfer to S3 (SSE-KMS) and GPG encryption

# S3へアップロード(KMS鍵で暗号化)
AWS_BUCKET=s3://org-backup-vault
aws s3 cp "$SNAP_FILE" "$AWS_BUCKET" \
  --sse aws:kms --sse-kms-key-id arn:aws:kms:us-east-1:123456789012:key/xxxx
aws s3 cp "${SNAP_FILE}.meta" "$AWS_BUCKET"
aws s3 cp "${SNAP_DIR}/SHA256SUMS" "$AWS_BUCKET"

# GPGでクライアント暗号化してから送る例
gpg --encrypt --recipient [email protected] "$SNAP_FILE"
aws s3 cp "${SNAP_FILE}.gpg" "$AWS_BUCKET"

Restore Procedure (Single-Node Recovery and Full Cluster Rebuild)

Whether you are rebuilding a single node or recovering the entire cluster, the standard approach is to first apply the snapshot to one node, start and unseal it, and then have the other nodes join. Overwriting a node that still holds existing data will fail, so the target node's Raft data directory must be empty.

Restore on the same Vault version as the source, or on a compatible one. After restoration, expired dynamic secrets and leases are cleaned up based on their TTLs. Server configuration (vault.hcl) and TLS certificates are not in the snapshot, so prepare equivalent versions separately.

  • Full recovery order: apply to one node → unseal → other nodes join
  • Empty the target node's data_dir (existing Raft logs cause errors)
  • vault operator raft snapshot restore needs -force to acknowledge the overwrite
  • After recovery, verify with raft list-peers and functional checks (auth, secret reads)

Example: full cluster recovery (assuming systemd)

# 1) すべてのVaultを停止
sudo systemctl stop vault

# 2) 復旧に使うノードAのみデータ削除
sudo rm -rf /opt/vault/data/*

# 3) ノードAを起動
sudo systemctl start vault
export VAULT_ADDR=https://node-a.example.com:8200
export VAULT_TOKEN=s.xxxxx

# 4) スナップショットを適用(A上で)
vault operator raft snapshot restore -force /var/backups/vault/vault-raft-20240401-000000.snap

# 5) Unseal(元のUnsealキーを使用)
vault operator unseal
vault operator unseal
vault operator unseal

# 6) 動作確認
vault status
vault operator raft list-peers

# 7) 残りノードB/Cを初期化・起動後、各ノードでjoin
# ノードB側で実行(BのVAULT_ADDRをエクスポートしてから)
vault operator raft join https://node-a.example.com:8200
vault operator unseal

# ノードC側も同様
vault operator raft join https://node-a.example.com:8200
vault operator unseal

Comparison with Alternatives and Architecture Overview

Backup options for Vault include Raft snapshots, filesystem-level copies, and Consul snapshots (when Consul storage is in use). When you are running on Raft integrated storage, Raft snapshots are almost always the safest choice.

When designing operations, decide upfront on snapshot storage location, encryption, retention, separate handling for configuration files and certificates, and recovery drill frequency. That preparation eliminates hesitation when a real incident hits.

  • On Raft, use Raft snapshots as the first choice
  • Configuration files and TLS materials need separate backups
  • Always validate version compatibility for migrations (rehearse in staging)
MethodConsistencyDowntime / ImpactOperational Notes
Raft snapshot (vault operator raft snapshot)Consistent based on the leader's applied stateEffectively none (taken online)Officially recommended. Manage config/TLS separately. Restore by applying to one node, then having others join.
Filesystem copy (stop, then rsync/snapshot)Consistent when stopped; risky while runningRequires downtimeFor small or single-node setups that can be stopped; involves more manual steps
Consul snapshot (when using Consul storage)Depends on Consul's consistency modelLow (can be taken online)Only valid with the Consul backend; not available on Vault integrated storage

Raft snapshot capture and restore flow (conceptual diagram)

Vault ClusterClients → NodeA / NodeB / ...Forwarded to leaderConsistent Raft State/backups/vault/raft.snapsnapshot saveOffsite Storage (S3/GCS)New Node (clean data_dir)snapshot restore -forceOther nodes joinunseal & verifyCapture: cluster consistent state → snapshot save → offsite storage. Restore: restore -force → Unseal → other nodes join

For comparison: example with the Consul backend (reference only)

# VaultがConsulストレージを使っている場合の参考(Raft統合ストレージでは使用不可)
# Consulのスナップショット
consul snapshot save consul-$(date +%Y%m%d).snap
# 復元
consul snapshot restore consul-20240401.snap

Exam Checklist and Troubleshooting

Ops exams frequently ask about the snapshot commands, the restore order, the requirement for Unseal keys, the fact that configuration is not in the snapshot, and how version compatibility is handled. Make sure you can state the recovery order — apply to one node, unseal, then join — without hesitation.

Typical errors include restoring without clearing the data directory (existing logs make it fail), leader forwarding failures during network partitions, and version-compatibility errors. Automating the restore procedure in staging beforehand minimizes hesitation when production is on the line.

  • Memorize the commands: vault operator raft snapshot save / restore (restore needs -force)
  • Back up configuration and TLS separately — they are not in the snapshot
  • Unsealing is mandatory after restore — Unseal key management is often the bottleneck
  • Run join from the new node and point it at the leader's address
  • Restore on a version equal to or newer than the source (watch for backward-incompatible changes)

Commands for troubleshooting

# リーダーへのフォワーディングが機能しているか
vault status

# ピアと投票状況の把握
vault operator raft list-peers -format=json | jq .

# スナップショットサイズとハッシュ再計算
ls -lh /var/backups/vault/*.snap
sha256sum /var/backups/vault/*.snap

# Autopilotの状態(安定性評価の参考)
vault operator raft autopilot state

Check with a Practice Question

Ops

問題 1

You are running Vault on integrated storage (Raft). A disk failure has wiped out the data directories on all nodes, but the most recent Raft snapshot is safely stored. Which recovery procedure is most appropriate?

  1. Start one Vault node with a clean data directory, run vault operator raft snapshot restore -force, and unseal it. Then start the remaining nodes and run vault operator raft join on each one.
  2. Extract the snapshot on any node, copy the files into data_dir, and start all nodes at the same time.
  3. Since the snapshot includes configuration, vault.hcl is unnecessary. Restoring on any node will sync the rest automatically.
  4. Run vault operator raft snapshot restore against a different live cluster and fail over immediately.

正解: A

The correct recovery is to apply the snapshot to one node, unseal it, and have the other nodes join. Copying directly into data_dir or starting all nodes simultaneously produces inconsistencies. The snapshot does not contain configuration, so vault.hcl must be provided separately. Overwriting another cluster is both discouraged and dangerous.

Frequently Asked Questions

Is it safe to take a snapshot while writes are happening?

Yes. The request is forwarded to the leader, and a consistent state is carved out based on the applied commit point. No downtime is typically required, but you should monitor for latency increases.

Do snapshots include server configuration or TLS certificates?

No. Back up vault.hcl, TLS certificates, and keys separately, and prepare an equivalent configuration during recovery. Snapshots only cover storage state (secrets, policies, etc.).

Can I restore to a different version of Vault?

Compatibility is limited. As a rule, restore to the same major version as the source, or to an equivalent or newer compatible version. If there is a large version gap, validate in staging beforehand and perform a staged upgrade if needed.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Vault

Vault Core Concepts: Sealed/Unsealed, Auth, Secrets (2026)

Vault fundamentals — sealed/unsealed state, auth methods, se...

Vault

Vault Operations Professional (VOP-003): Complete Guide (2026)

Pass the Vault Operations Professional exam — enterprise pat...

Vault

Vault Path-Based Routing: API URL Structure (2026)

How Vault's path-based routing works — mount points, sub-pat...

Vault

Vault Tokens: Auth Token Mechanics (2026)

Token fundamentals — service vs. batch tokens, accessor, ren...

Vault

Vault Token Types: Service, Batch, Periodic (2026)

Service vs. batch tokens compared — performance, ACL behavio...

Browse all Vault articles (101)
© 2026 NicheeLab All rights reserved.