Raft Autopilot keeps Vault's integrated storage (Raft) healthy autonomously, reducing human error during failures and operational events.
This article centers on stable concepts from the official documentation and covers both exam-favorite topics and the practical settings and checks you use day to day in operations.
Vault's integrated storage uses Raft consensus to replicate and reconcile data. Autopilot handles health monitoring for this Raft cluster and automatic corrective actions (for example, cleaning up long-unhealthy nodes). The goal is to maintain availability and reduce operational burden.
Autopilot operates within bounds that preserve Raft's fundamental behavior, including leader election, and takes safe actions based on thresholds. Configuration can be changed via the API or CLI and applied incrementally. It can also make decisions that respect zone redundancy, which pays off in multi-AZ and multi-rack deployments.
Autopilot evaluates the cluster from several angles: node reachability, replication lag, and stabilization time. Evaluation is not instantaneous — it is based on observation over a window of time. This design prevents short spikes from triggering false actions.
Whether for the exam or production use, focus on what each threshold means and what side effects it produces. Overly strict settings cause false positives; settings that are too loose delay cleanup. Adjust incrementally based on your network characteristics (LAN/WAN) and node count.
Leaders are elected automatically by Raft, and Autopilot provides supporting meta-features (health maintenance and cleanup). Operationally, voluntarily stepping the leader down (step-down) before planned maintenance triggers an immediate re-election.
Nodes that are unreachable or that exceed log-lag thresholds for an extended period get marked unhealthy. With cleanup_dead_servers enabled, they become retirement candidates. Setting redundancy_zone_tag reduces the risk of skewing into a single zone.
Visualizing Autopilot behavior in a 5-node multi-AZ cluster
Start with observation. Once state is visible, adjust thresholds in small increments. Sensible values differ between low-latency LAN clusters and clusters that span AZs or regions.
Configuration changes can be applied online. After each change, verify the effect via Autopilot state and cluster logs to confirm there is no excessive automatic retirement or false detection.
Inspecting and configuring Autopilot state (API/CLI examples)
# 状態確認(CLI)
vault operator raft autopilot state
# ピア一覧(健全性と投票可否の確認)
vault operator raft list-peers
# 状態確認(API)
curl -sS \
-H "X-Vault-Token: $VAULT_TOKEN" \
"$VAULT_ADDR/v1/sys/storage/raft/autopilot/state" | jq .
# 設定の取得(API)
curl -sS \
-H "X-Vault-Token: $VAULT_TOKEN" \
"$VAULT_ADDR/v1/sys/storage/raft/autopilot/configuration" | jq .
# 設定の更新(API)
curl -sS -X PUT \
-H "X-Vault-Token: $VAULT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"cleanup_dead_servers": true,
"last_contact_threshold": "400ms",
"max_trailing_logs": 250,
"server_stabilization_time": "12s",
"redundancy_zone_tag": "az"
}' \
"$VAULT_ADDR/v1/sys/storage/raft/autopilot/configuration"
# 計画停止前にリーダーを降格(影響最小化)
vault operator step-down
Autopilot pays off more where failures and recoveries are frequent. That said, you shouldn't hand everything over to it — document the rationale for your thresholds and your exception procedures (manual intervention in emergencies). On the exam, expect questions about whether automatic retirement is allowed without breaking quorum, zone-aware decisions, and what each threshold actually means.
| Aspect | Raft+Autopilot | Raft (manual ops) | Consul backend |
|---|---|---|---|
| Failure detection / correction | Threshold-driven automatic cleanup and stabilization | Handled case by case by the operator | Consul-side health plus manual tuning |
| Zone-redundancy awareness | redundancy_zone_tag suppresses skew | Depends on the operator's judgment | Depends on Consul topology design |
| Operational load | Low to medium (mostly monitoring and tuning) | Medium to high (frequent night and emergency response) | Medium (keeping two products in alignment) |
| Architectural simplicity | High (self-contained within Vault) | High (Vault only) | Medium (requires running an external cluster) |
| Migration and scaling | Safe incremental scaling via server_stabilization_time | Success depends on the procedure design | Hinges on Consul-side scale design |
Here are the shortest viable procedures for common operational events, assuming Autopilot is enabled. Always prioritize preserving quorum, even in edge cases.
This section maps directly to the practical scenario questions you see on the exam.
Ops
問題 1
You operate a Vault Raft cluster across multiple AZs. Nodes in a particular AZ occasionally become unreachable for long periods. You want to automatically retire them without breaking quorum, while also avoiding load or risk concentrating in a single AZ. Which combination of settings is most appropriate?
正解: A
To balance safe automatic retirement with avoiding zone skew, enable cleanup_dead_servers and configure redundancy_zone_tag. Extreme thresholds or immediate-effect automation lead to false positives and instability.
Do Autopilot configuration changes require a restart?
Typical Autopilot settings can be changed online via the API or CLI. After changing them, observe autopilot state and cluster logs to confirm the behavior matches your expectations.
What is the minimum cluster size? Does a 2-node cluster work?
We recommend an odd number of 3 or more nodes (often 5 in real-world deployments). A 2-node cluster has no quorum safety margin and tends to lose availability during failures, so avoid it.
How does Autopilot participate during a minor upgrade?
Autopilot does not manage versions directly, but tuning server_stabilization_time and last_contact_threshold controls how stability is judged when nodes restart and rejoin. The safe approach is to step-down before planned maintenance and then apply rolling updates one node at a time.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Vault Core Concepts: Sealed/Unsealed, Auth, Secrets (2026)
Vault fundamentals — sealed/unsealed state, auth methods, se...
Vault Operations Professional (VOP-003): Complete Guide (2026)
Pass the Vault Operations Professional exam — enterprise pat...
Vault Path-Based Routing: API URL Structure (2026)
How Vault's path-based routing works — mount points, sub-pat...
Vault Tokens: Auth Token Mechanics (2026)
Token fundamentals — service vs. batch tokens, accessor, ren...
Vault Token Types: Service, Batch, Periodic (2026)
Service vs. batch tokens compared — performance, ACL behavio...