This article focuses on adding and removing nodes from a Vault Integrated Storage (Raft) cluster. It walks through how to maintain odd-numbered topologies, preserve quorum, design TLS and addressing, and execute removals during planned downtime or outages — with a strong operations bias.
For the exam, knowing which commands apply where (leader vs. any node), how to calculate quorum, the requirements for join and remove, and the role of Autopilot translates directly into points. The discussion follows the terminology and behavior in the official documentation.
Vault's Integrated Storage uses Raft for both replication and leader election. Availability hinges on a majority of voting nodes being reachable. The recommended topology is an odd number — 3 or 5 nodes — and any plan to add or remove members must keep the cluster above quorum at every step.
Each node holds a local raft data directory and uses an API port (default 8200) plus an internal cluster-communication port (default 8201), both expected to run over TLS. api_addr and cluster_addr must be configured with valid FQDNs/SANs, and mutual name resolution and connectivity are prerequisites for adding a node.
Minimal Vault server configuration example (for a new node)
storage "raft" {
path = "/opt/vault/data" # Point at an empty directory
node_id = "vault-3" # Must be unique per node
}
cluster_name = "vault-prod"
api_addr = "https://vault-3.example.com:8200"
cluster_addr = "https://vault-3.example.com:8201"
listener "tcp" {
address = "0.0.0.0:8200"
tls_cert_file = "/etc/vault.d/certs/vault.crt"
tls_key_file = "/etc/vault.d/certs/vault.key"
}
tls_disable = 0Here is the canonical procedure for adding a new node to a running cluster. You can scale out while preserving availability. Before starting the target node, configuration, certificates, and port connectivity must all be in place.
Run join from the new node. Set VAULT_ADDR to the new node's own API, and pass the API address of the existing cluster's leader (or any reachable member) as the argument.
| Add / Deploy Approach | Primary Purpose / Use Case | Downtime Impact | Key Commands / Steps |
|---|---|---|---|
| Online join (add more nodes to the same cluster) | Scale-out, keep an odd node count | None (as long as quorum is maintained) | vault operator raft join, verify with list-peers |
| Replace a failed node (keep node count constant) | Replace a permanently failed node, restore health | None (remove-peer first, then join) | Confirm peer_id with list-peers → remove-peer → join from the new node |
| Rebuild a new cluster from a snapshot | Large-scale migration or disaster recovery (separate environment) | Planned downtime required at cutover | vault operator raft snapshot save/restore (build as a separate cluster) |
Node addition flow (overview)
Before (3 nodes):
[vault-1] (Leader) ---- [vault-2] (Follower)
\
\---------------- [vault-3] (Follower)
Add vault-4:
New Node: [vault-4]
- start w/ empty raft dir
- api_addr/cluster_addr set
- join -> https://vault-1:8200
After (4 nodes, recommended end state is 5 nodes):
[vault-1] (Leader) ---- [vault-2]
\
\---- [vault-3] ---- [vault-4]
Example commands for adding a node
# Run on the new node (targeting its own Vault API)
export VAULT_ADDR="https://vault-4.example.com:8200"
export VAULT_CACERT="/etc/vault.d/certs/ca.crt"
# Specify the existing cluster's API address and join
vault operator raft join https://vault-1.example.com:8200
# Verify peers from any node in the existing cluster
export VAULT_ADDR="https://vault-1.example.com:8200"
vault operator raft list-peersAlways perform removals in a configuration that preserves quorum. Dropping one node from a 5-node cluster leaves 4, where the majority is 3. Plan so that you never stop two nodes at the same time in the next operation.
For planned downtime (the node is healthy), if the target is the leader, run step-down first, then remove it from cluster membership with remove-peer. When the node is permanently failed and will not return, run remove-peer with the peer_id as well.
Safe removal procedure (planned downtime / failure)
# 1) Check peers (from any node)
export VAULT_ADDR="https://vault-1.example.com:8200"
vault operator raft list-peers
# Identify the target PEER_ID from the Node ID column
# 2) If the target is the leader, step it down first
vault operator step-down
# 3) Remove it from membership
vault operator raft remove-peer -peer-id=<PEER_ID>
# 4) Stop the service and handle data on the target node
# systemctl stop vault
# rm -rf /opt/vault/data # Only if not reused (be careful)Raft refuses writes unless a majority of voting nodes is alive. For additions, removals, and rolling upgrades, always verify quorum based on the count remaining at the next instant.
Documenting basic rules in your operations runbook — never stop two or more nodes simultaneously, freeze planned work when an incident occurs, and prioritize stabilizing the cluster first — prevents avoidable incidents.
Quick health check
export VAULT_ADDR="https://vault-1.example.com:8200"
# Peer information
vault operator raft list-peers
# Autopilot health (lag and candidacy)
vault operator raft autopilot stateMost join failures come from mismatched TLS/SAN and address settings. api_addr is the URL clients and CLIs reach, while cluster_addr is used for inter-node replication (port 8201). Confirm that the certificate's SAN includes both hostnames (or an appropriate wildcard / SubjectAltName).
Another classic cause is attempting to join with a raft directory that already contains data. The new node's raft data directory must be empty.
Examples for pre-checking TLS/SAN and connectivity
# Check the SAN (whether the cert extension includes the api_addr/cluster_addr hosts)
openssl x509 -in /etc/vault.d/certs/vault.crt -noout -text | grep -A1 "Subject Alternative Name"
# API / cluster connectivity
curl --cacert /etc/vault.d/certs/ca.crt https://vault-1.example.com:8200/v1/sys/health
curl --cacert /etc/vault.d/certs/ca.crt https://vault-1.example.com:8201/raft/configuration || true # On 8201 we just confirm reachability even if cert verification failsFor rolling upgrades or OS patching, work one node at a time: if the target is the leader, run step-down, then stop, start, and verify health. Run list-peers and autopilot state each time to keep the cluster from dropping below quorum.
Autopilot helps with cluster health monitoring and cleaning up unneeded servers. In operations, lean on state visibility first, and apply any configuration changes (set-config) in stages — validate before promoting to production.
Sample shell snippet for rolling operations (excerpt)
# Step down if the target host is the leader
vault operator step-down || true
# Verify health
vault operator raft list-peers
vault operator raft autopilot state
# Apply OS / binary updates here -> restart
# Re-verify health and move on to the next nodeOps
問題 1
In a 5-node Vault Raft cluster, one node has permanently failed with no chance of recovery. Which procedure most appropriately removes it and restores cluster health without harming availability?
正解: A
When a failed node will not return, remove-peer is the correct way to formally drop it from cluster membership; once the cluster is healthy, join the replacement. The majority threshold is 3/5, so availability is preserved throughout. Joining first to reach 6 nodes can be acceptable from a pure quorum standpoint, but for membership consistency and operational clarity, removing the dead node before replacing it is preferred. The other options rely on a misconception about automatic rejoin or involve an unnecessary full stop.
On which node should I run remove-peer? Can only the leader execute it?
You can run it on any reachable node and it will be forwarded to the leader internally. Before executing, confirm the target peer_id with vault operator raft list-peers.
Which address does join need: api_addr or cluster_addr?
Pass the existing cluster's API address (https://<host>:8200) to the join command. Internal replication uses each node's cluster_addr (typically 8201), so certificates and connectivity for both ports must be configured correctly.
Should I avoid even-numbered clusters (4 or 6 nodes)?
By the nature of Raft, an odd number of voting nodes is preferred. Even-numbered clusters still work but are less resilient to simultaneous failures and offer less flexibility for planned downtime. In operations, 3 or 5 nodes is the baseline, and any scale-out plan should land back on an odd count.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Vault Core Concepts: Sealed/Unsealed, Auth, Secrets (2026)
Vault fundamentals — sealed/unsealed state, auth methods, se...
Vault Operations Professional (VOP-003): Complete Guide (2026)
Pass the Vault Operations Professional exam — enterprise pat...
Vault Path-Based Routing: API URL Structure (2026)
How Vault's path-based routing works — mount points, sub-pat...
Vault Tokens: Auth Token Mechanics (2026)
Token fundamentals — service vs. batch tokens, accessor, ren...
Vault Token Types: Service, Batch, Periodic (2026)
Service vs. batch tokens compared — performance, ACL behavio...