Vault HA Cluster: Active/Standby Architecture (2026)

Running Vault in production starts with a precise understanding of the Active/Standby roles and their behavior. This article distills the key points from an Ops perspective, grounded in the stable concepts of the official documentation.

For certification prep, the most common topics are: Active is the only write endpoint, how Standby handles requests, how to choose between health checks, and failover behavior and procedures.

Active/Standby Fundamentals and HA Architecture Overview

In a Vault HA cluster, exactly one node is Active (the leader) and handles all writes. The rest wait as Standby nodes, monitoring and following the Active node. On failure, a new Active is elected by quorum (majority vote).

Standby nodes generally redirect (or forward) client requests to the Active node. In Enterprise, you can add Performance Standby nodes that locally serve read operations and certain APIs with low latency (writes still go only to Active).

Active: the sole write endpoint. Responsible for all cluster state changes.
Standby: monitors Active and waits. Relays or forwards requests to the Active node.
Performance Standby (Enterprise): used to scale out read traffic.
HA storage: typically built on Integrated Storage (Raft) or Consul.

Role	Read / Write	Typical API response	Primary responsibilities
Active	Reads and writes (sole write endpoint)	Healthy as Active	Acts as leader, drives state changes and replication
Standby	Generally no writes (relays/forwards requests)	Standby response (distinguishable by LB/monitoring)	Watches and follows Active, participates in elections when there is no leader
Performance Standby (Ent)	Read-centric (low latency). Writes go to Active	Performance response (local reads)	Read scale-out and latency reduction

Vault HA (Active/Standby) conceptual diagram

Common status check commands (selected)

# Check the current leader
curl -s http://vault.service:8200/v1/sys/leader | jq .

# List Raft peers (when using Integrated Storage)
vault operator raft list-peers

# Step down (e.g., for planned maintenance)
vault operator step-down

Request Flow and Health Check Design

When a client reaches a Standby, Vault funnels processing back to the Active node via redirects (or by forwarding the request). If your app cannot handle redirects or retries, the standard practice is to have the load balancer route only to the Active node.

Use /v1/sys/health or /v1/sys/leader for health checks. /v1/sys/health, in particular, accepts query parameters that control behavior, letting you choose whether to treat only Active as UP or to also accept Standby. Accepting Standby raises availability, but assumes clients can tolerate forwarding and retries.

Allow only Active: configure /v1/sys/health so Standby is not treated as UP.
Allow Standby too: enable request forwarding and let the LB distribute across all nodes.
The health and leader APIs are also useful for monitoring and automation (failover detection).

Concrete examples of health and leader checks

# Treat only Active as UP (example)
curl -sf http://vault.service:8200/v1/sys/health

# Health check that allows Standby (example)
# Parameters like standbyok=true broaden what is considered acceptable
curl -sf "http://vault.service:8200/v1/sys/health?standbyok=true"

# Leader info (useful for discovering the forwarding target URL)
curl -s http://vault.service:8200/v1/sys/leader | jq '{ha_enabled, is_self, leader_address}'

Failover Flow and Election Principles

When the Active node becomes unresponsive, the Standby nodes hold an election and choose a new Active. The election requires a majority (quorum). This is required to protect storage consistency, so an odd number of nodes (3, 5, 7, ...) is recommended.

For planned maintenance, manually step down the leader and hand off the role to a Standby before the work to minimize downtime. Election and log-replay convergence times depend on network latency and load, so standardize monitoring and wait procedures to meet your SLOs.

Prefer an odd-numbered cluster over an even-numbered one to maintain quorum.
Run planned outages safely as: step-down -> LB swap -> work -> rejoin.
Monitor both the leader API and storage health together.

Common commands used in failover operations

# Planned failover (step down the leader)
vault operator step-down

# Health of Raft peers (reachability / voting rights)
vault operator raft list-peers

# Latest log application status (e.g., for monitoring / inspection)
vault operator raft snapshot save /tmp/vault.snap

Storage Patterns: Integrated Storage (Raft) and Consul

The current recommendation is Integrated Storage (Raft). It minimizes external dependencies and lets you build a self-contained HA cluster. Leader election is handled by the built-in Raft, achieving high reliability with a simple architecture.

On the other hand, in environments already standardized on Consul, using Consul for storage remains a valid choice. Pick the option that matches your network design and failure-domain isolation strategy.

Integrated Storage: few dependencies, tends to simplify build-out and operations.
Consul: integrates easily with an existing service mesh / catalog.
Active/Standby fundamentals are the same either way (writes go only to Active).

Vault server configuration examples (Raft vs Consul)

# Raft (Integrated Storage) example
listener "tcp" {
  address     = "0.0.0.0:8200"
  tls_disable = 1
}
api_addr = "http://vault-node1:8200"
cluster_addr = "http://vault-node1:8201"
storage "raft" {
  path = "/opt/vault/data"
  # You can use retry_join etc. to configure peer joining
  # retry_join { leader_api_addr = "http://vault-nodeA:8200" }
}

# Consul example (when leveraging existing Consul)
# storage "consul" {
#   address = "consul.service:8500"
#   path    = "vault/"
# }

Load Balancer Design and Rolling Upgrade Procedure

If clients cannot handle redirects or retries, the safe design is to have the LB route only to the Active node. Use health checks to exclude Standby and to fail over to the new Active as soon as roles change. If you leverage request forwarding, even distributing the LB evenly across all nodes effectively concentrates traffic at the Active.

Run rolling upgrades starting from Standby nodes, then step down the Active last and upgrade it. Insert health, leader, and functional tests between each step and verify you stay within your SLO and maintenance window.

Use /v1/sys/health for LB health judgement (Active-only or Standby-allowed).
Default upgrade order: Standby -> Standby -> Active (step-down).
Automation: predefine health-stability waits, error-budget monitoring, and rollback procedures.

HAProxy example that lets only Active through (sketch)

backend vault
  option httpchk GET /v1/sys/health
  http-check expect status 200
  server node1 vault-node1:8200 check
  server node2 vault-node2:8200 check
  server node3 vault-node3:8200 check

# Note: treats 200 as the Active indicator (adjust to your environment)

Exam Key Points and Pitfalls

Writes always go only to Active. Standby either forwards/relays, or with Enterprise's Performance Standby handles some reads. Memorize this division of roles first. DR replication (primary/secondary) and HA (Active/Standby) are separate concepts, and exam questions love to mix them up.

Nodes must be initialized and unsealed before they can properly join the cluster. Adopting auto-unseal, strictly managing unseal keys, and standardizing leader step-down and upgrade procedures are common topics in both operations and the exam.

Active is the sole write endpoint. Standby is primarily for forwarding.
LB design: route only to Active if redirects are not supported; otherwise use forwarding.
Do not confuse DR with HA (a classic exam trick).

Common commands for exam prep (in a lab environment)

# Initialize (lab example; in production, design thresholds and use auto-unseal)
vault operator init -key-shares=1 -key-threshold=1

# Unseal (run multiple times until the threshold is met)
vault operator unseal

# Check status
vault status

Check Your Understanding

Ops

問題 1

Your application has no redirect or retry logic, and write requests to Vault must always go only to the Active node. Which load balancer design is most appropriate?

A. Use /v1/sys/health with a setting that does not treat Standby as UP, allowing only Active through
B. Enable Standby request forwarding and have the LB round-robin across all nodes
C. Skip health checks and send traffic to whichever node booted first
D. Include the DR secondary as a target to boost availability

正解: A

If clients do not support redirects or retries, the safe approach is for the LB to route only to the Active node. Using /v1/sys/health with a judgement that does not consider Standby as UP guarantees write requests always reach Active. B relies on forwarding and does not satisfy the assumption. C cannot fail over. D confuses DR with HA and is wrong.

Frequently Asked Questions

What is the difference between Standby and Performance Standby?

Neither node is the leader, but a Standby mainly forwards requests to the Active node. A Performance Standby (Enterprise) can handle some read operations locally and is used to scale out read performance. In both cases, only the Active node accepts writes.

How do I identify the current Active node?

Query the leader API. Hitting /v1/sys/leader with curl returns ha_enabled, is_self, leader_address, and so on. On the CLI, vault status also helps you identify the role.

What happens if the storage backend fails?

Vault's consistency depends on storage. If quorum cannot be reached, writes stop and leader election cannot succeed. Design around a redundant storage layout (odd-numbered Raft nodes or a highly available Consul) with health monitoring.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

HashiCorp Vault HA Cluster Architecture: Active/Standby Roles for Operators