Vault DR Replication: Standby Cluster (2026)

Vault's DR Replication is an asynchronous, one-way replication feature designed to keep the business running during a disaster. Under normal conditions the secondary stays on standby, and during an outage it is promoted to operate as the primary.

Based on stable behavior from the official documentation, this article organizes design fundamentals, setup procedures, failover/failback, monitoring and drills, and a DR vs. Performance comparison — covering both certification prep and real-world operations.

DR Replication Basics and Exam Hot Spots

Vault DR Replication asynchronously replicates the same dataset to a different region or cluster. The secondary stays on standby in normal operation and is promoted during an outage to resume service. The secondary does not serve client requests until it is promoted.

From an Ops perspective, the key items are the roles (DR Primary / DR Secondary), promote/demote operations, RPO/RTO thinking, TLS and certificate prerequisites, and the difference from Performance Replication. The exam frequently tests these conceptual distinctions.

Purpose: business continuity in the face of a site failure. Standby during normal operation, cutover only on a failure.
Communication: asynchronous and one-way (Primary → Secondary).
Secondary behavior: client requests are not served in principle until promotion (standby).
RPO/RTO: because replication is asynchronous, RPO is on the order of seconds to network latency. RTO depends on promotion and routing cutover (automation shortens it).
Storage: Integrated Storage (Raft) is recommended. Designs using Consul are possible, but compare operational requirements before choosing.
Caveat: DR and Performance have different goals. DR is for disaster recovery, Performance is for read scale.

Minimal commands to check status (preparation phase)

export VAULT_ADDR=https://primary.example.com:8200
export VAULT_TOKEN=<admin_or_appropriate_token>

# Check DR status (on Primary)
vault read sys/replication/dr/status

# Check server state
vault status

Architecture Design: Inter-Region DR Cluster Topology

In production, the DR secondary is placed across regions or data centers. The network must allow bidirectional TLS connectivity, and the API and cluster addresses must match the SANs in the certificates.

Using Integrated Storage (Raft) simplifies the setup without depending on an external K/V store. The keys are intra-cluster communication on 8201 (cluster_addr) and correctly configuring 8200 (api_addr) for clients.

Required ports: 8200/TCP (API), 8201/TCP (inter-cluster Raft).
Certificates: align the hostnames in api_addr/cluster_addr with SANs, and consider mutual TLS.
Monitoring: use /sys/replication/dr/status and /sys/metrics (Prometheus), described below.
Routing: automate DNS / load balancer cutover during failover.
Drills: perform planned promote/demote exercises every quarter.

DR Replication (inter-region, Raft)

server.hcl (excerpt, example of stable parameters)

storage "raft" {
  path  = "/opt/vault/data"
  node_id = "vault-node-1"
}
api_addr     = "https://vault-a.example.com:8200"
cluster_addr = "https://vault-a.example.com:8201"
listener "tcp" {
  address       = "0.0.0.0:8200"
  tls_disable   = 0
  tls_cert_file = "/opt/vault/tls/tls.crt"
  tls_key_file  = "/opt/vault/tls/tls.key"
}

Setup Procedure (Production-Oriented, Safe Path)

Below is the minimal procedure. Run it after verifying TLS, policies, audit logging, and storage health (Raft/Consul). A dual-control admin approval workflow makes it safer.

Generate the secondary activation token on the primary and use it on the secondary for activation. Treat the token as short-lived and tightly scoped.

Prerequisite: both clusters are initialized, unsealed, and healthy (vault status reports ok).
Step 1: Enable the primary as a DR Primary.
Step 2: Generate the secondary activation token and transfer it over a secure channel.
Step 3: Enable the DR Secondary on the secondary side.
Step 4: Verify status on both sides and wait for replication to converge.
Step 5: Verify the routing design (only the Primary accepts clients in normal operation).

Example commands to enable DR Replication

# Primary side
export VAULT_ADDR=https://primary.example.com:8200
export VAULT_TOKEN=<admin_token>

# 1) Enable DR Primary
vault write -f sys/replication/dr/primary/enable

# 2) Generate Secondary activation token
DR_TOKEN=$(vault write -field=token -f sys/replication/dr/primary/secondary-token)

# Secondary side
export VAULT_ADDR=https://secondary.example.com:8200
export VAULT_TOKEN=<admin_token_on_secondary>

# 3) Enable DR Secondary (set primary_api_addr per your environment)
vault write sys/replication/dr/secondary/enable \
  token="$DR_TOKEN" \
  primary_api_addr="https://primary.example.com:8200"

# 4) Check status (on both sides)
vault read sys/replication/dr/status

Failover and Failback (Planned / Emergency)

For a planned failover, confirm that the DR secondary is fully caught up, then promote the secondary and switch traffic. The basic flow is the same for emergency failover, but RPO may take a hit equivalent to network latency.

For failback, the safe operational pattern is to rejoin the old primary as a secondary under the new primary. Rather than a simple “reset to original”, follow the demote and rejoin procedure.

Prerequisite for planned failover: replication lag is minimal (verify via DR status).
After promotion: switch DNS / load balancer, update client and CI/CD targets.
Handling the old primary: demote, then reconfigure as a secondary under the new primary.
Rollback: monitor the new primary's health, and if issues arise, apply the same procedure in the reverse direction.
Auditing: record and preserve promote/demote events via audit devices.

Representative commands for failover and failback

# 1) Planned failover (promote Secondary)
export VAULT_ADDR=https://secondary.example.com:8200
export VAULT_TOKEN=<admin_token_on_secondary>
# Promote
vault write -f sys/replication/dr/secondary/promote

# 2) Demote the old primary to prepare for rejoin
export VAULT_ADDR=https://old-primary.example.com:8200
export VAULT_TOKEN=<admin_token_on_old_primary>
# Demote
yes | vault write -f sys/replication/dr/primary/demote

# 3) Reissue secondary token on the new primary, update on the old primary
export VAULT_ADDR=https://new-primary.example.com:8200
export VAULT_TOKEN=<admin_token_on_new_primary>
NEW_TOKEN=$(vault write -field=token -f sys/replication/dr/primary/secondary-token)

export VAULT_ADDR=https://old-primary.example.com:8200
export VAULT_TOKEN=<admin_token_on_old_primary>
# Update to point at the new primary (use update-primary, not enable)
vault write sys/replication/dr/secondary/update-primary \
  token="$NEW_TOKEN" \
  primary_api_addr="https://new-primary.example.com:8200"

# 4) Check status on both sides
vault read sys/replication/dr/status

Monitoring, Drills, and Security Essentials

Monitoring centers on replication health (status, lag, connection errors) and the health of certificates and time synchronization. Collect Prometheus-format metrics and alert on thresholds.

For drills, regularly run planned failover → failback to measure actual RTO and keep runbooks fresh. Snapshots do not replace DR, but they are useful for incident analysis and worst-case recovery.

Status collection: regularly collect vault read sys/replication/dr/status (mode/state/known secondaries, etc.).
Metrics: scrape /sys/metrics?format=prometheus (mind network and certificate settings).
Auditing: enable audit devices on both clusters so promotions and demotions can be tracked.
Certificates: prepare expiration alerts and a rollover plan for CA changes.
Backup: take Raft snapshots regularly as a supplement to DR.

Practical monitoring and backup snippets

# DR status (JSON output)
vault read -format=json sys/replication/dr/status | jq .

# Prometheus metrics (set appropriate token/headers)
curl -s -H "X-Vault-Token: $VAULT_TOKEN" \
  "$VAULT_ADDR/v1/sys/metrics?format=prometheus" | grep replication

# Raft snapshot (supplementary backup)
vault operator raft snapshot save /backup/vault-$(date +%Y%m%d%H%M%S).snap

DR vs. Performance (A Common Exam Pitfall)

DR is a standby system for disaster recovery; Performance is a distributed system for read scale. Because the design goals differ, client availability, write capability, and cutover operations are fundamentally different.

The exam often asks whether the secondary can serve requests during normal operation and which operation is the promotion. The table below clarifies these differences.

A DR Secondary does not serve client requests until promotion.
A Performance Secondary serves reads but proxies writes to the primary.
Cutover ops: DR uses secondary/promote; Performance is usually unnecessary (topology preserved).

Item	DR Replication	Performance Replication	Single Cluster (reference)
Purpose	Disaster recovery (regional-failure protection)	Read scale / geographic distribution	Availability (within a single site)
Client handling in normal operation	Not possible (standby, blocked until promotion)	Possible (primarily serves reads)	Possible (as usual)
Writes	Not possible (allowed after promotion)	Proxied to the primary	Possible
Cutover operation	secondary/promote and routing cutover	Not needed (topology preserved)	Not needed
RPO/RTO tendency	Asynchronous, so RPO>0, RTO = promotion + cutover	RPO>0 (asynchronous)	RPO/RTO depend on intra-site HA

(Reference) Minimal example of enabling Performance Replication

# Enable Performance Primary on the primary cluster
vault write -f sys/replication/performance/primary/enable

# Generate a token for the Secondary
P_TOKEN=$(vault write -field=token -f sys/replication/performance/primary/secondary-token)

# Enable Performance Secondary on a separate cluster
vault write sys/replication/performance/secondary/enable \
  token="$P_TOKEN" \
  primary_api_addr="https://primary.example.com:8200"

Check Your Understanding

Ops

問題 1

In a Vault Enterprise DR Replication setup, which procedure is most appropriate for performing a planned failover with minimal data loss?

Verify that the DR secondary is fully caught up, then run sys/replication/dr/secondary/promote and switch DNS/load balancer to the new primary
Run vault operator seal on the DR primary to force-stop it and let clients be automatically routed to the DR secondary
Disable DR on the DR primary, then manually re-initialize the DR secondary
Newly enable Performance Replication and switch the read target

正解: A

The correct path for a planned failover is to confirm that the DR secondary is caught up, promote it via secondary/promote, and switch routing. Force-stopping the primary with seal is not a recommended procedure and may worsen RPO/RTO. Disabling DR or newly enabling Performance does not match the goal.

Frequently Asked Questions

Can a DR secondary serve read-only traffic during normal operation?

No. A DR secondary stays in standby until it is promoted and does not handle client requests in principle. If your goal is read scale, consider Performance Replication instead.

How do I perform a failback?

Issue a secondary token on the new primary, demote the old primary, then re-join it as a secondary under the new primary using sys/replication/dr/secondary/update-primary. Rather than directly “reverting”, follow the promote/demote procedure for safety.

Can backups (Raft snapshots) replace DR?

No. DR aims for low-RTO service continuity, while snapshots are a supplement for worst-case recovery, auditing, and verification. Use both: DR for business continuity, snapshots for an additional safety net.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Vault DR Replication in Practice and on the Exam: Designing and Operating DR Clusters