Vault

Vault DR Promotion Ops Guide: Failover Procedure and Practical Checklist

2026-04-19
NicheeLab Editorial Team

This article walks through the concrete procedure for promoting a DR secondary to primary in a HashiCorp Vault Enterprise Disaster Recovery (DR) replication environment, balancing real-world ops practice with exam-relevant patterns.

All commands use the stable endpoints documented officially, and we cover the decision criteria for avoiding split-brain during a partition together with the practical points of LB/DNS cutover.

Background and Terminology: Where DR Replication and Promotion Fit

Vault Enterprise offers Performance Replication and DR Replication. DR keeps a standby cluster for disaster recovery and, on failure, promotes the secondary into the primary role. During normal operation the DR secondary is a pure standby that does not even accept reads — it only activates at cutover.

DR promotion is invoked explicitly against the secondary via API/CLI. Around the cutover, update client endpoints (LB/DNS) and network-isolate the former primary before re-joining it — that is the safe pattern.

  • Scope is Vault Enterprise DR replication (official endpoint: sys/replication/dr/*).
  • Run the promotion on the leader node of the DR secondary cluster.
  • At cutover, isolate the old primary (remove it from the LB, block at the firewall) to prevent split-brain.
AspectDR ReplicationPerformance Replication
Primary purposeDisaster-time primary substitute (standby)Read scale-out / geo-distribution (used in steady state)
Steady-state I/OEssentially no client I/O (standby)Reads served on the secondary (with some limits)
Cutover opExplicitly run promote on the secondaryNo promote concept (primary/secondary roles are fixed)
RPO profileNear zero (follows WAL; lag during network interruptions)Replicas serve in steady state; RPO depends on use case
Real-world RTOOn the order of a few minutes (including verification and LB cutover)No cutover needed (steady-state operation)

DR topology (before cutover)

WALDC-A (Primary)Vault Cluster [Leader][Standby]DC-B (DR Sec)Vault Cluster [Leader][Standby]LBLBPoint traffic here at cutoverClients

Pre-checks for the Failover Decision

In an unplanned failover, the most dangerous failure mode is a partial recovery of the old primary. Before promoting the DR side, always verify DR-side health, replication lag, and reachability of the old primary.

Automate the checks via JSON output so the decision does not depend on individual operators.

  • DR secondary state check: vault status (Sealed: false / HA Mode: standby or active-standby), sys/replication/dr/status (mode: secondary, state: running, etc.)
  • Identify the DR secondary leader: vault operator raft list-peers to find leader/Voter (when using Raft)
  • Plan for cutting off the old primary: stop/switch LB health probes, temporarily block at FW/SG, and announce the maintenance window
  • Replication lag: inspect sys/replication/dr/status metrics (e.g. healthy, last_index). If lag is large, decide whether the resulting RPO is acceptable
  • Ops credentials: log in with a token that holds sudo on sys/replication/dr/secondary/promote (prepare a least-privilege policy in advance)

DR Promotion (Failover) Procedure

This is the standard flow for safely cutting over with minimum downtime. Automating LB/DNS changes ahead of time stabilizes recovery times.

Run the promotion operation on the leader node of the DR secondary cluster.

  • 1. Stop traffic to the old primary (remove from LB / fail its health check / isolate it from the network).
  • 2. Final check of the DR secondary state (vault status, sys/replication/dr/status).
  • 3. Run the promotion on the DR secondary (see code below).
  • 4. Confirm the mode has transitioned to primary and HA state has moved to active.
  • 5. Switch LB/DNS targets to the DR side. Confirm 200 from /sys/health and monitor client reconnection.
  • 6. Watch the audit log and metrics (auth success rate, 400/429/5xx). Communicate throttling or backoff guidance to clients if needed.

Runbook (promotion on the DR secondary)

# Point VAULT_ADDR and VAULT_TOKEN at the leader of the DR secondary first
set -euo pipefail

# 0) Remove the old primary from the LB (delegated to LB-side automation)
# ... (invoke the LB/DNS automation script)

# 1) Verify DR secondary health
vault status
vault operator raft list-peers || true   # Not needed for Consul HA

# 2) Check DR status (JSON for machine-readable health)
vault read -format=json sys/replication/dr/status | jq .

# 3) Run promotion (valid only on a secondary)
vault write -f sys/replication/dr/secondary/promote

# 4) Wait for state to settle and re-check
sleep 3
vault read sys/replication/dr/status
vault status    # Confirm HA Mode: active / Cluster Mode: primary, etc.

# 5) Cut LB/DNS over to the DR side (probe via /v1/sys/health)
# Example: active node returns 200, standby 429, sealed 503 by default (tunable per environment)

# 6) Smoke-test the key paths (e.g. auth, KV read, Transit sign)
# Run vault login, vault kv get, vault write transit/sign, etc.

Post-cutover Verification and Practical Client-Switching Points

Once the DR side has become primary, verify client reconnection and the continuity of secrets operations. Pay particular attention to token/lease expiry timing and the behavior of background renewals.

Use /sys/health for health checks. Assuming the default behavior of 200 for active and 429 for standby keeps the cutover logic simple.

  • Client cutover: keep LB/DNS TTLs short and measure the time until clients re-resolve.
  • Auth verification: confirm that login succeeds via the main Auth Methods (OIDC, AppRole, etc.).
  • Secrets verification: smoke-test read/write/sign on representative paths such as KV, Transit, and Database.
  • Audit and metrics: rotate audit devices and confirm metrics pipelines (e.g. Prometheus) point at the new primary.

Handling the Old Primary and How to Think About Fallback (Re-join)

When the old primary comes back, do not immediately return it to the network. First inspect its state in an isolated environment. The safe pattern is to start from the new primary (the promoted DR side), issue a fresh DR secondary token, and re-register the old primary as a secondary.

Because the exact re-join steps and API paths differ by version, follow the official documentation procedure for your specific Vault version strictly. If data divergence is suspected, prioritize a snapshot / re-sync plan over a quick re-join.

  • Bring the old primary up while still isolated from the network and confirm no client traffic accidentally reaches it.
  • On the new primary, issue a DR secondary registration token (use the official secondary-token API).
  • On the old primary, stop the DR primary role and re-enable it as a secondary using the issued token (use the API appropriate for your version).
  • Only return it to the network after re-sync completes and the cluster reports healthy.

Exam Hotspots and Operational Pitfalls

Ops-focused exams frequently test the correct endpoint, which node executes the operation, and split-brain countermeasures. Real operations also tend to fail at exactly those points.

Two facts in particular are worth committing to memory: DR promotion runs on the DR secondary side, and the old primary must be reliably isolated before cutover.

  • Correct API path: sys/replication/dr/secondary/promote (run on the secondary).
  • Permissions: sudo capability on that path is required. Use a root token or a properly scoped ops token.
  • Health codes: /sys/health defaults to active=200, standby=429, sealed=503.
  • LB cutover order: detach the old side → promote the DR side → bring the DR side into the LB.
  • Audit log continuity: verify in advance that audit-device sinks (S3, syslog, etc.) are reachable from the DR side.

Check with a Practice Question

Ops

問題 1

In a Vault Enterprise DR replication environment, which operation correctly promotes the DR secondary to primary during a failover?

  1. Run vault write -f sys/replication/dr/secondary/promote on the leader of the DR secondary
  2. Run vault operator raft promote on the old primary
  3. Run vault write -f sys/replication/performance/secondary/promote
  4. Run vault operator unseal -promote on the new primary candidate

正解: A

DR promotion is executed by running sys/replication/dr/secondary/promote on the DR secondary side. There is no raft promote, the path is not under performance replication, and unseal has no promote option.

Frequently Asked Questions

Do existing tokens and leases expire after promotion?

Because DR replication mirrors cluster state, tokens and leases do not all expire at the instant of promotion. Existing tokens and leases continue to honor their remaining TTL. However, anything that expired during the outage will not be revived. Also verify that each secrets engine's backend remains reachable from the new primary.

How can we minimize downtime?

Automate LB/DNS switching, use /sys/health for unambiguous health detection, and rehearse smoke tests beforehand. Strictly follow the order: detach the old primary → promote the DR secondary → expose the new destination. Make sure smoke tests over key paths (auth, KV, Transit) can complete in 1-2 minutes.

Who can execute the promotion?

You need a token with sudo capability on sys/replication/dr/secondary/promote. In practice, define a dedicated ops policy and execute the promotion through an auditable emergency runbook. Avoid using root tokens as a routine.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Vault

Vault Core Concepts: Sealed/Unsealed, Auth, Secrets (2026)

Vault fundamentals — sealed/unsealed state, auth methods, se...

Vault

Vault Operations Professional (VOP-003): Complete Guide (2026)

Pass the Vault Operations Professional exam — enterprise pat...

Vault

Vault Path-Based Routing: API URL Structure (2026)

How Vault's path-based routing works — mount points, sub-pat...

Vault

Vault Tokens: Auth Token Mechanics (2026)

Token fundamentals — service vs. batch tokens, accessor, ren...

Vault

Vault Token Types: Service, Batch, Periodic (2026)

Service vs. batch tokens compared — performance, ACL behavio...

Browse all Vault articles (101)
© 2026 NicheeLab All rights reserved.