Disaster Recovery Site — your DR site isn't a dusty rack in a closet. It's a live replica that's always current.
Most DR plans are fantasies. The backup tapes are six months old. The runbook was written three years ago by someone who left. Nobody has tested a failover since the last audit. When the primary site actually dies, it takes days — sometimes weeks — to recover. With ZFS replication over WireGuard, your DR site receives incremental updates every hour. The replica is always current. The failover is a documented, tested, repeatable procedure. RTO measured in minutes. RPO equals the time since the last syncoid run.
The recipe
Step 1: Set up the DR site
# Install kldload server profile on the DR hardware
# This machine mirrors your production environment
# Create the receiving datasets — one per production server
kdir /srv/dr
kdir /srv/dr/web-prod-01
kdir /srv/dr/db-prod-01
kdir /srv/dr/app-prod-01
# Set compression — replicated data compresses further on receive
zfs set compression=zstd rpool/srv/dr
Step 2: WireGuard tunnel between sites
# On the DR site — generate keys
wg genkey | tee /etc/wireguard/dr-private.key | wg pubkey > /etc/wireguard/dr-public.key
chmod 600 /etc/wireguard/dr-private.key
# Configure the site-to-site tunnel
cat > /etc/wireguard/wg-prod.conf <<'WG'
[Interface]
PrivateKey = DR_PRIVATE_KEY_HERE
Address = 10.200.0.2/24
ListenPort = 51820
[Peer]
PublicKey = PROD_PUBLIC_KEY_HERE
Endpoint = prod-site.example.com:51820
AllowedIPs = 10.200.0.1/32,10.0.0.0/16
PersistentKeepalive = 25
WG
systemctl enable --now wg-quick@wg-prod
# Verify the tunnel
wg show wg-prod
ping -c 3 10.200.0.1
Step 3: Configure syncoid replication from production
# On each production server, set up syncoid to push to the DR site
# Use the WireGuard tunnel IP (10.200.0.2)
# SSH key setup — production pushes to DR
ssh-keygen -t ed25519 -f /root/.ssh/dr-sync -N ""
ssh-copy-id -i /root/.ssh/dr-sync.pub root@10.200.0.2
# Initial full sync (this takes a while the first time)
syncoid --recursive --sshkey /root/.ssh/dr-sync \
rpool/ROOT 10.200.0.2:rpool/srv/dr/$(hostname)/ROOT
syncoid --recursive --sshkey /root/.ssh/dr-sync \
rpool/home 10.200.0.2:rpool/srv/dr/$(hostname)/home
syncoid --recursive --sshkey /root/.ssh/dr-sync \
rpool/srv 10.200.0.2:rpool/srv/dr/$(hostname)/srv
# Cron job for hourly incremental replication
cat > /etc/cron.d/syncoid-dr <<'CRON'
0 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/ROOT 10.200.0.2:rpool/srv/dr/$(hostname)/ROOT >> /var/log/syncoid-dr.log 2>&1
10 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/home 10.200.0.2:rpool/srv/dr/$(hostname)/home >> /var/log/syncoid-dr.log 2>&1
20 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/srv 10.200.0.2:rpool/srv/dr/$(hostname)/srv >> /var/log/syncoid-dr.log 2>&1
CRON
Step 4: Monitor replication health
# On the DR site — check how current each replica is
cat > /usr/local/bin/dr-status.sh <<'STATUS'
#!/bin/bash
echo "=== DR Replication Status ==="
echo ""
for server_dir in /srv/dr/*/; do
server=$(basename "${server_dir}")
echo "--- ${server} ---"
for dataset in ROOT home srv; do
ds="rpool/srv/dr/${server}/${dataset}"
if zfs list "${ds}" &>/dev/null; then
latest=$(zfs list -t snapshot -o name,creation -s creation \
-r "${ds}" | tail -1)
echo " ${dataset}: ${latest}"
else
echo " ${dataset}: NOT REPLICATED"
fi
done
echo ""
done
STATUS
chmod +x /usr/local/bin/dr-status.sh
# Run it
dr-status.sh
# === DR Replication Status ===
# --- web-prod-01 ---
# ROOT: rpool/srv/dr/web-prod-01/ROOT@syncoid_2026-03-23:13:00 2026-03-23 13:00
# home: rpool/srv/dr/web-prod-01/home@syncoid_2026-03-23:13:10 2026-03-23 13:10
# srv: rpool/srv/dr/web-prod-01/srv@syncoid_2026-03-23:13:20 2026-03-23 13:20
# Alert if replication is stale (add to cron)
cat > /etc/cron.d/dr-alert <<'CRON'
30 * * * * root /usr/local/bin/dr-check-stale.sh || echo "DR replication stale!" | mail -s "DR ALERT" oncall@example.com
CRON
Step 5: The failover runbook
# ============================================
# DISASTER RECOVERY FAILOVER RUNBOOK
# ============================================
# Scenario: Production site is down. Fail over to DR.
# STEP 1: Confirm production is truly dead
# Don't failover for a network blip
ping -c 10 prod-site.example.com
ssh web-prod-01 "echo alive" 2>/dev/null || echo "CONFIRMED DOWN"
# STEP 2: Check how current the replica is
dr-status.sh
# RPO = time since last successful syncoid
# STEP 3: Promote the replica
# For the web server:
# Create a new pool from the replicated data
zfs send -R rpool/srv/dr/web-prod-01/ROOT@latest | \
zfs recv -F rpool-recover/ROOT
zfs send -R rpool/srv/dr/web-prod-01/home@latest | \
zfs recv -F rpool-recover/home
zfs send -R rpool/srv/dr/web-prod-01/srv@latest | \
zfs recv -F rpool-recover/srv
# STEP 4: Set mountpoints and boot
zfs set mountpoint=/ rpool-recover/ROOT/kldload-node
zfs set mountpoint=/home rpool-recover/home
zfs set mountpoint=/srv rpool-recover/srv
# STEP 5: Install bootloader on the DR hardware
krecovery reinstall-bootloader /dev/sda
# STEP 6: Update DNS to point to DR site
# Your DNS provider's API or web panel
# web.example.com -> DR site IP
# STEP 7: Verify services
systemctl status nginx
systemctl status postgresql
curl -s http://localhost/ | head -5
# STEP 8: Notify the team
echo "Failover complete. Services running on DR site." | \
mail -s "DR FAILOVER COMPLETE" team@example.com
Step 6: Test the failover quarterly
# A DR plan that isn't tested is a wish, not a plan.
# Test failover quarterly using a clone — no impact on production.
# Snapshot the current DR replica
zfs snapshot -r rpool/srv/dr/web-prod-01@test-$(date +%Y%m%d)
# Clone it for testing (instant, no extra disk space)
zfs clone rpool/srv/dr/web-prod-01/ROOT@test-$(date +%Y%m%d) \
rpool/srv/dr-test/web-prod-01/ROOT
zfs clone rpool/srv/dr/web-prod-01/srv@test-$(date +%Y%m%d) \
rpool/srv/dr-test/web-prod-01/srv
# Boot the clone in a VM to verify it works
virt-install --name dr-test \
--memory 4096 --vcpus 2 \
--disk path=/dev/zvol/rpool/srv/dr-test/web-prod-01/ROOT \
--import --os-variant centos-stream9 \
--noautoconsole
# Run your verification checks
# Can you reach the web server?
# Does the database respond?
# Are the application configs correct?
# Clean up after testing
virsh destroy dr-test && virsh undefine dr-test
zfs destroy -r rpool/srv/dr-test
DR by the numbers
What makes this different
Always current
Syncoid runs every hour. The DR replica is never more than an hour behind production. No more stale backup tapes in a vault somewhere.
Tested quarterly
Clone the replica, boot it in a VM, run your checks. If the test fails, fix the runbook now. Not during a real disaster at 3 AM.
Encrypted in transit
All replication traffic flows through WireGuard. An attacker sniffing the wire sees encrypted noise. Your data stays private between sites.
Minutes, not days
Traditional DR takes days to recover. ZFS replication + a tested runbook = services online in 10-15 minutes. The difference between "we lost the weekend" and "we lost an hour."