Disaster Recovery Site — your DR site isn't a dusty rack in a closet. It's a live replica that's always current.
Most DR plans are fantasies. The backup tapes are six months old. The runbook was written three years ago by someone who left. Nobody has tested a failover since the last audit. When the primary site actually dies, it takes days — sometimes weeks — to recover. With ZFS replication over WireGuard, your DR site receives incremental updates every hour. The replica is always current. The failover is a documented, tested, repeatable procedure. RTO measured in minutes. RPO equals the time since the last syncoid run.
On ext4, DR means rsync jobs that miss open files, backup agents with their own failure modes, and restore procedures nobody has tested. With ZFS, zfs send over WireGuard IS your DR. Incremental, checksummed, encrypted at the wire level. The DR site is not a backup — it is a live replica receiving atomic dataset updates every hour. Failover means importing the pool and booting. No restore step, no data reconstruction, no prayer. Recovery time: minutes, not hours. And because ZFS checksums every block, you know the replica is bit-for-bit correct — not just "probably fine."
Why this changes how you think about DR:
Traditional DR is a separate system. You buy backup software. You configure backup jobs per server, per database, per application. Each backup has its own schedule, its own retention, its own restore procedure. Some are file-level (miss open files). Some are image-level (can't restore individual files). Some are application-specific (pg_dump, mysqldump). Testing restore is a project. Nobody tests it. When you need it, you discover it doesn't work.
ZFS DR is a property of the storage. syncoid -r rpool dr-node:tank/replica replicates everything — OS, databases, configs, user data, application state — as one atomic operation. The DR site is a mirror of production at the last snapshot. Not a backup. A mirror. Same datasets, same hierarchy, same checksums. Failover is: import the pool on the DR node, boot it. The services start because their configs are in the datasets. The data is there because it was replicated. The users connect because DNS points to the new node.
The test nobody does, made trivial. Clone the DR replica. Boot the clone. Verify the services start. Destroy the clone. You just tested your DR failover without touching production or the real DR replica. Do it weekly. Do it in CI. Do it every time you change the replication config. The test is a zfs clone and a boot — 30 seconds, zero risk.
Service-to-service: your application's DR strategy is "do nothing." The application doesn't know it's being replicated. It writes to a path. ZFS snapshots and replicates the path. The DR node has the same path with the same data. The application starts on the DR node and finds its data where it expects it. No DR-specific code. No failover scripts. No "switch the connection string." The storage layer handles everything.
The recipe
Step 1: Set up the DR site
# Install kldload server profile on the DR hardware
# This machine mirrors your production environment
# Create the receiving datasets — one per production server
kdir /srv/dr
kdir /srv/dr/web-prod-01
kdir /srv/dr/db-prod-01
kdir /srv/dr/app-prod-01
# Set compression — replicated data compresses further on receive
zfs set compression=zstd rpool/srv/dr
Step 2: WireGuard tunnel between sites
# On the DR site — generate keys
wg genkey | tee /etc/wireguard/dr-private.key | wg pubkey > /etc/wireguard/dr-public.key
chmod 600 /etc/wireguard/dr-private.key
# Configure the site-to-site tunnel
cat > /etc/wireguard/wg-prod.conf <<'WG'
[Interface]
PrivateKey = DR_PRIVATE_KEY_HERE
Address = 10.200.0.2/24
ListenPort = 51820
[Peer]
PublicKey = PROD_PUBLIC_KEY_HERE
Endpoint = prod-site.example.com:51820
AllowedIPs = 10.200.0.1/32,10.0.0.0/16
PersistentKeepalive = 25
WG
systemctl enable --now wg-quick@wg-prod
# Verify the tunnel
wg show wg-prod
ping -c 3 10.200.0.1
Step 3: Configure syncoid replication from production
# On each production server, set up syncoid to push to the DR site
# Use the WireGuard tunnel IP (10.200.0.2)
# SSH key setup — production pushes to DR
ssh-keygen -t ed25519 -f /root/.ssh/dr-sync -N ""
ssh-copy-id -i /root/.ssh/dr-sync.pub root@10.200.0.2
# Initial full sync (this takes a while the first time)
syncoid --recursive --sshkey /root/.ssh/dr-sync \
rpool/ROOT 10.200.0.2:rpool/srv/dr/$(hostname)/ROOT
syncoid --recursive --sshkey /root/.ssh/dr-sync \
rpool/home 10.200.0.2:rpool/srv/dr/$(hostname)/home
syncoid --recursive --sshkey /root/.ssh/dr-sync \
rpool/srv 10.200.0.2:rpool/srv/dr/$(hostname)/srv
# Cron job for hourly incremental replication
cat > /etc/cron.d/syncoid-dr <<'CRON'
0 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/ROOT 10.200.0.2:rpool/srv/dr/$(hostname)/ROOT >> /var/log/syncoid-dr.log 2>&1
10 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/home 10.200.0.2:rpool/srv/dr/$(hostname)/home >> /var/log/syncoid-dr.log 2>&1
20 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/srv 10.200.0.2:rpool/srv/dr/$(hostname)/srv >> /var/log/syncoid-dr.log 2>&1
CRON
Step 4: Monitor replication health
# On the DR site — check how current each replica is
cat > /usr/local/bin/dr-status.sh <<'STATUS'
#!/bin/bash
echo "=== DR Replication Status ==="
echo ""
for server_dir in /srv/dr/*/; do
server=$(basename "${server_dir}")
echo "--- ${server} ---"
for dataset in ROOT home srv; do
ds="rpool/srv/dr/${server}/${dataset}"
if zfs list "${ds}" &>/dev/null; then
latest=$(zfs list -t snapshot -o name,creation -s creation \
-r "${ds}" | tail -1)
echo " ${dataset}: ${latest}"
else
echo " ${dataset}: NOT REPLICATED"
fi
done
echo ""
done
STATUS
chmod +x /usr/local/bin/dr-status.sh
# Run it
dr-status.sh
# === DR Replication Status ===
# --- web-prod-01 ---
# ROOT: rpool/srv/dr/web-prod-01/ROOT@syncoid_2026-03-23:13:00 2026-03-23 13:00
# home: rpool/srv/dr/web-prod-01/home@syncoid_2026-03-23:13:10 2026-03-23 13:10
# srv: rpool/srv/dr/web-prod-01/srv@syncoid_2026-03-23:13:20 2026-03-23 13:20
# Alert if replication is stale (add to cron)
cat > /etc/cron.d/dr-alert <<'CRON'
30 * * * * root /usr/local/bin/dr-check-stale.sh || echo "DR replication stale!" | mail -s "DR ALERT" oncall@example.com
CRON
Step 5: The failover runbook
# ============================================
# DISASTER RECOVERY FAILOVER RUNBOOK
# ============================================
# Scenario: Production site is down. Fail over to DR.
# STEP 1: Confirm production is truly dead
# Don't failover for a network blip
ping -c 10 prod-site.example.com
ssh web-prod-01 "echo alive" 2>/dev/null || echo "CONFIRMED DOWN"
# STEP 2: Check how current the replica is
dr-status.sh
# RPO = time since last successful syncoid
# STEP 3: Promote the replica
# For the web server:
# Create a new pool from the replicated data
zfs send -R rpool/srv/dr/web-prod-01/ROOT@latest | \
zfs recv -F rpool-recover/ROOT
zfs send -R rpool/srv/dr/web-prod-01/home@latest | \
zfs recv -F rpool-recover/home
zfs send -R rpool/srv/dr/web-prod-01/srv@latest | \
zfs recv -F rpool-recover/srv
# STEP 4: Set mountpoints and boot
zfs set mountpoint=/ rpool-recover/ROOT/kldload-node
zfs set mountpoint=/home rpool-recover/home
zfs set mountpoint=/srv rpool-recover/srv
# STEP 5: Install bootloader on the DR hardware
krecovery reinstall-bootloader /dev/sda
# STEP 6: Update DNS to point to DR site
# Your DNS provider's API or web panel
# web.example.com -> DR site IP
# STEP 7: Verify services
systemctl status nginx
systemctl status postgresql
curl -s http://localhost/ | head -5
# STEP 8: Notify the team
echo "Failover complete. Services running on DR site." | \
mail -s "DR FAILOVER COMPLETE" team@example.com
Step 6: Test the failover quarterly
# A DR plan that isn't tested is a wish, not a plan.
# Test failover quarterly using a clone — no impact on production.
# Snapshot the current DR replica
zfs snapshot -r rpool/srv/dr/web-prod-01@test-$(date +%Y%m%d)
# Clone it for testing (instant, no extra disk space)
zfs clone rpool/srv/dr/web-prod-01/ROOT@test-$(date +%Y%m%d) \
rpool/srv/dr-test/web-prod-01/ROOT
zfs clone rpool/srv/dr/web-prod-01/srv@test-$(date +%Y%m%d) \
rpool/srv/dr-test/web-prod-01/srv
# Boot the clone in a VM to verify it works
virt-install --name dr-test \
--memory 4096 --vcpus 2 \
--disk path=/dev/zvol/rpool/srv/dr-test/web-prod-01/ROOT \
--import --os-variant centos-stream9 \
--noautoconsole
# Run your verification checks
# Can you reach the web server?
# Does the database respond?
# Are the application configs correct?
# Clean up after testing
virsh destroy dr-test && virsh undefine dr-test
zfs destroy -r rpool/srv/dr-test
DR by the numbers
What makes this different
Always current
Syncoid runs every hour. The DR replica is never more than an hour behind production. No more stale backup tapes in a vault somewhere.
Tested quarterly
Clone the replica, boot it in a VM, run your checks. If the test fails, fix the runbook now. Not during a real disaster at 3 AM.
Encrypted in transit
All replication traffic flows through WireGuard. An attacker sniffing the wire sees encrypted noise. Your data stays private between sites.
Minutes, not days
Traditional DR takes days to recover. ZFS replication + a tested runbook = services online in 10-15 minutes. The difference between "we lost the weekend" and "we lost an hour."