| your Linux construction kit
Source
← Back to Overview

Disaster Recovery Site — your DR site isn't a dusty rack in a closet. It's a live replica that's always current.

Most DR plans are fantasies. The backup tapes are six months old. The runbook was written three years ago by someone who left. Nobody has tested a failover since the last audit. When the primary site actually dies, it takes days — sometimes weeks — to recover. With ZFS replication over WireGuard, your DR site receives incremental updates every hour. The replica is always current. The failover is a documented, tested, repeatable procedure. RTO measured in minutes. RPO equals the time since the last syncoid run.

The recipe

Step 1: Set up the DR site

# Install kldload server profile on the DR hardware
# This machine mirrors your production environment

# Create the receiving datasets — one per production server
kdir /srv/dr
kdir /srv/dr/web-prod-01
kdir /srv/dr/db-prod-01
kdir /srv/dr/app-prod-01

# Set compression — replicated data compresses further on receive
zfs set compression=zstd rpool/srv/dr

Step 2: WireGuard tunnel between sites

# On the DR site — generate keys
wg genkey | tee /etc/wireguard/dr-private.key | wg pubkey > /etc/wireguard/dr-public.key
chmod 600 /etc/wireguard/dr-private.key

# Configure the site-to-site tunnel
cat > /etc/wireguard/wg-prod.conf <<'WG'
[Interface]
PrivateKey = DR_PRIVATE_KEY_HERE
Address = 10.200.0.2/24
ListenPort = 51820

[Peer]
PublicKey = PROD_PUBLIC_KEY_HERE
Endpoint = prod-site.example.com:51820
AllowedIPs = 10.200.0.1/32,10.0.0.0/16
PersistentKeepalive = 25
WG

systemctl enable --now wg-quick@wg-prod

# Verify the tunnel
wg show wg-prod
ping -c 3 10.200.0.1
WireGuard creates a permanent encrypted link between your production site and DR site. All replication traffic flows through this tunnel. No VPN appliances, no certificates to expire.

Step 3: Configure syncoid replication from production

# On each production server, set up syncoid to push to the DR site
# Use the WireGuard tunnel IP (10.200.0.2)

# SSH key setup — production pushes to DR
ssh-keygen -t ed25519 -f /root/.ssh/dr-sync -N ""
ssh-copy-id -i /root/.ssh/dr-sync.pub root@10.200.0.2

# Initial full sync (this takes a while the first time)
syncoid --recursive --sshkey /root/.ssh/dr-sync \
    rpool/ROOT 10.200.0.2:rpool/srv/dr/$(hostname)/ROOT
syncoid --recursive --sshkey /root/.ssh/dr-sync \
    rpool/home 10.200.0.2:rpool/srv/dr/$(hostname)/home
syncoid --recursive --sshkey /root/.ssh/dr-sync \
    rpool/srv 10.200.0.2:rpool/srv/dr/$(hostname)/srv

# Cron job for hourly incremental replication
cat > /etc/cron.d/syncoid-dr <<'CRON'
0 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/ROOT 10.200.0.2:rpool/srv/dr/$(hostname)/ROOT >> /var/log/syncoid-dr.log 2>&1
10 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/home 10.200.0.2:rpool/srv/dr/$(hostname)/home >> /var/log/syncoid-dr.log 2>&1
20 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/srv 10.200.0.2:rpool/srv/dr/$(hostname)/srv >> /var/log/syncoid-dr.log 2>&1
CRON
The first sync sends everything. After that, syncoid sends only the blocks that changed since the last snapshot. A 2TB database with 5GB of daily changes? The hourly sync takes seconds.

Step 4: Monitor replication health

# On the DR site — check how current each replica is
cat > /usr/local/bin/dr-status.sh <<'STATUS'
#!/bin/bash
echo "=== DR Replication Status ==="
echo ""
for server_dir in /srv/dr/*/; do
    server=$(basename "${server_dir}")
    echo "--- ${server} ---"

    for dataset in ROOT home srv; do
        ds="rpool/srv/dr/${server}/${dataset}"
        if zfs list "${ds}" &>/dev/null; then
            latest=$(zfs list -t snapshot -o name,creation -s creation \
                -r "${ds}" | tail -1)
            echo "  ${dataset}: ${latest}"
        else
            echo "  ${dataset}: NOT REPLICATED"
        fi
    done
    echo ""
done
STATUS
chmod +x /usr/local/bin/dr-status.sh

# Run it
dr-status.sh
# === DR Replication Status ===
# --- web-prod-01 ---
#   ROOT: rpool/srv/dr/web-prod-01/ROOT@syncoid_2026-03-23:13:00  2026-03-23 13:00
#   home: rpool/srv/dr/web-prod-01/home@syncoid_2026-03-23:13:10  2026-03-23 13:10
#   srv:  rpool/srv/dr/web-prod-01/srv@syncoid_2026-03-23:13:20   2026-03-23 13:20

# Alert if replication is stale (add to cron)
cat > /etc/cron.d/dr-alert <<'CRON'
30 * * * * root /usr/local/bin/dr-check-stale.sh || echo "DR replication stale!" | mail -s "DR ALERT" oncall@example.com
CRON

Step 5: The failover runbook

# ============================================
# DISASTER RECOVERY FAILOVER RUNBOOK
# ============================================
# Scenario: Production site is down. Fail over to DR.

# STEP 1: Confirm production is truly dead
# Don't failover for a network blip
ping -c 10 prod-site.example.com
ssh web-prod-01 "echo alive" 2>/dev/null || echo "CONFIRMED DOWN"

# STEP 2: Check how current the replica is
dr-status.sh
# RPO = time since last successful syncoid

# STEP 3: Promote the replica
# For the web server:
# Create a new pool from the replicated data
zfs send -R rpool/srv/dr/web-prod-01/ROOT@latest | \
    zfs recv -F rpool-recover/ROOT
zfs send -R rpool/srv/dr/web-prod-01/home@latest | \
    zfs recv -F rpool-recover/home
zfs send -R rpool/srv/dr/web-prod-01/srv@latest | \
    zfs recv -F rpool-recover/srv

# STEP 4: Set mountpoints and boot
zfs set mountpoint=/ rpool-recover/ROOT/kldload-node
zfs set mountpoint=/home rpool-recover/home
zfs set mountpoint=/srv rpool-recover/srv

# STEP 5: Install bootloader on the DR hardware
krecovery reinstall-bootloader /dev/sda

# STEP 6: Update DNS to point to DR site
# Your DNS provider's API or web panel
# web.example.com -> DR site IP

# STEP 7: Verify services
systemctl status nginx
systemctl status postgresql
curl -s http://localhost/ | head -5

# STEP 8: Notify the team
echo "Failover complete. Services running on DR site." | \
    mail -s "DR FAILOVER COMPLETE" team@example.com
Detect failure. Check the replica. Promote it. Update DNS. Verify services. That's five steps and ten minutes. Not five days and ten people.

Step 6: Test the failover quarterly

# A DR plan that isn't tested is a wish, not a plan.
# Test failover quarterly using a clone — no impact on production.

# Snapshot the current DR replica
zfs snapshot -r rpool/srv/dr/web-prod-01@test-$(date +%Y%m%d)

# Clone it for testing (instant, no extra disk space)
zfs clone rpool/srv/dr/web-prod-01/ROOT@test-$(date +%Y%m%d) \
    rpool/srv/dr-test/web-prod-01/ROOT
zfs clone rpool/srv/dr/web-prod-01/srv@test-$(date +%Y%m%d) \
    rpool/srv/dr-test/web-prod-01/srv

# Boot the clone in a VM to verify it works
virt-install --name dr-test \
    --memory 4096 --vcpus 2 \
    --disk path=/dev/zvol/rpool/srv/dr-test/web-prod-01/ROOT \
    --import --os-variant centos-stream9 \
    --noautoconsole

# Run your verification checks
# Can you reach the web server?
# Does the database respond?
# Are the application configs correct?

# Clean up after testing
virsh destroy dr-test && virsh undefine dr-test
zfs destroy -r rpool/srv/dr-test
Test the failover with a clone. No risk to production, no risk to the DR replica. If the test fails, fix the runbook before you need it for real.

DR by the numbers

RPO
Recovery Point Objective = time since last syncoid run. With hourly replication, worst case is 59 minutes of data loss. Run syncoid every 15 minutes for tighter RPO.
RTO
Recovery Time Objective = time to bring services online at DR. With the runbook above: 10-15 minutes. Most of that is DNS propagation.
Bandwidth
Only changed blocks travel over the wire. A 2TB server with 5GB of hourly changes uses ~5GB/hour of bandwidth. WireGuard encrypts it. Syncoid compresses it.
Storage
The DR site needs enough disk to hold the replica plus snapshot history. With ZFS compression, you typically need 30-50% less raw storage than the source.

What makes this different

Always current

Syncoid runs every hour. The DR replica is never more than an hour behind production. No more stale backup tapes in a vault somewhere.

Tested quarterly

Clone the replica, boot it in a VM, run your checks. If the test fails, fix the runbook now. Not during a real disaster at 3 AM.

Encrypted in transit

All replication traffic flows through WireGuard. An attacker sniffing the wire sees encrypted noise. Your data stays private between sites.

Minutes, not days

Traditional DR takes days to recover. ZFS replication + a tested runbook = services online in 10-15 minutes. The difference between "we lost the weekend" and "we lost an hour."