kldload kldload — your Linux re-packer your Linux re-packer — for freegt; kldload — infrastructure, your way — for freemdash; pick your distro, get ZFS on root

Build Your Own

Disaster Recovery Site — your DR site isn't a dusty rack in a closet. It's a live replica that's always current.

Most DR plans are fantasies. The backup tapes are six months old. The runbook was written three years ago by someone who left. Nobody has tested a failover since the last audit. When the primary site actually dies, it takes days — sometimes weeks — to recover. With ZFS replication over WireGuard, your DR site receives incremental updates every hour. The replica is always current. The failover is a documented, tested, repeatable procedure. RTO measured in minutes. RPO equals the time since the last syncoid run.

On ext4, DR means rsync jobs that miss open files, backup agents with their own failure modes, and restore procedures nobody has tested. With ZFS, zfs send over WireGuard IS your DR. Incremental, checksummed, encrypted at the wire level. The DR site is not a backup — it is a live replica receiving atomic dataset updates every hour. Failover means importing the pool and booting. No restore step, no data reconstruction, no prayer. Recovery time: minutes, not hours. And because ZFS checksums every block, you know the replica is bit-for-bit correct — not just "probably fine."

Why this changes how you think about DR:

Traditional DR is a separate system. You buy backup software. You configure backup jobs per server, per database, per application. Each backup has its own schedule, its own retention, its own restore procedure. Some are file-level (miss open files). Some are image-level (can't restore individual files). Some are application-specific (pg_dump, mysqldump). Testing restore is a project. Nobody tests it. When you need it, you discover it doesn't work.

ZFS DR is a property of the storage. syncoid -r rpool dr-node:tank/replica replicates everything — OS, databases, configs, user data, application state — as one atomic operation. The DR site is a mirror of production at the last snapshot. Not a backup. A mirror. Same datasets, same hierarchy, same checksums. Failover is: import the pool on the DR node, boot it. The services start because their configs are in the datasets. The data is there because it was replicated. The users connect because DNS points to the new node.

The test nobody does, made trivial. Clone the DR replica. Boot the clone. Verify the services start. Destroy the clone. You just tested your DR failover without touching production or the real DR replica. Do it weekly. Do it in CI. Do it every time you change the replication config. The test is a zfs clone and a boot — 30 seconds, zero risk.

Service-to-service: your application's DR strategy is "do nothing." The application doesn't know it's being replicated. It writes to a path. ZFS snapshots and replicates the path. The DR node has the same path with the same data. The application starts on the DR node and finds its data where it expects it. No DR-specific code. No failover scripts. No "switch the connection string." The storage layer handles everything.

The recipe

Step 1: Set up the DR site

# Install kldload server profile on the DR hardware
# This machine mirrors your production environment

# Create the receiving datasets — one per production server
kdir /srv/dr
kdir /srv/dr/web-prod-01
kdir /srv/dr/db-prod-01
kdir /srv/dr/app-prod-01

# Set compression — replicated data compresses further on receive
zfs set compression=zstd rpool/srv/dr

Step 2: WireGuard tunnel between sites

# On the DR site — generate keys
wg genkey | tee /etc/wireguard/dr-private.key | wg pubkey > /etc/wireguard/dr-public.key
chmod 600 /etc/wireguard/dr-private.key

# Configure the site-to-site tunnel
cat > /etc/wireguard/wg-prod.conf <<'WG'
[Interface]
PrivateKey = DR_PRIVATE_KEY_HERE
Address = 10.200.0.2/24
ListenPort = 51820

[Peer]
PublicKey = PROD_PUBLIC_KEY_HERE
Endpoint = prod-site.example.com:51820
AllowedIPs = 10.200.0.1/32,10.0.0.0/16
PersistentKeepalive = 25
WG

systemctl enable --now wg-quick@wg-prod

# Verify the tunnel
wg show wg-prod
ping -c 3 10.200.0.1

WireGuard creates a permanent encrypted link between your production site and DR site. All replication traffic flows through this tunnel. No VPN appliances, no certificates to expire.

Step 3: Configure syncoid replication from production

# On each production server, set up syncoid to push to the DR site
# Use the WireGuard tunnel IP (10.200.0.2)

# SSH key setup — production pushes to DR
ssh-keygen -t ed25519 -f /root/.ssh/dr-sync -N ""
ssh-copy-id -i /root/.ssh/dr-sync.pub root@10.200.0.2

# Initial full sync (this takes a while the first time)
syncoid --recursive --sshkey /root/.ssh/dr-sync \
    rpool/ROOT 10.200.0.2:rpool/srv/dr/$(hostname)/ROOT
syncoid --recursive --sshkey /root/.ssh/dr-sync \
    rpool/home 10.200.0.2:rpool/srv/dr/$(hostname)/home
syncoid --recursive --sshkey /root/.ssh/dr-sync \
    rpool/srv 10.200.0.2:rpool/srv/dr/$(hostname)/srv

# Cron job for hourly incremental replication
cat > /etc/cron.d/syncoid-dr <<'CRON'
0 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/ROOT 10.200.0.2:rpool/srv/dr/$(hostname)/ROOT >> /var/log/syncoid-dr.log 2>&1
10 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/home 10.200.0.2:rpool/srv/dr/$(hostname)/home >> /var/log/syncoid-dr.log 2>&1
20 * * * * root syncoid --recursive --no-sync-snap --sshkey /root/.ssh/dr-sync rpool/srv 10.200.0.2:rpool/srv/dr/$(hostname)/srv >> /var/log/syncoid-dr.log 2>&1
CRON

The first sync sends everything. After that, syncoid sends only the blocks that changed since the last snapshot. A 2TB database with 5GB of daily changes? The hourly sync takes seconds.

Step 4: Monitor replication health

# On the DR site — check how current each replica is
cat > /usr/local/bin/dr-status.sh <<'STATUS'
#!/bin/bash
echo "=== DR Replication Status ==="
echo ""
for server_dir in /srv/dr/*/; do
    server=$(basename "${server_dir}")
    echo "--- ${server} ---"

    for dataset in ROOT home srv; do
        ds="rpool/srv/dr/${server}/${dataset}"
        if zfs list "${ds}" &>/dev/null; then
            latest=$(zfs list -t snapshot -o name,creation -s creation \
                -r "${ds}" | tail -1)
            echo "  ${dataset}: ${latest}"
        else
            echo "  ${dataset}: NOT REPLICATED"
        fi
    done
    echo ""
done
STATUS
chmod +x /usr/local/bin/dr-status.sh

# Run it
dr-status.sh
# === DR Replication Status ===
# --- web-prod-01 ---
#   ROOT: rpool/srv/dr/web-prod-01/ROOT@syncoid_2026-03-23:13:00  2026-03-23 13:00
#   home: rpool/srv/dr/web-prod-01/home@syncoid_2026-03-23:13:10  2026-03-23 13:10
#   srv:  rpool/srv/dr/web-prod-01/srv@syncoid_2026-03-23:13:20   2026-03-23 13:20

# Alert if replication is stale (add to cron)
cat > /etc/cron.d/dr-alert <<'CRON'
30 * * * * root /usr/local/bin/dr-check-stale.sh || echo "DR replication stale!" | mail -s "DR ALERT" oncall@example.com
CRON

Step 5: The failover runbook

# ============================================
# DISASTER RECOVERY FAILOVER RUNBOOK
# ============================================
# Scenario: Production site is down. Fail over to DR.

# STEP 1: Confirm production is truly dead
# Don't failover for a network blip
ping -c 10 prod-site.example.com
ssh web-prod-01 "echo alive" 2>/dev/null || echo "CONFIRMED DOWN"

# STEP 2: Check how current the replica is
dr-status.sh
# RPO = time since last successful syncoid

# STEP 3: Promote the replica
# For the web server:
# Create a new pool from the replicated data
zfs send -R rpool/srv/dr/web-prod-01/ROOT@latest | \
    zfs recv -F rpool-recover/ROOT
zfs send -R rpool/srv/dr/web-prod-01/home@latest | \
    zfs recv -F rpool-recover/home
zfs send -R rpool/srv/dr/web-prod-01/srv@latest | \
    zfs recv -F rpool-recover/srv

# STEP 4: Set mountpoints and boot
zfs set mountpoint=/ rpool-recover/ROOT/kldload-node
zfs set mountpoint=/home rpool-recover/home
zfs set mountpoint=/srv rpool-recover/srv

# STEP 5: Install bootloader on the DR hardware
krecovery reinstall-bootloader /dev/sda

# STEP 6: Update DNS to point to DR site
# Your DNS provider's API or web panel
# web.example.com -> DR site IP

# STEP 7: Verify services
systemctl status nginx
systemctl status postgresql
curl -s http://localhost/ | head -5

# STEP 8: Notify the team
echo "Failover complete. Services running on DR site." | \
    mail -s "DR FAILOVER COMPLETE" team@example.com

Detect failure. Check the replica. Promote it. Update DNS. Verify services. That's five steps and ten minutes. Not five days and ten people.

Step 6: Test the failover quarterly

# A DR plan that isn't tested is a wish, not a plan.
# Test failover quarterly using a clone — no impact on production.

# Snapshot the current DR replica
zfs snapshot -r rpool/srv/dr/web-prod-01@test-$(date +%Y%m%d)

# Clone it for testing (instant, no extra disk space)
zfs clone rpool/srv/dr/web-prod-01/ROOT@test-$(date +%Y%m%d) \
    rpool/srv/dr-test/web-prod-01/ROOT
zfs clone rpool/srv/dr/web-prod-01/srv@test-$(date +%Y%m%d) \
    rpool/srv/dr-test/web-prod-01/srv

# Boot the clone in a VM to verify it works
virt-install --name dr-test \
    --memory 4096 --vcpus 2 \
    --disk path=/dev/zvol/rpool/srv/dr-test/web-prod-01/ROOT \
    --import --os-variant centos-stream9 \
    --noautoconsole

# Run your verification checks
# Can you reach the web server?
# Does the database respond?
# Are the application configs correct?

# Clean up after testing
virsh destroy dr-test && virsh undefine dr-test
zfs destroy -r rpool/srv/dr-test

Test the failover with a clone. No risk to production, no risk to the DR replica. If the test fails, fix the runbook before you need it for real.

DR by the numbers

RPO

Recovery Point Objective = time since last syncoid run. With hourly replication, worst case is 59 minutes of data loss. Run syncoid every 15 minutes for tighter RPO.

RTO

Recovery Time Objective = time to bring services online at DR. With the runbook above: 10-15 minutes. Most of that is DNS propagation.

Bandwidth

Only changed blocks travel over the wire. A 2TB server with 5GB of hourly changes uses ~5GB/hour of bandwidth. WireGuard encrypts it. Syncoid compresses it.

Storage

The DR site needs enough disk to hold the replica plus snapshot history. With ZFS compression, you typically need 30-50% less raw storage than the source.

What makes this different

Always current

Syncoid runs every hour. The DR replica is never more than an hour behind production. No more stale backup tapes in a vault somewhere.

Tested quarterly

Clone the replica, boot it in a VM, run your checks. If the test fails, fix the runbook now. Not during a real disaster at 3 AM.

Encrypted in transit

All replication traffic flows through WireGuard. An attacker sniffing the wire sees encrypted noise. Your data stays private between sites.

Minutes, not days

Traditional DR takes days to recover. ZFS replication + a tested runbook = services online in 10-15 minutes. The difference between "we lost the weekend" and "we lost an hour."

← Classroom / Training Lab — every student gets a fresh machine every class, in seconds. Serverless & MicroVMs — Firecracker on ZFS →