| pick your distro, get ZFS on root
kldload — your platform, your way, free
Source

Backup & Disaster Recovery Masterclass

This guide covers the complete lifecycle of protecting your infrastructure: from the philosophical distinction between backups and disaster recovery, through automated ZFS snapshot policies with Sanoid, cross-site replication with Syncoid, boot environments for safe rollback, database-specific backup strategies, Kubernetes cluster protection, failover runbooks, ransomware recovery, and compliance-driven retention. By the end you will have a tested, documented DR plan that actually works when the building is on fire.

The premise: Everyone has backups. Almost nobody has disaster recovery. A backup is a copy of data. Disaster recovery is the proven ability to restore service within a defined time window after a catastrophic failure. The difference between "we have backups" and "we have DR" is the difference between "we have parachutes in the warehouse" and "we have tested parachutes strapped to every seat." This masterclass teaches you to build and test the latter.

What this page covers: DR fundamentals, RPO/RTO definitions, ZFS copy-on-write snapshots, Sanoid automated policies, Syncoid incremental replication, cross-site replication architecture over WireGuard, boot environments, database-consistent backups, Kubernetes etcd and PV protection, failover procedures, DR testing methodology, ransomware recovery workflows, compliance and retention, monitoring replication health, and a complete runbook reference.

Prerequisites: a running kldload system with ZFS on root. The replication sections assume at least two nodes. The Kubernetes sections assume a cluster from the Kubernetes on KVM guide. Everything else works on any kldload node.

Here is the single most important thing about disaster recovery: untested DR is not DR. Every organization I have ever audited that suffered a real disaster — ransomware, datacenter fire, cloud region outage — had backups. Most of them could not restore from those backups within any reasonable timeframe. Some discovered their backups were corrupted. Some discovered their restore process required a server that no longer existed. Some discovered their backups did not include the database schema, only the data. The common thread is that nobody had ever tested a full restore end-to-end. This masterclass exists to make sure you test before the fire.

1. DR Fundamentals

Before configuring any tools, you need to understand the conceptual hierarchy of data protection. These terms are not interchangeable, and confusing them is how organizations end up with a false sense of security.

Backup vs. High Availability vs. Disaster Recovery

Backup

A point-in-time copy of data, stored separately from the primary. Backups protect against data loss — accidental deletion, corruption, application bugs. They do not protect against site loss. A backup on the same server as the data it protects is not a backup. A backup on the same rack is barely a backup.

// Backup = photocopying your passport and keeping it in a different bag.

High Availability (HA)

Redundant infrastructure that continues operating when individual components fail — dual power supplies, RAID arrays, clustered databases, load-balanced application servers. HA protects against component failure. It does not protect against site failure, correlated failures (ransomware), or data corruption that replicates to all replicas before detection.

// HA = two engines on a plane. If one fails, the other keeps you flying. If both catch fire, you need a parachute.

Disaster Recovery (DR)

The ability to restore full service from a geographically separate location after a catastrophic failure of the primary site. DR encompasses data replication, infrastructure provisioning, network failover, DNS cutover, and validated runbooks. DR is not a product — it is a process that you test regularly.

// DR = a second plane, pre-fueled, at a different airport, with tested flight plans.

Business Continuity (BC)

The superset of all the above, plus non-technical concerns: communication plans, legal obligations, regulatory reporting, customer notification, staff relocation, vendor contracts. DR is the technical component of BC. This masterclass focuses on the technical side, but your DR plan must fit inside a larger BC plan.

// BC = the whole emergency manual. DR is the chapter about getting the servers back.

Why backups are not DR

A backup tells you the data exists somewhere. DR tells you: where the data is, how long it takes to restore, what infrastructure is needed to run it, what the service restart order is, who is responsible for each step, and how you know the restore succeeded. If you cannot answer all of those questions, you have backups, not DR.

The most common failure mode I see is this: the team has excellent automated backups — nightly, encrypted, offsite, verified checksums. Then a disaster happens and they discover that restoring a 2 TB database from S3 takes 14 hours over their connection, the application requires three other services to be running first, the TLS certificates for those services expired, and nobody documented the service startup order. They had great backups and zero DR. The backup was fine. The recovery was never tested.

2. RPO & RTO — Defining Your Tolerance

Every DR plan starts with two numbers. These numbers drive every architectural decision that follows — replication frequency, storage costs, infrastructure investment, and staffing requirements.

Recovery Point Objective (RPO)

The maximum acceptable amount of data loss, measured in time. An RPO of 1 hour means you can tolerate losing up to 1 hour of data. An RPO of zero means you need synchronous replication — every write must be confirmed at both sites before the application proceeds. RPO directly determines your replication frequency.

// RPO answers: "How much work are we willing to lose?"

Recovery Time Objective (RTO)

The maximum acceptable downtime, measured from the moment the disaster is declared to the moment service is restored. An RTO of 4 hours means full service must be running at the DR site within 4 hours. An RTO of zero means active-active — both sites serve traffic simultaneously. RTO determines your DR infrastructure investment.

// RTO answers: "How long can we be down before it becomes catastrophic?"

The cost-tolerance tradeoff

RPORTOArchitectureRelative Cost
24 hours48 hoursNightly backup to offsite storage, cold DR site$
1 hour4 hoursHourly ZFS snapshots + syncoid replication, warm DR node$$
15 minutes1 hourFrequent snapshots + syncoid + pre-staged DR with automation$$$
~0 (seconds)15 minutesNear-synchronous replication, hot standby, automated failover$$$$
00Active-active multi-site, synchronous replication, global load balancing$$$$$

For most kldload deployments — homelabs, small businesses, edge sites — an RPO of 15 minutes and an RTO of 1 hour is the sweet spot. ZFS makes this achievable at commodity hardware prices. The rest of this masterclass builds that architecture.

The dirty secret of RPO/RTO is that most organizations pick numbers without ever measuring whether they can meet them. They write "RTO: 4 hours" in a compliance document and never test whether a restore actually completes in 4 hours. When I run DR tests for clients, the actual RTO is typically 3-5x the documented number on the first attempt. After two rounds of testing and fixing the gaps, it converges to something close to the target. The testing is the point — not the document.

3. The ZFS Advantage

ZFS fundamentally changes what is possible for backup and DR because of three properties that traditional filesystems lack: copy-on-write semantics, atomic snapshots, and incremental send/receive. These are not incremental improvements — they are a different category of capability.

Copy-on-Write (CoW)

ZFS never overwrites data in place. When a block is modified, the new version is written to a new location, and the metadata tree is updated atomically. The old block remains on disk until no snapshot references it. This means snapshots are free — they just prevent old blocks from being freed.

// CoW = writing on a new page instead of erasing the old one. Snapshots = keeping the old pages.

Atomic Snapshots

A ZFS snapshot captures the entire dataset at a single point in time, atomically. There is no "snapshot window" where data is inconsistent. The snapshot is created by updating a single pointer in the metadata tree — it takes microseconds regardless of dataset size. No application quiesce required for filesystem consistency.

// "zfs snapshot" is like freezing time. Everything is consistent to that instant.

Incremental Send/Receive

ZFS can compute the delta between any two snapshots and send only the changed blocks. This is not a file-level diff — it is a block-level diff computed from the metadata tree. A 10 TB dataset with 50 MB of changes since the last snapshot sends 50 MB of data, regardless of how many files were modified or how large the files are.

// Incremental send = "here are only the blocks that changed." Rsync wishes it could do this.

Self-Healing Checksums

Every block in ZFS has a checksum stored in its parent block (not alongside the data). On read, ZFS verifies the checksum. On a mirror or raidz pool, a checksum failure triggers automatic repair from the redundant copy. Your backups are verified every time they are read — silent corruption is detected and corrected automatically.

// Every read is a verify. Every verify on redundant storage is an auto-repair. Bit rot is not a ZFS problem.

Why this changes everything for DR

On a traditional filesystem, a "backup" means copying files — which takes hours for large datasets, produces inconsistent state if files change during the copy, and the incremental logic is fragile (rsync compares mtimes and sizes, which is wrong for databases). On ZFS, a "backup" means: take a snapshot (microseconds), send the delta to the replica (only changed blocks, checksummed end-to-end). The result is a block-for-block identical copy of the dataset at that exact point in time, verified by checksums on both sides. This is not better rsync. This is a different thing entirely.

# Take a snapshot — instant, atomic, zero performance impact
zfs snapshot rpool/data@2026-04-05_14:30

# Send the full dataset to a remote host (initial seed)
zfs send rpool/data@2026-04-05_14:30 | ssh dr-node zfs recv backup/data

# Send only the changes since the last snapshot (incremental)
zfs send -i rpool/data@2026-04-05_14:00 rpool/data@2026-04-05_14:30 \
  | ssh dr-node zfs recv backup/data

# Verify: list snapshots on both sides
zfs list -t snapshot -r rpool/data
ssh dr-node zfs list -t snapshot -r backup/data

4. Sanoid — Automated Snapshot Policies

Taking snapshots manually does not scale. Sanoid is a policy-driven snapshot manager for ZFS that creates and prunes snapshots on a schedule. It is the standard tool in the ZFS ecosystem for automated snapshot lifecycle management, and kldload installs it by default on desktop and server profiles.

Install Sanoid

# CentOS / RHEL / Rocky (kldload includes sanoid in darksite)
dnf install -y sanoid

# Debian / Ubuntu
apt install -y sanoid

# Verify
sanoid --version

Configuration: /etc/sanoid/sanoid.conf

Sanoid uses an INI-style configuration file. Each section defines a dataset and its snapshot retention policy. The use_template directive lets you define reusable policies and apply them to multiple datasets.

# /etc/sanoid/sanoid.conf

# ── Templates ──────────────────────────────────────────────

[template_production]
  frequently = 0
  hourly = 48
  daily = 30
  monthly = 12
  yearly = 2
  autosnap = yes
  autoprune = yes

[template_database]
  # More frequent snapshots for databases — tighter RPO
  frequently = 4
  hourly = 72
  daily = 60
  monthly = 24
  yearly = 5
  autosnap = yes
  autoprune = yes

[template_scratch]
  # Scratch/temp datasets — minimal retention
  frequently = 0
  hourly = 24
  daily = 7
  monthly = 0
  yearly = 0
  autosnap = yes
  autoprune = yes

# ── Dataset policies ──────────────────────────────────────

[rpool/ROOT]
  use_template = production
  recursive = yes

[rpool/data]
  use_template = production

[rpool/data/postgres]
  use_template = database

[rpool/data/mysql]
  use_template = database

[rpool/docker]
  use_template = scratch

[rpool/tmp]
  use_template = scratch

Sanoid systemd timer

Sanoid ships with a systemd timer that fires every 15 minutes by default. Each run creates any scheduled snapshots and prunes any that have exceeded their retention window.

# Enable and start the timer
systemctl enable --now sanoid.timer

# Check timer status
systemctl list-timers sanoid.timer

# Run sanoid manually to verify config
sanoid --take-snapshots --verbose

# View the snapshots it created
zfs list -t snapshot -r rpool -o name,creation,used | head -30

Per-dataset overrides

You can override any template value at the dataset level. This is how you give critical datasets tighter retention without duplicating the entire template.

# Override: keep 90 days of daily snapshots for compliance data
[rpool/data/audit-logs]
  use_template = production
  daily = 90
  monthly = 36
  yearly = 7
The single most common mistake with Sanoid is using it without setting autoprune to yes. Snapshots are nearly free to create but they hold space — every block referenced by any snapshot cannot be freed. A dataset with 500 unpruned snapshots spanning 6 months will show nearly zero "used" space per snapshot but massive "referred" space that cannot be reclaimed. I have seen production pools hit 100% capacity because someone configured autosnap without autoprune. Always set both. Always monitor pool free space.

5. Syncoid — Incremental Replication

Syncoid is Sanoid's companion tool. Where Sanoid manages the snapshot lifecycle on a single host, Syncoid replicates datasets between hosts using ZFS incremental send/receive. It handles the bookkeeping of finding the common snapshot, computing the delta, and applying it on the remote side.

Basic replication

# Replicate a single dataset to a remote host
syncoid rpool/data root@dr-node:backup/data

# First run sends a full stream (initial seed)
# Subsequent runs send only incremental deltas

# Replicate recursively — all child datasets
syncoid --recursive rpool/data root@dr-node:backup/data

# Dry run — show what would be sent without sending it
syncoid --no-sync-snap --dryrun rpool/data root@dr-node:backup/data

Push vs. Pull

Syncoid can operate in push mode (run on the source, send to the destination) or pull mode (run on the destination, pull from the source). Pull mode is generally preferred for DR because the DR node controls the replication schedule and the production node does not need SSH access to the DR node.

# Push mode — run on production server
syncoid rpool/data root@dr-node:backup/data

# Pull mode — run on DR server
syncoid root@prod-node:rpool/data backup/data

# Pull mode with SSH key and restricted command
syncoid -c "ssh -i /root/.ssh/syncoid_ed25519" \
  root@prod-node:rpool/data backup/data

Bandwidth limiting and compression

# Limit bandwidth to 50 Mbps to avoid saturating the WireGuard link
syncoid --bandwidth-limit=50M rpool/data root@dr-node:backup/data

# Use mbuffer for buffered transfer (reduces stalls on variable links)
syncoid --sendoptions="w" --no-stream \
  rpool/data root@dr-node:backup/data

# Compress the send stream with lz4 (useful over WAN)
syncoid --compress=lz4 rpool/data root@dr-node:backup/data

# Use pigz for parallel gzip compression on multi-core machines
syncoid --compress=pigz-fast rpool/data root@dr-node:backup/data

Automated replication with systemd

# /etc/systemd/system/syncoid-replication.service
[Unit]
Description=ZFS replication via syncoid
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/sbin/syncoid --recursive --no-sync-snap \
  --compress=lz4 rpool/data root@dr-node:backup/data
ExecStart=/usr/sbin/syncoid --recursive --no-sync-snap \
  --compress=lz4 rpool/ROOT root@dr-node:backup/ROOT
# Limit CPU and I/O impact on production
CPUQuota=50%
IOWeight=100
Nice=10

[Install]
WantedBy=multi-user.target
# /etc/systemd/system/syncoid-replication.timer
[Unit]
Description=Run ZFS replication every 15 minutes

[Timer]
OnCalendar=*:0/15
RandomizedDelaySec=60
Persistent=true

[Install]
WantedBy=timers.target
# Enable the timer
systemctl daemon-reload
systemctl enable --now syncoid-replication.timer

# Verify
systemctl list-timers syncoid-replication.timer
journalctl -u syncoid-replication.service --since "1 hour ago"

Encryption in transit

Syncoid uses SSH by default, so replication is encrypted in transit. For replication over a WireGuard tunnel (which is already encrypted), you can use --no-privilege-elevation and raw sends for maximum throughput.

# Over WireGuard (already encrypted) — use raw send for speed
syncoid --sendoptions="w" --recursive \
  rpool/data root@10.100.0.2:backup/data

# The "w" flag sends the raw encrypted stream if the dataset
# uses ZFS native encryption — the DR node cannot read the data
# without the encryption key, providing encryption at rest on
# the DR side without trusting the DR node
The raw send flag (-w) is one of the most underappreciated features in ZFS for DR. If your datasets use ZFS native encryption, sending with -w means the DR node receives and stores the data in its encrypted form. The DR node never sees the plaintext. This means: (1) the DR node does not need the encryption key during normal replication, (2) a compromised DR node cannot read your data, (3) you only need the key when you actually failover and mount the datasets. This is encryption at rest on the DR side for free, with zero performance overhead, and it works with syncoid out of the box. Most people do not know this exists.

6. Cross-Site Replication Architecture

A real DR architecture requires a dedicated replica at a geographically separate location, connected via an encrypted network, with automated replication and monitoring. This section describes the reference architecture for kldload cross-site DR.

Topology

┌─────────────────────────────┐          WireGuard           ┌─────────────────────────────┐
│       PRIMARY SITE          │      Storage Backplane        │         DR SITE             │
│                             │    10.100.0.0/24 (wg-dr)     │                             │
│  ┌───────────┐              │◄────────────────────────────►│              ┌───────────┐  │
│  │ prod-node │ rpool/data   │                               │  backup/data │  dr-node  │  │
│  │           │ rpool/ROOT   │   syncoid every 15 min        │  backup/ROOT │           │  │
│  │           │ rpool/db     │──────────────────────────────►│  backup/db   │           │  │
│  └───────────┘              │                               │              └───────────┘  │
│                             │                               │                             │
│  Sanoid snapshots locally   │                               │  Sanoid prunes old snaps    │
│  Syncoid pushes to DR       │                               │  Prometheus monitors lag     │
└─────────────────────────────┘                               └─────────────────────────────┘

WireGuard storage plane

DR replication traffic should travel over a dedicated WireGuard interface, separate from your service traffic. This provides encryption, access control, and bandwidth isolation.

# /etc/wireguard/wg-dr.conf — on production node
[Interface]
Address = 10.100.0.1/24
PrivateKey = <prod-private-key>
ListenPort = 51821

[Peer]
PublicKey = <dr-public-key>
AllowedIPs = 10.100.0.2/32
Endpoint = dr-site.example.com:51821
PersistentKeepalive = 25
# /etc/wireguard/wg-dr.conf — on DR node
[Interface]
Address = 10.100.0.2/24
PrivateKey = <dr-private-key>
ListenPort = 51821

[Peer]
PublicKey = <prod-public-key>
AllowedIPs = 10.100.0.1/32
Endpoint = prod-site.example.com:51821
PersistentKeepalive = 25
# Bring up the DR WireGuard interface on both nodes
systemctl enable --now wg-quick@wg-dr

# Verify connectivity
ping -c 3 10.100.0.2   # from prod
ping -c 3 10.100.0.1   # from DR

# Configure syncoid to use the WireGuard IP
syncoid --recursive rpool/data root@10.100.0.2:backup/data

SSH hardening for replication

The replication SSH key should be restricted to ZFS commands only. Use command= in authorized_keys to limit what the production node can do on the DR node.

# Generate a dedicated key pair for syncoid
ssh-keygen -t ed25519 -f /root/.ssh/syncoid_ed25519 -N "" -C "syncoid-replication"

# On the DR node, add to /root/.ssh/authorized_keys:
command="zfs recv -Fdu backup",restrict ssh-ed25519 AAAA... syncoid-replication

# For recursive replication, the command restriction needs to be broader:
command="/usr/local/bin/syncoid-receiver.sh",restrict ssh-ed25519 AAAA... syncoid-replication
# /usr/local/bin/syncoid-receiver.sh — on DR node
#!/bin/bash
# Restrict to zfs send/receive commands only
case "$SSH_ORIGINAL_COMMAND" in
  zfs\ recv*|zfs\ receive*|zfs\ list*|zfs\ get*|zfs\ snapshot*)
    eval "$SSH_ORIGINAL_COMMAND"
    ;;
  *)
    echo "Command not allowed: $SSH_ORIGINAL_COMMAND" >&2
    exit 1
    ;;
esac
I cannot overstate the importance of the dedicated WireGuard interface for DR traffic. If your replication shares the same network path as your production traffic, a network issue affects both simultaneously — exactly when you need replication the most. A separate WireGuard tunnel means: separate routing, separate firewall rules, separate bandwidth accounting, and you can rate-limit replication without affecting production. It also means the DR node's SSH access is only reachable over the WireGuard IP, not from the public internet.

7. Boot Environments

A boot environment is a bootable clone of your root filesystem. ZFS makes this trivial: clone the root dataset, update the bootloader to point at the clone, and reboot into it. If the new environment has problems, reboot into the old one. This is the safest way to perform OS upgrades, kernel updates, and major configuration changes.

How boot environments work on kldload

# kldload's root layout (default install)
# rpool/ROOT/centos      — the root dataset
# rpool/ROOT/centos@safe  — pre-upgrade snapshot

# List current boot environments
zfs list -r rpool/ROOT -o name,used,mountpoint,origin

# The active boot environment is the one mounted at /

Creating a boot environment before an upgrade

# 1. Snapshot the current root
zfs snapshot rpool/ROOT/centos@pre-upgrade-2026-04-05

# 2. Clone it to a new boot environment
zfs clone rpool/ROOT/centos@pre-upgrade-2026-04-05 rpool/ROOT/centos-upgrade

# 3. Mount the clone temporarily and perform the upgrade
mount -t zfs rpool/ROOT/centos-upgrade /mnt
# ... chroot into /mnt and run dnf upgrade, or:
dnf --installroot=/mnt upgrade -y

# 4. Update the bootloader to offer both environments
# For systemd-boot (kldload default):
cat > /boot/efi/loader/entries/upgrade.conf <<'EOF'
title    CentOS Stream 9 (upgrade)
linux    /vmlinuz-upgrade
initrd   /initramfs-upgrade.img
options  root=zfs:rpool/ROOT/centos-upgrade rw
EOF

# 5. Reboot and select the upgrade entry
# If it works: keep it. If it breaks: reboot into the old entry.

Rolling back a failed upgrade

# Option 1: Reboot into the previous boot environment
# Select the old entry in the bootloader menu — done.

# Option 2: Rollback the snapshot (destructive — overwrites current state)
zfs rollback rpool/ROOT/centos@pre-upgrade-2026-04-05

# Option 3: Promote the old clone back to the active dataset
# (if you promoted the upgrade clone earlier)
zfs promote rpool/ROOT/centos

Boot environments as DR safety net

Boot environments are not a replacement for offsite replication, but they are an essential complement. Every upgrade, every kernel update, every major config change should create a boot environment first. The cost is essentially zero (CoW clone), and the safety net is immediate — no need to restore from a remote backup for a botched upgrade.

The boot environment model is why ZFS-on-root changes the calculus of system administration. On a traditional Linux system, a failed kernel upgrade means booting from a rescue ISO, mounting the root filesystem, manually reverting packages, fixing GRUB, and praying. On ZFS, it means selecting the previous boot entry and rebooting. The difference in mean time to recovery is the difference between 2 hours of stressful troubleshooting and 30 seconds of choosing a menu entry. Every kldload node should snapshot before every upgrade. Automate it — put it in your dnf plugin or your pre-upgrade script.

8. Database Backup

Databases require special handling because a consistent backup is not the same as a consistent filesystem snapshot. A filesystem snapshot of a running database may capture the data files in a state that the database considers corrupted — write-ahead logs partially applied, pages half-written. The correct approach depends on the database.

PostgreSQL: pg_dump + ZFS snapshots

# Strategy 1: Logical backup with pg_dump (portable, slow for large DBs)
pg_dump -Fc -Z 4 -f /backup/postgres/mydb_$(date +%Y%m%d_%H%M).dump mydb

# Restore from logical backup
pg_restore -d mydb /backup/postgres/mydb_20260405_1430.dump

# Strategy 2: ZFS snapshot (instant, block-level consistent)
# PostgreSQL on ZFS: put $PGDATA on its own dataset
# rpool/data/postgres — mountpoint /var/lib/pgsql/data

# Take a consistent snapshot:
# PostgreSQL's WAL ensures crash consistency — a ZFS snapshot
# of a running PostgreSQL instance is equivalent to a power failure,
# and PostgreSQL recovers from that using WAL replay on startup.
zfs snapshot rpool/data/postgres@$(date +%Y%m%d_%H%M)

# For absolute consistency, use pg_start_backup/pg_stop_backup:
psql -c "SELECT pg_backup_start('zfs-snap');"
zfs snapshot rpool/data/postgres@consistent_$(date +%Y%m%d_%H%M)
psql -c "SELECT pg_backup_stop();"

PostgreSQL point-in-time recovery (PITR)

# Enable WAL archiving in postgresql.conf
archive_mode = on
archive_command = 'cp %p /backup/postgres/wal/%f'
wal_level = replica

# With WAL archiving + ZFS snapshots, you can recover to any
# point in time between snapshots:
# 1. Restore the most recent snapshot before the target time
# 2. Replay WAL files up to the exact target timestamp

# recovery.conf (PostgreSQL 12+: recovery.signal + postgresql.conf)
restore_command = 'cp /backup/postgres/wal/%f %p'
recovery_target_time = '2026-04-05 14:25:00'
recovery_target_action = 'promote'

MySQL / MariaDB with consistent snapshots

# For InnoDB (default): ZFS snapshots are crash-consistent
# InnoDB's redo log provides the same crash recovery as PostgreSQL's WAL
zfs snapshot rpool/data/mysql@$(date +%Y%m%d_%H%M)

# For mixed InnoDB/MyISAM: flush and lock first
mysql -e "FLUSH TABLES WITH READ LOCK;"
zfs snapshot rpool/data/mysql@consistent_$(date +%Y%m%d_%H%M)
mysql -e "UNLOCK TABLES;"

# Logical backup (portable, slow)
mysqldump --single-transaction --routines --triggers \
  --all-databases | gzip > /backup/mysql/all_$(date +%Y%m%d_%H%M).sql.gz

# Binary backup with xtrabackup (fast, InnoDB only)
xtrabackup --backup --target-dir=/backup/mysql/xtra_$(date +%Y%m%d_%H%M)

Automated database backup script

#!/bin/bash
# /usr/local/bin/db-backup.sh — run via systemd timer
set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M)
LOG="/var/log/db-backup.log"

echo "[${TIMESTAMP}] Starting database backup" >> "$LOG"

# PostgreSQL: logical + snapshot
if systemctl is-active --quiet postgresql; then
    pg_dump -Fc -Z 4 -f "/backup/postgres/mydb_${TIMESTAMP}.dump" mydb
    zfs snapshot "rpool/data/postgres@${TIMESTAMP}"
    echo "[${TIMESTAMP}] PostgreSQL backup complete" >> "$LOG"
fi

# MySQL: snapshot (InnoDB crash-consistent)
if systemctl is-active --quiet mysqld; then
    zfs snapshot "rpool/data/mysql@${TIMESTAMP}"
    echo "[${TIMESTAMP}] MySQL backup complete" >> "$LOG"
fi

# Prune local logical backups older than 7 days
find /backup/postgres/ -name "*.dump" -mtime +7 -delete
find /backup/mysql/ -name "*.sql.gz" -mtime +7 -delete

echo "[${TIMESTAMP}] Database backup finished" >> "$LOG"
The question I get asked most about database backups on ZFS is: "Do I still need pg_dump if I have ZFS snapshots?" The answer is yes, for two reasons. First, logical backups are portable — you can restore a pg_dump to a different server, a different version, a different OS. A ZFS snapshot can only be restored to a ZFS pool. Second, logical backups let you restore individual tables or rows. A ZFS snapshot is all-or-nothing for the dataset. The correct strategy is both: ZFS snapshots for fast RPO (15-minute incremental replicas) and periodic pg_dump for portability and granular recovery. They serve different purposes.

9. Kubernetes Backup

Kubernetes clusters have two categories of state that need protection: the cluster state (stored in etcd) and the persistent data (stored on PersistentVolumes). Losing etcd means losing all Kubernetes object definitions — deployments, services, configmaps, secrets. Losing PVs means losing application data. You need to protect both.

etcd snapshots

# Take an etcd snapshot (run on a control plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd/snapshot_$(date +%Y%m%d_%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd/snapshot_$(date +%Y%m%d_%H%M).db \
  --write-table

# Automated etcd backup — systemd timer
cat > /etc/systemd/system/etcd-backup.service <<'EOF'
[Unit]
Description=etcd snapshot backup
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/etcd-backup.sh
EOF

cat > /etc/systemd/system/etcd-backup.timer <<'EOF'
[Unit]
Description=Run etcd backup every hour

[Timer]
OnCalendar=hourly
Persistent=true

[Install]
WantedBy=timers.target
EOF

systemctl daemon-reload
systemctl enable --now etcd-backup.timer

Restore etcd from snapshot

# Stop the kube-apiserver and etcd
systemctl stop kube-apiserver
systemctl stop etcd

# Restore the snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd/snapshot_20260405_1400.db \
  --data-dir=/var/lib/etcd-restore \
  --name=$(hostname) \
  --initial-cluster="$(hostname)=https://$(hostname):2380" \
  --initial-advertise-peer-urls="https://$(hostname):2380"

# Replace the etcd data directory
mv /var/lib/etcd /var/lib/etcd.old
mv /var/lib/etcd-restore /var/lib/etcd
chown -R etcd:etcd /var/lib/etcd

# Restart services
systemctl start etcd
systemctl start kube-apiserver

# Verify cluster health
kubectl get nodes
kubectl get pods --all-namespaces

Velero for cluster state backup

# Install Velero CLI
curl -Lo /tmp/velero.tar.gz \
  https://github.com/vmware-tanzu/velero/releases/latest/download/velero-linux-amd64.tar.gz
tar -xzf /tmp/velero.tar.gz -C /tmp
install -m 0755 /tmp/velero-*/velero /usr/local/bin/velero

# Install Velero server components with MinIO backend (on-prem S3)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket velero-backups \
  --secret-file /etc/velero/credentials-velero \
  --backup-location-config \
    region=us-east-1,s3ForcePathStyle=true,s3Url=http://minio.storage.svc:9000 \
  --use-node-agent

# Create a backup of all namespaces
velero backup create full-cluster-$(date +%Y%m%d) \
  --include-namespaces '*' \
  --default-volumes-to-fs-backup

# Schedule daily backups with 30-day retention
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --include-namespaces '*' \
  --ttl 720h

# List backups
velero backup get

# Restore from a Velero backup
velero restore create --from-backup full-cluster-20260405

PersistentVolume snapshots via ZFS CSI

# If using OpenZFS CSI driver, PV snapshots are ZFS snapshots
# Create a VolumeSnapshot for a PVC
cat <<'EOF' | kubectl apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-data-snap
  namespace: production
spec:
  volumeSnapshotClassName: zfs-snapshot-class
  source:
    persistentVolumeClaimName: postgres-data-pvc
EOF

# List snapshots
kubectl get volumesnapshot -n production

# Restore from snapshot — create a new PVC from the snapshot
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-restored
  namespace: production
spec:
  storageClassName: zfs-csi
  dataSource:
    name: postgres-data-snap
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
EOF
The number one Kubernetes DR mistake is backing up etcd but not testing the restore. An etcd restore changes the cluster ID, which means every node needs to rejoin the cluster. If you are running a multi-master etcd cluster, the restore procedure is significantly more complex — you need to restore on all members simultaneously and reconfigure the peer URLs. Test this quarterly. On a real incident, the last thing you want is to be reading documentation about etcd cluster recovery for the first time while your CEO is standing behind you asking when the site will be back up.

10. Failover Procedures

When a disaster is declared, you need a step-by-step runbook that anyone on the team can execute. This is not the time for improvisation. The runbook should be tested quarterly and updated after every test.

Step-by-step failover runbook

# ═══════════════════════════════════════════════════════════
# DR FAILOVER RUNBOOK — kldload
# ═══════════════════════════════════════════════════════════
#
# TRIGGER: Primary site unreachable for > 15 minutes
#          AND confirmed by two team members
#          AND incident commander declares DR activation
#
# ESTIMATED TIME: 45-60 minutes
# LAST TESTED: 2026-04-01 (quarterly drill)
# ═══════════════════════════════════════════════════════════

# ── PHASE 1: ASSESS (5 minutes) ──────────────────────────

# 1.1 Verify primary is truly down (not just monitoring flap)
ping -c 5 prod-node.example.com
ssh -o ConnectTimeout=10 root@prod-node.example.com echo "alive"
# If either succeeds: stand down, investigate, do not failover.

# 1.2 Check replication lag — how much data will we lose?
ssh root@dr-node "zfs list -t snapshot -r backup/data -o name,creation \
  -s creation | tail -5"
# The newest snapshot timestamp = your actual RPO for this incident.

# ── PHASE 2: PROMOTE DR DATASETS (10 minutes) ────────────

# 2.1 Stop replication to prevent partial writes
ssh root@dr-node "systemctl stop syncoid-replication.timer"

# 2.2 Set datasets to read-write on DR node
ssh root@dr-node "zfs set readonly=off backup/data"
ssh root@dr-node "zfs set readonly=off backup/ROOT"
ssh root@dr-node "zfs set readonly=off backup/db"

# 2.3 Mount the datasets at their production paths
ssh root@dr-node "zfs set mountpoint=/data backup/data"
ssh root@dr-node "zfs set mountpoint=/ backup/ROOT"
ssh root@dr-node "zfs set mountpoint=/var/lib/pgsql backup/db"
ssh root@dr-node "zfs mount -a"

# ── PHASE 3: START SERVICES (15 minutes) ─────────────────

# 3.1 Start services in dependency order
ssh root@dr-node "systemctl start postgresql"
ssh root@dr-node "systemctl start redis"
ssh root@dr-node "systemctl start application"
ssh root@dr-node "systemctl start nginx"

# 3.2 Verify services are healthy
ssh root@dr-node "systemctl is-active postgresql redis application nginx"
ssh root@dr-node "curl -sf http://localhost:8080/health"

# ── PHASE 4: DNS CUTOVER (5 minutes) ─────────────────────

# 4.1 Update DNS to point at DR node
# If using Cloudflare:
curl -X PUT "https://api.cloudflare.com/client/v4/zones/ZONE_ID/dns_records/RECORD_ID" \
  -H "Authorization: Bearer CF_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"type":"A","name":"app.example.com","content":"DR_NODE_IP","ttl":60}'

# 4.2 Lower TTL before failover (should already be low — 60s)
# If TTL was 3600, it takes up to 1 hour for all clients to see the change.
# PRE-WORK: Set TTL to 60 on all critical DNS records NOW, before any disaster.

# ── PHASE 5: VALIDATE (10 minutes) ───────────────────────

# 5.1 External health check
curl -sf https://app.example.com/health

# 5.2 Run smoke tests
./scripts/dr-smoke-test.sh

# 5.3 Check monitoring — alerts should clear
# Grafana: https://grafana.example.com/d/dr-status

# ── PHASE 6: COMMUNICATE ─────────────────────────────────
# Notify stakeholders: DR activated, services restored,
# data loss = [RPO from step 1.2], ETA for primary restore = TBD

DNS pre-work: low TTL

Critical pre-work: Set DNS TTLs to 60 seconds on all records that would change during failover. Do this now, during peacetime. If your TTL is 3600 (1 hour) and you change the DNS record during a disaster, some clients will not see the change for up to an hour. A 60-second TTL means convergence within 2 minutes.

# Verify current TTLs
dig +short app.example.com | head -1
dig app.example.com | grep TTL

# Set TTL to 60 in your DNS provider (Cloudflare example)
# Proxied records: TTL is managed by Cloudflare (automatic)
# DNS-only records: set TTL to 60
The single biggest time sink in every DR failover I have participated in is service dependency order. You start PostgreSQL, then the application, and it crashes because Redis is not running. You start Redis, then the application, and it times out because the background worker has not processed the queue. Every service depends on two other services, and nobody documented the startup graph. The fix is to write the startup order in the runbook, test it quarterly, and update it whenever you add a new service. The runbook is a living document — a stale runbook is worse than no runbook because it gives false confidence.

11. DR Testing

A DR plan that has not been tested is a hypothesis, not a plan. Testing reveals gaps in documentation, missing dependencies, expired credentials, incorrect assumptions, and timing issues. Test quarterly at minimum.

Quarterly fire drill procedure

# DR Fire Drill Checklist
# ═══════════════════════════════════════════════════════════

# 1. Schedule: pick a date, notify the team, block 4 hours
# 2. Pre-drill: verify replication is current, check DR node health
# 3. Simulate: follow the failover runbook exactly as written
# 4. Validate: run the full smoke test suite against DR
# 5. Measure: record actual failover time vs. target RTO
# 6. Document: write up findings — what worked, what broke, what to fix
# 7. Remediate: fix every issue found within 2 weeks
# 8. Re-test: if critical issues were found, schedule a re-test

Automated DR validation script

#!/bin/bash
# /usr/local/bin/dr-validate.sh
# Run on DR node to validate replica health
set -euo pipefail

ERRORS=0
REPORT="/var/log/dr-validation-$(date +%Y%m%d).log"

echo "=== DR Validation Report — $(date) ===" > "$REPORT"

# Check 1: Replication lag
LATEST_SNAP=$(zfs list -t snapshot -r backup/data -o name,creation \
  -s creation -H | tail -1 | awk '{print $2, $3, $4, $5}')
SNAP_EPOCH=$(date -d "$LATEST_SNAP" +%s 2>/dev/null || echo 0)
NOW_EPOCH=$(date +%s)
LAG_MINUTES=$(( (NOW_EPOCH - SNAP_EPOCH) / 60 ))

echo "Replication lag: ${LAG_MINUTES} minutes" >> "$REPORT"
if [ "$LAG_MINUTES" -gt 30 ]; then
    echo "FAIL: Replication lag exceeds 30 minutes" >> "$REPORT"
    ERRORS=$((ERRORS + 1))
fi

# Check 2: Dataset integrity
for ds in backup/data backup/ROOT backup/db; do
    if zfs list "$ds" &>/dev/null; then
        echo "OK: Dataset $ds exists" >> "$REPORT"
    else
        echo "FAIL: Dataset $ds missing" >> "$REPORT"
        ERRORS=$((ERRORS + 1))
    fi
done

# Check 3: Disk space
POOL_FREE=$(zpool list -Hp -o free backup 2>/dev/null || echo 0)
POOL_FREE_GB=$((POOL_FREE / 1073741824))
echo "DR pool free space: ${POOL_FREE_GB} GB" >> "$REPORT"
if [ "$POOL_FREE_GB" -lt 50 ]; then
    echo "FAIL: DR pool free space below 50 GB" >> "$REPORT"
    ERRORS=$((ERRORS + 1))
fi

# Check 4: Services can start (dry run)
echo "Service readiness checks:" >> "$REPORT"
for svc in postgresql redis nginx; do
    if systemctl cat "$svc" &>/dev/null; then
        echo "  OK: $svc unit file present" >> "$REPORT"
    else
        echo "  WARN: $svc unit file missing on DR node" >> "$REPORT"
    fi
done

# Summary
echo "" >> "$REPORT"
if [ "$ERRORS" -eq 0 ]; then
    echo "RESULT: ALL CHECKS PASSED" >> "$REPORT"
else
    echo "RESULT: ${ERRORS} CHECK(S) FAILED — investigate immediately" >> "$REPORT"
fi

cat "$REPORT"
exit "$ERRORS"

Chaos engineering for DR

Beyond scheduled fire drills, introduce controlled failures to validate resilience:

# Kill the primary node's network (simulate site failure)
# On the primary node:
iptables -A OUTPUT -j DROP
# Wait, then verify DR monitoring detects the outage

# Simulate disk failure on primary
# On the primary node (test only — use a spare disk):
zpool offline rpool /dev/sdb

# Simulate corrupted replication stream
# Intentionally break a syncoid run and verify alerting catches it
syncoid --force-delete rpool/test root@dr-node:backup/test 2>&1 || true
# Verify the monitoring alert fires for failed replication

# ALWAYS have a rollback plan for chaos tests
# ALWAYS run chaos tests in a maintenance window
# NEVER run chaos tests in production without team consensus
I run DR drills for clients where the rule is: the person who built the infrastructure is not allowed to touch the keyboard. Someone else on the team must follow the runbook and execute the failover. This immediately reveals every undocumented step, every assumption that only lives in one person's head, every "oh, you also need to do X" that is not in the runbook. If your DR plan requires the one person who knows everything to be available, it is not a plan — it is a single point of failure wearing a human suit. Write the runbook so that any competent engineer on the team can execute it.

12. Ransomware Recovery

Ransomware is the most common disaster scenario in modern infrastructure. The attack encrypts your data and demands payment for the decryption key. ZFS provides uniquely strong defenses — but only if configured correctly before the attack.

Immutable snapshots

ZFS snapshots are inherently read-only — nothing can modify a snapshot after creation. However, a root-level attacker can destroy snapshots. The defense is to replicate snapshots to a separate system where the production node does not have destroy permissions.

# The attack surface: if ransomware gets root on the production node
# it can destroy snapshots:
# zfs destroy rpool/data@2026-04-05_14:00   # attacker destroys your backups

# The defense: the DR node holds the replicas, and the production
# node's SSH key is restricted to "zfs recv" — it cannot destroy
# anything on the DR node.

# On the DR node — the authorized_keys restriction prevents destruction:
command="zfs recv -Fdu backup",restrict ssh-ed25519 AAAA... syncoid-replication
# The production node can SEND data but cannot DESTROY anything on DR.

Offline replica (air-gapped backup)

# For maximum ransomware protection: maintain an air-gapped replica
# that is only connected during replication windows

# 1. Connect the offline DR drive/node
# 2. Run replication
syncoid --recursive rpool/data root@offline-node:airgap/data

# 3. Disconnect the offline DR drive/node
# The air gap means ransomware cannot reach this replica even with root

# For USB-attached ZFS pools:
zpool import airgap-pool
syncoid --recursive rpool/data airgap-pool/data
zpool export airgap-pool
# Physically disconnect the drive and store it offsite

Detection and response

# Signs of ransomware on ZFS:
# - Sudden massive space consumption (encryption creates new blocks)
# - Unusual snapshot destruction activity
# - High write I/O on datasets that are normally read-heavy
# - Files with ransomware extensions (.encrypted, .locked, etc.)

# Monitor for snapshot destruction attempts
# Add to /etc/zfs/zed.d/zed.rc:
ZED_NOTIFY_VERBOSE=1

# Custom zed script to alert on snapshot destruction
cat > /etc/zfs/zed.d/snapshot-destroy-alert.sh <<'ZEDEOF'
#!/bin/bash
# Alert on any snapshot destruction
if [ "$ZEVENT_SUBCLASS" = "snapshot_destroy" ]; then
    echo "ALERT: Snapshot destroyed: $ZEVENT_HISTORY_DSNAME" | \
      mail -s "ZFS Snapshot Destruction Alert" admin@example.com
fi
ZEDEOF
chmod +x /etc/zfs/zed.d/snapshot-destroy-alert.sh

Ransomware recovery workflow

# ═══════════════════════════════════════════════════════════
# RANSOMWARE RECOVERY PROCEDURE
# ═══════════════════════════════════════════════════════════

# 1. ISOLATE — disconnect the infected node from the network immediately
#    Do NOT shut down — you may need memory forensics
ssh root@infected-node "nmcli device disconnect eth0"

# 2. ASSESS — determine the scope of infection
#    - Which datasets are affected?
#    - When did the encryption start? (check snapshot diffs)
#    - Are other nodes affected?

# 3. IDENTIFY the last clean snapshot
# Compare snapshots to find when the attack started:
zfs diff rpool/data@2026-04-05_14:00 rpool/data@2026-04-05_14:15
# Look for mass file modifications or ransomware note files

# 4. RESTORE from the last clean snapshot on the DR node
# On the DR node:
zfs rollback backup/data@2026-04-05_14:00

# Or clone the clean snapshot for forensic preservation:
zfs clone backup/data@2026-04-05_14:00 backup/data-clean
# Keep the encrypted state for forensic analysis

# 5. FAILOVER to DR using the standard failover runbook
# Follow Section 10 — Failover Procedures

# 6. REBUILD the primary node from scratch
# Do NOT attempt to "clean" the infected node
# Reinstall the OS, restore from clean backups
# Change ALL credentials — the attacker had root

# 7. POST-INCIDENT
# - Forensic analysis of the infected node
# - Root cause analysis — how did they get in?
# - Update firewall rules, access controls, monitoring
# - File insurance claim if applicable
# - Regulatory notification if PII was exposed
The reason ZFS snapshots are so effective against ransomware is the copy-on-write architecture. When ransomware encrypts your files, it writes new blocks — it cannot modify the blocks referenced by existing snapshots. The snapshots still contain your original, unencrypted data. The attacker knows this, which is why sophisticated ransomware explicitly destroys ZFS snapshots before encrypting. The defense is that your replicas are on a separate node where the attacker does not have destroy permissions. This is why the SSH key restriction in the cross-site replication section is not optional — it is your last line of defense.

13. Compliance & Retention

Regulatory frameworks impose minimum retention periods for different categories of data. Your snapshot and backup retention policies must satisfy these requirements. Failure to retain data for the required period is a compliance violation; failure to delete data after the retention period (in some jurisdictions) is also a violation.

Common retention requirements

RegulationData TypeMinimum RetentionNotes
SOX (Sarbanes-Oxley)Financial records, audit logs7 yearsApplies to public companies and their IT systems
HIPAAPatient health information6 yearsFrom date of creation or last effective date
GDPRPersonal dataNo minimum — only as long as necessaryRight to erasure conflicts with backup retention
PCI DSSCardholder data, audit logs1 year (logs), 3 months onlineAudit trail must be immediately available for 3 months
SEC Rule 17a-4Broker-dealer records3-6 yearsMust be stored in non-rewritable, non-erasable format
FISMAFederal system audit logs3 yearsDepends on system categorization (Low/Moderate/High)

Implementing retention with Sanoid

# Compliance dataset — 7-year retention for SOX
[rpool/data/financial]
  use_template = production
  daily = 90
  monthly = 84       # 7 years of monthly snapshots
  yearly = 7
  autoprune = yes

# Audit log dataset — 3-year retention for PCI DSS
[rpool/data/audit-logs]
  use_template = production
  daily = 90         # 3 months of daily snapshots (PCI: immediately available)
  monthly = 36       # 3 years of monthly snapshots
  yearly = 3
  autoprune = yes

# GDPR-sensitive dataset — shorter retention, must support deletion
[rpool/data/user-pii]
  use_template = production
  daily = 30
  monthly = 12
  yearly = 0         # No yearly — minimize PII retention
  autoprune = yes

Legal hold on snapshots

# When litigation requires preserving data, you need to prevent
# snapshot pruning for specific datasets. Sanoid supports this
# by setting autoprune=no temporarily.

# Place a legal hold: disable autopruning
zfs set com.kldload:legal_hold=active rpool/data/financial

# Modify sanoid.conf to check the property:
# In your sanoid wrapper script:
HOLD=$(zfs get -H -o value com.kldload:legal_hold rpool/data/financial 2>/dev/null)
if [ "$HOLD" = "active" ]; then
    echo "Legal hold active on rpool/data/financial — skipping prune"
    sanoid --take-snapshots --no-prune --configdir=/etc/sanoid
else
    sanoid --cron
fi

# Release the hold when litigation concludes
zfs set com.kldload:legal_hold=released rpool/data/financial
GDPR right-to-erasure and backup retention are fundamentally in conflict. A user requests deletion of their data. You delete it from the live database. But their data still exists in every snapshot and backup for the retention period. The pragmatic approach is to document this in your privacy policy: "Data may persist in encrypted backups for up to N days after deletion from active systems." The legal approach varies by jurisdiction — consult counsel. The technical approach is to use ZFS encryption and destroy the key material for the deleted data, rendering the backed-up data irrecoverable without actually modifying the snapshots. This is not perfect, but it is the best balance of compliance and operational reality.

14. Monitoring DR Health

DR infrastructure that is not monitored will silently fail. Replication will stop and nobody will notice until the disaster happens. The following metrics must be monitored continuously and alerted on.

Key metrics to monitor

MetricAlert ThresholdWhy
Replication lag (minutes since last successful syncoid)> 30 minutesDirectly impacts your actual RPO
Newest snapshot age on DR node> 30 minutesCatches replication failures even if timer is running
DR pool free space< 20%Full pool = failed replication = no DR
Syncoid exit code!= 0Catches auth failures, network issues, ZFS errors
Snapshot count per dataset> 1000Indicates autoprune failure — will eventually fill pool
WireGuard handshake age> 5 minutesDR tunnel is down — replication will fail
etcd backup age (K8s)> 2 hoursStale etcd backup = larger blast radius

Prometheus exporter for ZFS replication

#!/bin/bash
# /usr/local/bin/zfs-dr-exporter.sh
# Textfile collector for node_exporter
# Run via cron every 5 minutes

METRICS_FILE="/var/lib/prometheus/node-exporter/zfs_dr.prom"
TMPFILE="${METRICS_FILE}.tmp"

cat > "$TMPFILE" <<'HEADER'
# HELP zfs_dr_replication_lag_seconds Seconds since last successful replication snapshot
# TYPE zfs_dr_replication_lag_seconds gauge
# HELP zfs_dr_pool_free_bytes Free bytes on DR pool
# TYPE zfs_dr_pool_free_bytes gauge
# HELP zfs_dr_snapshot_count Number of snapshots on dataset
# TYPE zfs_dr_snapshot_count gauge
HEADER

# Replication lag per dataset
for ds in backup/data backup/ROOT backup/db; do
    LATEST=$(zfs list -t snapshot -r "$ds" -o creation -s creation -H 2>/dev/null | tail -1)
    if [ -n "$LATEST" ]; then
        SNAP_EPOCH=$(date -d "$LATEST" +%s 2>/dev/null || echo 0)
        NOW_EPOCH=$(date +%s)
        LAG=$((NOW_EPOCH - SNAP_EPOCH))
        echo "zfs_dr_replication_lag_seconds{dataset=\"${ds}\"} ${LAG}" >> "$TMPFILE"
    fi
done

# Pool free space
POOL_FREE=$(zpool list -Hp -o free backup 2>/dev/null || echo 0)
echo "zfs_dr_pool_free_bytes ${POOL_FREE}" >> "$TMPFILE"

# Snapshot counts
for ds in backup/data backup/ROOT backup/db; do
    COUNT=$(zfs list -t snapshot -r "$ds" -H 2>/dev/null | wc -l)
    echo "zfs_dr_snapshot_count{dataset=\"${ds}\"} ${COUNT}" >> "$TMPFILE"
done

mv "$TMPFILE" "$METRICS_FILE"

Prometheus alerting rules

# /etc/prometheus/rules/dr-alerts.yml
groups:
  - name: disaster_recovery
    interval: 1m
    rules:
      - alert: ReplicationLagCritical
        expr: zfs_dr_replication_lag_seconds > 1800
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "ZFS replication lag exceeds 30 minutes"
          description: "Dataset {{ $labels.dataset }} last replicated {{ $value | humanizeDuration }} ago"

      - alert: DRPoolSpaceLow
        expr: (zfs_dr_pool_free_bytes / zfs_dr_pool_size_bytes) < 0.2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "DR pool free space below 20%"
          description: "DR pool has {{ $value | humanize1024 }}B free"

      - alert: SnapshotCountHigh
        expr: zfs_dr_snapshot_count > 1000
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Snapshot count exceeds 1000 on {{ $labels.dataset }}"
          description: "Autoprune may not be running. Current count: {{ $value }}"

      - alert: SyncoidFailed
        expr: increase(node_systemd_unit_state{name="syncoid-replication.service",state="failed"}[1h]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Syncoid replication service failed"
          description: "Check journalctl -u syncoid-replication.service for details"

      - alert: WireGuardDRTunnelDown
        expr: time() - wireguard_latest_handshake_seconds{interface="wg-dr"} > 300
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "WireGuard DR tunnel handshake stale"
          description: "DR tunnel {{ $labels.interface }} last handshake {{ $value | humanizeDuration }} ago"

Grafana dashboard

Create a dedicated DR health dashboard with panels for: replication lag per dataset (time series), pool free space (gauge), snapshot counts (table), last syncoid run status (stat panel), and WireGuard tunnel status (state timeline). Pin this dashboard to the team's TV monitor.

# Import a DR dashboard via Grafana API
curl -X POST http://localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -d @/etc/grafana/dashboards/dr-health.json

# Key panels for the DR dashboard:
# 1. Replication Lag (time series) — query: zfs_dr_replication_lag_seconds
# 2. Pool Free Space (gauge) — query: zfs_dr_pool_free_bytes
# 3. Snapshot Counts (table) — query: zfs_dr_snapshot_count
# 4. Syncoid Status (stat) — query: node_systemd_unit_state{name="syncoid-replication.service"}
# 5. WireGuard Handshake (state timeline) — query: wireguard_latest_handshake_seconds{interface="wg-dr"}
I had a client whose syncoid replication silently failed for 47 days. The SSH key had been rotated during a security audit, and nobody updated the syncoid configuration. The timer ran every 15 minutes, failed every 15 minutes, and nobody looked at the logs because there was no alerting. When they had a disk failure on the primary, they discovered their "15-minute RPO" was actually a 47-day RPO. The monitoring in this section is not optional. If you take nothing else from this masterclass: set up the replication lag alert. One alert. Pager-level severity. That one alert would have saved my client from losing 47 days of data.

15. Complete Runbook Reference

Daily operations reference

TaskCommandFrequencyOwner
Verify Sanoid is runningsystemctl status sanoid.timerDaily (automated)Monitoring
Verify syncoid is runningsystemctl status syncoid-replication.timerDaily (automated)Monitoring
Check replication lagzfs list -t snap -r backup -s creation | tail -5Daily (automated)Monitoring
Check DR pool spacezpool list backupDaily (automated)Monitoring
Verify WireGuard DR tunnelwg show wg-drDaily (automated)Monitoring
Check etcd backup freshnessls -lt /backup/etcd/ | head -3Daily (automated)Monitoring
Review DR alerts in GrafanaDashboard: DR HealthDaily (manual)On-call

Quarterly DR test checklist

StepActionExpected ResultPass/Fail
1Verify replication is current (lag < 30 min)Latest snapshot within 30 minutes
2Execute failover runbook (Section 10)Services running on DR node
3Run smoke tests against DRAll health checks pass
4Measure actual failover timeWithin RTO target
5Verify data integrity on DRSpot-check recent records exist
6Test database restore from logical backuppg_restore completes, data verified
7Test etcd restore (K8s)Cluster recovers, pods running
8Test boot environment rollbackSystem boots into previous environment
9Fail back to primaryReverse replication, DNS restored
10Document findings and remediation planWritten report distributed to team

Troubleshooting reference

SymptomLikely CauseResolution
Syncoid fails with "cannot receive: dataset has been modified"Someone wrote to the DR dataset, breaking the replication chainReset DR dataset: zfs rollback backup/data@latest_common_snap, then re-run syncoid
Syncoid fails with "no matching snapshots"All common snapshots between source and target were prunedFull re-seed required: destroy target dataset, re-run syncoid for full send
Replication is slow / stallsWireGuard tunnel congestion, mbuffer not installed, or --compress not setCheck wg show wg-dr, install mbuffer, add --compress=lz4
DR pool is fullAutoprune disabled or snapshots accumulating faster than pruningCheck sanoid.conf autoprune=yes, manually prune oldest snapshots: zfs destroy backup/data@oldest
Services fail to start on DR nodeConfig files reference primary hostname/IP, missing depsPre-stage configs on DR: /etc/dr-configs/ with DR-specific settings
DNS cutover is slowTTL was not lowered in peacetimeLower TTL to 60 now. During incident, wait for old TTL to expire
etcd restore fails with "member ID mismatch"Restoring to a running cluster without stopping all membersStop etcd on all nodes, restore on all nodes simultaneously, restart
Boot environment will not bootBootloader entry incorrect or kernel/initramfs missingBoot from previous environment, check /boot/efi/loader/entries/
ZFS encrypted datasets unreadable on DREncryption key not available on DR nodeLoad key: zfs load-key backup/data — key must be available during failover
Sanoid creates snapshots but does not pruneautoprune = no in config or sanoid running with --take-snapshots onlySet autoprune = yes, ensure timer runs sanoid --cron (not just --take-snapshots)

Emergency contacts template

# ═══════════════════════════════════════════════════════════
# DR EMERGENCY CONTACTS
# ═══════════════════════════════════════════════════════════
# Incident Commander:    [Name] — [Phone] — [Email]
# Primary SRE:           [Name] — [Phone] — [Email]
# Secondary SRE:         [Name] — [Phone] — [Email]
# Database Admin:        [Name] — [Phone] — [Email]
# Network Engineer:      [Name] — [Phone] — [Email]
# Management Escalation: [Name] — [Phone] — [Email]
# DNS Provider Support:  Cloudflare — support.cloudflare.com
# Hosting Provider:      [Provider] — [Support Phone/URL]
# Insurance Broker:      [Name] — [Phone] (for ransomware/data loss claims)
# Legal Counsel:         [Name] — [Phone] (for regulatory notification)
# ═══════════════════════════════════════════════════════════

Summary

Disaster recovery on kldload is built on ZFS's unique capabilities: atomic snapshots, incremental send/receive, and copy-on-write immutability. Sanoid automates the snapshot lifecycle. Syncoid automates replication. WireGuard provides the encrypted transport. Boot environments provide instant local rollback. And the runbook — tested quarterly — ties it all together into a process that works when everything else is on fire.

The tools are straightforward. The architecture is documented above. The hard part is the discipline: testing the DR plan regularly, monitoring replication health continuously, updating the runbook when the infrastructure changes, and ensuring more than one person can execute the failover. That discipline is what separates "we have backups" from "we have disaster recovery."

If you read this entire masterclass and only do one thing: set up the syncoid replication with the Prometheus alert for replication lag. One cron job, one alert. That single change takes your infrastructure from "hope we can recover" to "know we can recover within 15 minutes of the last snapshot." Everything else in this masterclass is refinement. That one step is the foundation.

Related pages