Backup & Disaster Recovery Masterclass
This guide covers the complete lifecycle of protecting your infrastructure: from the philosophical distinction between backups and disaster recovery, through automated ZFS snapshot policies with Sanoid, cross-site replication with Syncoid, boot environments for safe rollback, database-specific backup strategies, Kubernetes cluster protection, failover runbooks, ransomware recovery, and compliance-driven retention. By the end you will have a tested, documented DR plan that actually works when the building is on fire.
The premise: Everyone has backups. Almost nobody has disaster recovery. A backup is a copy of data. Disaster recovery is the proven ability to restore service within a defined time window after a catastrophic failure. The difference between "we have backups" and "we have DR" is the difference between "we have parachutes in the warehouse" and "we have tested parachutes strapped to every seat." This masterclass teaches you to build and test the latter.
What this page covers: DR fundamentals, RPO/RTO definitions, ZFS copy-on-write snapshots, Sanoid automated policies, Syncoid incremental replication, cross-site replication architecture over WireGuard, boot environments, database-consistent backups, Kubernetes etcd and PV protection, failover procedures, DR testing methodology, ransomware recovery workflows, compliance and retention, monitoring replication health, and a complete runbook reference.
Prerequisites: a running kldload system with ZFS on root. The replication sections assume at least two nodes. The Kubernetes sections assume a cluster from the Kubernetes on KVM guide. Everything else works on any kldload node.
1. DR Fundamentals
Before configuring any tools, you need to understand the conceptual hierarchy of data protection. These terms are not interchangeable, and confusing them is how organizations end up with a false sense of security.
Backup vs. High Availability vs. Disaster Recovery
Backup
A point-in-time copy of data, stored separately from the primary. Backups protect against data loss — accidental deletion, corruption, application bugs. They do not protect against site loss. A backup on the same server as the data it protects is not a backup. A backup on the same rack is barely a backup.
High Availability (HA)
Redundant infrastructure that continues operating when individual components fail — dual power supplies, RAID arrays, clustered databases, load-balanced application servers. HA protects against component failure. It does not protect against site failure, correlated failures (ransomware), or data corruption that replicates to all replicas before detection.
Disaster Recovery (DR)
The ability to restore full service from a geographically separate location after a catastrophic failure of the primary site. DR encompasses data replication, infrastructure provisioning, network failover, DNS cutover, and validated runbooks. DR is not a product — it is a process that you test regularly.
Business Continuity (BC)
The superset of all the above, plus non-technical concerns: communication plans, legal obligations, regulatory reporting, customer notification, staff relocation, vendor contracts. DR is the technical component of BC. This masterclass focuses on the technical side, but your DR plan must fit inside a larger BC plan.
Why backups are not DR
A backup tells you the data exists somewhere. DR tells you: where the data is, how long it takes to restore, what infrastructure is needed to run it, what the service restart order is, who is responsible for each step, and how you know the restore succeeded. If you cannot answer all of those questions, you have backups, not DR.
2. RPO & RTO — Defining Your Tolerance
Every DR plan starts with two numbers. These numbers drive every architectural decision that follows — replication frequency, storage costs, infrastructure investment, and staffing requirements.
Recovery Point Objective (RPO)
The maximum acceptable amount of data loss, measured in time. An RPO of 1 hour means you can tolerate losing up to 1 hour of data. An RPO of zero means you need synchronous replication — every write must be confirmed at both sites before the application proceeds. RPO directly determines your replication frequency.
Recovery Time Objective (RTO)
The maximum acceptable downtime, measured from the moment the disaster is declared to the moment service is restored. An RTO of 4 hours means full service must be running at the DR site within 4 hours. An RTO of zero means active-active — both sites serve traffic simultaneously. RTO determines your DR infrastructure investment.
The cost-tolerance tradeoff
| RPO | RTO | Architecture | Relative Cost |
|---|---|---|---|
| 24 hours | 48 hours | Nightly backup to offsite storage, cold DR site | $ |
| 1 hour | 4 hours | Hourly ZFS snapshots + syncoid replication, warm DR node | $$ |
| 15 minutes | 1 hour | Frequent snapshots + syncoid + pre-staged DR with automation | $$$ |
| ~0 (seconds) | 15 minutes | Near-synchronous replication, hot standby, automated failover | $$$$ |
| 0 | 0 | Active-active multi-site, synchronous replication, global load balancing | $$$$$ |
For most kldload deployments — homelabs, small businesses, edge sites — an RPO of 15 minutes and an RTO of 1 hour is the sweet spot. ZFS makes this achievable at commodity hardware prices. The rest of this masterclass builds that architecture.
3. The ZFS Advantage
ZFS fundamentally changes what is possible for backup and DR because of three properties that traditional filesystems lack: copy-on-write semantics, atomic snapshots, and incremental send/receive. These are not incremental improvements — they are a different category of capability.
Copy-on-Write (CoW)
ZFS never overwrites data in place. When a block is modified, the new version is written to a new location, and the metadata tree is updated atomically. The old block remains on disk until no snapshot references it. This means snapshots are free — they just prevent old blocks from being freed.
Atomic Snapshots
A ZFS snapshot captures the entire dataset at a single point in time, atomically. There is no "snapshot window" where data is inconsistent. The snapshot is created by updating a single pointer in the metadata tree — it takes microseconds regardless of dataset size. No application quiesce required for filesystem consistency.
Incremental Send/Receive
ZFS can compute the delta between any two snapshots and send only the changed blocks. This is not a file-level diff — it is a block-level diff computed from the metadata tree. A 10 TB dataset with 50 MB of changes since the last snapshot sends 50 MB of data, regardless of how many files were modified or how large the files are.
Self-Healing Checksums
Every block in ZFS has a checksum stored in its parent block (not alongside the data). On read, ZFS verifies the checksum. On a mirror or raidz pool, a checksum failure triggers automatic repair from the redundant copy. Your backups are verified every time they are read — silent corruption is detected and corrected automatically.
Why this changes everything for DR
On a traditional filesystem, a "backup" means copying files — which takes hours for large datasets, produces inconsistent state if files change during the copy, and the incremental logic is fragile (rsync compares mtimes and sizes, which is wrong for databases). On ZFS, a "backup" means: take a snapshot (microseconds), send the delta to the replica (only changed blocks, checksummed end-to-end). The result is a block-for-block identical copy of the dataset at that exact point in time, verified by checksums on both sides. This is not better rsync. This is a different thing entirely.
# Take a snapshot — instant, atomic, zero performance impact
zfs snapshot rpool/data@2026-04-05_14:30
# Send the full dataset to a remote host (initial seed)
zfs send rpool/data@2026-04-05_14:30 | ssh dr-node zfs recv backup/data
# Send only the changes since the last snapshot (incremental)
zfs send -i rpool/data@2026-04-05_14:00 rpool/data@2026-04-05_14:30 \
| ssh dr-node zfs recv backup/data
# Verify: list snapshots on both sides
zfs list -t snapshot -r rpool/data
ssh dr-node zfs list -t snapshot -r backup/data
4. Sanoid — Automated Snapshot Policies
Taking snapshots manually does not scale. Sanoid is a policy-driven snapshot manager for ZFS that creates and prunes snapshots on a schedule. It is the standard tool in the ZFS ecosystem for automated snapshot lifecycle management, and kldload installs it by default on desktop and server profiles.
Install Sanoid
# CentOS / RHEL / Rocky (kldload includes sanoid in darksite)
dnf install -y sanoid
# Debian / Ubuntu
apt install -y sanoid
# Verify
sanoid --version
Configuration: /etc/sanoid/sanoid.conf
Sanoid uses an INI-style configuration file. Each section defines a dataset and its
snapshot retention policy. The use_template directive lets you define reusable
policies and apply them to multiple datasets.
# /etc/sanoid/sanoid.conf
# ── Templates ──────────────────────────────────────────────
[template_production]
frequently = 0
hourly = 48
daily = 30
monthly = 12
yearly = 2
autosnap = yes
autoprune = yes
[template_database]
# More frequent snapshots for databases — tighter RPO
frequently = 4
hourly = 72
daily = 60
monthly = 24
yearly = 5
autosnap = yes
autoprune = yes
[template_scratch]
# Scratch/temp datasets — minimal retention
frequently = 0
hourly = 24
daily = 7
monthly = 0
yearly = 0
autosnap = yes
autoprune = yes
# ── Dataset policies ──────────────────────────────────────
[rpool/ROOT]
use_template = production
recursive = yes
[rpool/data]
use_template = production
[rpool/data/postgres]
use_template = database
[rpool/data/mysql]
use_template = database
[rpool/docker]
use_template = scratch
[rpool/tmp]
use_template = scratch
Sanoid systemd timer
Sanoid ships with a systemd timer that fires every 15 minutes by default. Each run creates any scheduled snapshots and prunes any that have exceeded their retention window.
# Enable and start the timer
systemctl enable --now sanoid.timer
# Check timer status
systemctl list-timers sanoid.timer
# Run sanoid manually to verify config
sanoid --take-snapshots --verbose
# View the snapshots it created
zfs list -t snapshot -r rpool -o name,creation,used | head -30
Per-dataset overrides
You can override any template value at the dataset level. This is how you give critical datasets tighter retention without duplicating the entire template.
# Override: keep 90 days of daily snapshots for compliance data
[rpool/data/audit-logs]
use_template = production
daily = 90
monthly = 36
yearly = 7
5. Syncoid — Incremental Replication
Syncoid is Sanoid's companion tool. Where Sanoid manages the snapshot lifecycle on a single host, Syncoid replicates datasets between hosts using ZFS incremental send/receive. It handles the bookkeeping of finding the common snapshot, computing the delta, and applying it on the remote side.
Basic replication
# Replicate a single dataset to a remote host
syncoid rpool/data root@dr-node:backup/data
# First run sends a full stream (initial seed)
# Subsequent runs send only incremental deltas
# Replicate recursively — all child datasets
syncoid --recursive rpool/data root@dr-node:backup/data
# Dry run — show what would be sent without sending it
syncoid --no-sync-snap --dryrun rpool/data root@dr-node:backup/data
Push vs. Pull
Syncoid can operate in push mode (run on the source, send to the destination) or pull mode (run on the destination, pull from the source). Pull mode is generally preferred for DR because the DR node controls the replication schedule and the production node does not need SSH access to the DR node.
# Push mode — run on production server
syncoid rpool/data root@dr-node:backup/data
# Pull mode — run on DR server
syncoid root@prod-node:rpool/data backup/data
# Pull mode with SSH key and restricted command
syncoid -c "ssh -i /root/.ssh/syncoid_ed25519" \
root@prod-node:rpool/data backup/data
Bandwidth limiting and compression
# Limit bandwidth to 50 Mbps to avoid saturating the WireGuard link
syncoid --bandwidth-limit=50M rpool/data root@dr-node:backup/data
# Use mbuffer for buffered transfer (reduces stalls on variable links)
syncoid --sendoptions="w" --no-stream \
rpool/data root@dr-node:backup/data
# Compress the send stream with lz4 (useful over WAN)
syncoid --compress=lz4 rpool/data root@dr-node:backup/data
# Use pigz for parallel gzip compression on multi-core machines
syncoid --compress=pigz-fast rpool/data root@dr-node:backup/data
Automated replication with systemd
# /etc/systemd/system/syncoid-replication.service
[Unit]
Description=ZFS replication via syncoid
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/sbin/syncoid --recursive --no-sync-snap \
--compress=lz4 rpool/data root@dr-node:backup/data
ExecStart=/usr/sbin/syncoid --recursive --no-sync-snap \
--compress=lz4 rpool/ROOT root@dr-node:backup/ROOT
# Limit CPU and I/O impact on production
CPUQuota=50%
IOWeight=100
Nice=10
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/syncoid-replication.timer
[Unit]
Description=Run ZFS replication every 15 minutes
[Timer]
OnCalendar=*:0/15
RandomizedDelaySec=60
Persistent=true
[Install]
WantedBy=timers.target
# Enable the timer
systemctl daemon-reload
systemctl enable --now syncoid-replication.timer
# Verify
systemctl list-timers syncoid-replication.timer
journalctl -u syncoid-replication.service --since "1 hour ago"
Encryption in transit
Syncoid uses SSH by default, so replication is encrypted in transit. For replication
over a WireGuard tunnel (which is already encrypted), you can use --no-privilege-elevation
and raw sends for maximum throughput.
# Over WireGuard (already encrypted) — use raw send for speed
syncoid --sendoptions="w" --recursive \
rpool/data root@10.100.0.2:backup/data
# The "w" flag sends the raw encrypted stream if the dataset
# uses ZFS native encryption — the DR node cannot read the data
# without the encryption key, providing encryption at rest on
# the DR side without trusting the DR node
6. Cross-Site Replication Architecture
A real DR architecture requires a dedicated replica at a geographically separate location, connected via an encrypted network, with automated replication and monitoring. This section describes the reference architecture for kldload cross-site DR.
Topology
┌─────────────────────────────┐ WireGuard ┌─────────────────────────────┐
│ PRIMARY SITE │ Storage Backplane │ DR SITE │
│ │ 10.100.0.0/24 (wg-dr) │ │
│ ┌───────────┐ │◄────────────────────────────►│ ┌───────────┐ │
│ │ prod-node │ rpool/data │ │ backup/data │ dr-node │ │
│ │ │ rpool/ROOT │ syncoid every 15 min │ backup/ROOT │ │ │
│ │ │ rpool/db │──────────────────────────────►│ backup/db │ │ │
│ └───────────┘ │ │ └───────────┘ │
│ │ │ │
│ Sanoid snapshots locally │ │ Sanoid prunes old snaps │
│ Syncoid pushes to DR │ │ Prometheus monitors lag │
└─────────────────────────────┘ └─────────────────────────────┘
WireGuard storage plane
DR replication traffic should travel over a dedicated WireGuard interface, separate from your service traffic. This provides encryption, access control, and bandwidth isolation.
# /etc/wireguard/wg-dr.conf — on production node
[Interface]
Address = 10.100.0.1/24
PrivateKey = <prod-private-key>
ListenPort = 51821
[Peer]
PublicKey = <dr-public-key>
AllowedIPs = 10.100.0.2/32
Endpoint = dr-site.example.com:51821
PersistentKeepalive = 25
# /etc/wireguard/wg-dr.conf — on DR node
[Interface]
Address = 10.100.0.2/24
PrivateKey = <dr-private-key>
ListenPort = 51821
[Peer]
PublicKey = <prod-public-key>
AllowedIPs = 10.100.0.1/32
Endpoint = prod-site.example.com:51821
PersistentKeepalive = 25
# Bring up the DR WireGuard interface on both nodes
systemctl enable --now wg-quick@wg-dr
# Verify connectivity
ping -c 3 10.100.0.2 # from prod
ping -c 3 10.100.0.1 # from DR
# Configure syncoid to use the WireGuard IP
syncoid --recursive rpool/data root@10.100.0.2:backup/data
SSH hardening for replication
The replication SSH key should be restricted to ZFS commands only. Use
command= in authorized_keys to limit what the production node can do on the DR node.
# Generate a dedicated key pair for syncoid
ssh-keygen -t ed25519 -f /root/.ssh/syncoid_ed25519 -N "" -C "syncoid-replication"
# On the DR node, add to /root/.ssh/authorized_keys:
command="zfs recv -Fdu backup",restrict ssh-ed25519 AAAA... syncoid-replication
# For recursive replication, the command restriction needs to be broader:
command="/usr/local/bin/syncoid-receiver.sh",restrict ssh-ed25519 AAAA... syncoid-replication
# /usr/local/bin/syncoid-receiver.sh — on DR node
#!/bin/bash
# Restrict to zfs send/receive commands only
case "$SSH_ORIGINAL_COMMAND" in
zfs\ recv*|zfs\ receive*|zfs\ list*|zfs\ get*|zfs\ snapshot*)
eval "$SSH_ORIGINAL_COMMAND"
;;
*)
echo "Command not allowed: $SSH_ORIGINAL_COMMAND" >&2
exit 1
;;
esac
7. Boot Environments
A boot environment is a bootable clone of your root filesystem. ZFS makes this trivial: clone the root dataset, update the bootloader to point at the clone, and reboot into it. If the new environment has problems, reboot into the old one. This is the safest way to perform OS upgrades, kernel updates, and major configuration changes.
How boot environments work on kldload
# kldload's root layout (default install)
# rpool/ROOT/centos — the root dataset
# rpool/ROOT/centos@safe — pre-upgrade snapshot
# List current boot environments
zfs list -r rpool/ROOT -o name,used,mountpoint,origin
# The active boot environment is the one mounted at /
Creating a boot environment before an upgrade
# 1. Snapshot the current root
zfs snapshot rpool/ROOT/centos@pre-upgrade-2026-04-05
# 2. Clone it to a new boot environment
zfs clone rpool/ROOT/centos@pre-upgrade-2026-04-05 rpool/ROOT/centos-upgrade
# 3. Mount the clone temporarily and perform the upgrade
mount -t zfs rpool/ROOT/centos-upgrade /mnt
# ... chroot into /mnt and run dnf upgrade, or:
dnf --installroot=/mnt upgrade -y
# 4. Update the bootloader to offer both environments
# For systemd-boot (kldload default):
cat > /boot/efi/loader/entries/upgrade.conf <<'EOF'
title CentOS Stream 9 (upgrade)
linux /vmlinuz-upgrade
initrd /initramfs-upgrade.img
options root=zfs:rpool/ROOT/centos-upgrade rw
EOF
# 5. Reboot and select the upgrade entry
# If it works: keep it. If it breaks: reboot into the old entry.
Rolling back a failed upgrade
# Option 1: Reboot into the previous boot environment
# Select the old entry in the bootloader menu — done.
# Option 2: Rollback the snapshot (destructive — overwrites current state)
zfs rollback rpool/ROOT/centos@pre-upgrade-2026-04-05
# Option 3: Promote the old clone back to the active dataset
# (if you promoted the upgrade clone earlier)
zfs promote rpool/ROOT/centos
Boot environments as DR safety net
Boot environments are not a replacement for offsite replication, but they are an essential complement. Every upgrade, every kernel update, every major config change should create a boot environment first. The cost is essentially zero (CoW clone), and the safety net is immediate — no need to restore from a remote backup for a botched upgrade.
8. Database Backup
Databases require special handling because a consistent backup is not the same as a consistent filesystem snapshot. A filesystem snapshot of a running database may capture the data files in a state that the database considers corrupted — write-ahead logs partially applied, pages half-written. The correct approach depends on the database.
PostgreSQL: pg_dump + ZFS snapshots
# Strategy 1: Logical backup with pg_dump (portable, slow for large DBs)
pg_dump -Fc -Z 4 -f /backup/postgres/mydb_$(date +%Y%m%d_%H%M).dump mydb
# Restore from logical backup
pg_restore -d mydb /backup/postgres/mydb_20260405_1430.dump
# Strategy 2: ZFS snapshot (instant, block-level consistent)
# PostgreSQL on ZFS: put $PGDATA on its own dataset
# rpool/data/postgres — mountpoint /var/lib/pgsql/data
# Take a consistent snapshot:
# PostgreSQL's WAL ensures crash consistency — a ZFS snapshot
# of a running PostgreSQL instance is equivalent to a power failure,
# and PostgreSQL recovers from that using WAL replay on startup.
zfs snapshot rpool/data/postgres@$(date +%Y%m%d_%H%M)
# For absolute consistency, use pg_start_backup/pg_stop_backup:
psql -c "SELECT pg_backup_start('zfs-snap');"
zfs snapshot rpool/data/postgres@consistent_$(date +%Y%m%d_%H%M)
psql -c "SELECT pg_backup_stop();"
PostgreSQL point-in-time recovery (PITR)
# Enable WAL archiving in postgresql.conf
archive_mode = on
archive_command = 'cp %p /backup/postgres/wal/%f'
wal_level = replica
# With WAL archiving + ZFS snapshots, you can recover to any
# point in time between snapshots:
# 1. Restore the most recent snapshot before the target time
# 2. Replay WAL files up to the exact target timestamp
# recovery.conf (PostgreSQL 12+: recovery.signal + postgresql.conf)
restore_command = 'cp /backup/postgres/wal/%f %p'
recovery_target_time = '2026-04-05 14:25:00'
recovery_target_action = 'promote'
MySQL / MariaDB with consistent snapshots
# For InnoDB (default): ZFS snapshots are crash-consistent
# InnoDB's redo log provides the same crash recovery as PostgreSQL's WAL
zfs snapshot rpool/data/mysql@$(date +%Y%m%d_%H%M)
# For mixed InnoDB/MyISAM: flush and lock first
mysql -e "FLUSH TABLES WITH READ LOCK;"
zfs snapshot rpool/data/mysql@consistent_$(date +%Y%m%d_%H%M)
mysql -e "UNLOCK TABLES;"
# Logical backup (portable, slow)
mysqldump --single-transaction --routines --triggers \
--all-databases | gzip > /backup/mysql/all_$(date +%Y%m%d_%H%M).sql.gz
# Binary backup with xtrabackup (fast, InnoDB only)
xtrabackup --backup --target-dir=/backup/mysql/xtra_$(date +%Y%m%d_%H%M)
Automated database backup script
#!/bin/bash
# /usr/local/bin/db-backup.sh — run via systemd timer
set -euo pipefail
TIMESTAMP=$(date +%Y%m%d_%H%M)
LOG="/var/log/db-backup.log"
echo "[${TIMESTAMP}] Starting database backup" >> "$LOG"
# PostgreSQL: logical + snapshot
if systemctl is-active --quiet postgresql; then
pg_dump -Fc -Z 4 -f "/backup/postgres/mydb_${TIMESTAMP}.dump" mydb
zfs snapshot "rpool/data/postgres@${TIMESTAMP}"
echo "[${TIMESTAMP}] PostgreSQL backup complete" >> "$LOG"
fi
# MySQL: snapshot (InnoDB crash-consistent)
if systemctl is-active --quiet mysqld; then
zfs snapshot "rpool/data/mysql@${TIMESTAMP}"
echo "[${TIMESTAMP}] MySQL backup complete" >> "$LOG"
fi
# Prune local logical backups older than 7 days
find /backup/postgres/ -name "*.dump" -mtime +7 -delete
find /backup/mysql/ -name "*.sql.gz" -mtime +7 -delete
echo "[${TIMESTAMP}] Database backup finished" >> "$LOG"
9. Kubernetes Backup
Kubernetes clusters have two categories of state that need protection: the cluster state (stored in etcd) and the persistent data (stored on PersistentVolumes). Losing etcd means losing all Kubernetes object definitions — deployments, services, configmaps, secrets. Losing PVs means losing application data. You need to protect both.
etcd snapshots
# Take an etcd snapshot (run on a control plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd/snapshot_$(date +%Y%m%d_%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd/snapshot_$(date +%Y%m%d_%H%M).db \
--write-table
# Automated etcd backup — systemd timer
cat > /etc/systemd/system/etcd-backup.service <<'EOF'
[Unit]
Description=etcd snapshot backup
After=network.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/etcd-backup.sh
EOF
cat > /etc/systemd/system/etcd-backup.timer <<'EOF'
[Unit]
Description=Run etcd backup every hour
[Timer]
OnCalendar=hourly
Persistent=true
[Install]
WantedBy=timers.target
EOF
systemctl daemon-reload
systemctl enable --now etcd-backup.timer
Restore etcd from snapshot
# Stop the kube-apiserver and etcd
systemctl stop kube-apiserver
systemctl stop etcd
# Restore the snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd/snapshot_20260405_1400.db \
--data-dir=/var/lib/etcd-restore \
--name=$(hostname) \
--initial-cluster="$(hostname)=https://$(hostname):2380" \
--initial-advertise-peer-urls="https://$(hostname):2380"
# Replace the etcd data directory
mv /var/lib/etcd /var/lib/etcd.old
mv /var/lib/etcd-restore /var/lib/etcd
chown -R etcd:etcd /var/lib/etcd
# Restart services
systemctl start etcd
systemctl start kube-apiserver
# Verify cluster health
kubectl get nodes
kubectl get pods --all-namespaces
Velero for cluster state backup
# Install Velero CLI
curl -Lo /tmp/velero.tar.gz \
https://github.com/vmware-tanzu/velero/releases/latest/download/velero-linux-amd64.tar.gz
tar -xzf /tmp/velero.tar.gz -C /tmp
install -m 0755 /tmp/velero-*/velero /usr/local/bin/velero
# Install Velero server components with MinIO backend (on-prem S3)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket velero-backups \
--secret-file /etc/velero/credentials-velero \
--backup-location-config \
region=us-east-1,s3ForcePathStyle=true,s3Url=http://minio.storage.svc:9000 \
--use-node-agent
# Create a backup of all namespaces
velero backup create full-cluster-$(date +%Y%m%d) \
--include-namespaces '*' \
--default-volumes-to-fs-backup
# Schedule daily backups with 30-day retention
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--include-namespaces '*' \
--ttl 720h
# List backups
velero backup get
# Restore from a Velero backup
velero restore create --from-backup full-cluster-20260405
PersistentVolume snapshots via ZFS CSI
# If using OpenZFS CSI driver, PV snapshots are ZFS snapshots
# Create a VolumeSnapshot for a PVC
cat <<'EOF' | kubectl apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-data-snap
namespace: production
spec:
volumeSnapshotClassName: zfs-snapshot-class
source:
persistentVolumeClaimName: postgres-data-pvc
EOF
# List snapshots
kubectl get volumesnapshot -n production
# Restore from snapshot — create a new PVC from the snapshot
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-restored
namespace: production
spec:
storageClassName: zfs-csi
dataSource:
name: postgres-data-snap
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
EOF
10. Failover Procedures
When a disaster is declared, you need a step-by-step runbook that anyone on the team can execute. This is not the time for improvisation. The runbook should be tested quarterly and updated after every test.
Step-by-step failover runbook
# ═══════════════════════════════════════════════════════════
# DR FAILOVER RUNBOOK — kldload
# ═══════════════════════════════════════════════════════════
#
# TRIGGER: Primary site unreachable for > 15 minutes
# AND confirmed by two team members
# AND incident commander declares DR activation
#
# ESTIMATED TIME: 45-60 minutes
# LAST TESTED: 2026-04-01 (quarterly drill)
# ═══════════════════════════════════════════════════════════
# ── PHASE 1: ASSESS (5 minutes) ──────────────────────────
# 1.1 Verify primary is truly down (not just monitoring flap)
ping -c 5 prod-node.example.com
ssh -o ConnectTimeout=10 root@prod-node.example.com echo "alive"
# If either succeeds: stand down, investigate, do not failover.
# 1.2 Check replication lag — how much data will we lose?
ssh root@dr-node "zfs list -t snapshot -r backup/data -o name,creation \
-s creation | tail -5"
# The newest snapshot timestamp = your actual RPO for this incident.
# ── PHASE 2: PROMOTE DR DATASETS (10 minutes) ────────────
# 2.1 Stop replication to prevent partial writes
ssh root@dr-node "systemctl stop syncoid-replication.timer"
# 2.2 Set datasets to read-write on DR node
ssh root@dr-node "zfs set readonly=off backup/data"
ssh root@dr-node "zfs set readonly=off backup/ROOT"
ssh root@dr-node "zfs set readonly=off backup/db"
# 2.3 Mount the datasets at their production paths
ssh root@dr-node "zfs set mountpoint=/data backup/data"
ssh root@dr-node "zfs set mountpoint=/ backup/ROOT"
ssh root@dr-node "zfs set mountpoint=/var/lib/pgsql backup/db"
ssh root@dr-node "zfs mount -a"
# ── PHASE 3: START SERVICES (15 minutes) ─────────────────
# 3.1 Start services in dependency order
ssh root@dr-node "systemctl start postgresql"
ssh root@dr-node "systemctl start redis"
ssh root@dr-node "systemctl start application"
ssh root@dr-node "systemctl start nginx"
# 3.2 Verify services are healthy
ssh root@dr-node "systemctl is-active postgresql redis application nginx"
ssh root@dr-node "curl -sf http://localhost:8080/health"
# ── PHASE 4: DNS CUTOVER (5 minutes) ─────────────────────
# 4.1 Update DNS to point at DR node
# If using Cloudflare:
curl -X PUT "https://api.cloudflare.com/client/v4/zones/ZONE_ID/dns_records/RECORD_ID" \
-H "Authorization: Bearer CF_TOKEN" \
-H "Content-Type: application/json" \
--data '{"type":"A","name":"app.example.com","content":"DR_NODE_IP","ttl":60}'
# 4.2 Lower TTL before failover (should already be low — 60s)
# If TTL was 3600, it takes up to 1 hour for all clients to see the change.
# PRE-WORK: Set TTL to 60 on all critical DNS records NOW, before any disaster.
# ── PHASE 5: VALIDATE (10 minutes) ───────────────────────
# 5.1 External health check
curl -sf https://app.example.com/health
# 5.2 Run smoke tests
./scripts/dr-smoke-test.sh
# 5.3 Check monitoring — alerts should clear
# Grafana: https://grafana.example.com/d/dr-status
# ── PHASE 6: COMMUNICATE ─────────────────────────────────
# Notify stakeholders: DR activated, services restored,
# data loss = [RPO from step 1.2], ETA for primary restore = TBD
DNS pre-work: low TTL
Critical pre-work: Set DNS TTLs to 60 seconds on all records that would change during failover. Do this now, during peacetime. If your TTL is 3600 (1 hour) and you change the DNS record during a disaster, some clients will not see the change for up to an hour. A 60-second TTL means convergence within 2 minutes.
# Verify current TTLs
dig +short app.example.com | head -1
dig app.example.com | grep TTL
# Set TTL to 60 in your DNS provider (Cloudflare example)
# Proxied records: TTL is managed by Cloudflare (automatic)
# DNS-only records: set TTL to 60
11. DR Testing
A DR plan that has not been tested is a hypothesis, not a plan. Testing reveals gaps in documentation, missing dependencies, expired credentials, incorrect assumptions, and timing issues. Test quarterly at minimum.
Quarterly fire drill procedure
# DR Fire Drill Checklist
# ═══════════════════════════════════════════════════════════
# 1. Schedule: pick a date, notify the team, block 4 hours
# 2. Pre-drill: verify replication is current, check DR node health
# 3. Simulate: follow the failover runbook exactly as written
# 4. Validate: run the full smoke test suite against DR
# 5. Measure: record actual failover time vs. target RTO
# 6. Document: write up findings — what worked, what broke, what to fix
# 7. Remediate: fix every issue found within 2 weeks
# 8. Re-test: if critical issues were found, schedule a re-test
Automated DR validation script
#!/bin/bash
# /usr/local/bin/dr-validate.sh
# Run on DR node to validate replica health
set -euo pipefail
ERRORS=0
REPORT="/var/log/dr-validation-$(date +%Y%m%d).log"
echo "=== DR Validation Report — $(date) ===" > "$REPORT"
# Check 1: Replication lag
LATEST_SNAP=$(zfs list -t snapshot -r backup/data -o name,creation \
-s creation -H | tail -1 | awk '{print $2, $3, $4, $5}')
SNAP_EPOCH=$(date -d "$LATEST_SNAP" +%s 2>/dev/null || echo 0)
NOW_EPOCH=$(date +%s)
LAG_MINUTES=$(( (NOW_EPOCH - SNAP_EPOCH) / 60 ))
echo "Replication lag: ${LAG_MINUTES} minutes" >> "$REPORT"
if [ "$LAG_MINUTES" -gt 30 ]; then
echo "FAIL: Replication lag exceeds 30 minutes" >> "$REPORT"
ERRORS=$((ERRORS + 1))
fi
# Check 2: Dataset integrity
for ds in backup/data backup/ROOT backup/db; do
if zfs list "$ds" &>/dev/null; then
echo "OK: Dataset $ds exists" >> "$REPORT"
else
echo "FAIL: Dataset $ds missing" >> "$REPORT"
ERRORS=$((ERRORS + 1))
fi
done
# Check 3: Disk space
POOL_FREE=$(zpool list -Hp -o free backup 2>/dev/null || echo 0)
POOL_FREE_GB=$((POOL_FREE / 1073741824))
echo "DR pool free space: ${POOL_FREE_GB} GB" >> "$REPORT"
if [ "$POOL_FREE_GB" -lt 50 ]; then
echo "FAIL: DR pool free space below 50 GB" >> "$REPORT"
ERRORS=$((ERRORS + 1))
fi
# Check 4: Services can start (dry run)
echo "Service readiness checks:" >> "$REPORT"
for svc in postgresql redis nginx; do
if systemctl cat "$svc" &>/dev/null; then
echo " OK: $svc unit file present" >> "$REPORT"
else
echo " WARN: $svc unit file missing on DR node" >> "$REPORT"
fi
done
# Summary
echo "" >> "$REPORT"
if [ "$ERRORS" -eq 0 ]; then
echo "RESULT: ALL CHECKS PASSED" >> "$REPORT"
else
echo "RESULT: ${ERRORS} CHECK(S) FAILED — investigate immediately" >> "$REPORT"
fi
cat "$REPORT"
exit "$ERRORS"
Chaos engineering for DR
Beyond scheduled fire drills, introduce controlled failures to validate resilience:
# Kill the primary node's network (simulate site failure)
# On the primary node:
iptables -A OUTPUT -j DROP
# Wait, then verify DR monitoring detects the outage
# Simulate disk failure on primary
# On the primary node (test only — use a spare disk):
zpool offline rpool /dev/sdb
# Simulate corrupted replication stream
# Intentionally break a syncoid run and verify alerting catches it
syncoid --force-delete rpool/test root@dr-node:backup/test 2>&1 || true
# Verify the monitoring alert fires for failed replication
# ALWAYS have a rollback plan for chaos tests
# ALWAYS run chaos tests in a maintenance window
# NEVER run chaos tests in production without team consensus
12. Ransomware Recovery
Ransomware is the most common disaster scenario in modern infrastructure. The attack encrypts your data and demands payment for the decryption key. ZFS provides uniquely strong defenses — but only if configured correctly before the attack.
Immutable snapshots
ZFS snapshots are inherently read-only — nothing can modify a snapshot after creation. However, a root-level attacker can destroy snapshots. The defense is to replicate snapshots to a separate system where the production node does not have destroy permissions.
# The attack surface: if ransomware gets root on the production node
# it can destroy snapshots:
# zfs destroy rpool/data@2026-04-05_14:00 # attacker destroys your backups
# The defense: the DR node holds the replicas, and the production
# node's SSH key is restricted to "zfs recv" — it cannot destroy
# anything on the DR node.
# On the DR node — the authorized_keys restriction prevents destruction:
command="zfs recv -Fdu backup",restrict ssh-ed25519 AAAA... syncoid-replication
# The production node can SEND data but cannot DESTROY anything on DR.
Offline replica (air-gapped backup)
# For maximum ransomware protection: maintain an air-gapped replica
# that is only connected during replication windows
# 1. Connect the offline DR drive/node
# 2. Run replication
syncoid --recursive rpool/data root@offline-node:airgap/data
# 3. Disconnect the offline DR drive/node
# The air gap means ransomware cannot reach this replica even with root
# For USB-attached ZFS pools:
zpool import airgap-pool
syncoid --recursive rpool/data airgap-pool/data
zpool export airgap-pool
# Physically disconnect the drive and store it offsite
Detection and response
# Signs of ransomware on ZFS:
# - Sudden massive space consumption (encryption creates new blocks)
# - Unusual snapshot destruction activity
# - High write I/O on datasets that are normally read-heavy
# - Files with ransomware extensions (.encrypted, .locked, etc.)
# Monitor for snapshot destruction attempts
# Add to /etc/zfs/zed.d/zed.rc:
ZED_NOTIFY_VERBOSE=1
# Custom zed script to alert on snapshot destruction
cat > /etc/zfs/zed.d/snapshot-destroy-alert.sh <<'ZEDEOF'
#!/bin/bash
# Alert on any snapshot destruction
if [ "$ZEVENT_SUBCLASS" = "snapshot_destroy" ]; then
echo "ALERT: Snapshot destroyed: $ZEVENT_HISTORY_DSNAME" | \
mail -s "ZFS Snapshot Destruction Alert" admin@example.com
fi
ZEDEOF
chmod +x /etc/zfs/zed.d/snapshot-destroy-alert.sh
Ransomware recovery workflow
# ═══════════════════════════════════════════════════════════
# RANSOMWARE RECOVERY PROCEDURE
# ═══════════════════════════════════════════════════════════
# 1. ISOLATE — disconnect the infected node from the network immediately
# Do NOT shut down — you may need memory forensics
ssh root@infected-node "nmcli device disconnect eth0"
# 2. ASSESS — determine the scope of infection
# - Which datasets are affected?
# - When did the encryption start? (check snapshot diffs)
# - Are other nodes affected?
# 3. IDENTIFY the last clean snapshot
# Compare snapshots to find when the attack started:
zfs diff rpool/data@2026-04-05_14:00 rpool/data@2026-04-05_14:15
# Look for mass file modifications or ransomware note files
# 4. RESTORE from the last clean snapshot on the DR node
# On the DR node:
zfs rollback backup/data@2026-04-05_14:00
# Or clone the clean snapshot for forensic preservation:
zfs clone backup/data@2026-04-05_14:00 backup/data-clean
# Keep the encrypted state for forensic analysis
# 5. FAILOVER to DR using the standard failover runbook
# Follow Section 10 — Failover Procedures
# 6. REBUILD the primary node from scratch
# Do NOT attempt to "clean" the infected node
# Reinstall the OS, restore from clean backups
# Change ALL credentials — the attacker had root
# 7. POST-INCIDENT
# - Forensic analysis of the infected node
# - Root cause analysis — how did they get in?
# - Update firewall rules, access controls, monitoring
# - File insurance claim if applicable
# - Regulatory notification if PII was exposed
13. Compliance & Retention
Regulatory frameworks impose minimum retention periods for different categories of data. Your snapshot and backup retention policies must satisfy these requirements. Failure to retain data for the required period is a compliance violation; failure to delete data after the retention period (in some jurisdictions) is also a violation.
Common retention requirements
| Regulation | Data Type | Minimum Retention | Notes |
|---|---|---|---|
| SOX (Sarbanes-Oxley) | Financial records, audit logs | 7 years | Applies to public companies and their IT systems |
| HIPAA | Patient health information | 6 years | From date of creation or last effective date |
| GDPR | Personal data | No minimum — only as long as necessary | Right to erasure conflicts with backup retention |
| PCI DSS | Cardholder data, audit logs | 1 year (logs), 3 months online | Audit trail must be immediately available for 3 months |
| SEC Rule 17a-4 | Broker-dealer records | 3-6 years | Must be stored in non-rewritable, non-erasable format |
| FISMA | Federal system audit logs | 3 years | Depends on system categorization (Low/Moderate/High) |
Implementing retention with Sanoid
# Compliance dataset — 7-year retention for SOX
[rpool/data/financial]
use_template = production
daily = 90
monthly = 84 # 7 years of monthly snapshots
yearly = 7
autoprune = yes
# Audit log dataset — 3-year retention for PCI DSS
[rpool/data/audit-logs]
use_template = production
daily = 90 # 3 months of daily snapshots (PCI: immediately available)
monthly = 36 # 3 years of monthly snapshots
yearly = 3
autoprune = yes
# GDPR-sensitive dataset — shorter retention, must support deletion
[rpool/data/user-pii]
use_template = production
daily = 30
monthly = 12
yearly = 0 # No yearly — minimize PII retention
autoprune = yes
Legal hold on snapshots
# When litigation requires preserving data, you need to prevent
# snapshot pruning for specific datasets. Sanoid supports this
# by setting autoprune=no temporarily.
# Place a legal hold: disable autopruning
zfs set com.kldload:legal_hold=active rpool/data/financial
# Modify sanoid.conf to check the property:
# In your sanoid wrapper script:
HOLD=$(zfs get -H -o value com.kldload:legal_hold rpool/data/financial 2>/dev/null)
if [ "$HOLD" = "active" ]; then
echo "Legal hold active on rpool/data/financial — skipping prune"
sanoid --take-snapshots --no-prune --configdir=/etc/sanoid
else
sanoid --cron
fi
# Release the hold when litigation concludes
zfs set com.kldload:legal_hold=released rpool/data/financial
14. Monitoring DR Health
DR infrastructure that is not monitored will silently fail. Replication will stop and nobody will notice until the disaster happens. The following metrics must be monitored continuously and alerted on.
Key metrics to monitor
| Metric | Alert Threshold | Why |
|---|---|---|
| Replication lag (minutes since last successful syncoid) | > 30 minutes | Directly impacts your actual RPO |
| Newest snapshot age on DR node | > 30 minutes | Catches replication failures even if timer is running |
| DR pool free space | < 20% | Full pool = failed replication = no DR |
| Syncoid exit code | != 0 | Catches auth failures, network issues, ZFS errors |
| Snapshot count per dataset | > 1000 | Indicates autoprune failure — will eventually fill pool |
| WireGuard handshake age | > 5 minutes | DR tunnel is down — replication will fail |
| etcd backup age (K8s) | > 2 hours | Stale etcd backup = larger blast radius |
Prometheus exporter for ZFS replication
#!/bin/bash
# /usr/local/bin/zfs-dr-exporter.sh
# Textfile collector for node_exporter
# Run via cron every 5 minutes
METRICS_FILE="/var/lib/prometheus/node-exporter/zfs_dr.prom"
TMPFILE="${METRICS_FILE}.tmp"
cat > "$TMPFILE" <<'HEADER'
# HELP zfs_dr_replication_lag_seconds Seconds since last successful replication snapshot
# TYPE zfs_dr_replication_lag_seconds gauge
# HELP zfs_dr_pool_free_bytes Free bytes on DR pool
# TYPE zfs_dr_pool_free_bytes gauge
# HELP zfs_dr_snapshot_count Number of snapshots on dataset
# TYPE zfs_dr_snapshot_count gauge
HEADER
# Replication lag per dataset
for ds in backup/data backup/ROOT backup/db; do
LATEST=$(zfs list -t snapshot -r "$ds" -o creation -s creation -H 2>/dev/null | tail -1)
if [ -n "$LATEST" ]; then
SNAP_EPOCH=$(date -d "$LATEST" +%s 2>/dev/null || echo 0)
NOW_EPOCH=$(date +%s)
LAG=$((NOW_EPOCH - SNAP_EPOCH))
echo "zfs_dr_replication_lag_seconds{dataset=\"${ds}\"} ${LAG}" >> "$TMPFILE"
fi
done
# Pool free space
POOL_FREE=$(zpool list -Hp -o free backup 2>/dev/null || echo 0)
echo "zfs_dr_pool_free_bytes ${POOL_FREE}" >> "$TMPFILE"
# Snapshot counts
for ds in backup/data backup/ROOT backup/db; do
COUNT=$(zfs list -t snapshot -r "$ds" -H 2>/dev/null | wc -l)
echo "zfs_dr_snapshot_count{dataset=\"${ds}\"} ${COUNT}" >> "$TMPFILE"
done
mv "$TMPFILE" "$METRICS_FILE"
Prometheus alerting rules
# /etc/prometheus/rules/dr-alerts.yml
groups:
- name: disaster_recovery
interval: 1m
rules:
- alert: ReplicationLagCritical
expr: zfs_dr_replication_lag_seconds > 1800
for: 5m
labels:
severity: critical
annotations:
summary: "ZFS replication lag exceeds 30 minutes"
description: "Dataset {{ $labels.dataset }} last replicated {{ $value | humanizeDuration }} ago"
- alert: DRPoolSpaceLow
expr: (zfs_dr_pool_free_bytes / zfs_dr_pool_size_bytes) < 0.2
for: 10m
labels:
severity: warning
annotations:
summary: "DR pool free space below 20%"
description: "DR pool has {{ $value | humanize1024 }}B free"
- alert: SnapshotCountHigh
expr: zfs_dr_snapshot_count > 1000
for: 1h
labels:
severity: warning
annotations:
summary: "Snapshot count exceeds 1000 on {{ $labels.dataset }}"
description: "Autoprune may not be running. Current count: {{ $value }}"
- alert: SyncoidFailed
expr: increase(node_systemd_unit_state{name="syncoid-replication.service",state="failed"}[1h]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Syncoid replication service failed"
description: "Check journalctl -u syncoid-replication.service for details"
- alert: WireGuardDRTunnelDown
expr: time() - wireguard_latest_handshake_seconds{interface="wg-dr"} > 300
for: 2m
labels:
severity: critical
annotations:
summary: "WireGuard DR tunnel handshake stale"
description: "DR tunnel {{ $labels.interface }} last handshake {{ $value | humanizeDuration }} ago"
Grafana dashboard
Create a dedicated DR health dashboard with panels for: replication lag per dataset (time series), pool free space (gauge), snapshot counts (table), last syncoid run status (stat panel), and WireGuard tunnel status (state timeline). Pin this dashboard to the team's TV monitor.
# Import a DR dashboard via Grafana API
curl -X POST http://localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-d @/etc/grafana/dashboards/dr-health.json
# Key panels for the DR dashboard:
# 1. Replication Lag (time series) — query: zfs_dr_replication_lag_seconds
# 2. Pool Free Space (gauge) — query: zfs_dr_pool_free_bytes
# 3. Snapshot Counts (table) — query: zfs_dr_snapshot_count
# 4. Syncoid Status (stat) — query: node_systemd_unit_state{name="syncoid-replication.service"}
# 5. WireGuard Handshake (state timeline) — query: wireguard_latest_handshake_seconds{interface="wg-dr"}
15. Complete Runbook Reference
Daily operations reference
| Task | Command | Frequency | Owner |
|---|---|---|---|
| Verify Sanoid is running | systemctl status sanoid.timer | Daily (automated) | Monitoring |
| Verify syncoid is running | systemctl status syncoid-replication.timer | Daily (automated) | Monitoring |
| Check replication lag | zfs list -t snap -r backup -s creation | tail -5 | Daily (automated) | Monitoring |
| Check DR pool space | zpool list backup | Daily (automated) | Monitoring |
| Verify WireGuard DR tunnel | wg show wg-dr | Daily (automated) | Monitoring |
| Check etcd backup freshness | ls -lt /backup/etcd/ | head -3 | Daily (automated) | Monitoring |
| Review DR alerts in Grafana | Dashboard: DR Health | Daily (manual) | On-call |
Quarterly DR test checklist
| Step | Action | Expected Result | Pass/Fail |
|---|---|---|---|
| 1 | Verify replication is current (lag < 30 min) | Latest snapshot within 30 minutes | |
| 2 | Execute failover runbook (Section 10) | Services running on DR node | |
| 3 | Run smoke tests against DR | All health checks pass | |
| 4 | Measure actual failover time | Within RTO target | |
| 5 | Verify data integrity on DR | Spot-check recent records exist | |
| 6 | Test database restore from logical backup | pg_restore completes, data verified | |
| 7 | Test etcd restore (K8s) | Cluster recovers, pods running | |
| 8 | Test boot environment rollback | System boots into previous environment | |
| 9 | Fail back to primary | Reverse replication, DNS restored | |
| 10 | Document findings and remediation plan | Written report distributed to team |
Troubleshooting reference
| Symptom | Likely Cause | Resolution |
|---|---|---|
| Syncoid fails with "cannot receive: dataset has been modified" | Someone wrote to the DR dataset, breaking the replication chain | Reset DR dataset: zfs rollback backup/data@latest_common_snap, then re-run syncoid |
| Syncoid fails with "no matching snapshots" | All common snapshots between source and target were pruned | Full re-seed required: destroy target dataset, re-run syncoid for full send |
| Replication is slow / stalls | WireGuard tunnel congestion, mbuffer not installed, or --compress not set | Check wg show wg-dr, install mbuffer, add --compress=lz4 |
| DR pool is full | Autoprune disabled or snapshots accumulating faster than pruning | Check sanoid.conf autoprune=yes, manually prune oldest snapshots: zfs destroy backup/data@oldest |
| Services fail to start on DR node | Config files reference primary hostname/IP, missing deps | Pre-stage configs on DR: /etc/dr-configs/ with DR-specific settings |
| DNS cutover is slow | TTL was not lowered in peacetime | Lower TTL to 60 now. During incident, wait for old TTL to expire |
| etcd restore fails with "member ID mismatch" | Restoring to a running cluster without stopping all members | Stop etcd on all nodes, restore on all nodes simultaneously, restart |
| Boot environment will not boot | Bootloader entry incorrect or kernel/initramfs missing | Boot from previous environment, check /boot/efi/loader/entries/ |
| ZFS encrypted datasets unreadable on DR | Encryption key not available on DR node | Load key: zfs load-key backup/data — key must be available during failover |
| Sanoid creates snapshots but does not prune | autoprune = no in config or sanoid running with --take-snapshots only | Set autoprune = yes, ensure timer runs sanoid --cron (not just --take-snapshots) |
Emergency contacts template
# ═══════════════════════════════════════════════════════════
# DR EMERGENCY CONTACTS
# ═══════════════════════════════════════════════════════════
# Incident Commander: [Name] — [Phone] — [Email]
# Primary SRE: [Name] — [Phone] — [Email]
# Secondary SRE: [Name] — [Phone] — [Email]
# Database Admin: [Name] — [Phone] — [Email]
# Network Engineer: [Name] — [Phone] — [Email]
# Management Escalation: [Name] — [Phone] — [Email]
# DNS Provider Support: Cloudflare — support.cloudflare.com
# Hosting Provider: [Provider] — [Support Phone/URL]
# Insurance Broker: [Name] — [Phone] (for ransomware/data loss claims)
# Legal Counsel: [Name] — [Phone] (for regulatory notification)
# ═══════════════════════════════════════════════════════════
Summary
Disaster recovery on kldload is built on ZFS's unique capabilities: atomic snapshots, incremental send/receive, and copy-on-write immutability. Sanoid automates the snapshot lifecycle. Syncoid automates replication. WireGuard provides the encrypted transport. Boot environments provide instant local rollback. And the runbook — tested quarterly — ties it all together into a process that works when everything else is on fire.
The tools are straightforward. The architecture is documented above. The hard part is the discipline: testing the DR plan regularly, monitoring replication health continuously, updating the runbook when the infrastructure changes, and ensuring more than one person can execute the failover. That discipline is what separates "we have backups" from "we have disaster recovery."
Related pages
- ZFS Masterclass — deep dive into ZFS architecture, pools, datasets, and tuning
- Snapshots Guide — tutorial-level introduction to ZFS snapshots
- Boot Environments — step-by-step boot environment management
- WireGuard Masterclass — building the encrypted backplane for replication
- Databases on ZFS — database-specific ZFS tuning and configuration
- Kubernetes Masterclass — cluster architecture, etcd, and storage
- Observability Masterclass — Prometheus, Grafana, and alerting
- Security Hardening — defense in depth, access control, and audit
- Operations Guide Upgrades & Boot Environments — day-to-day operational procedures and runbooks
- Build: Disaster Recovery — building a dedicated DR appliance with kldload
- ZFS Wiki: Snapshots & Replication — reference documentation for ZFS send/receive