| pick your distro, get ZFS on root
kldload — your platform, your way, free
Source

I am an operator. Computers exist to facilitate my needs.

This is the operations manual for kldload infrastructure. Not theory. Not architecture diagrams. Day-to-day operations — what you do when you wake up, what you do when something breaks at 3 AM, what you do when the business asks for more capacity, and what you do when you need to prove that everything is working. Every command on this page has been run in production. Every workflow has been tested under pressure. This is the operator's handbook.

The operator philosophy: You are not a ticket router. You are not a button-clicker in a web UI. You are an operator. You understand the system from bare metal to application layer. You automate the toil so you can focus on the work that matters: reliability, capacity, and recovery. The machine serves you. You do not serve the machine.

SRE principles applied to real infrastructure: define SLOs, measure error budgets, automate everything that happens more than twice, and always have a rollback path. ZFS gives you the rollback path. WireGuard gives you the connectivity. eBPF gives you the visibility. kldload gives you all three on every host, every distro, every profile. The rest is discipline.

I have been an SRE for over a decade. The single most important thing I have learned is this: the difference between a good operator and a bad operator is not knowledge — it is the ability to undo. A good operator can undo anything in under 30 seconds. ZFS snapshots make that possible. Boot environments make that possible. Everything on this page is built around that one principle: you can always go back.

Automate the toil

If you do something more than twice, script it. If you script it more than twice, make it a systemd timer. Sanoid handles snapshot rotation. Syncoid handles replication. kupgrade handles OS upgrades. Your job is to set the policy and monitor the outcome.

Toil is work that scales linearly with the size of your fleet. Automation scales to zero.

Measure everything

If you cannot measure it, you cannot manage it. Pool health, ARC hit rate, replication lag, snapshot age, disk utilization, network throughput. Every metric tells you something. The absence of a metric tells you more.

The dashboard that is always green is the dashboard that is not measuring anything.

Always have a rollback

Before every change: snapshot. Before every upgrade: boot environment. Before every migration: replicate. The cost of a snapshot is zero. The cost of not having one is your weekend.

ZFS snapshots are free insurance. Not taking them is uninsured driving.

Blast radius awareness

Every action has a blast radius. A bad dnf update affects one host. A bad ZFS pool import affects one pool. A bad WireGuard config affects one tunnel. Understand the blast radius before you act. Limit it with snapshots, boot environments, and staged rollouts.

Rolling upgrades exist because nobody wants to learn what "fleet-wide outage" means.

Cattle, not pets

Every host should be replaceable from a golden image + cloud-init + ZFS replication in under 10 minutes. If you cannot rebuild a host from scratch without a runbook, it is a pet. Pets die and take your weekend with them. Cattle get replaced.

kexport + cloud-init + syncoid = every host is cattle. Even the ones you love.

Defense in depth

Snapshots protect against operator error. Replication protects against disk failure. Boot environments protect against bad upgrades. WireGuard protects the backplane. eBPF provides visibility without agents. Layer them. Every layer you skip is a gap in your armor.

The operator who says "I don't need backups, I have RAID" is the operator who loses data.

Day 1 — First Boot & Initial Setup

You have installed kldload. The machine has rebooted into your chosen distro with ZFS on root. Here is exactly what you do in the first 30 minutes to turn a fresh install into a production-ready host.

Verify the install

# Confirm ZFS pool is healthy
zpool status rpool
# Expected: state: ONLINE, no errors

# Confirm boot environment exists
zfs list -r rpool/ROOT -o name,used,mountpoint

# Confirm kldload tools are installed (desktop/server profiles)
which kst ksnap kbe kupgrade kpkg kdf kdir kclone kexport krecovery

# Confirm distro
cat /etc/os-release | head -3

# Confirm kernel and ZFS module
uname -r
modinfo zfs | head -3
zfs version

# kldload system health (one command)
kst

Network configuration

# Check current network state
nmcli device status
ip addr show

# Set a static IP (production hosts should never use DHCP)
nmcli connection modify "Wired connection 1" \
  ipv4.method manual \
  ipv4.addresses 10.100.10.50/24 \
  ipv4.gateway 10.100.10.1 \
  ipv4.dns "1.1.1.1 1.0.0.1"
nmcli connection up "Wired connection 1"

# Set hostname
hostnamectl set-hostname prod-web-01.example.com

# Verify DNS resolution
dig +short example.com
ping -c 3 1.1.1.1

WireGuard enrollment

# Generate keypair (do this on every new host)
umask 077
wg genkey | tee /etc/wireguard/private.key | wg pubkey > /etc/wireguard/public.key
wg genpsk > /etc/wireguard/psk.key

# Show public key (send this to your WireGuard hub)
cat /etc/wireguard/public.key

# Create WireGuard config
cat > /etc/wireguard/wg0.conf << 'EOF'
[Interface]
PrivateKey = CONTENTS_OF_PRIVATE_KEY
Address = 10.200.0.50/24
ListenPort = 51820

[Peer]
PublicKey = HUB_PUBLIC_KEY
PresharedKey = CONTENTS_OF_PSK
Endpoint = hub.example.com:51820
AllowedIPs = 10.200.0.0/24
PersistentKeepalive = 25
EOF

# Enable and start WireGuard
systemctl enable --now wg-quick@wg0

# Verify tunnel is up
wg show wg0
ping -c 3 10.200.0.1

Storage verification

# Full pool status with all details
zpool status -v rpool

# Verify compression is enabled
zfs get compression rpool
zfs get compressratio rpool

# Verify dataset layout
zfs list -o name,used,avail,compress,mountpoint

# Run an initial scrub (verify all checksums on disk)
zpool scrub rpool

# Watch scrub progress
zpool status rpool | grep scan

# Set up the data dataset structure
zfs create -o mountpoint=/srv -o compression=lz4 rpool/srv
zfs create -o mountpoint=/srv/data rpool/srv/data
zfs create -o mountpoint=/srv/logs -o compression=zstd rpool/srv/logs
zfs create -o mountpoint=/srv/db -o recordsize=8k rpool/srv/db

Monitoring setup

# Install node_exporter for Prometheus (RPM distros)
kpkg install golang-github-prometheus-node-exporter
systemctl enable --now node_exporter

# Install node_exporter (Debian/Ubuntu)
kpkg install prometheus-node-exporter
systemctl enable --now prometheus-node-exporter

# Verify metrics endpoint
curl -s http://localhost:9100/metrics | head -20

# Install ZFS exporter (all distros)
curl -L -o /usr/local/bin/zfs_exporter \
  https://github.com/pdf/zfs_exporter/releases/latest/download/zfs_exporter-linux-amd64
chmod +x /usr/local/bin/zfs_exporter

# Create systemd unit for ZFS exporter
cat > /etc/systemd/system/zfs-exporter.service << 'EOF'
[Unit]
Description=ZFS Exporter
After=zfs-mount.service

[Service]
ExecStart=/usr/local/bin/zfs_exporter
Restart=always

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now zfs-exporter

# Verify ZFS metrics
curl -s http://localhost:9134/metrics | grep zfs_pool_health

# Open firewall for Prometheus scrape (over WireGuard only)
firewall-cmd --permanent --zone=trusted --add-port=9100/tcp
firewall-cmd --permanent --zone=trusted --add-port=9134/tcp
firewall-cmd --reload

Snapshot policy (Sanoid)

# Configure Sanoid for automated snapshots
cat > /etc/sanoid/sanoid.conf << 'EOF'
[rpool/ROOT]
  use_template = production
  recursive = yes

[rpool/srv]
  use_template = production
  recursive = yes

[rpool/home]
  use_template = production
  recursive = yes

[template_production]
  frequently = 4
  hourly = 24
  daily = 30
  monthly = 6
  yearly = 1
  autosnap = yes
  autoprune = yes
EOF

# Enable Sanoid timer
systemctl enable --now sanoid.timer

# Verify Sanoid is running
systemctl status sanoid.timer
sanoid --cron --verbose
Day 1 should take 30 minutes. Not 30 hours. Not a week of "hardening." kldload gives you ZFS, WireGuard, eBPF, Sanoid, and monitoring out of the box. Your job on Day 1 is to configure the policy — static IP, WireGuard peer, snapshot retention, monitoring endpoints. The tools are already installed. The kernel modules are already loaded. The services are already there. You just need to turn them on.

Day 2 — Routine Maintenance

Day 2 is every day after the first. This is the steady state. Scrubs, snapshots, updates, certificate renewal. Most of this should be automated. Your job is to verify the automation is working and respond when it is not.

Weekly scrubs

# Manual scrub
zpool scrub rpool

# Check scrub status
zpool status rpool | grep -A 3 scan

# Automated weekly scrub (systemd timer)
cat > /etc/systemd/system/zfs-scrub.service << 'EOF'
[Unit]
Description=ZFS Pool Scrub
After=zfs-mount.service

[Service]
Type=oneshot
ExecStart=/usr/sbin/zpool scrub rpool
EOF

cat > /etc/systemd/system/zfs-scrub.timer << 'EOF'
[Unit]
Description=Weekly ZFS Scrub

[Timer]
OnCalendar=Sun *-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target
EOF

systemctl daemon-reload
systemctl enable --now zfs-scrub.timer

# Verify timer is scheduled
systemctl list-timers --all | grep zfs-scrub

Snapshot rotation verification

# Check Sanoid's last run
systemctl status sanoid.timer
journalctl -u sanoid --since "1 hour ago" --no-pager

# List all snapshots with ages
zfs list -t snapshot -o name,used,creation -S creation | head -30

# Count snapshots by dataset
zfs list -t snapshot -H -o name | awk -F@ '{print $1}' | sort | uniq -c | sort -rn

# Verify snapshot retention policy is pruning correctly
zfs list -t snapshot -o name,creation -S creation | grep rpool/ROOT | head -10

# Manual snapshot (before any risky change)
zfs snapshot -r rpool@manual-$(date +%Y%m%d-%H%M%S)

Package updates

# kldload way: kpkg snapshots before every operation
kpkg update            # refresh package index
kpkg upgrade           # snapshot + upgrade all packages

# Manual way (CentOS/RHEL/Rocky/Fedora)
zfs snapshot -r rpool@pre-update-$(date +%Y%m%d)
dnf check-update
dnf upgrade -y
needs-restarting -r    # check if reboot needed

# Manual way (Debian/Ubuntu)
zfs snapshot -r rpool@pre-update-$(date +%Y%m%d)
apt update
apt upgrade -y
[ -f /var/run/reboot-required ] && echo "REBOOT NEEDED"

# Manual way (Arch)
zfs snapshot -r rpool@pre-update-$(date +%Y%m%d)
pacman -Syu

# If the update broke something, rollback is instant
zfs rollback -r rpool@pre-update-20260404

Kernel upgrades with boot environments

# The safe way: boot environment before kernel upgrade
kbe create before-kernel-6.12
kupgrade

# Or manually
zfs snapshot rpool/ROOT/default@before-kernel-6.12

# Upgrade kernel (CentOS/RHEL/Rocky)
dnf upgrade kernel kernel-devel kernel-headers -y

# Rebuild ZFS module for new kernel
dkms autoinstall

# Verify ZFS module built for new kernel
dkms status | grep zfs

# Reboot into new kernel
reboot

# After reboot: verify
uname -r
zfs version
zpool status rpool

# If new kernel broke ZFS or anything else:
# Boot the kldload ISO, import pool, activate previous BE
krecovery import rpool
krecovery activate before-kernel-6.12
reboot

Certificate renewal

# Check certificate expiry
openssl x509 -in /etc/pki/tls/certs/server.crt -noout -enddate

# Let's Encrypt renewal (certbot)
certbot renew --dry-run
certbot renew

# Automated renewal timer
systemctl enable --now certbot-renew.timer

# Check all certificates on the system
find /etc -name '*.crt' -o -name '*.pem' | while read cert; do
  expiry=$(openssl x509 -in "$cert" -noout -enddate 2>/dev/null)
  [ -n "$expiry" ] && echo "$cert: $expiry"
done

# Restart services after renewal
systemctl reload nginx
systemctl reload httpd
The single most common failure mode in production is "I forgot to renew the certificate." Not disk failure. Not kernel panic. Not ransomware. An expired TLS certificate on a Friday afternoon. Automate certbot. Set an alert for 14 days before expiry. This is the easiest outage to prevent and the most embarrassing to explain.

Upgrades — The kupgrade Workflow

Upgrades are the single most dangerous routine operation. A bad upgrade can brick a host, break a kernel module, or silently corrupt a configuration file. The kldload upgrade workflow eliminates this risk entirely. Snapshot current state, upgrade, test, keep or rollback. Every upgrade is a two-second undo away from never happening.

The kupgrade command

# kupgrade does all of this in one command:
# 1. Creates a boot environment snapshot
# 2. Updates all packages (distro-aware: dnf/apt/pacman)
# 3. Rebuilds ZFS DKMS if kernel changed
# 4. Updates bootloader if needed
# 5. Reports what changed
kupgrade

# Dry run (see what would change)
kupgrade --check

# Force rebuild of ZFS module
kupgrade --rebuild-zfs

Manual upgrade workflow (when you want full control)

# Step 1: Snapshot everything
zfs snapshot -r rpool@pre-upgrade-$(date +%Y%m%d)

# Step 2: Create named boot environment
kbe create pre-upgrade-$(date +%Y%m%d)
kbe list

# Step 3: Upgrade packages
# CentOS/RHEL/Rocky:
dnf upgrade -y

# Debian/Ubuntu:
apt update && apt upgrade -y

# Fedora:
dnf upgrade -y --refresh

# Arch:
pacman -Syu --noconfirm

# Step 4: Rebuild ZFS if kernel changed
dkms status
dkms autoinstall
modinfo zfs | grep ^version

# Step 5: Verify and reboot
needs-restarting -r 2>/dev/null || [ -f /var/run/reboot-required ]
reboot

# Step 6: Post-reboot verification
uname -r
zfs version
zpool status rpool
systemctl --failed
kst

Rollback if upgrade failed

# If the system boots fine but something is broken:
kbe activate pre-upgrade-20260404
reboot

# If the system does not boot (boot from kldload ISO):
krecovery import rpool
krecovery list-be
krecovery activate pre-upgrade-20260404
reboot

# If you need to inspect before rolling back:
krecovery chroot
# You're now inside the installed system — check logs, fix configs
journalctl -b -1 --no-pager | tail -50
exit
reboot

Per-distro upgrade paths

CentOS Stream 9 / RHEL 9 / Rocky 9

Point releases (9.1 to 9.2 to 9.3) are just dnf upgrade. No special procedure. The kernel and ZFS module upgrade together via DKMS. CentOS Stream tracks ahead of RHEL, so you test on Stream and deploy to RHEL/Rocky.

dnf upgrade + dkms autoinstall + reboot. That is the entire CentOS upgrade path.

Debian 13 (Trixie)

Stable Debian rarely changes. Security updates come through apt upgrade. Major version upgrades (Trixie to the next release) require editing /etc/apt/sources.list and running apt dist-upgrade. Always snapshot first.

Debian stable is the calm distro. It only breaks when you tell it to.

Ubuntu 24.04 (Noble)

LTS releases get 5 years of support. Point upgrades via apt upgrade. Major version jumps via do-release-upgrade. HWE kernels bring newer hardware support without a full release upgrade.

Ubuntu LTS: 5 years of "apt upgrade" and it just works. The boring choice is the right choice.

Fedora 41

Fedora releases every 6 months. Upgrade via dnf system-upgrade download --releasever=42 then dnf system-upgrade reboot. Fedora is the bleeding edge — snapshot before every upgrade, no exceptions.

Fedora is where RHEL features come from. Running Fedora in production requires boot environments.

Arch Linux

Rolling release. Every pacman -Syu is a potential breaking change. There is no "safe" upgrade path — there is only "snapshot before upgrade." This is why kldload installs ZFS on Arch: because pacman -Syu is Russian roulette without a rollback.

Arch + ZFS snapshots = the bleeding edge with a safety net. Arch without snapshots = chaos.

ZFS module coordination

When the kernel upgrades, the ZFS DKMS module must rebuild. If DKMS fails, the new kernel will boot without ZFS — and your root filesystem will not mount. Always verify dkms status shows the ZFS module built for the new kernel before rebooting. kupgrade does this automatically.

Kernel + ZFS must match. dkms autoinstall. Check dkms status. Then reboot. Never before.

Backup & Recovery

Backups are not optional. Backups that have never been tested are not backups — they are hopes. The kldload backup strategy is simple: Sanoid for local snapshot retention, Syncoid for replication to a remote host, and regular restore tests to prove it all works. ZFS makes this trivially easy because snapshots and replication are native, atomic, and block-level efficient.

Sanoid retention policies

# /etc/sanoid/sanoid.conf — production policy
cat > /etc/sanoid/sanoid.conf << 'EOF'
[rpool/ROOT]
  use_template = production
  recursive = yes

[rpool/srv]
  use_template = production
  recursive = yes

[rpool/home]
  use_template = production
  recursive = yes

[rpool/srv/db]
  use_template = database
  recursive = yes

[template_production]
  frequently = 4
  hourly = 24
  daily = 30
  monthly = 12
  yearly = 2
  autosnap = yes
  autoprune = yes

[template_database]
  frequently = 12
  hourly = 48
  daily = 90
  monthly = 24
  yearly = 5
  autosnap = yes
  autoprune = yes
EOF

# Verify configuration
sanoid --configcheck

# Run Sanoid manually to test
sanoid --cron --verbose

# Enable the timer
systemctl enable --now sanoid.timer

Syncoid replication to remote

# One-time: set up SSH key for automated replication
ssh-keygen -t ed25519 -f /root/.ssh/syncoid -N ""
ssh-copy-id -i /root/.ssh/syncoid.pub root@backup-server

# Test Syncoid manually
syncoid -r rpool root@backup-server:tank/backup/prod-web-01

# Syncoid with compression over the wire
syncoid -r --compress=zstd-fast rpool root@backup-server:tank/backup/prod-web-01

# Syncoid systemd timer for automated replication
cat > /etc/systemd/system/syncoid.service << 'EOF'
[Unit]
Description=ZFS Replication via Syncoid
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/sbin/syncoid -r --no-sync-snap rpool root@backup-server:tank/backup/prod-web-01
TimeoutStartSec=3600
EOF

cat > /etc/systemd/system/syncoid.timer << 'EOF'
[Unit]
Description=Hourly ZFS Replication

[Timer]
OnCalendar=*-*-* *:15:00
Persistent=true

[Install]
WantedBy=timers.target
EOF

systemctl daemon-reload
systemctl enable --now syncoid.timer

# Check last replication
systemctl status syncoid
journalctl -u syncoid --since "2 hours ago" --no-pager

Testing restores (the part everyone skips)

# On the backup server: clone a snapshot and verify it
# This proves your backups are real, not just hopes

# List available backups
ssh root@backup-server "zfs list -t snapshot -r tank/backup/prod-web-01 -o name,creation -S creation | head -10"

# Clone the latest backup snapshot
ssh root@backup-server "zfs clone tank/backup/prod-web-01/ROOT/default@autosnap_2026-04-04_hourly \
  tank/restore-test"

# Mount and verify contents
ssh root@backup-server "ls -la /tank/restore-test/etc/ /tank/restore-test/srv/"

# For a full boot test: create a VM from the cloned dataset
ssh root@backup-server "zfs create -V 40G tank/restore-test-zvol"
ssh root@backup-server "dd if=/dev/zvol/tank/restore-test of=/dev/zvol/tank/restore-test-zvol bs=4M"
# Boot a VM from the zvol and verify services start

# Clean up after test
ssh root@backup-server "zfs destroy tank/restore-test"

# Automate monthly restore tests
cat > /etc/systemd/system/restore-test.service << 'EOF'
[Unit]
Description=Monthly Backup Restore Test

[Service]
Type=oneshot
ExecStart=/usr/local/bin/test-restore.sh
EOF

cat > /etc/systemd/system/restore-test.timer << 'EOF'
[Unit]
Description=Monthly Restore Test

[Timer]
OnCalendar=*-*-01 03:00:00
Persistent=true

[Install]
WantedBy=timers.target
EOF

systemctl daemon-reload
systemctl enable --now restore-test.timer

Disaster recovery runbook

# ═══════════════════════════════════════════
# DISASTER RECOVERY RUNBOOK
# ═══════════════════════════════════════════

# Scenario 1: Host won't boot — disk is fine
# Boot from kldload ISO, recover from boot environment
krecovery import rpool
krecovery list-be
krecovery activate last-known-good
reboot

# Scenario 2: Single disk failed in mirror
# Replace disk, resilver (ZFS handles this automatically)
zpool status rpool                  # identify failed disk
zpool replace rpool /dev/sdX /dev/sdY
zpool status rpool                  # watch resilver progress

# Scenario 3: Total host loss
# New hardware, fresh kldload install, restore from Syncoid backup
# On new host after install:
zfs receive -F rpool < <(ssh root@backup-server "zfs send -R tank/backup/prod-web-01@latest")

# Scenario 4: Accidental data deletion
# Find the snapshot, clone or rollback
zfs list -t snapshot -o name,creation -S creation | grep srv/data | head -5
zfs rollback rpool/srv/data@autosnap_2026-04-04_hourly

# Scenario 5: Accidental data deletion (need specific files)
# Mount snapshot read-only, copy what you need
mkdir /tmp/recovery
mount -t zfs rpool/srv/data@autosnap_2026-04-04_hourly /tmp/recovery -o ro
cp /tmp/recovery/important-file.txt /srv/data/
umount /tmp/recovery
The restore test is the most important backup operation. A backup you have never restored is Schrodinger's backup — it both exists and does not exist until you try to use it. I run monthly restore tests on every production dataset. It takes 5 minutes because ZFS clones are instant. There is no excuse for not testing your restores.

Capacity Planning

ZFS pools have one iron rule: never let a pool exceed 80% capacity. Performance degrades sharply after 80% because COW operations need free space to write new blocks before releasing old ones. At 90%, writes slow to a crawl. At 95%, the pool may become unusable. Capacity planning is not optional — it is a survival requirement.

Pool monitoring

# Current pool utilization
zpool list -o name,size,alloc,free,cap,health
# cap = percentage used — alert at 70%, act at 80%

# Per-dataset usage
zfs list -o name,used,avail,refer,mountpoint -S used | head -20

# Snapshot space consumption (often the surprise)
zfs list -t snapshot -o name,used -S used | head -20

# Space consumed by snapshots per dataset
zfs list -o name,usedbysnapshots -S usedbysnapshots | head -10

# Compression savings
zfs get compressratio -r rpool -t filesystem | grep -v "1.00"

# ARC usage
cat /proc/spl/kstat/zfs/arcstats | grep -E "^(size|c_max|hits|misses)" | column -t

# ARC hit rate (should be >90%)
arc_hits=$(awk '/^hits/ {print $3}' /proc/spl/kstat/zfs/arcstats)
arc_misses=$(awk '/^misses/ {print $3}' /proc/spl/kstat/zfs/arcstats)
echo "ARC hit rate: $(echo "scale=2; $arc_hits * 100 / ($arc_hits + $arc_misses)" | bc)%"

# Growth rate (compare weekly pool usage)
# Record this in your monitoring system
echo "$(date +%Y%m%d) $(zpool list -H -o cap rpool)" >> /var/log/pool-capacity.log

When and how to expand

# RULE: Add vdevs to a pool. NEVER try to grow individual disks in a vdev.
# ZFS pools are the sum of their vdevs. Add more vdevs for more space.

# Add a mirror vdev to existing pool
zpool add rpool mirror /dev/sdC /dev/sdD

# Add a RAIDZ1 vdev to existing pool
zpool add rpool raidz1 /dev/sdC /dev/sdD /dev/sdE

# Verify new capacity
zpool list rpool
zpool status rpool

# NEVER do this (cannot be undone, different vdev types = bad):
# zpool add rpool /dev/sdC    ← single disk, no redundancy

# If you need to replace smaller disks with larger ones:
# Replace each disk in the mirror/raidz one at a time
zpool replace rpool /dev/sdA-old /dev/sdA-new
# Wait for resilver to complete
zpool status rpool | grep scan
# Then replace the next disk
zpool replace rpool /dev/sdB-old /dev/sdB-new
# After ALL disks in the vdev are replaced, run:
zpool online -e rpool /dev/sdA-new
zpool online -e rpool /dev/sdB-new
# The pool will now use the full capacity of the larger disks

ARC sizing

# Check current ARC size and max
arc_size=$(awk '/^size/ {print $3}' /proc/spl/kstat/zfs/arcstats)
arc_max=$(awk '/^c_max/ {print $3}' /proc/spl/kstat/zfs/arcstats)
echo "ARC size: $((arc_size / 1024 / 1024)) MB"
echo "ARC max: $((arc_max / 1024 / 1024)) MB"

# Set ARC max to 8GB (for a 32GB host)
echo "options zfs zfs_arc_max=8589934592" > /etc/modprobe.d/zfs-arc.conf

# Set ARC max to 4GB (for a 16GB host)
echo "options zfs zfs_arc_max=4294967296" > /etc/modprobe.d/zfs-arc.conf

# Apply without reboot
echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_max

# Guidelines:
# - Dedicated file server: 50-75% of RAM for ARC
# - Database server: 25% of RAM for ARC (databases have own cache)
# - VM host: 25-33% of RAM for ARC (VMs need their own RAM)
# - Desktop: 25% of RAM for ARC

Disk replacement procedure

# Identify the failed disk
zpool status rpool
# Look for: DEGRADED, FAULTED, or checksum errors

# If the disk is still present but failing:
zpool offline rpool /dev/sdX

# Physically replace the disk, then:
zpool replace rpool /dev/sdX-old /dev/sdX-new

# Monitor resilver progress
zpool status rpool
# Look for: "scan: resilver in progress"

# Watch resilver I/O
zpool iostat rpool 5

# After resilver completes:
zpool status rpool
# Should show: state: ONLINE, no errors

# Run a scrub after resilver to verify
zpool scrub rpool

# If using disk by-id (recommended for production):
ls -la /dev/disk/by-id/ | grep -v part
zpool replace rpool /dev/disk/by-id/old-disk-serial /dev/disk/by-id/new-disk-serial
The number one capacity surprise is snapshot space. You set a 30-day retention policy and forget about it. Then someone writes 50GB of temp files, deletes them, and the snapshots hold onto every byte for 30 days. Monitor usedbysnapshots per dataset. That is where your "missing" space is hiding.

Incident Response — Something Broke at 3 AM

The phone rings. Something is down. Here is the exact sequence of commands you run, in order, every time. Triage first, fix second, document third. ZFS and boot environments mean that most incidents end with a two-second rollback. The goal is not to debug at 3 AM — the goal is to restore service at 3 AM and debug at 10 AM with coffee.

The triage sequence

# Step 1: Is the host reachable?
ping -c 3 10.200.0.50                 # WireGuard IP
ping -c 3 10.100.10.50                # physical IP

# Step 2: SSH in and check pool health
ssh admin@10.200.0.50

# Step 3: Pool status (the first thing you check, always)
zpool status rpool

# Step 4: System health overview
kst

# Step 5: Failed services
systemctl --failed

# Step 6: Recent logs
journalctl -b --priority=err --no-pager | tail -30

# Step 7: Disk I/O (is something thrashing the disk?)
iostat -xz 2 5

# Step 8: Memory pressure
free -h
cat /proc/meminfo | grep -E "MemTotal|MemAvailable|SwapTotal|SwapFree"

# Step 9: Process load
top -bn1 | head -20

# Step 10: Network connectivity
ss -tlnp          # listening ports
wg show           # WireGuard tunnels

Common incidents and immediate fixes

# ── Service crashed ──
systemctl status myservice
journalctl -u myservice --since "1 hour ago" --no-pager | tail -50
systemctl restart myservice

# ── Bad config deployed ──
# Rollback to last snapshot (instant)
zfs list -t snapshot -o name,creation -S creation | grep srv | head -5
zfs rollback rpool/srv/myapp@autosnap_2026-04-04_hourly
systemctl restart myservice

# ── Bad package update broke things ──
# Rollback the boot environment
kbe list
kbe activate pre-upgrade-20260404
reboot

# ── Disk filled up ──
zfs list -o name,used,avail -S used | head -10
zfs list -t snapshot -o name,used -S used | head -10
# Delete old snapshots to free space
zfs destroy rpool/srv/logs@autosnap_2026-03-01_daily
# Or destroy all snapshots older than a pattern
zfs list -t snapshot -H -o name | grep "2026-01" | xargs -n1 zfs destroy

# ── Pool degraded (disk failing) ──
zpool status rpool
# If mirrored: the pool is still online with redundancy
# Schedule disk replacement during business hours
# DO NOT resilver at 3 AM unless the second disk is also failing

eBPF for live diagnosis

# What is using the disk right now?
biotop

# What is using the CPU right now?
bpftrace -e 'profile:hz:99 { @[comm] = count(); }'
# Ctrl+C after 10 seconds to see results

# What files are being opened?
opensnoop -d 10

# What network connections are being made?
tcpconnect

# Slow disk I/O (operations >10ms)
biolatency -m

# ZFS operations slower than 1ms
zfsslower 1

# Which process is eating memory?
bpftrace -e 'tracepoint:kmem:mm_page_alloc { @[comm] = count(); }'

# Snapshot for forensic evidence before fixing
zfs snapshot -r rpool@incident-$(date +%Y%m%d-%H%M%S)

The "I don't even get up anymore" workflow

# This is the endgame: fully automated incident response.
# When a host is so broken that fixing it is slower than replacing it.

# Step 1: The monitoring system detects the failure
# Step 2: Automation clones the golden image
qemu-img create -f qcow2 -b /var/lib/libvirt/images/golden.qcow2 \
  -F qcow2 /var/lib/libvirt/images/replacement.qcow2

# Step 3: Cloud-init configures the replacement
cat > /var/lib/libvirt/images/user-data << 'EOF'
#cloud-config
hostname: prod-web-01
manage_etc_hosts: true
ssh_authorized_keys:
  - ssh-ed25519 AAAA... admin@fleet
runcmd:
  - systemctl enable --now wg-quick@wg0
  - syncoid -r root@backup-server:tank/backup/prod-web-01 rpool
EOF

# Step 4: Boot the replacement
virt-install --name prod-web-01-replacement --ram 4096 --vcpus 4 \
  --disk path=/var/lib/libvirt/images/replacement.qcow2 \
  --import --os-variant centos-stream9 --network network=default \
  --cloud-init user-data=/var/lib/libvirt/images/user-data \
  --noautoconsole

# Step 5: Syncoid restores data from backup
# Step 6: DNS/load balancer cuts over to replacement
# Step 7: You wake up, check the log, nod, go back to sleep
The goal of mature operations is not to prevent incidents. Incidents will happen. Hardware fails. Software has bugs. Operators make mistakes. The goal is to make incidents boring. A two-second rollback is boring. A clone-and-replace is boring. Debugging a corrupted filesystem at 3 AM is not boring — it is a sign that your operations process has failed long before the incident.

Fleet Operations

A fleet is two or more hosts running kldload. The principles are the same at 2 hosts or 200: WireGuard mesh for connectivity, Syncoid for backups, rolling upgrades for safety, and SSH for everything else. You do not need Ansible for a kldload fleet. You need for-loops, SSH keys, and discipline.

WireGuard mesh for fleet connectivity

# Hub-and-spoke topology (simplest, works for 2-50 nodes)
# Hub config (/etc/wireguard/wg0.conf on the hub):
[Interface]
PrivateKey = HUB_PRIVATE_KEY
Address = 10.200.0.1/24
ListenPort = 51820

[Peer]
# prod-web-01
PublicKey = NODE1_PUBLIC_KEY
AllowedIPs = 10.200.0.10/32

[Peer]
# prod-web-02
PublicKey = NODE2_PUBLIC_KEY
AllowedIPs = 10.200.0.11/32

[Peer]
# prod-db-01
PublicKey = NODE3_PUBLIC_KEY
AllowedIPs = 10.200.0.20/32

# Full mesh (every node peers with every other node)
# Generate configs for N nodes:
for i in $(seq 1 16); do
  echo "=== Node $i: 10.200.0.$i ==="
  wg genkey | tee /tmp/node${i}.key | wg pubkey > /tmp/node${i}.pub
done

# Verify mesh connectivity from any node:
for ip in 10.200.0.{1..16}; do
  ping -c 1 -W 1 $ip >/dev/null 2>&1 && echo "$ip: UP" || echo "$ip: DOWN"
done

Fleet health checks

# Define fleet nodes
FLEET="10.200.0.{1..16}"

# Pool health across fleet
for ip in $FLEET; do
  health=$(ssh -o ConnectTimeout=5 admin@$ip 'zpool list -H -o health rpool' 2>/dev/null || echo "UNREACHABLE")
  printf "%-15s %s\n" "$ip" "$health"
done

# Kernel versions across fleet
for ip in $FLEET; do
  ver=$(ssh -o ConnectTimeout=5 admin@$ip 'uname -r' 2>/dev/null || echo "UNREACHABLE")
  printf "%-15s %s\n" "$ip" "$ver"
done

# ZFS versions across fleet
for ip in $FLEET; do
  ver=$(ssh -o ConnectTimeout=5 admin@$ip 'zfs version 2>/dev/null | head -1' 2>/dev/null || echo "UNREACHABLE")
  printf "%-15s %s\n" "$ip" "$ver"
done

# Pool capacity across fleet
for ip in $FLEET; do
  cap=$(ssh -o ConnectTimeout=5 admin@$ip 'zpool list -H -o cap rpool' 2>/dev/null || echo "UNREACHABLE")
  printf "%-15s %s\n" "$ip" "$cap"
done

# Failed systemd units across fleet
for ip in $FLEET; do
  failed=$(ssh -o ConnectTimeout=5 admin@$ip 'systemctl --failed --no-legend --no-pager | wc -l' 2>/dev/null || echo "?")
  printf "%-15s %s failed units\n" "$ip" "$failed"
done

# Last Sanoid run across fleet
for ip in $FLEET; do
  last=$(ssh -o ConnectTimeout=5 admin@$ip 'systemctl show sanoid.timer -p LastTriggerUSec --value' 2>/dev/null || echo "?")
  printf "%-15s %s\n" "$ip" "$last"
done

Centralized monitoring (Prometheus + Grafana)

# Prometheus scrape config for fleet
# /etc/prometheus/prometheus.yml
cat > /etc/prometheus/fleet-targets.yml << 'EOF'
- targets:
  - '10.200.0.1:9100'
  - '10.200.0.2:9100'
  - '10.200.0.3:9100'
  - '10.200.0.10:9100'
  - '10.200.0.11:9100'
  - '10.200.0.20:9100'
  labels:
    job: 'node'

- targets:
  - '10.200.0.1:9134'
  - '10.200.0.2:9134'
  - '10.200.0.3:9134'
  - '10.200.0.10:9134'
  - '10.200.0.11:9134'
  - '10.200.0.20:9134'
  labels:
    job: 'zfs'
EOF

# Key Prometheus alerting rules for fleet
cat > /etc/prometheus/rules/fleet.yml << 'EOF'
groups:
- name: fleet
  rules:
  - alert: ZFSPoolUnhealthy
    expr: zfs_pool_health != 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "ZFS pool unhealthy on {{ $labels.instance }}"

  - alert: ZFSPoolCapacityHigh
    expr: zfs_pool_allocated_bytes / zfs_pool_size_bytes > 0.80
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Pool capacity >80% on {{ $labels.instance }}"

  - alert: ZFSPoolCapacityCritical
    expr: zfs_pool_allocated_bytes / zfs_pool_size_bytes > 0.90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pool capacity >90% on {{ $labels.instance }}"

  - alert: NodeDown
    expr: up == 0
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "Node {{ $labels.instance }} is unreachable"
EOF

# Reload Prometheus
curl -X POST http://localhost:9090/-/reload

ZFS replication for fleet backups

# Backup server: receive from all fleet nodes
# Create per-host backup datasets
for host in prod-web-01 prod-web-02 prod-db-01; do
  zfs create -p tank/backup/$host
done

# On each fleet node: replicate to backup server
# (or run from backup server pulling from each node)
for ip in 10.200.0.10 10.200.0.11 10.200.0.20; do
  host=$(ssh admin@$ip hostname -s)
  echo "=== Replicating $host ($ip) ==="
  syncoid -r admin@$ip:rpool tank/backup/$host
done

# Check replication lag per host
for dir in /tank/backup/*/; do
  host=$(basename "$dir")
  latest=$(zfs list -t snapshot -H -o creation -S creation -r "tank/backup/$host" | head -1)
  echo "$host: last snapshot $latest"
done

Rolling upgrades

# Rolling upgrade: one host at a time, verify before proceeding
FLEET_NODES="10.200.0.10 10.200.0.11 10.200.0.20"

for ip in $FLEET_NODES; do
  host=$(ssh admin@$ip hostname -s)
  echo "════════════════════════════════════════"
  echo "  Upgrading $host ($ip)"
  echo "════════════════════════════════════════"

  # Pre-upgrade snapshot
  ssh admin@$ip "sudo zfs snapshot -r rpool@pre-upgrade-$(date +%Y%m%d)"

  # Run upgrade
  ssh admin@$ip "sudo kupgrade"

  # Reboot if needed
  ssh admin@$ip "sudo needs-restarting -r 2>/dev/null || [ -f /var/run/reboot-required ]" && \
    ssh admin@$ip "sudo reboot" && \
    sleep 60

  # Verify post-upgrade
  ssh admin@$ip "zpool status rpool | head -3; uname -r; systemctl --failed --no-pager"

  # Pause between nodes — operator reviews before continuing
  echo "Press Enter to continue to next node (or Ctrl+C to abort)..."
  read
done
These are bash for-loops over SSH. That is deliberate. You do not need Ansible, Puppet, Chef, or Salt to manage a kldload fleet. SSH is the transport. ZFS is the state. WireGuard is the encryption. A for-loop is the orchestrator. If your fleet is 16 nodes, a for-loop is the right tool. If it is 1,600 nodes, you probably want syncoid for replication and a proper orchestrator for config — but the commands are the same. They just run from a different trigger.

Performance Tuning

The first rule of performance tuning: measure before you tune. The second rule: most systems do not need tuning. kldload's defaults are correct for 90% of workloads. Tuning is for the remaining 10% — databases, high-throughput media servers, and VM hypervisors with hundreds of zvols. If you cannot measure the problem, do not tune for it.

Measure first

# Baseline I/O performance
zpool iostat rpool 5
# Look at: operations/sec, bandwidth, latency

# ARC efficiency (the most important metric)
arc_summary
# Or manually:
awk '/^hits/ {h=$3} /^misses/ {m=$3} END {printf "Hit rate: %.1f%%\n", h*100/(h+m)}' \
  /proc/spl/kstat/zfs/arcstats

# Disk latency histogram (eBPF)
biolatency -m 10
# If most I/O is <1ms on NVMe, don't tune. If >10ms, investigate.

# ZFS operation latency
zfsslower 1
# Shows operations slower than 1ms — normal for HDDs, investigate on NVMe

# CPU profile
bpftrace -e 'profile:hz:99 { @[comm] = count(); }' -d 10

# Memory pressure
vmstat 2 10

Recordsize tuning by workload

# Default: 128K — good for general purpose, large files, media
zfs get recordsize rpool/srv/data

# Database workloads: match the database page size
# PostgreSQL (8K pages)
zfs set recordsize=8k rpool/srv/db/postgres

# MySQL/MariaDB (16K pages)
zfs set recordsize=16k rpool/srv/db/mysql

# MongoDB (WiredTiger uses variable block sizes)
zfs set recordsize=32k rpool/srv/db/mongo

# VM zvols (match guest filesystem block size, usually 4K-8K)
zfs set volblocksize=8k rpool/vms/webserver
# NOTE: volblocksize can only be set at creation time
# To change it, you must destroy and recreate the zvol

# Large file storage (video, backups, archives): 1M for throughput
zfs set recordsize=1M rpool/srv/media

# Small files (source code, node_modules): 16K-32K
zfs set recordsize=32k rpool/srv/code

Write performance tuning

# sync=standard (default) — safe, ZIL commits every transaction
zfs get sync rpool/srv/data

# sync=disabled — DANGEROUS: data loss possible on power failure
# Only for truly ephemeral data (build caches, temp directories)
zfs set sync=disabled rpool/srv/tmp

# SLOG (separate intent log) — accelerate sync writes
# Use a fast NVMe device as SLOG for database workloads
zpool add rpool log mirror /dev/nvme2n1 /dev/nvme3n1

# L2ARC (level 2 ARC) — extend read cache to SSD
# Only useful when ARC hit rate is low AND you have spare SSD
zpool add rpool cache /dev/ssd-cache

# Check if L2ARC is helping
awk '/^l2_hits/ {h=$3} /^l2_misses/ {m=$3} END {printf "L2ARC hit rate: %.1f%%\n", h*100/(h+m)}' \
  /proc/spl/kstat/zfs/arcstats

# Special allocation (metadata on fast device)
zpool add rpool special mirror /dev/nvme-fast-1 /dev/nvme-fast-2
# Metadata and small blocks go to fast device, data to bulk storage
zfs set special_small_blocks=64k rpool/srv/data

Compression tuning

# Default: lz4 — fast, good compression, nearly zero CPU
# Use lz4 for everything unless you have a reason not to
zfs get compression rpool

# zstd — better compression, slightly more CPU
# Good for archives, logs, cold data
zfs set compression=zstd rpool/srv/archive

# zstd-fast — speed of lz4, compression between lz4 and zstd
zfs set compression=zstd-fast rpool/srv/logs

# Check actual compression ratios per dataset
zfs list -o name,compressratio,used,logicalused -S compressratio

# Off — only for pre-compressed data (already-compressed media, encrypted blobs)
zfs set compression=off rpool/srv/encrypted-backups
I have seen operators spend a week tuning recordsize and compression for a 1% improvement, while their ARC hit rate was 60% because they gave ZFS 1GB of RAM on a 64GB host. Check ARC first. If ARC hit rate is below 90%, give ZFS more RAM. That single change will do more than every other tuning knob combined.

Runbooks

A runbook is a step-by-step procedure that anyone on the team can follow. No guessing. No improvising. No "I think the command was something like...". Every step has a concrete command. Every step has a verification. The runbook is the safety net for the operator — when your brain is fog at 3 AM, the runbook remembers.

Runbook: Disk replacement

# ── PRE: Identify and prepare ──
# 1. Identify the failed disk
zpool status rpool
#    Look for: FAULTED, DEGRADED, checksum errors
#    Note the device path (e.g., /dev/sdB or /dev/disk/by-id/...)

# 2. Confirm pool redundancy (mirror or raidz)
zpool status rpool | grep -E "mirror|raidz"
#    If no redundancy: STOP — this is a data-loss scenario
#    Backup everything before proceeding

# 3. Take a pre-replacement snapshot
zfs snapshot -r rpool@pre-disk-replace-$(date +%Y%m%d)

# 4. Note the disk serial for physical identification
smartctl -a /dev/sdB | grep "Serial Number"
#    Or blink the disk LED:
ledctl locate=/dev/sdB

# ── REPLACE ──
# 5. Offline the disk (if it's still responding)
zpool offline rpool /dev/sdB

# 6. Physically swap the disk

# 7. Tell ZFS about the new disk
zpool replace rpool /dev/sdB /dev/sdB-new
#    Or if the new disk is in the same slot:
zpool replace rpool /dev/sdB

# ── VERIFY ──
# 8. Monitor resilver
watch -n 5 'zpool status rpool | grep -A 2 scan'

# 9. After resilver completes, verify
zpool status rpool
#    Expected: state: ONLINE, no errors, no DEGRADED

# 10. Run a scrub to double-check
zpool scrub rpool

# 11. Clear any error counters
zpool clear rpool

# 12. Verify final state
zpool status -v rpool

Runbook: Pool expansion

# ── PRE ──
# 1. Document current state
zpool list rpool
zpool status rpool

# 2. Snapshot before expansion
zfs snapshot -r rpool@pre-expand-$(date +%Y%m%d)

# 3. Verify new disks are visible
lsblk
ls /dev/disk/by-id/ | grep -v part

# ── EXPAND ──
# 4. Add new vdev (MUST match existing redundancy level)
#    If pool uses mirrors, add a mirror:
zpool add rpool mirror /dev/disk/by-id/new-disk-1 /dev/disk/by-id/new-disk-2

#    If pool uses raidz1, add a raidz1:
zpool add rpool raidz1 /dev/disk/by-id/new-disk-1 /dev/disk/by-id/new-disk-2 /dev/disk/by-id/new-disk-3

# ── VERIFY ──
# 5. Confirm new capacity
zpool list rpool

# 6. Confirm new vdev is online
zpool status rpool

# 7. Run a scrub
zpool scrub rpool

Runbook: Host migration

# ── PRE ──
# 1. Install kldload on new host (same distro and profile)

# 2. Set up WireGuard on new host and verify connectivity
wg show wg0
ping -c 3 old-host-wg-ip

# 3. Initial full replication (may take hours for large datasets)
syncoid -r admin@old-host:rpool rpool

# ── MIGRATE ──
# 4. Stop services on old host
ssh admin@old-host "sudo systemctl stop myservice"

# 5. Final incremental sync (seconds, not hours)
syncoid -r admin@old-host:rpool rpool

# 6. Update DNS / load balancer to point to new host

# 7. Start services on new host
systemctl start myservice

# ── VERIFY ──
# 8. Verify services are running
systemctl status myservice
curl -s http://localhost:8080/health

# 9. Monitor for 24 hours before decommissioning old host

# ── ROLLBACK (if needed) ──
# 10. Update DNS back to old host
# 11. Start services on old host
ssh admin@old-host "sudo systemctl start myservice"

Runbook: Network reconfiguration

# ── PRE ──
# 1. Document current config
nmcli connection show "Wired connection 1" | grep ipv4
ip addr show
ip route show
cat /etc/wireguard/wg0.conf

# 2. Ensure you have out-of-band access (IPMI, console, physical)

# ── CHANGE ──
# 3. Apply new IP
nmcli connection modify "Wired connection 1" \
  ipv4.addresses 10.100.20.50/24 \
  ipv4.gateway 10.100.20.1
nmcli connection up "Wired connection 1"

# 4. Update WireGuard endpoint if needed
wg set wg0 peer PEER_PUBKEY endpoint new-endpoint:51820
wg showconf wg0 > /etc/wireguard/wg0.conf

# 5. Update DNS records
# 6. Update monitoring targets
# 7. Update firewall rules if subnets changed

# ── VERIFY ──
# 8. Test connectivity
ping -c 3 10.100.20.1
ping -c 3 10.200.0.1   # WireGuard hub
wg show wg0

Runbook: Ransomware recovery

# ═══════════════════════════════════════════
# RANSOMWARE RECOVERY — ZFS is your shield
# ═══════════════════════════════════════════

# ── ISOLATE ──
# 1. Disconnect the host from the network IMMEDIATELY
#    Physical: unplug the cable
#    Remote: drop all traffic
nft flush ruleset
nft add table inet filter
nft add chain inet filter input '{type filter hook input priority 0; policy drop;}'
nft add chain inet filter output '{type filter hook output priority 0; policy drop;}'
#    Only allow your management IP:
nft add rule inet filter input ip saddr YOUR_IP accept
nft add rule inet filter output ip daddr YOUR_IP accept

# ── ASSESS ──
# 2. Check ZFS snapshots — ransomware cannot delete ZFS snapshots
#    unless the attacker has root AND knows ZFS
zfs list -t snapshot -o name,creation -S creation | head -20

# 3. Find the last clean snapshot (before encryption started)
#    Compare file contents in snapshots vs current
ls /srv/data/.zfs/snapshot/autosnap_2026-04-03_hourly/
#    If files look normal (not encrypted), this snapshot is clean

# ── RECOVER ──
# 4. Roll back to clean snapshot
zfs rollback rpool/srv/data@autosnap_2026-04-03_hourly

# 5. For multiple datasets:
for ds in rpool/srv/data rpool/srv/logs rpool/home; do
  clean_snap=$(zfs list -t snapshot -H -o name "$ds" | grep "2026-04-03" | tail -1)
  echo "Rolling back $ds to $clean_snap"
  zfs rollback "$clean_snap"
done

# ── HARDEN ──
# 6. Before reconnecting to the network:
#    - Change all passwords
#    - Rotate all SSH keys
#    - Check for persistence (crontabs, systemd units, authorized_keys)
crontab -l
ls /etc/cron.d/
systemctl list-unit-files --state=enabled | grep -v default
cat /root/.ssh/authorized_keys
for user in $(awk -F: '$3 >= 1000 {print $1}' /etc/passwd); do
  echo "=== $user ===" && cat /home/$user/.ssh/authorized_keys 2>/dev/null
done

# 7. If in doubt: rebuild from golden image + restore data from backup
ZFS snapshots are the best ransomware defense that exists. Ransomware encrypts files. ZFS snapshots are immutable — they cannot be modified or deleted except by an operator with root access running explicit zfs destroy commands. If your Sanoid retention keeps 30 days of snapshots, you have a 30-day window to detect and recover from ransomware. The recovery is a single zfs rollback command. No paying ransoms. No FBI. No "thoughts and prayers" press release. Just rollback and harden.

On-Call

On-call with kldload infrastructure is fundamentally different from on-call with traditional infrastructure. The 2-second rollback changes everything. When your response to most incidents is "rollback the snapshot and investigate tomorrow," on-call becomes a lot less painful. Here is how to set it up so the pager wakes you as rarely as possible.

What to monitor

Pool health (CRITICAL)

If zpool status shows anything other than ONLINE, you need to know immediately. A DEGRADED pool means you have lost redundancy. A FAULTED pool means you have lost data access. This is the single most important alert.

Prometheus: zfs_pool_health != 0. Alert immediately. No silence. No snooze.

Pool capacity (WARNING at 70%, CRITICAL at 85%)

A pool at 80% is slow. A pool at 90% is dangerously slow. A pool at 95% may refuse writes. Alert at 70% so you have time to plan expansion. Alert at 85% so you act before performance degrades. Never let a production pool exceed 80%.

zfs_pool_allocated_bytes / zfs_pool_size_bytes > 0.70 = time to order disks.

Scrub errors (CRITICAL)

If a scrub finds checksum errors, you have data corruption. On a mirrored pool, ZFS auto-repairs. On a single-disk pool, you have lost blocks. Either way, you need to know.

zpool status | grep errors. If it's not "No known data errors," drop everything.

Replication lag (WARNING at 2h, CRITICAL at 6h)

If Syncoid hasn't run in 6 hours, your RPO has blown out. The backup server is behind. If the primary dies now, you lose 6 hours of data. Acceptable in dev. Not in production.

Compare latest snapshot timestamps between primary and backup. The gap is your RPO.

Failed systemd units (WARNING)

A failed service is not always urgent, but it is always information. A failed sanoid.timer means snapshots stopped. A failed syncoid.timer means replication stopped. A failed wg-quick@wg0 means the backplane is down.

systemctl --failed | wc -l > 0 = something broke. Check what.

Node unreachable (CRITICAL after 3m)

If Prometheus cannot scrape a node for 3 minutes, the node is down or the network is broken. Either way, you need to know. Use WireGuard IPs for monitoring so you are testing the backplane path, not just the public network.

up == 0 for 3m. Page. If it's a network blip, the alert auto-resolves.

Alert thresholds

# Prometheus alerting rules for on-call
cat > /etc/prometheus/rules/oncall.yml << 'EOF'
groups:
- name: oncall
  rules:

  # ── CRITICAL: Wake up ──
  - alert: ZFSPoolNotOnline
    expr: zfs_pool_health != 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "POOL DEGRADED on {{ $labels.instance }}"
      runbook: "https://wiki.internal/runbooks/pool-degraded"

  - alert: HostDown
    expr: up == 0
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "HOST DOWN: {{ $labels.instance }}"

  - alert: PoolCapacityCritical
    expr: zfs_pool_allocated_bytes / zfs_pool_size_bytes > 0.85
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pool >85% on {{ $labels.instance }}"

  # ── WARNING: Fix during business hours ──
  - alert: PoolCapacityWarning
    expr: zfs_pool_allocated_bytes / zfs_pool_size_bytes > 0.70
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Pool >70% on {{ $labels.instance }} — plan expansion"

  - alert: ReplicationLagHigh
    expr: time() - zfs_dataset_snapshot_latest_timestamp > 7200
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Replication >2h behind on {{ $labels.instance }}"

  - alert: ReplicationLagCritical
    expr: time() - zfs_dataset_snapshot_latest_timestamp > 21600
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "Replication >6h behind on {{ $labels.instance }}"

  - alert: ScrubErrors
    expr: zfs_pool_scrub_errors > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "SCRUB ERRORS on {{ $labels.instance }}"

  - alert: FailedSystemdUnits
    expr: node_systemd_unit_state{state="failed"} > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Failed unit {{ $labels.name }} on {{ $labels.instance }}"

  - alert: ARCHitRateLow
    expr: rate(zfs_arc_hits_total[5m]) / (rate(zfs_arc_hits_total[5m]) + rate(zfs_arc_misses_total[5m])) < 0.80
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "ARC hit rate <80% on {{ $labels.instance }} — consider increasing ARC max"
EOF

Escalation procedures

# ── LEVEL 1: Automated response (no human needed) ──
# - Service crashed → systemd auto-restarts (Restart=always)
# - WireGuard tunnel flap → PersistentKeepalive reconnects
# - ARC pressure → ZFS automatically shrinks ARC

# ── LEVEL 2: On-call operator (1 person, 15 min response) ──
# - Host unreachable → check IPMI, physical access if needed
# - Pool degraded → schedule disk replacement (business hours)
# - Service won't restart → rollback snapshot, restart
# - Bad deploy → kbe activate previous-be && reboot

# ── LEVEL 3: Senior operator (2 people, 30 min response) ──
# - Pool faulted → DR runbook, restore from backup
# - Data corruption → forensic snapshot, assess scope
# - Security incident → isolate, assess, recover
# - Multiple hosts down → network-level investigation

# ── LEVEL 4: All hands (entire team, immediate) ──
# - Ransomware → isolate all hosts, ransomware recovery runbook
# - Total fleet outage → DR site activation
# - Data breach → legal + security + ops coordination
The best on-call rotation is the one where you never get paged. That is not a joke — it is the goal. If your alerts are tuned correctly and your automation handles Level 1 issues, the on-call operator should be paged fewer than twice a month. If you are getting paged more than that, your alerts are too noisy or your automation is incomplete. Fix the system, not the symptoms.

SLOs & Error Budgets

An SLO (Service Level Objective) is a promise you make about your infrastructure's behavior. An error budget is how much failure that promise allows. Together, they transform operations from "fight every fire" to "spend the error budget wisely." kldload infrastructure has natural SLOs built into the stack — ZFS pool health, ARC hit rate, replication lag. Measure them. Set targets. Alert on burn rate, not on individual violations.

Defining SLOs for kldload infrastructure

Pool availability: 99.99%

The ZFS pool must be ONLINE and accepting I/O 99.99% of the time. That allows 52.6 minutes of downtime per year. On a mirrored pool, this target is trivially achievable — disk failures are handled transparently. The only threat is double-disk failure or controller failure.

52 minutes of downtime per year. A disk replacement with resilver takes 20 minutes. Budget allows 2-3 failures.

ARC hit rate: >90%

If the ARC hit rate drops below 90% for a sustained period, the workload is thrashing disk. This is either a sizing issue (give ARC more RAM) or a workload change (random access pattern that exceeds cache capacity). Either way, it needs attention.

90% ARC = 90% of reads served from RAM. Below that, you are paying the disk tax.

Replication lag: <1 hour

Syncoid should complete within each hourly window. If replication falls behind, your RPO (Recovery Point Objective) is degrading. The business thinks they can lose 1 hour of data. If replication is 6 hours behind, they can lose 7.

RPO = how much data you can afford to lose. Replication lag = how much you will actually lose.

Snapshot retention: 100% compliance

Sanoid must run every scheduled interval. Missing a snapshot window means your point-in-time recovery granularity is degraded. If you promised hourly snapshots and Sanoid missed 3 hours, your recovery window just went from 1 hour to 4.

Sanoid.timer must fire every time. If it doesn't, your retention policy is a lie.

Scrub completion: weekly, zero errors

Every pool should be scrubbed weekly. Every scrub should complete with zero errors. A skipped scrub means undetected bit rot. A scrub with errors means active corruption. Both degrade your data integrity SLO.

Scrubs are your checksum audit. Skipping them is like skipping financial audits.

Recovery time: <5 minutes

From "something is broken" to "service is restored" should take under 5 minutes. ZFS rollback takes 2 seconds. Boot environment switch takes 30 seconds + reboot. Clone-and-replace takes 5 minutes. If your MTTR exceeds this, your tooling or runbooks need improvement.

MTTR is the metric that matters. Not MTBF. Things will break. How fast can you fix them?

Measuring with Prometheus

# SLO: Pool availability 99.99%
# Recording rule: track pool online ratio over 30 days
- record: slo:zfs_pool_availability:ratio_30d
  expr: 1 - (sum_over_time(zfs_pool_health[30d]) / count_over_time(zfs_pool_health[30d]))

# SLO: ARC hit rate >90%
- record: slo:zfs_arc_hit_rate:ratio_5m
  expr: rate(zfs_arc_hits_total[5m]) / (rate(zfs_arc_hits_total[5m]) + rate(zfs_arc_misses_total[5m]))

# SLO: Replication lag <1h
- record: slo:syncoid_lag_seconds
  expr: time() - zfs_dataset_snapshot_latest_timestamp

# Error budget: how much downtime remains this month
# 99.99% of 30 days = 259,200 seconds allowed downtime = 4.32 minutes
- record: slo:pool_error_budget_remaining
  expr: (1 - 0.9999) * 30 * 86400 - sum_over_time((zfs_pool_health != bool 0)[30d:1m]) * 60

Alerting on burn rate

# Burn rate alerting: alert when you are consuming your error budget
# too fast, not when a single violation occurs.

# If the error budget is being consumed 14x faster than allowed
# (1-hour window), you'll exhaust it in 2 days → page immediately
- alert: PoolAvailabilityBudgetBurn
  expr: |
    (
      sum_over_time((zfs_pool_health != bool 0)[1h:1m]) / 60
    ) / (1 - 0.9999) > 14
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Pool availability error budget burning 14x on {{ $labels.instance }}"

# If consumed 3x faster than allowed (6-hour window),
# you'll exhaust it in 10 days → warning
- alert: PoolAvailabilityBudgetBurnSlow
  expr: |
    (
      sum_over_time((zfs_pool_health != bool 0)[6h:1m]) / 360
    ) / (1 - 0.9999) > 3
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Pool availability error budget burning 3x on {{ $labels.instance }}"

# Replication SLO burn rate
- alert: ReplicationBudgetBurn
  expr: |
    (
      sum_over_time((slo:syncoid_lag_seconds > bool 3600)[3h:5m]) / 36
    ) / (1 - 0.999) > 10
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Replication SLO budget burning on {{ $labels.instance }}"

The error budget as a decision tool

# Check remaining error budget for this month
# (run this in Grafana or ad-hoc with promtool)

# Pool availability budget (99.99% = 4.32 min/month allowed downtime)
# If budget remaining > 50%: safe to push changes, do maintenance
# If budget remaining 10-50%: proceed with caution, extra testing
# If budget remaining < 10%: freeze changes, focus on reliability

# Practical example:
# It's April 15th. Pool was degraded for 2 minutes on April 3rd.
# Monthly budget: 4.32 minutes. Consumed: 2 minutes. Remaining: 2.32 minutes.
# Budget is 54% remaining → safe to proceed with kernel upgrade.

# Another example:
# Pool was degraded for 3.5 minutes already this month.
# Monthly budget: 4.32 minutes. Remaining: 0.82 minutes.
# Budget is 19% remaining → delay non-critical changes until next month.
Error budgets are the SRE's secret weapon. They turn subjective arguments ("Is this change safe?") into objective decisions ("Do we have the budget for this risk?"). When the budget is healthy, you ship fast. When it is exhausted, you lock down and fix reliability. No arguments. No blame. Just math. This is how Google, Netflix, and every serious SRE team operates. kldload gives you the metrics. Prometheus gives you the math. The error budget gives you the decision framework.

Quick Reference — The Commands You Will Use Every Day

This is the cheat sheet. The commands that become muscle memory. If you remember nothing else from this page, remember these.

# ── Health ──
kst                                    # system overview
zpool status rpool                     # pool health
zpool list                             # pool capacity
systemctl --failed                     # broken services
wg show                                # WireGuard tunnels
journalctl -b --priority=err           # errors since boot

# ── Snapshots ──
ksnap                                  # snapshot key datasets
zfs snapshot -r rpool@$(date +%Y%m%d)  # snapshot everything
zfs list -t snapshot -S creation       # list snapshots (newest first)
zfs rollback rpool/srv/data@snap       # instant rollback

# ── Boot environments ──
kbe create before-change               # save current state
kbe list                               # show all BEs
kbe activate before-change             # switch to a BE

# ── Upgrades ──
kupgrade                               # safe upgrade (snapshot first)
kpkg install nginx                     # install with auto-snapshot
kpkg upgrade                           # upgrade all with snapshot

# ── Replication ──
syncoid -r rpool backup:tank/backup    # replicate to remote
systemctl status syncoid.timer         # check replication timer

# ── Recovery ──
krecovery import rpool                 # from kldload ISO
krecovery list-be                      # show available BEs
krecovery activate snap-name           # boot to a previous state

# ── Diagnosis ──
biotop                                 # disk I/O by process
execsnoop                              # process launches
tcpconnect                             # network connections
zfsslower 1                            # slow ZFS operations

The operator's pledge: I will snapshot before every change. I will test my backups monthly. I will monitor my error budgets. I will automate every task I perform more than twice. I will not debug at 3 AM — I will rollback at 3 AM and debug at 10 AM. I will treat my infrastructure as cattle, not pets. I will measure before I tune. I will keep my runbooks current. I will sleep well because my systems are reliable, my backups are tested, and my rollback is two seconds away.

This page is the one you bookmark. Everything here is copy-pasteable. If you want to understand why these commands work, read the ZFS Zero to Hero tutorial. If you want to understand the architecture underneath, read How Things Work. This page assumes you already know. It is the operator's manual for people who operate.