Operations Guide Masterclass
The daily ops bible for kldload in production. What to check every morning. What to run every week. What to review every month. What to test every quarter. Copy-paste commands throughout — no theory, just the playbook.
SRE gives you principles. This guide gives you the playbook. The Blue/Green & SRE Masterclass covers philosophy: error budgets, toil reduction, the reliability engineering mindset. This guide covers execution: what you actually do, in what order, on what schedule. On OpenZFS, most of these operations are one command. Knowing which command to run when is the difference between a system that runs for years and one that fails at the worst possible time.
Who this is for: Anyone operating kldload in production. A solo homelabber with three nodes. A small team running twenty. A platform team managing a fleet. The cadence scales with the environment — the commands are the same. Daily → weekly → monthly → quarterly → annually. Follow the cadence. That’s the whole job.
2. The Morning Check — Daily Ops (5 Minutes)
Five minutes every morning. Same commands, same order. If everything is green, your infrastructure survived the night and you can get on with your day. If anything is yellow or red, you have the rest of the day to fix it before users notice. That’s the deal.
The one-command health check
# kst — kldload status: pool status, snapshot freshness, service health, WireGuard
kst
kst runs the canonical health check and formats the output for fast scanning. Green means go. Anything else means dig in. The sections below show what to look for in each check and how to respond.
What to look for
| Check | Green | Red — Do This |
|---|---|---|
| Pool status | ONLINE | DEGRADED/FAULTED → section 12 (disk failed) |
| Snapshot freshness | Latest < 1h old | Stale → check sanoid: systemctl status sanoid |
| Failed services | 0 failed units | Any failed → systemctl status <unit> |
| WireGuard peers | All handshakes < 3min | Stale handshake → wg show, check peer reachability |
| Overnight errors | 0 err/crit/alert | Any errors → investigate before they compound |
| Disk space | All pools < 70% used | >80% → prune snapshots or expand pool immediately |
The morning check script
#!/bin/bash
# /usr/local/bin/morning-check
# Run every morning. Should complete in under 30 seconds.
set -euo pipefail
BOLD='\033[1m'; GREEN='\033[0;32m'; RED='\033[0;31m'; YELLOW='\033[0;33m'; NC='\033[0m'
ERRORS=0
section() { echo -e "\n${BOLD}=== $1 ===${NC}"; }
ok() { echo -e " ${GREEN}OK${NC} $1"; }
warn() { echo -e " ${YELLOW}WARN${NC} $1"; ERRORS=$((ERRORS+1)); }
fail() { echo -e " ${RED}FAIL${NC} $1"; ERRORS=$((ERRORS+1)); }
section "Pool Status"
while IFS= read -r line; do
pool=$(echo "$line" | awk '{print $1}')
state=$(echo "$line" | awk '{print $2}')
if [[ "$state" == "ONLINE" ]]; then
ok "$pool: $state"
else
fail "$pool: $state"
fi
done < <(zpool list -H -o name,health)
section "Snapshot Freshness (last 5)"
zfs list -t snapshot -o name,creation -s creation 2>/dev/null | tail -5
section "Snapshot Age Check"
LATEST=$(zfs list -t snapshot -H -o creation -s creation 2>/dev/null | tail -1)
if [[ -n "$LATEST" ]]; then
AGE=$(( $(date +%s) - $(date -d "$LATEST" +%s 2>/dev/null || echo 0) ))
if [[ $AGE -lt 7200 ]]; then
ok "Latest snapshot is $((AGE/60)) minutes old"
elif [[ $AGE -lt 86400 ]]; then
warn "Latest snapshot is $((AGE/3600)) hours old — check sanoid"
else
fail "No snapshot in 24h — sanoid may be broken"
fi
fi
section "Failed Services"
FAILED=$(systemctl --failed --no-legend --no-pager 2>/dev/null | grep -c "failed" || true)
if [[ "$FAILED" -eq 0 ]]; then
ok "No failed systemd units"
else
fail "$FAILED failed unit(s):"
systemctl --failed --no-legend --no-pager
fi
section "WireGuard Peers"
if command -v wg &>/dev/null; then
wg show all latest-handshakes 2>/dev/null | while read iface peer ts; do
if [[ -z "$ts" || "$ts" == "0" ]]; then
fail "$iface peer $peer: never connected"
else
AGE=$(( $(date +%s) - ts ))
if [[ $AGE -lt 180 ]]; then
ok "$iface peer ${peer:0:8}...: ${AGE}s ago"
elif [[ $AGE -lt 300 ]]; then
warn "$iface peer ${peer:0:8}...: ${AGE}s ago (slow)"
else
fail "$iface peer ${peer:0:8}...: ${AGE}s ago (stale)"
fi
fi
done
else
warn "WireGuard not installed"
fi
section "Overnight Errors (since yesterday)"
COUNT=$(journalctl -p err --since yesterday --no-pager -q 2>/dev/null | wc -l)
if [[ "$COUNT" -eq 0 ]]; then
ok "No errors in journal since yesterday"
elif [[ "$COUNT" -lt 10 ]]; then
warn "$COUNT error(s) in journal — review:"
journalctl -p err --since yesterday --no-pager -q | head -10
else
fail "$COUNT errors in journal — review urgently:"
journalctl -p err --since yesterday --no-pager -q | tail -20
fi
section "Disk Space"
zfs list -H -o name,used,avail,usedbydataset -t filesystem | \
awk '{
used=$2; avail=$3;
gsub(/[TGMK]/, "", used); gsub(/[TGMK]/, "", avail);
printf " %-40s used=%-8s avail=%s\n", $1, $2, $3
}'
echo ""
if [[ "$ERRORS" -eq 0 ]]; then
echo -e "${GREEN}${BOLD}Morning check: ALL GREEN. Infrastructure survived the night.${NC}"
else
echo -e "${RED}${BOLD}Morning check: $ERRORS issue(s) require attention.${NC}"
exit 1
fi
# Install and schedule
chmod +x /usr/local/bin/morning-check
# Run manually
morning-check
# Or alias it
alias mc='morning-check'
less if your terminal doesn’t scroll.
3. Weekly Maintenance
Weekly tasks catch problems that don’t trigger alerts — silent corruption, missed snapshots, stale backups. Block 30 minutes on the calendar. The same 30 minutes every week. The habit is the system.
ZFS scrub: verify every block on every pool
A scrub reads every block on the pool and verifies the checksum. OpenZFS checksums detect corruption when data is read. The only way to find corruption on cold (rarely accessed) data is to scrub it. Without weekly scrubs, you might not discover corruption until you need to restore from backup — and the backup is also corrupt.
# Start a scrub on all pools
for pool in $(zpool list -H -o name); do
echo "Starting scrub on $pool..."
zpool scrub "$pool"
done
# Check scrub status (run again after scrub completes)
zpool status | grep -A4 "scrub:"
# Stagger across pools for large fleets: pool A on Monday, pool B on Tuesday
# Monday: zpool scrub tank-a
# Tuesday: zpool scrub tank-b
# etc.
Monitoring scrub progress:
# Watch scrub progress live
watch -n 10 'zpool status | grep -A3 "scan:"'
# One-liner: show scrub ETA for all pools
zpool status | grep "scan:" | grep -v "none requested"
What scrub errors mean:
| Error type | What it means | Action |
|---|---|---|
| 0 errors | All blocks intact | Nothing. Sleep well. |
| Checksum errors, repaired | Corruption detected and fixed from redundancy | Note the disk. Order a replacement. It’s starting to fail. |
| Checksum errors, unrepaired | Corruption that cannot be fixed (no redundancy or multiple failures) | Restore affected files from snapshot immediately. |
| Read errors | Disk is having trouble reading blocks | Replace disk before it fails completely. |
Package updates
# ALWAYS snapshot before updating
zfs snapshot -r rpool/ROOT@pre-update-$(date +%Y%m%d)
# Check what would be updated (review before applying)
dnf check-update # CentOS / RHEL / Rocky / Fedora
apt list --upgradable # Debian / Ubuntu
# Apply security patches only
dnf upgrade --security # RPM-based
apt-get upgrade # Debian-based (all updates; use unattended-upgrades for security-only)
# If kernel was updated, create boot environment before rebooting
kbe create pre-kernel-update-$(date +%Y%m%d)
# Reboot and verify
systemctl reboot
# After reboot:
uname -r # confirm new kernel
kst # confirm everything is healthy
zpool status # confirm pool healthy after reboot
Backup verification: actually restore something
# Pick a recent snapshot and clone it to a temp dataset
SNAP=$(zfs list -t snapshot -H -o name -s creation | grep "^rpool/data" | tail -1)
echo "Verifying restore from: $SNAP"
# Clone the snapshot
zfs clone "$SNAP" rpool/verify-$(date +%Y%m%d)
# Mount and check the data
mkdir -p /mnt/verify
mount -t zfs rpool/verify-$(date +%Y%m%d) /mnt/verify
ls -la /mnt/verify/
# Spot-check: open files, verify sizes, check modification times
# Verify replication is current on DR host
ssh dr-host "zfs list -t snapshot -H -o name,creation -s creation | grep rpool | tail -5"
# Cleanup
umount /mnt/verify
zfs destroy rpool/verify-$(date +%Y%m%d)
WireGuard key age check
# Check how old your WireGuard keys are
for conf in /etc/wireguard/wg*.conf; do
iface=$(basename "$conf" .conf)
key=$(grep PrivateKey "$conf" | awk '{print $3}')
pubkey=$(echo "$key" | wg pubkey 2>/dev/null || echo "cannot derive")
# Check stat of the config file as a proxy for key age
mtime=$(stat -c %y "$conf" | cut -d' ' -f1)
echo "$iface: config last modified $mtime (rotate if older than your policy)"
done
Log review
# Scan the past week for patterns worth knowing about
journalctl --since "1 week ago" -p warning --no-pager | \
grep -v "audit\[" | \
awk '{$1=$2=$3=""; print}' | sort | uniq -c | sort -rn | head -30
# ZFS-specific events this week
journalctl --since "1 week ago" -t kernel --no-pager | grep -i "zfs\|zpool\|arc" | tail -50
# Authentication failures this week
journalctl --since "1 week ago" -t sshd --no-pager | grep -i "fail\|invalid\|refused" | wc -l
The weekly maintenance script
#!/bin/bash
# /usr/local/bin/weekly-maintenance
# Run once a week. Automate with: systemd timer or cron.
set -euo pipefail
LOG="/var/log/kldload/weekly-$(date +%Y%m%d).log"
mkdir -p "$(dirname "$LOG")"
exec > >(tee -a "$LOG") 2>&1
echo "=== Weekly Maintenance: $(date) ==="
echo "--- Snapshot before maintenance ---"
zfs snapshot -r rpool@weekly-maint-$(date +%Y%m%d)
echo "--- Starting scrubs ---"
for pool in $(zpool list -H -o name); do
zpool scrub "$pool"
echo "Scrub started on $pool"
done
echo "--- Checking for package updates ---"
dnf check-update --quiet || true
echo "--- Verifying latest snapshot is recent ---"
LATEST=$(zfs list -t snapshot -H -o name,creation -s creation | tail -1)
echo "Latest snapshot: $LATEST"
echo "--- Log summary: errors this week ---"
journalctl --since "1 week ago" -p err --no-pager -q | wc -l | xargs echo "Error count:"
echo "--- Weekly maintenance complete: $(date) ---"
# Schedule with systemd timer
cat > /etc/systemd/system/weekly-maintenance.service << 'EOF'
[Unit]
Description=kldload Weekly Maintenance
After=network.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/weekly-maintenance
EOF
cat > /etc/systemd/system/weekly-maintenance.timer << 'EOF'
[Unit]
Description=Run kldload weekly maintenance every Monday at 04:00
[Timer]
OnCalendar=Mon *-*-* 04:00:00
RandomizedDelaySec=1800
Persistent=true
[Install]
WantedBy=timers.target
EOF
systemctl enable --now weekly-maintenance.timer
4. Monthly Review
The monthly review is where you catch slow-burning problems. Gradual capacity growth. Declining ARC hit rates. Increasing scrub times. These don’t trigger alerts because they’re not sudden — they’re trends. The review catches them before they become incidents.
Capacity review
# Growth rate per pool: compare used space month-over-month
zfs list -H -o name,used,avail,usedsnap,usedds -t filesystem | \
column -t
# Which datasets are growing fastest? (sort by used size, descending)
zfs list -H -o name,used -t filesystem | sort -k2 -h -r | head -20
# Are any pools approaching 80%? (performance degrades past 80%)
zpool list -o name,size,alloc,free,capacity,health | \
awk 'NR==1 {print} NR>1 { cap=$6+0; if(cap>70) print "\033[0;33mWARN\033[0m " $0; else print " OK " $0 }'
# Procurement forecast: estimate months until 80% capacity
# Usage: zfs-capacity-forecast poolname
zfs-capacity-forecast() {
local pool="$1"
local used=$(zpool list -H -o alloc "$pool" | tr -d 'G')
local total=$(zpool list -H -o size "$pool" | tr -d 'G')
local limit=$(echo "$total * 0.8" | bc)
echo "Pool $pool: ${used}G used of ${total}G (limit: ${limit}G)"
echo "Track growth manually: compare this output month over month"
echo "Rule: if doubling time < 6 months, start procurement now"
}
Performance review
# ARC statistics: hit rate, size, evictions
arc_summary 2>/dev/null || \
awk '/^arcstats/ { print }' /proc/spl/kstat/zfs/arcstats | \
grep -E "^(hits|misses|c |size|mru|mfu)" | head -20
# Calculate ARC hit ratio manually
awk '
/^hits/ { hits=$3 }
/^misses/ { misses=$3 }
END {
total=hits+misses
if(total>0) printf "ARC hit ratio: %.2f%% (%d hits, %d misses)\n", (hits/total)*100, hits, misses
}
' /proc/spl/kstat/zfs/arcstats
# I/O latency: check iostat for sustained high latency
zpool iostat -v 1 5 # 5 samples, 1 second apart
# Scrub duration trend: compare last 4 scrub reports
zpool history | grep "scrub repaired" | tail -8
# If ARC hit ratio is declining (e.g., was 95%, now 80%), your working set
# has grown beyond your RAM. Options: add RAM, or accept the performance hit.
Security review
# TLS certificate expiry check
for cert in /etc/ssl/certs/*.pem /etc/letsencrypt/live/*/cert.pem; do
[[ -f "$cert" ]] || continue
expiry=$(openssl x509 -noout -enddate -in "$cert" 2>/dev/null | cut -d= -f2)
days=$(( ($(date -d "$expiry" +%s) - $(date +%s)) / 86400 ))
if [[ $days -lt 30 ]]; then
echo "EXPIRING SOON ($days days): $cert"
elif [[ $days -lt 60 ]]; then
echo "Warning ($days days): $cert"
else
echo "OK ($days days): $cert"
fi
done
# Failed authentication attempts this month
journalctl --since "1 month ago" -t sshd --no-pager | \
grep -i "failed\|invalid user" | \
awk '{print $NF}' | sort | uniq -c | sort -rn | head -20
# WireGuard key ages (see weekly section for key age check)
# Flag any keys older than 90 days for rotation
# eBPF anomaly summary (if eBPF security tooling is deployed)
if command -v bpftool &>/dev/null; then
echo "Active eBPF programs:"
bpftool prog list | grep -c "type" | xargs echo " count:"
fi
Snapshot policy review
# How much space are snapshots consuming?
zfs list -H -o name,usedsnap -t filesystem | sort -k2 -h -r | head -20
# How many snapshots per dataset?
zfs list -H -t snapshot -o name | awk -F@ '{print $1}' | sort | uniq -c | sort -rn | head -20
# Oldest snapshot per dataset
zfs list -H -t snapshot -o name,creation -s creation | \
awk '{ ds=substr($1,1,index($1,"@")-1); if(!(ds in seen)){ seen[ds]=$0; print } }'
# Review sanoid configuration: is retention policy correct?
cat /etc/sanoid/sanoid.conf
# If snapshots are consuming too much space, tighten retention:
# [dataset]
# hourly = 24 # keep 24 hourly (was 168)
# daily = 30 # keep 30 daily (was 90)
# monthly = 6 # keep 6 monthly (was 12)
# yearly = 1 # keep 1 yearly
Replication review
# Check DR replication lag: compare snapshot timestamps on source vs DR
echo "=== Source snapshots ==="
zfs list -t snapshot -H -o name,creation -s creation | grep "auto-" | tail -5
echo "=== DR snapshots (SSH to DR host) ==="
ssh dr-host "zfs list -t snapshot -H -o name,creation -s creation | grep 'auto-' | tail -5"
# Check syncoid last run time
journalctl -u syncoid --no-pager --since "1 week ago" | tail -20
# Check replication bandwidth utilization
# (if syncoid is running, it shows transfer rate in output)
journalctl -u syncoid --no-pager -g "sent" | tail -10
Monthly review report script
#!/bin/bash
# /usr/local/bin/monthly-review
# Generates a monthly health report. Run on the 1st of each month.
REPORT="/var/log/kldload/monthly-report-$(date +%Y%m).txt"
mkdir -p "$(dirname "$REPORT")"
{
echo "kldload Monthly Review Report"
echo "Generated: $(date)"
echo "Hostname: $(hostname)"
echo ""
echo "=== POOL HEALTH ==="
zpool list
echo ""
echo "=== CAPACITY (top datasets by usage) ==="
zfs list -H -o name,used,avail,usedsnap -t filesystem | sort -k2 -h -r | head -15
echo ""
echo "=== SNAPSHOT SPACE USAGE ==="
zfs list -H -o name,usedsnap -t filesystem | \
awk '$2!="0" {print}' | sort -k2 -h -r | head -10
echo ""
echo "=== ARC PERFORMANCE ==="
awk '
/^hits/ { hits=$3 }
/^misses/ { misses=$3 }
/^size/ { size=$3 }
/^c / { target=$3 }
END {
total=hits+misses
if(total>0) {
printf "ARC hit ratio: %.2f%%\n", (hits/total)*100
printf "ARC size: %d MB / target %d MB\n", size/1024/1024, target/1024/1024
}
}
' /proc/spl/kstat/zfs/arcstats
echo ""
echo "=== ERROR SUMMARY (past 30 days) ==="
journalctl --since "30 days ago" -p err --no-pager -q | wc -l | xargs echo "Total errors:"
journalctl --since "30 days ago" -p err --no-pager -q | \
awk '{$1=$2=$3=$4=""; print}' | sort | uniq -c | sort -rn | head -10
echo ""
echo "=== CERTIFICATE EXPIRY ==="
for cert in /etc/ssl/certs/*.pem /etc/letsencrypt/live/*/cert.pem; do
[[ -f "$cert" ]] || continue
expiry=$(openssl x509 -noout -enddate -in "$cert" 2>/dev/null | cut -d= -f2)
days=$(( ($(date -d "$expiry" +%s) - $(date +%s)) / 86400 ))
echo "$days days: $cert"
done | sort -n
echo ""
echo "=== SCRUB HISTORY ==="
zpool history | grep "scrub repaired" | tail -4
echo ""
echo "=== Report complete ==="
} | tee "$REPORT"
echo "Report saved to: $REPORT"
5. Quarterly Operations
Quarterly operations are the ones most teams skip — because they require dedicated time, not just a few minutes. They’re also the most important. A quarterly DR test is the difference between “we have replication” and “we have verified disaster recovery.”
DR test: clone production, boot it, verify, destroy
This is the most valuable operation on this page. A system that has never been tested has no known RTO. A system that was tested last quarter has a measured RTO. OpenZFS makes the test free — clone, boot, verify, destroy. Total cost: one afternoon.
# Step 1: Create a recursive snapshot of production
zfs snapshot -r rpool@dr-test-$(date +%Y%m%d)
# Step 2: Send snapshot to DR host (if not already replicated)
# If syncoid is already replicating, skip this step — use the existing snapshot on DR
zfs send -R rpool@dr-test-$(date +%Y%m%d) | \
ssh dr-host "zfs recv -F dr-test/production"
# Step 3: On DR host, clone the snapshot into a bootable dataset
ssh dr-host "
zfs clone dr-test/production@dr-test-$(date +%Y%m%d) dr-test/boot
# If using VMs: import the dataset as a disk, boot the VM
# If bare metal: boot from DR host with cloned pool
"
# Step 4: Run smoke tests against the DR clone
# Test connectivity, application health, data integrity
curl -sf http://dr-host:8080/health || echo "FAIL: app health check"
ssh dr-host "psql -U app -c 'SELECT count(*) FROM critical_table;'" || echo "FAIL: DB check"
# Step 5: Measure RTO
# Record: time from "declare disaster" to "services serving traffic on DR"
# Step 6: Document the test
cat > /var/log/kldload/dr-test-$(date +%Y%m%d).txt << EOF
DR Test: $(date)
Snapshot used: rpool@dr-test-$(date +%Y%m%d)
Snapshot age at test: (measure from snapshot timestamp to boot)
RTO achieved: (time from clone to serving traffic)
What worked:
What didn't:
Action items:
EOF
# Step 7: Destroy the test clone on DR host
ssh dr-host "
zfs destroy -r dr-test/boot
zfs destroy -r dr-test/production@dr-test-$(date +%Y%m%d)
"
echo "DR test complete. Clean up done."
Key rotation: WireGuard keys and TLS certificates
# WireGuard key rotation procedure
# 1. Generate new keys for each interface on each node
for iface in wg0 wg1 wg2 wg3; do
[[ -f /etc/wireguard/$iface.conf ]] || continue
echo "Generating new keys for $iface..."
NEW_PRIV=$(wg genkey)
NEW_PUB=$(echo "$NEW_PRIV" | wg pubkey)
echo " New public key: $NEW_PUB"
echo " Update peers with this public key, then update PrivateKey in $iface.conf"
done
# 2. Update each peer's config with the new public keys
# 3. Reload WireGuard (connections drop momentarily, re-establish with new keys)
for iface in wg0 wg1 wg2 wg3; do
[[ -f /etc/wireguard/$iface.conf ]] || continue
wg syncconf "$iface" <(wg-quick strip "$iface")
echo "Reloaded $iface"
done
# 4. Verify all peers reconnected
wg show all
# All peers should show recent handshakes within 2 minutes
# TLS certificate renewal (Let's Encrypt)
certbot renew --dry-run # test first
certbot renew # renew if within 30 days of expiry
systemctl reload nginx haproxy 2>/dev/null || true
Hardware health audit
# SMART data for all disks
for disk in /dev/sd? /dev/nvme?; do
[[ -b "$disk" ]] || continue
echo "=== $disk ==="
smartctl -H "$disk" 2>/dev/null | grep -E "SMART overall|result"
smartctl -A "$disk" 2>/dev/null | grep -E "Reallocated|Pending|Uncorrectable|Temperature|Power_On"
echo ""
done
# Disk age: how long has each disk been powered on?
for disk in /dev/sd?; do
[[ -b "$disk" ]] || continue
hours=$(smartctl -A "$disk" 2>/dev/null | awk '/Power_On_Hours/ {print $10}')
[[ -n "$hours" ]] && printf "%-10s %s hours (%s years)\n" \
"$disk" "$hours" "$(echo "scale=1; $hours/8760" | bc)"
done
# Flag disks approaching end of warranty (5-year typical warranty)
# Flag disks with increasing reallocated sector counts
# Order replacements before they fail, not after
Capacity planning: 3-month and 6-month projections
# Compare current month to last month's report (run monthly-review monthly)
# Estimate growth rate
LAST_MONTH_USED="2.3T" # from last month's report
THIS_MONTH_USED=$(zfs list -H -o used rpool | tail -1)
echo "Last month: $LAST_MONTH_USED"
echo "This month: $THIS_MONTH_USED"
echo "Trend: calculate delta, project forward 3-6 months"
echo "Rule: if current growth rate continues, when do you hit 80%?"
echo "Order hardware when you’re 3 months from 80%, not when you hit it."
# Upgrade planning checklist
echo ""
echo "=== Upgrade Planning ==="
uname -r | xargs echo "Current kernel:"
rpm -q zfs 2>/dev/null || dpkg -l zfs-dkms 2>/dev/null | grep zfs | head -1
echo "Check upstream: https://github.com/openzfs/zfs/releases"
echo "Policy: test on dev node, schedule production upgrade with maintenance window"
6. Annual Operations
Annual operations are the strategic layer — where you step back from the daily and weekly cadence and review the system as a whole. Is the architecture still correct? Is the hardware aging out? Is the documentation current?
Full hardware audit
# Physical inspection checklist (do this with eyes on the hardware)
# 1. Verify every disk label matches zpool configuration
zpool status -v | grep -E "^\s+(ada|sd|nvme|wwn)" | awk '{print $1}'
# 2. Check disk serial numbers against inventory
for disk in /dev/sd?; do
[[ -b "$disk" ]] || continue
serial=$(smartctl -i "$disk" 2>/dev/null | grep "Serial Number" | awk '{print $3}')
echo "$disk: $serial"
done
# 3. Verify labels and physical location tags are legible and correct
# 4. Check cable condition, seating, controller firmware
# 5. Reconcile physical inventory with asset management system
License and warranty review
# What hardware warranties expire in the next 12 months?
# (Pull from your hardware inventory/CMDB — this should be tracked)
# Flag for replacement budget planning
# What software licenses or subscriptions expire?
# Red Hat: subscription-manager list --consumed
# SSL certificates: see monthly security review
# Any SaaS tooling in the stack?
Architecture review
The architecture review is not a technical audit — it’s a design question. Ask: does the current design still meet requirements? Is the pool layout appropriate for the workload? Is the WireGuard mesh the right topology? Is the monitoring stack adequate? Are the runbooks current? Are the names still meaningful?
The output of the architecture review is a list of action items — not changes to make today, but changes to plan for the next 12 months. Major version upgrades, topology changes, hardware refreshes. These go into the budget and the roadmap.
Documentation review
# Are all runbooks current? Test each one.
# Are all asset labels correct? Walk the rack.
# Are all naming conventions followed? Check dataset names.
# Are all emergency contacts current? Check the on-call rotation.
# The documentation review test: give the runbook to someone new.
# Can they execute it without asking questions?
# If not, the runbook is incomplete. Fix it.
# Verify the ops runbook covers:
# 1. How to run the morning check
# 2. How to replace a failed disk
# 3. How to restore from a snapshot
# 4. How to fail over to DR
# 5. How to roll back a bad upgrade
# 6. How to add a new node to the mesh
# 7. Who to call when something breaks
7. Fleet Management — Operating at Scale
Fleet management is where individual node operations become coordinated operations. Every kldload in the fleet has the same tools and the same interface. Scale is a multiplier, not a different job.
Running commands across multiple nodes
# Salt: run morning check across all nodes
salt '*' cmd.run 'kst'
salt '*' cmd.run 'zpool status' --out=yaml
# Salt: target by role/label
salt -G 'role:storage' cmd.run 'zpool status'
salt -G 'tier:production' cmd.run 'morning-check'
# Ansible: health check across fleet
ansible all -m shell -a 'kst'
ansible all -m shell -a 'zpool status | grep -E "state|errors"'
# Ansible: targeted by group
ansible production -m shell -a 'morning-check'
ansible -l 'storage_nodes' -m shell -a 'zpool iostat -v 1 3'
# Simple SSH loop for ad-hoc runs
while IFS= read -r host; do
echo "=== $host ==="
ssh "$host" kst 2>/dev/null || echo " UNREACHABLE"
done < /etc/kldload/fleet/hosts.txt
# Parallel SSH loop (faster for large fleets)
parallel -j 10 'echo "=== {} ==="; ssh {} kst 2>/dev/null || echo "UNREACHABLE"' \
:::: /etc/kldload/fleet/hosts.txt
Fleet-wide snapshots: before every coordinated change
# Snapshot every node in the fleet before a coordinated change
SNAP_NAME="pre-fleet-change-$(date +%Y%m%d-%H%M)"
while IFS= read -r host; do
echo "Snapshotting $host..."
ssh "$host" "zfs snapshot -r rpool@${SNAP_NAME}" &
done < /etc/kldload/fleet/hosts.txt
wait
echo "All nodes snapshotted: $SNAP_NAME"
echo "Rollback: ssh 'zfs rollback -r rpool@${SNAP_NAME}'"
Fleet-wide rolling upgrade
#!/bin/bash
# Rolling upgrade with health check gates
# Upgrades one node at a time. Stops if any node fails health check.
set -euo pipefail
HOSTS_FILE="/etc/kldload/fleet/hosts.txt"
HEALTH_CHECK="morning-check"
WAIT_BETWEEN=300 # 5 minutes between nodes
while IFS= read -r host; do
echo "=== Upgrading $host ==="
# Health check before upgrade
echo "Pre-upgrade health check..."
ssh "$host" "$HEALTH_CHECK" || { echo "FAIL: $host failed pre-upgrade health check. Stopping."; exit 1; }
# Snapshot before upgrade
ssh "$host" "zfs snapshot -r rpool@pre-upgrade-$(date +%Y%m%d)"
# Apply updates
ssh "$host" "dnf upgrade -y --security" || true
# Reboot if kernel updated
ssh "$host" "needs-restarting -r" && {
echo "Kernel updated. Rebooting $host..."
ssh "$host" "kbe create pre-upgrade-$(date +%Y%m%d) && systemctl reboot" || true
sleep 120 # Wait for reboot
}
# Post-upgrade health check
echo "Post-upgrade health check (waiting for $host to be ready)..."
for i in {1..12}; do
ssh "$host" "$HEALTH_CHECK" && break || sleep 15
done
echo "$host upgraded successfully."
echo "Waiting ${WAIT_BETWEEN}s before next node..."
sleep "$WAIT_BETWEEN"
done < "$HOSTS_FILE"
echo "Fleet upgrade complete."
Fleet-wide replication status
# Check replication lag across fleet
while IFS= read -r host; do
LATEST=$(ssh "$host" "zfs list -t snapshot -H -o name,creation -s creation | \
grep auto- | tail -1" 2>/dev/null || echo "UNREACHABLE")
printf "%-20s %s\n" "$host:" "$LATEST"
done < /etc/kldload/fleet/hosts.txt
zfs get -r com.kldload:tier | grep production gives you every production dataset across the fleet. Salt and Ansible target by grain or group. The labels ARE the inventory, the inventory IS the targeting. If your datasets don’t have labels, fleet management is ad-hoc. If they do, it’s systematic. Label everything. The Labeling & Asset Management Masterclass covers the labeling strategy in full.
8. Maintenance Windows
A maintenance window is not just time blocked on the calendar. It’s a structured procedure: announce, snapshot, execute, verify, close. Every step is documented. Every rollback path is identified before the window starts. The window is a success if the change is applied AND the rollback is never needed. The window is a failure if the rollback is needed and wasn’t prepared.
Planning a maintenance window
| Phase | Action | Who |
|---|---|---|
| T-48h | Announce window, expected duration, affected services | Operator |
| T-24h | Verify rollback procedure. Document it. Test it on dev. | Operator |
| T-1h | Final health check. Take pre-maintenance snapshot. | Operator |
| T+0 | Execute change. Follow the written procedure. | Operator |
| T+change | Verify: health check, smoke test, user confirmation | Operator + users |
| T+close | Close window. Announce completion. Document outcome. | Operator |
The maintenance window script
#!/bin/bash
# /usr/local/bin/maint-window
# Usage: maint-window start|finish|rollback "description"
set -euo pipefail
ACTION="$1"
DESC="${2:-maintenance}"
SNAP="pre-maint-$(date +%Y%m%d-%H%M)"
case "$ACTION" in
start)
echo "=== Starting maintenance window: $DESC ==="
echo "Time: $(date)"
echo "Taking pre-maintenance snapshot: $SNAP"
zfs snapshot -r rpool@"$SNAP"
echo "Snapshot created. Rollback command if needed:"
echo " maint-window rollback '$DESC'"
echo ""
echo "Maintenance window is OPEN. Execute your change."
;;
finish)
echo "=== Closing maintenance window: $DESC ==="
echo "Running post-maintenance health check..."
morning-check || echo "WARNING: Health check reported issues. Investigate before declaring success."
echo ""
echo "Maintenance window CLOSED at $(date)."
echo "Pre-maintenance snapshot retained: rpool@$SNAP"
echo "Remove when confident: zfs destroy -r rpool@$SNAP"
;;
rollback)
# Find the most recent pre-maint snapshot
SNAP=$(zfs list -t snapshot -H -o name -s creation | grep "pre-maint-" | tail -1)
if [[ -z "$SNAP" ]]; then
echo "No pre-maintenance snapshot found. Manual recovery required."
exit 1
fi
echo "=== ROLLBACK: $DESC ==="
echo "Rolling back to: $SNAP"
echo "WARNING: This will discard all changes since the snapshot."
read -rp "Type 'ROLLBACK' to confirm: " confirm
[[ "$confirm" == "ROLLBACK" ]] || { echo "Aborted."; exit 1; }
POOL=$(echo "$SNAP" | cut -d/ -f1 | cut -d@ -f1)
zfs rollback -r "$SNAP"
echo "Rollback complete. Run morning-check to verify."
;;
*)
echo "Usage: maint-window start|finish|rollback 'description'"
exit 1
;;
esac
When maintenance goes wrong: rollback and communication
# Rollback procedure (fastest path)
# 1. Identify the pre-maintenance snapshot
zfs list -t snapshot -H -o name,creation -s creation | grep "pre-maint-" | tail -5
# 2. Roll back (this is instant for the dataset)
zfs rollback -r rpool@pre-maint-20260402-0200
# 3. If kernel was updated: reboot into previous boot environment
kbe list
kbe activate previous-be-name
systemctl reboot
# 4. Verify
morning-check
# Communication template (send to stakeholders immediately):
cat << 'EOF'
Subject: Maintenance Rollback - [service] - [date]
We encountered an issue during tonight's maintenance window for [service].
Status: Rolled back to pre-maintenance state at [time].
Impact: [describe what users experienced, duration]
Current status: All services restored to normal operation.
Root cause: [brief description]
Next steps: [when will we retry, what changes to the procedure]
We apologize for any disruption.
EOF
zfs rollback takes less time than the application restart that follows it. There is no scenario where not snapshotting is the right call.
9. Health Check Endpoints
Every service should expose a health endpoint. Every health check should include pool health as a component — if the pool is degraded, the service is degraded, even if the application process is running.
Standard health endpoint conventions
# What each endpoint should return:
# GET /health — liveness: is the process running? Returns 200 or 503
# GET /ready — readiness: is the service ready to serve traffic? Returns 200 or 503
# GET /metrics — Prometheus metrics (if instrumented)
# Minimal health script (embed in any service)
cat > /usr/local/bin/health-check-service << 'EOF'
#!/bin/bash
# Returns 0 (healthy) or 1 (unhealthy)
# Check 1: ZFS pool
POOL_HEALTH=$(zpool list -H -o health tank 2>/dev/null || echo "FAULTED")
[[ "$POOL_HEALTH" == "ONLINE" ]] || { echo "UNHEALTHY: pool $POOL_HEALTH"; exit 1; }
# Check 2: Application process
pgrep -x myapp &>/dev/null || { echo "UNHEALTHY: myapp not running"; exit 1; }
# Check 3: Application port responding
curl -sf http://127.0.0.1:8080/ping &>/dev/null || { echo "UNHEALTHY: app not responding"; exit 1; }
echo "HEALTHY"
exit 0
EOF
chmod +x /usr/local/bin/health-check-service
HAProxy health checks
# /etc/haproxy/haproxy.cfg
backend app_servers
option httpchk GET /health
http-check expect status 200
server app1 10.201.0.1:8080 check inter 10s rise 2 fall 3
server app2 10.201.0.2:8080 check inter 10s rise 2 fall 3
# ZFS health as a backend check: use custom health endpoint
# GET /health returns 503 if pool is not ONLINE
backend storage_nodes
option httpchk GET /health
http-check expect status 200
server stor1 10.203.0.1:9100 check inter 30s
Prometheus blackbox_exporter endpoint monitoring
# /etc/prometheus/prometheus.yml (blackbox probe section)
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://10.201.0.1:8080/health
- http://10.201.0.2:8080/health
- http://10.202.0.1:9090/-/healthy # Prometheus itself
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # blackbox_exporter
# Alert: service down for > 1 minute
# - alert: ServiceDown
# expr: probe_success == 0
# for: 1m
# labels:
# severity: critical
10. Log Management
Logs are the audit trail of everything that happened. OpenZFS events, WireGuard handshakes, authentication attempts, application errors — all flowing into systemd journal. The discipline is knowing what to look for and when.
What to log and where
| Source | Where | Retention |
|---|---|---|
| systemd units | Journal (automatic) | 30 days local |
| ZFS events (zed) | Journal + /var/log/zed.log | 30 days local, 1 year archive |
| WireGuard handshakes | Journal (kernel messages) | 30 days local |
| Application logs | Journal (via stdout) or /var/log/app/ | 90 days central (Loki) |
| Audit log (auth) | Journal + /var/log/audit/ | 1 year (compliance) |
Journal configuration
# /etc/systemd/journald.conf
[Journal]
SystemMaxUse=4G # Cap journal at 4GB
SystemKeepFree=2G # Leave 2GB free on the partition
MaxRetentionSec=2592000 # 30 days
Compress=yes # Compress older entries
ForwardToSyslog=no # Don’t double-log to syslog
# Apply
systemctl restart systemd-journald
ZFS event daemon
# Enable ZFS event daemon: sends email/executes scripts on ZFS events
systemctl enable --now zfs-zed
# ZED config: /etc/zfs/zed.d/zed.rc
# Set ZED_EMAIL_ADDR for email alerts on pool degradation, checksum errors, etc.
# ZED_EMAIL_ADDR="ops@yourdomain.com"
# ZED_EMAIL_PROG="sendmail"
# Test ZED is working:
zpool scrub tank
journalctl -t zed --no-pager -n 20
Log-based alerting: patterns that matter
# Patterns to alert on (via Loki alerting rules or a simple grep in cron)
# OOM killer
journalctl -p err --no-pager | grep "Out of memory"
# Segfault
journalctl --no-pager | grep "segfault at"
# Authentication failure (more than 10 in an hour)
journalctl -t sshd --since "1 hour ago" --no-pager | grep -c "Failed password"
# ZFS checksum errors
journalctl -t kernel --no-pager | grep -i "zfs.*checksum\|zfs.*error"
# Disk I/O errors
journalctl -t kernel --no-pager | grep -i "I/O error\|blk_update_request"
# Cron job for alert scanning (runs every 5 minutes)
cat > /etc/cron.d/log-alert-scan << 'EOF'
*/5 * * * * root /usr/local/bin/log-alert-scan 2>&1 | systemd-cat -t log-alert-scan -p warning
EOF
The log review ritual
# Weekly log review: what to look for
# 1. Error count trend (is it going up?)
journalctl --since "1 week ago" -p err --no-pager -q | wc -l
# 2. Top error sources
journalctl --since "1 week ago" -p err --no-pager -q | \
awk '{print $5}' | sort | uniq -c | sort -rn | head -20
# 3. Any new error patterns not seen before?
journalctl --since "1 week ago" -p err --no-pager -q | \
grep -v "audit\[" | grep -v "CRON\[" | head -50
# 4. Authentication anomalies
journalctl --since "1 week ago" -t sshd --no-pager | \
grep -E "Accepted|Failed|Invalid" | \
awk '{print $9, $11}' | sort | uniq -c | sort -rn | head -20
11. Upgrade Procedures
Every upgrade is the same structure: snapshot, update, verify. The only variable is what you’re updating and whether it requires a reboot. The boot environment is the safety net. Use it for every kernel and ZFS update.
The upgrade calendar
| Type | Cadence | Procedure |
|---|---|---|
| Security patches | Within 48h of release | Snapshot, update, verify |
| Feature updates | Monthly (maintenance window) | Snapshot, update, verify |
| Kernel updates | Monthly or as needed | Boot environment, update, reboot, verify |
| ZFS updates | Monthly (with kernel) | Boot environment, update, reboot, verify module |
| Major OS versions | Quarterly (planned) | Test on dev, maintenance window, full rollback ready |
Package updates (no reboot required)
# Snapshot first
zfs snapshot -r rpool@pre-update-$(date +%Y%m%d)
# Review what changes
dnf check-update # RPM
apt list --upgradable 2>/dev/null # APT
# Apply updates
dnf upgrade -y # RPM
apt-get upgrade -y # APT
# Verify: check critical services still running
morning-check
Kernel updates (reboot required)
# Create a boot environment BEFORE rebooting
kbe create pre-kernel-$(date +%Y%m%d)
kbe list
# Apply kernel update
dnf upgrade -y kernel kernel-headers # RPM
apt-get install -y linux-image-generic # APT
# Verify initramfs includes ZFS (critical — if missing, pool won’t mount at boot)
dracut --force # RPM (rebuild initramfs)
update-initramfs -u -k all # APT (rebuild initramfs)
# Reboot
systemctl reboot
# After reboot: verify new kernel booted and pool is healthy
uname -r
zpool status
kst
# If the new kernel fails to boot:
# Select previous boot environment from GRUB menu
# kbe activate pre-kernel-YYYYMMDD
# systemctl reboot
ZFS module updates
# ZFS needs special handling: DKMS rebuild on kernel update
# This is automatic if zfs-dkms is installed correctly
# Verify ZFS module is loaded for the new kernel
modinfo zfs | grep vermagic
# Check DKMS build status
dkms status | grep zfs
# If DKMS build failed, rebuild manually
dkms build zfs/ -k $(uname -r)
dkms install zfs/ -k $(uname -r)
# Verify module loads
modprobe zfs
zfs version
Application updates
# Snapshot the application dataset
zfs snapshot rpool/data/myapp@pre-update-$(date +%Y%m%d)
# Update the application
systemctl stop myapp
# (deploy new version here)
systemctl start myapp
# Smoke test
curl -sf http://localhost:8080/health || {
echo "FAIL: rolling back..."
systemctl stop myapp
zfs rollback rpool/data/myapp@pre-update-$(date +%Y%m%d)
systemctl start myapp
echo "Rolled back."
exit 1
}
echo "Update successful. Monitoring for 1 hour before confirming."
# Monitor logs:
journalctl -u myapp -f
12. Troubleshooting Decision Tree
When something breaks, the decision tree tells you where to look. Start at the top. Follow the branches. Each branch ends with a command. Run it. Read the output. Go to the next branch.
"Something is slow"
# Step 1: Is it ZFS?
zpool iostat -v 1 5 # Look for high w/r latency (>10ms sustained = investigate)
zfs get compressratio,logicalused tank
# Step 2: Is ARC being missed?
awk '/^hits/{h=$3}/^misses/{m=$3}END{printf "ARC hit ratio: %.1f%%\n",(h/(h+m))*100}' \
/proc/spl/kstat/zfs/arcstats
# Hit ratio < 80% = working set exceeds ARC; add RAM or expect cache-cold I/O
# Step 3: Is it CPU?
top -bn1 | head -20
sar -u 1 5 # CPU utilization per second
# Step 4: Is it network?
iftop -i eth0 # Live network utilization
ss -s # Socket summary: any backlogs?
wg show # WireGuard: are all tunnels up and healthy?
# Step 5: Is it the application?
journalctl -u myapp -p warning --since "1 hour ago" --no-pager | tail -50
"Something is down"
# Step 1: What is down?
systemctl --failed --no-legend
# Step 2: Why did it fail?
systemctl status
journalctl -u --since "30 minutes ago" --no-pager | tail -50
# Step 3: Is it a dependency?
systemctl list-dependencies --failed
# Step 4: Network issue?
wg show # WireGuard tunnels
ss -tlnp # What is listening?
nft list ruleset # Is nftables blocking?
# Step 5: Storage issue?
zpool status # Pool degraded?
df -h # Out of space?
"Disk failed"
# Step 1: Identify the failed disk
zpool status -v # Shows which disk(s) are FAULTED or removed
# Step 2: Map to physical disk (use labels)
# kldload labels disks with dataset and pool information
zpool status -P # Show full device paths
zdb -C | grep -A5 "path" # Pool configuration with paths
# Step 3: If the pool is DEGRADED but still ONLINE, it’s safe to replace
# Order the replacement disk. Data is still protected.
# Step 4: Offline the failed disk (if it hasn’t auto-offlined)
zpool offline tank sda3
# Step 5: Replace the disk
zpool replace tank /dev/old-disk /dev/new-disk
# Step 6: Monitor resilver progress
watch -n 10 'zpool status | grep -A5 "scan:"'
# Step 7: Verify resilver completes without errors
zpool status
# Should show: scan: resilvered X with 0 errors
"Out of space"
# Step 1: Where is the space going?
zfs list -H -o name,used,usedsnap,usedbydataset -t filesystem | \
sort -k2 -h -r | head -20
# Step 2: Snapshot space?
zfs list -H -t snapshot -o name,used | sort -k2 -h -r | head -20
# Step 3: Prune old snapshots (carefully)
# List snapshots older than 30 days
zfs list -t snapshot -H -o name,creation -s creation | \
awk '{ cmd="date -d \""$2" "$3" "$4" "$5"\" +%s"; cmd | getline ts; close(cmd);
age=(systime()-ts)/86400; if(age>30) print age" days: "$1 }'
# Step 4: Destroy selected old snapshots
zfs destroy rpool@old-snapshot-name
# Step 5: Use send|recv to offload cold data (example: archive dataset)
zfs send tank/cold-data@snap | ssh dr-host "zfs recv archive/cold-data"
zfs destroy -r tank/cold-data
# Step 6: Expand the pool (add a drive, or grow vdev if possible)
zpool add tank /dev/new-disk
"Replication lagging"
# Step 1: Check WireGuard tunnel to replication target
wg show wg3 # Storage plane should have recent handshake
ping 10.203.0.2 # Can you reach the DR host over the storage plane?
# Step 2: Check source snapshot freshness
zfs list -t snapshot -H -o name,creation -s creation | tail -5
# Step 3: Check syncoid status and errors
systemctl status syncoid
journalctl -u syncoid --since "24 hours ago" --no-pager | tail -50
# Step 4: Check bandwidth between source and DR
ssh 10.203.0.2 "dd if=/dev/zero bs=1M count=100 | pv > /dev/null"
# Step 5: Run syncoid manually to see live output
syncoid --no-privilege-elevation rpool/data 10.203.0.2:backup/data
"Can’t SSH"
# Step 1: Is WireGuard up?
wg show wg1 # Management plane
# Look for recent handshake (< 180 seconds)
# Step 2: If handshake is stale, check the peer’s endpoint
wg show wg1 endpoints # Correct IP:port?
# Step 3: Is sshd bound to the correct address?
ss -tlnp | grep :22 # Should show 10.201.0.1:22 (management plane only)
# Step 4: Is nftables blocking?
nft list ruleset | grep -A20 "chain input"
# Confirm management plane (wg1) is allowed for SSH
# Step 5: Check sshd config
grep -E "ListenAddress|PermitRootLogin|AllowUsers|AuthorizedKeysFile" /etc/ssh/sshd_config
# Step 6: Check journal for sshd errors
journalctl -t sshd --since "1 hour ago" --no-pager | tail -30
13. The Ops Toolkit
Every command an operator needs, organized by category. Print this section. Pin it to the wall. Reference it when you’re working under pressure and can’t remember the exact flags.
Health
kst — kldload status overview
zpool status — pool state and disk health
zpool status -v — with device paths
systemctl --failed — failed units
wg show — WireGuard peer status
morning-check — full daily health check
Storage
zfs list — all filesystems
zfs list -t snapshot -s creation — snapshots by age
zpool iostat -v 1 — live I/O per vdev
arc_summary — ARC statistics
zfs get compressratio — compression ratio
zpool list -o name,capacity,health — pool usage
Snapshots
ksnap — create a snapshot
sanoid --run — force sanoid snapshot cycle
zfs snapshot -r rpool@name — recursive snapshot
zfs rollback -r rpool/data@snap — rollback
zfs clone snap dest — create a clone
zfs destroy -r rpool@old-snap — prune snapshot
Replication
syncoid --dryrun src dst — test replication
syncoid src host:dst — replicate dataset
zfs send -R snap | ssh host zfs recv dst — manual send
zfs get com.kldload:dr-target — check DR label
journalctl -u syncoid — replication logs
Network
wg show — all WireGuard interfaces
wg show wg1 latest-handshakes — handshake times
ip addr — interface addresses
ss -tlnp — listening services
nft list ruleset — firewall rules
ping 10.201.0.2 — test WireGuard connectivity
Monitoring
curl localhost:9090/metrics — Prometheus metrics
journalctl -p err --since today — today’s errors
journalctl -f -u myservice — follow service log
bpftrace -e 'tracepoint:...' — eBPF tracing
tcpconnect — trace outbound connections
sar -u 1 10 — CPU utilization
Lifecycle
kbe list — list boot environments
kbe create name — create boot environment
kbe activate name — switch boot environment
kupgrade — upgrade with boot environment
maint-window start "desc" — open maintenance window
maint-window finish "desc" — close maintenance window
Scrub & Health
zpool scrub tank — start scrub
zpool scrub -s tank — stop scrub
zpool status | grep scan — scrub progress
smartctl -H /dev/sda — SMART health
smartctl -A /dev/sda — SMART attributes
zpool clear tank — clear error counts after fix
14. Handoff and Documentation
The ultimate test of your operations: if you disappeared tomorrow, could someone else run your infrastructure from the documentation alone? If the answer is no, the documentation is incomplete. This section is about making the answer yes.
The ops runbook: one document, entire environment
# Template: /etc/kldload/runbook.md
# Keep this document current. Review it quarterly. Test it annually.
# Section 1: Infrastructure overview
# - How many nodes? What are they? What do they run?
# - Network topology (WireGuard planes, subnets)
# - Storage pools: which pool on which node, what data
# Section 2: Access
# - How to SSH into each node (WireGuard first!)
# - Where are the keys stored?
# - Emergency access if WireGuard is down (IPMI/console)
# Section 3: Daily operations
# - How to run the morning check
# - What to do for each red/yellow result
# Section 4: Common procedures (link to each procedure)
# - Replace a failed disk
# - Restore from snapshot
# - Roll back an upgrade
# - Fail over to DR
# - Add a new node
# - Rotate WireGuard keys
# Section 5: Emergency contacts
# - Primary operator
# - Backup operator
# - Hardware vendor support contacts
# - Network provider support
# Section 6: Escalation path
# - What can be fixed without escalation?
# - When to call the backup operator?
# - When to declare a disaster and invoke DR?
The on-call handoff
# On-call handoff template (written at shift change):
cat << 'EOF'
On-Call Handoff - [date] [time]
Outgoing: [name]
Incoming: [name]
CURRENT STATUS:
- All pools: [ONLINE/any issues]
- All services: [green/any issues]
- Active incidents: [none / describe]
- Pending maintenance: [none / describe]
WATCH LIST (things to keep an eye on):
- [any slow-burning issues, trends to watch]
RECENT CHANGES:
- [what changed in the last shift]
OPEN ITEMS:
- [anything that needs follow-up but isn’t urgent]
The runbook is at /etc/kldload/runbook.md
Morning check: morning-check
Alerts: Grafana at https://10.202.0.1:3000
EOF
New operator onboarding
# First hour: orient
# 1. Read the runbook (/etc/kldload/runbook.md)
# 2. Run the morning check (morning-check)
# 3. Review the pool topology (zpool status -v for each pool)
# 4. Review the WireGuard mesh (wg show all)
# 5. Look at Grafana dashboards with a senior operator
# First week: supervised operations
# 1. Run the morning check every day, discuss results
# 2. Observe a maintenance window
# 3. Practice snapshot and rollback on a non-production dataset
# 4. Walk through the DR procedure (don’t execute, just read)
# First month: independent operations
# 1. Run all weekly maintenance tasks
# 2. Participate in monthly review
# 3. Execute at least one maintenance window
# 4. Add or update a section of the runbook based on what was unclear
Related pages
- Blue/Green & SRE Masterclass — the philosophy and principles that underpin everything on this page
- ZFS Masterclass — deep dive on pool design, snapshots, replication, and tuning
- Backplane Networks Masterclass — designing and operating the encrypted infrastructure network
- Observability Masterclass — Prometheus, Grafana, and alerting in depth
- Boot Environments tutorial — full guide to creating, activating, and managing boot environments
- Snapshots Guide — sanoid configuration, retention policies, and manual snapshot management
- Disaster Recovery build guide — end-to-end DR setup with syncoid and automated failover