Masterclass

Operations Guide Masterclass

The daily ops bible for kldload in production. What to check every morning. What to run every week. What to review every month. What to test every quarter. Copy-paste commands throughout — no theory, just the playbook.

SRE gives you principles. This guide gives you the playbook. The Blue/Green & SRE Masterclass covers philosophy: error budgets, toil reduction, the reliability engineering mindset. This guide covers execution: what you actually do, in what order, on what schedule. On OpenZFS, most of these operations are one command. Knowing which command to run when is the difference between a system that runs for years and one that fails at the worst possible time.

Who this is for: Anyone operating kldload in production. A solo homelabber with three nodes. A small team running twenty. A platform team managing a fleet. The cadence scales with the environment — the commands are the same. Daily → weekly → monthly → quarterly → annually. Follow the cadence. That’s the whole job.

Out-of-character note: This page is designed to be printed and pinned to the wall next to the terminal. Or bookmarked as the first tab you open every morning. Every section has copy-paste commands. No theory — just “run this, check that, fix if broken.” Start at section 2 (daily), work your way forward. The first time you do the quarterly DR test (section 5) and it actually works, you’ll understand why the cadence exists.

2. The Morning Check — Daily Ops (5 Minutes)

Five minutes every morning. Same commands, same order. If everything is green, your infrastructure survived the night and you can get on with your day. If anything is yellow or red, you have the rest of the day to fix it before users notice. That’s the deal.

The one-command health check

# kst — kldload status: pool status, snapshot freshness, service health, WireGuard
kst

kst runs the canonical health check and formats the output for fast scanning. Green means go. Anything else means dig in. The sections below show what to look for in each check and how to respond.

What to look for

Check	Green	Red — Do This
Pool status	ONLINE	DEGRADED/FAULTED → section 12 (disk failed)
Snapshot freshness	Latest < 1h old	Stale → check sanoid: `systemctl status sanoid`
Failed services	0 failed units	Any failed → `systemctl status <unit>`
WireGuard peers	All handshakes < 3min	Stale handshake → `wg show`, check peer reachability
Overnight errors	0 err/crit/alert	Any errors → investigate before they compound
Disk space	All pools < 70% used	>80% → prune snapshots or expand pool immediately

The morning check script

#!/bin/bash
# /usr/local/bin/morning-check
# Run every morning. Should complete in under 30 seconds.
set -euo pipefail

BOLD='\033[1m'; GREEN='\033[0;32m'; RED='\033[0;31m'; YELLOW='\033[0;33m'; NC='\033[0m'
ERRORS=0

section() { echo -e "\n${BOLD}=== $1 ===${NC}"; }
ok()      { echo -e "  ${GREEN}OK${NC}  $1"; }
warn()    { echo -e "  ${YELLOW}WARN${NC} $1"; ERRORS=$((ERRORS+1)); }
fail()    { echo -e "  ${RED}FAIL${NC} $1"; ERRORS=$((ERRORS+1)); }

section "Pool Status"
while IFS= read -r line; do
  pool=$(echo "$line" | awk '{print $1}')
  state=$(echo "$line" | awk '{print $2}')
  if [[ "$state" == "ONLINE" ]]; then
    ok "$pool: $state"
  else
    fail "$pool: $state"
  fi
done < <(zpool list -H -o name,health)

section "Snapshot Freshness (last 5)"
zfs list -t snapshot -o name,creation -s creation 2>/dev/null | tail -5

section "Snapshot Age Check"
LATEST=$(zfs list -t snapshot -H -o creation -s creation 2>/dev/null | tail -1)
if [[ -n "$LATEST" ]]; then
  AGE=$(( $(date +%s) - $(date -d "$LATEST" +%s 2>/dev/null || echo 0) ))
  if [[ $AGE -lt 7200 ]]; then
    ok "Latest snapshot is $((AGE/60)) minutes old"
  elif [[ $AGE -lt 86400 ]]; then
    warn "Latest snapshot is $((AGE/3600)) hours old — check sanoid"
  else
    fail "No snapshot in 24h — sanoid may be broken"
  fi
fi

section "Failed Services"
FAILED=$(systemctl --failed --no-legend --no-pager 2>/dev/null | grep -c "failed" || true)
if [[ "$FAILED" -eq 0 ]]; then
  ok "No failed systemd units"
else
  fail "$FAILED failed unit(s):"
  systemctl --failed --no-legend --no-pager
fi

section "WireGuard Peers"
if command -v wg &>/dev/null; then
  wg show all latest-handshakes 2>/dev/null | while read iface peer ts; do
    if [[ -z "$ts" || "$ts" == "0" ]]; then
      fail "$iface peer $peer: never connected"
    else
      AGE=$(( $(date +%s) - ts ))
      if [[ $AGE -lt 180 ]]; then
        ok "$iface peer ${peer:0:8}...: ${AGE}s ago"
      elif [[ $AGE -lt 300 ]]; then
        warn "$iface peer ${peer:0:8}...: ${AGE}s ago (slow)"
      else
        fail "$iface peer ${peer:0:8}...: ${AGE}s ago (stale)"
      fi
    fi
  done
else
  warn "WireGuard not installed"
fi

section "Overnight Errors (since yesterday)"
COUNT=$(journalctl -p err --since yesterday --no-pager -q 2>/dev/null | wc -l)
if [[ "$COUNT" -eq 0 ]]; then
  ok "No errors in journal since yesterday"
elif [[ "$COUNT" -lt 10 ]]; then
  warn "$COUNT error(s) in journal — review:"
  journalctl -p err --since yesterday --no-pager -q | head -10
else
  fail "$COUNT errors in journal — review urgently:"
  journalctl -p err --since yesterday --no-pager -q | tail -20
fi

section "Disk Space"
zfs list -H -o name,used,avail,usedbydataset -t filesystem | \
  awk '{
    used=$2; avail=$3;
    gsub(/[TGMK]/, "", used); gsub(/[TGMK]/, "", avail);
    printf "  %-40s used=%-8s avail=%s\n", $1, $2, $3
  }'

echo ""
if [[ "$ERRORS" -eq 0 ]]; then
  echo -e "${GREEN}${BOLD}Morning check: ALL GREEN. Infrastructure survived the night.${NC}"
else
  echo -e "${RED}${BOLD}Morning check: $ERRORS issue(s) require attention.${NC}"
  exit 1
fi

# Install and schedule
chmod +x /usr/local/bin/morning-check

# Run manually
morning-check

# Or alias it
alias mc='morning-check'

Out-of-character note: If the morning check is green, your infrastructure survived the night. If anything is yellow or red, you have the rest of the day to fix it before users notice. Five minutes every morning prevents 3 AM pages. The script above is intentionally verbose — when something is wrong, you want to see exactly what it is, not just “something failed.” Pipe it through less if your terminal doesn’t scroll.

3. Weekly Maintenance

Weekly tasks catch problems that don’t trigger alerts — silent corruption, missed snapshots, stale backups. Block 30 minutes on the calendar. The same 30 minutes every week. The habit is the system.

ZFS scrub: verify every block on every pool

A scrub reads every block on the pool and verifies the checksum. OpenZFS checksums detect corruption when data is read. The only way to find corruption on cold (rarely accessed) data is to scrub it. Without weekly scrubs, you might not discover corruption until you need to restore from backup — and the backup is also corrupt.

# Start a scrub on all pools
for pool in $(zpool list -H -o name); do
  echo "Starting scrub on $pool..."
  zpool scrub "$pool"
done

# Check scrub status (run again after scrub completes)
zpool status | grep -A4 "scrub:"

# Stagger across pools for large fleets: pool A on Monday, pool B on Tuesday
# Monday: zpool scrub tank-a
# Tuesday: zpool scrub tank-b
# etc.

Monitoring scrub progress:

# Watch scrub progress live
watch -n 10 'zpool status | grep -A3 "scan:"'

# One-liner: show scrub ETA for all pools
zpool status | grep "scan:" | grep -v "none requested"

What scrub errors mean:

Error type	What it means	Action
0 errors	All blocks intact	Nothing. Sleep well.
Checksum errors, repaired	Corruption detected and fixed from redundancy	Note the disk. Order a replacement. It’s starting to fail.
Checksum errors, unrepaired	Corruption that cannot be fixed (no redundancy or multiple failures)	Restore affected files from snapshot immediately.
Read errors	Disk is having trouble reading blocks	Replace disk before it fails completely.

Package updates

# ALWAYS snapshot before updating
zfs snapshot -r rpool/ROOT@pre-update-$(date +%Y%m%d)

# Check what would be updated (review before applying)
dnf check-update              # CentOS / RHEL / Rocky / Fedora
apt list --upgradable         # Debian / Ubuntu

# Apply security patches only
dnf upgrade --security        # RPM-based
apt-get upgrade               # Debian-based (all updates; use unattended-upgrades for security-only)

# If kernel was updated, create boot environment before rebooting
kbe create pre-kernel-update-$(date +%Y%m%d)

# Reboot and verify
systemctl reboot
# After reboot:
uname -r                      # confirm new kernel
kst                           # confirm everything is healthy
zpool status                  # confirm pool healthy after reboot

Backup verification: actually restore something

# Pick a recent snapshot and clone it to a temp dataset
SNAP=$(zfs list -t snapshot -H -o name -s creation | grep "^rpool/data" | tail -1)
echo "Verifying restore from: $SNAP"

# Clone the snapshot
zfs clone "$SNAP" rpool/verify-$(date +%Y%m%d)

# Mount and check the data
mkdir -p /mnt/verify
mount -t zfs rpool/verify-$(date +%Y%m%d) /mnt/verify
ls -la /mnt/verify/
# Spot-check: open files, verify sizes, check modification times

# Verify replication is current on DR host
ssh dr-host "zfs list -t snapshot -H -o name,creation -s creation | grep rpool | tail -5"

# Cleanup
umount /mnt/verify
zfs destroy rpool/verify-$(date +%Y%m%d)

WireGuard key age check

# Check how old your WireGuard keys are
for conf in /etc/wireguard/wg*.conf; do
  iface=$(basename "$conf" .conf)
  key=$(grep PrivateKey "$conf" | awk '{print $3}')
  pubkey=$(echo "$key" | wg pubkey 2>/dev/null || echo "cannot derive")
  # Check stat of the config file as a proxy for key age
  mtime=$(stat -c %y "$conf" | cut -d' ' -f1)
  echo "$iface: config last modified $mtime (rotate if older than your policy)"
done

Log review

# Scan the past week for patterns worth knowing about
journalctl --since "1 week ago" -p warning --no-pager | \
  grep -v "audit\[" | \
  awk '{$1=$2=$3=""; print}' | sort | uniq -c | sort -rn | head -30

# ZFS-specific events this week
journalctl --since "1 week ago" -t kernel --no-pager | grep -i "zfs\|zpool\|arc" | tail -50

# Authentication failures this week
journalctl --since "1 week ago" -t sshd --no-pager | grep -i "fail\|invalid\|refused" | wc -l

The weekly maintenance script

#!/bin/bash
# /usr/local/bin/weekly-maintenance
# Run once a week. Automate with: systemd timer or cron.
set -euo pipefail

LOG="/var/log/kldload/weekly-$(date +%Y%m%d).log"
mkdir -p "$(dirname "$LOG")"
exec > >(tee -a "$LOG") 2>&1

echo "=== Weekly Maintenance: $(date) ==="

echo "--- Snapshot before maintenance ---"
zfs snapshot -r rpool@weekly-maint-$(date +%Y%m%d)

echo "--- Starting scrubs ---"
for pool in $(zpool list -H -o name); do
  zpool scrub "$pool"
  echo "Scrub started on $pool"
done

echo "--- Checking for package updates ---"
dnf check-update --quiet || true

echo "--- Verifying latest snapshot is recent ---"
LATEST=$(zfs list -t snapshot -H -o name,creation -s creation | tail -1)
echo "Latest snapshot: $LATEST"

echo "--- Log summary: errors this week ---"
journalctl --since "1 week ago" -p err --no-pager -q | wc -l | xargs echo "Error count:"

echo "--- Weekly maintenance complete: $(date) ---"

# Schedule with systemd timer
cat > /etc/systemd/system/weekly-maintenance.service << 'EOF'
[Unit]
Description=kldload Weekly Maintenance
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/weekly-maintenance
EOF

cat > /etc/systemd/system/weekly-maintenance.timer << 'EOF'
[Unit]
Description=Run kldload weekly maintenance every Monday at 04:00

[Timer]
OnCalendar=Mon *-*-* 04:00:00
RandomizedDelaySec=1800
Persistent=true

[Install]
WantedBy=timers.target
EOF

systemctl enable --now weekly-maintenance.timer

Out-of-character note: Scrub is the most important weekly task. Without it, you don’t know if your data is intact. OpenZFS checksums detect corruption when blocks are read — but only when blocks are read. Scrub reads every block. A monthly scrub on a heavily-used pool might miss corruption that sits on cold data for weeks. Weekly is the right cadence for production. The scrub is free (it’s just reads) and it can run while the pool is in use. There’s no excuse for skipping it.

4. Monthly Review

The monthly review is where you catch slow-burning problems. Gradual capacity growth. Declining ARC hit rates. Increasing scrub times. These don’t trigger alerts because they’re not sudden — they’re trends. The review catches them before they become incidents.

Capacity review

# Growth rate per pool: compare used space month-over-month
zfs list -H -o name,used,avail,usedsnap,usedds -t filesystem | \
  column -t

# Which datasets are growing fastest? (sort by used size, descending)
zfs list -H -o name,used -t filesystem | sort -k2 -h -r | head -20

# Are any pools approaching 80%? (performance degrades past 80%)
zpool list -o name,size,alloc,free,capacity,health | \
  awk 'NR==1 {print} NR>1 { cap=$6+0; if(cap>70) print "\033[0;33mWARN\033[0m " $0; else print "  OK " $0 }'

# Procurement forecast: estimate months until 80% capacity
# Usage: zfs-capacity-forecast poolname
zfs-capacity-forecast() {
  local pool="$1"
  local used=$(zpool list -H -o alloc "$pool" | tr -d 'G')
  local total=$(zpool list -H -o size "$pool" | tr -d 'G')
  local limit=$(echo "$total * 0.8" | bc)
  echo "Pool $pool: ${used}G used of ${total}G (limit: ${limit}G)"
  echo "Track growth manually: compare this output month over month"
  echo "Rule: if doubling time < 6 months, start procurement now"
}

Performance review

# ARC statistics: hit rate, size, evictions
arc_summary 2>/dev/null || \
  awk '/^arcstats/ { print }' /proc/spl/kstat/zfs/arcstats | \
  grep -E "^(hits|misses|c |size|mru|mfu)" | head -20

# Calculate ARC hit ratio manually
awk '
  /^hits/ { hits=$3 }
  /^misses/ { misses=$3 }
  END {
    total=hits+misses
    if(total>0) printf "ARC hit ratio: %.2f%% (%d hits, %d misses)\n", (hits/total)*100, hits, misses
  }
' /proc/spl/kstat/zfs/arcstats

# I/O latency: check iostat for sustained high latency
zpool iostat -v 1 5    # 5 samples, 1 second apart

# Scrub duration trend: compare last 4 scrub reports
zpool history | grep "scrub repaired" | tail -8

# If ARC hit ratio is declining (e.g., was 95%, now 80%), your working set
# has grown beyond your RAM. Options: add RAM, or accept the performance hit.

Security review

# TLS certificate expiry check
for cert in /etc/ssl/certs/*.pem /etc/letsencrypt/live/*/cert.pem; do
  [[ -f "$cert" ]] || continue
  expiry=$(openssl x509 -noout -enddate -in "$cert" 2>/dev/null | cut -d= -f2)
  days=$(( ($(date -d "$expiry" +%s) - $(date +%s)) / 86400 ))
  if [[ $days -lt 30 ]]; then
    echo "EXPIRING SOON ($days days): $cert"
  elif [[ $days -lt 60 ]]; then
    echo "Warning ($days days): $cert"
  else
    echo "OK ($days days): $cert"
  fi
done

# Failed authentication attempts this month
journalctl --since "1 month ago" -t sshd --no-pager | \
  grep -i "failed\|invalid user" | \
  awk '{print $NF}' | sort | uniq -c | sort -rn | head -20

# WireGuard key ages (see weekly section for key age check)
# Flag any keys older than 90 days for rotation

# eBPF anomaly summary (if eBPF security tooling is deployed)
if command -v bpftool &>/dev/null; then
  echo "Active eBPF programs:"
  bpftool prog list | grep -c "type" | xargs echo "  count:"
fi

Snapshot policy review

# How much space are snapshots consuming?
zfs list -H -o name,usedsnap -t filesystem | sort -k2 -h -r | head -20

# How many snapshots per dataset?
zfs list -H -t snapshot -o name | awk -F@ '{print $1}' | sort | uniq -c | sort -rn | head -20

# Oldest snapshot per dataset
zfs list -H -t snapshot -o name,creation -s creation | \
  awk '{ ds=substr($1,1,index($1,"@")-1); if(!(ds in seen)){ seen[ds]=$0; print } }'

# Review sanoid configuration: is retention policy correct?
cat /etc/sanoid/sanoid.conf

# If snapshots are consuming too much space, tighten retention:
# [dataset]
#   hourly = 24       # keep 24 hourly (was 168)
#   daily = 30        # keep 30 daily (was 90)
#   monthly = 6       # keep 6 monthly (was 12)
#   yearly = 1        # keep 1 yearly

Replication review

# Check DR replication lag: compare snapshot timestamps on source vs DR
echo "=== Source snapshots ==="
zfs list -t snapshot -H -o name,creation -s creation | grep "auto-" | tail -5

echo "=== DR snapshots (SSH to DR host) ==="
ssh dr-host "zfs list -t snapshot -H -o name,creation -s creation | grep 'auto-' | tail -5"

# Check syncoid last run time
journalctl -u syncoid --no-pager --since "1 week ago" | tail -20

# Check replication bandwidth utilization
# (if syncoid is running, it shows transfer rate in output)
journalctl -u syncoid --no-pager -g "sent" | tail -10

Monthly review report script

#!/bin/bash
# /usr/local/bin/monthly-review
# Generates a monthly health report. Run on the 1st of each month.

REPORT="/var/log/kldload/monthly-report-$(date +%Y%m).txt"
mkdir -p "$(dirname "$REPORT")"

{
echo "kldload Monthly Review Report"
echo "Generated: $(date)"
echo "Hostname: $(hostname)"
echo ""

echo "=== POOL HEALTH ==="
zpool list
echo ""

echo "=== CAPACITY (top datasets by usage) ==="
zfs list -H -o name,used,avail,usedsnap -t filesystem | sort -k2 -h -r | head -15
echo ""

echo "=== SNAPSHOT SPACE USAGE ==="
zfs list -H -o name,usedsnap -t filesystem | \
  awk '$2!="0" {print}' | sort -k2 -h -r | head -10
echo ""

echo "=== ARC PERFORMANCE ==="
awk '
  /^hits/ { hits=$3 }
  /^misses/ { misses=$3 }
  /^size/ { size=$3 }
  /^c / { target=$3 }
  END {
    total=hits+misses
    if(total>0) {
      printf "ARC hit ratio: %.2f%%\n", (hits/total)*100
      printf "ARC size: %d MB / target %d MB\n", size/1024/1024, target/1024/1024
    }
  }
' /proc/spl/kstat/zfs/arcstats
echo ""

echo "=== ERROR SUMMARY (past 30 days) ==="
journalctl --since "30 days ago" -p err --no-pager -q | wc -l | xargs echo "Total errors:"
journalctl --since "30 days ago" -p err --no-pager -q | \
  awk '{$1=$2=$3=$4=""; print}' | sort | uniq -c | sort -rn | head -10
echo ""

echo "=== CERTIFICATE EXPIRY ==="
for cert in /etc/ssl/certs/*.pem /etc/letsencrypt/live/*/cert.pem; do
  [[ -f "$cert" ]] || continue
  expiry=$(openssl x509 -noout -enddate -in "$cert" 2>/dev/null | cut -d= -f2)
  days=$(( ($(date -d "$expiry" +%s) - $(date +%s)) / 86400 ))
  echo "$days days: $cert"
done | sort -n
echo ""

echo "=== SCRUB HISTORY ==="
zpool history | grep "scrub repaired" | tail -4
echo ""

echo "=== Report complete ==="
} | tee "$REPORT"

echo "Report saved to: $REPORT"

Out-of-character note: The monthly review is where you catch slow-burning problems. Gradual capacity growth at 3% per month is invisible in the daily check. Over a year, it’s 36%. The monthly review shows the trend before it becomes a crisis. The ARC hit ratio is the same: a drop from 96% to 91% over three months is hard to see in daily checks, but obvious in a monthly comparison. Run the report script. Compare it to last month. The differences are the story.

5. Quarterly Operations

Quarterly operations are the ones most teams skip — because they require dedicated time, not just a few minutes. They’re also the most important. A quarterly DR test is the difference between “we have replication” and “we have verified disaster recovery.”

DR test: clone production, boot it, verify, destroy

This is the most valuable operation on this page. A system that has never been tested has no known RTO. A system that was tested last quarter has a measured RTO. OpenZFS makes the test free — clone, boot, verify, destroy. Total cost: one afternoon.

# Step 1: Create a recursive snapshot of production
zfs snapshot -r rpool@dr-test-$(date +%Y%m%d)

# Step 2: Send snapshot to DR host (if not already replicated)
# If syncoid is already replicating, skip this step — use the existing snapshot on DR
zfs send -R rpool@dr-test-$(date +%Y%m%d) | \
  ssh dr-host "zfs recv -F dr-test/production"

# Step 3: On DR host, clone the snapshot into a bootable dataset
ssh dr-host "
  zfs clone dr-test/production@dr-test-$(date +%Y%m%d) dr-test/boot
  # If using VMs: import the dataset as a disk, boot the VM
  # If bare metal: boot from DR host with cloned pool
"

# Step 4: Run smoke tests against the DR clone
# Test connectivity, application health, data integrity
curl -sf http://dr-host:8080/health || echo "FAIL: app health check"
ssh dr-host "psql -U app -c 'SELECT count(*) FROM critical_table;'" || echo "FAIL: DB check"

# Step 5: Measure RTO
# Record: time from "declare disaster" to "services serving traffic on DR"

# Step 6: Document the test
cat > /var/log/kldload/dr-test-$(date +%Y%m%d).txt << EOF
DR Test: $(date)
Snapshot used: rpool@dr-test-$(date +%Y%m%d)
Snapshot age at test: (measure from snapshot timestamp to boot)
RTO achieved: (time from clone to serving traffic)
What worked:
What didn't:
Action items:
EOF

# Step 7: Destroy the test clone on DR host
ssh dr-host "
  zfs destroy -r dr-test/boot
  zfs destroy -r dr-test/production@dr-test-$(date +%Y%m%d)
"
echo "DR test complete. Clean up done."

Key rotation: WireGuard keys and TLS certificates

# WireGuard key rotation procedure
# 1. Generate new keys for each interface on each node
for iface in wg0 wg1 wg2 wg3; do
  [[ -f /etc/wireguard/$iface.conf ]] || continue
  echo "Generating new keys for $iface..."
  NEW_PRIV=$(wg genkey)
  NEW_PUB=$(echo "$NEW_PRIV" | wg pubkey)
  echo "  New public key: $NEW_PUB"
  echo "  Update peers with this public key, then update PrivateKey in $iface.conf"
done

# 2. Update each peer's config with the new public keys
# 3. Reload WireGuard (connections drop momentarily, re-establish with new keys)
for iface in wg0 wg1 wg2 wg3; do
  [[ -f /etc/wireguard/$iface.conf ]] || continue
  wg syncconf "$iface" <(wg-quick strip "$iface")
  echo "Reloaded $iface"
done

# 4. Verify all peers reconnected
wg show all
# All peers should show recent handshakes within 2 minutes

# TLS certificate renewal (Let's Encrypt)
certbot renew --dry-run   # test first
certbot renew             # renew if within 30 days of expiry
systemctl reload nginx haproxy 2>/dev/null || true

Hardware health audit

# SMART data for all disks
for disk in /dev/sd? /dev/nvme?; do
  [[ -b "$disk" ]] || continue
  echo "=== $disk ==="
  smartctl -H "$disk" 2>/dev/null | grep -E "SMART overall|result"
  smartctl -A "$disk" 2>/dev/null | grep -E "Reallocated|Pending|Uncorrectable|Temperature|Power_On"
  echo ""
done

# Disk age: how long has each disk been powered on?
for disk in /dev/sd?; do
  [[ -b "$disk" ]] || continue
  hours=$(smartctl -A "$disk" 2>/dev/null | awk '/Power_On_Hours/ {print $10}')
  [[ -n "$hours" ]] && printf "%-10s %s hours (%s years)\n" \
    "$disk" "$hours" "$(echo "scale=1; $hours/8760" | bc)"
done

# Flag disks approaching end of warranty (5-year typical warranty)
# Flag disks with increasing reallocated sector counts
# Order replacements before they fail, not after

Capacity planning: 3-month and 6-month projections

# Compare current month to last month's report (run monthly-review monthly)
# Estimate growth rate
LAST_MONTH_USED="2.3T"   # from last month's report
THIS_MONTH_USED=$(zfs list -H -o used rpool | tail -1)
echo "Last month: $LAST_MONTH_USED"
echo "This month: $THIS_MONTH_USED"
echo "Trend: calculate delta, project forward 3-6 months"
echo "Rule: if current growth rate continues, when do you hit 80%?"
echo "Order hardware when you’re 3 months from 80%, not when you hit it."

# Upgrade planning checklist
echo ""
echo "=== Upgrade Planning ==="
uname -r | xargs echo "Current kernel:"
rpm -q zfs 2>/dev/null || dpkg -l zfs-dkms 2>/dev/null | grep zfs | head -1
echo "Check upstream: https://github.com/openzfs/zfs/releases"
echo "Policy: test on dev node, schedule production upgrade with maintenance window"

Out-of-character note: The quarterly DR test is the most important operation that most teams skip. “We have replication” is not DR. “We tested failover last quarter and it took 4 minutes, here’s the report” is DR. OpenZFS makes the test free — clone, boot, verify, destroy. Total cost: one afternoon per quarter. The first time you run it, something will be broken. That’s the point. Find it on a Tuesday afternoon, not during a Sunday night incident.

6. Annual Operations

Annual operations are the strategic layer — where you step back from the daily and weekly cadence and review the system as a whole. Is the architecture still correct? Is the hardware aging out? Is the documentation current?

Full hardware audit

# Physical inspection checklist (do this with eyes on the hardware)
# 1. Verify every disk label matches zpool configuration
zpool status -v | grep -E "^\s+(ada|sd|nvme|wwn)" | awk '{print $1}'

# 2. Check disk serial numbers against inventory
for disk in /dev/sd?; do
  [[ -b "$disk" ]] || continue
  serial=$(smartctl -i "$disk" 2>/dev/null | grep "Serial Number" | awk '{print $3}')
  echo "$disk: $serial"
done

# 3. Verify labels and physical location tags are legible and correct
# 4. Check cable condition, seating, controller firmware
# 5. Reconcile physical inventory with asset management system

License and warranty review

# What hardware warranties expire in the next 12 months?
# (Pull from your hardware inventory/CMDB — this should be tracked)
# Flag for replacement budget planning

# What software licenses or subscriptions expire?
# Red Hat: subscription-manager list --consumed
# SSL certificates: see monthly security review
# Any SaaS tooling in the stack?

Architecture review

The architecture review is not a technical audit — it’s a design question. Ask: does the current design still meet requirements? Is the pool layout appropriate for the workload? Is the WireGuard mesh the right topology? Is the monitoring stack adequate? Are the runbooks current? Are the names still meaningful?

The output of the architecture review is a list of action items — not changes to make today, but changes to plan for the next 12 months. Major version upgrades, topology changes, hardware refreshes. These go into the budget and the roadmap.

Documentation review

# Are all runbooks current? Test each one.
# Are all asset labels correct? Walk the rack.
# Are all naming conventions followed? Check dataset names.
# Are all emergency contacts current? Check the on-call rotation.

# The documentation review test: give the runbook to someone new.
# Can they execute it without asking questions?
# If not, the runbook is incomplete. Fix it.

# Verify the ops runbook covers:
# 1. How to run the morning check
# 2. How to replace a failed disk
# 3. How to restore from a snapshot
# 4. How to fail over to DR
# 5. How to roll back a bad upgrade
# 6. How to add a new node to the mesh
# 7. Who to call when something breaks

7. Fleet Management — Operating at Scale

Fleet management is where individual node operations become coordinated operations. Every kldload in the fleet has the same tools and the same interface. Scale is a multiplier, not a different job.

Running commands across multiple nodes

# Salt: run morning check across all nodes
salt '*' cmd.run 'kst'
salt '*' cmd.run 'zpool status' --out=yaml

# Salt: target by role/label
salt -G 'role:storage' cmd.run 'zpool status'
salt -G 'tier:production' cmd.run 'morning-check'

# Ansible: health check across fleet
ansible all -m shell -a 'kst'
ansible all -m shell -a 'zpool status | grep -E "state|errors"'

# Ansible: targeted by group
ansible production -m shell -a 'morning-check'
ansible -l 'storage_nodes' -m shell -a 'zpool iostat -v 1 3'

# Simple SSH loop for ad-hoc runs
while IFS= read -r host; do
  echo "=== $host ==="
  ssh "$host" kst 2>/dev/null || echo "  UNREACHABLE"
done < /etc/kldload/fleet/hosts.txt

# Parallel SSH loop (faster for large fleets)
parallel -j 10 'echo "=== {} ==="; ssh {} kst 2>/dev/null || echo "UNREACHABLE"' \
  :::: /etc/kldload/fleet/hosts.txt

Fleet-wide snapshots: before every coordinated change

# Snapshot every node in the fleet before a coordinated change
SNAP_NAME="pre-fleet-change-$(date +%Y%m%d-%H%M)"

while IFS= read -r host; do
  echo "Snapshotting $host..."
  ssh "$host" "zfs snapshot -r rpool@${SNAP_NAME}" &
done < /etc/kldload/fleet/hosts.txt
wait

echo "All nodes snapshotted: $SNAP_NAME"
echo "Rollback: ssh  'zfs rollback -r rpool@${SNAP_NAME}'"

Fleet-wide rolling upgrade

#!/bin/bash
# Rolling upgrade with health check gates
# Upgrades one node at a time. Stops if any node fails health check.
set -euo pipefail

HOSTS_FILE="/etc/kldload/fleet/hosts.txt"
HEALTH_CHECK="morning-check"
WAIT_BETWEEN=300   # 5 minutes between nodes

while IFS= read -r host; do
  echo "=== Upgrading $host ==="

  # Health check before upgrade
  echo "Pre-upgrade health check..."
  ssh "$host" "$HEALTH_CHECK" || { echo "FAIL: $host failed pre-upgrade health check. Stopping."; exit 1; }

  # Snapshot before upgrade
  ssh "$host" "zfs snapshot -r rpool@pre-upgrade-$(date +%Y%m%d)"

  # Apply updates
  ssh "$host" "dnf upgrade -y --security" || true

  # Reboot if kernel updated
  ssh "$host" "needs-restarting -r" && {
    echo "Kernel updated. Rebooting $host..."
    ssh "$host" "kbe create pre-upgrade-$(date +%Y%m%d) && systemctl reboot" || true
    sleep 120  # Wait for reboot
  }

  # Post-upgrade health check
  echo "Post-upgrade health check (waiting for $host to be ready)..."
  for i in {1..12}; do
    ssh "$host" "$HEALTH_CHECK" && break || sleep 15
  done

  echo "$host upgraded successfully."
  echo "Waiting ${WAIT_BETWEEN}s before next node..."
  sleep "$WAIT_BETWEEN"

done < "$HOSTS_FILE"

echo "Fleet upgrade complete."

Fleet-wide replication status

# Check replication lag across fleet
while IFS= read -r host; do
  LATEST=$(ssh "$host" "zfs list -t snapshot -H -o name,creation -s creation | \
    grep auto- | tail -1" 2>/dev/null || echo "UNREACHABLE")
  printf "%-20s %s\n" "$host:" "$LATEST"
done < /etc/kldload/fleet/hosts.txt

Out-of-character note: Fleet management is where proper ZFS labeling pays off. zfs get -r com.kldload:tier | grep production gives you every production dataset across the fleet. Salt and Ansible target by grain or group. The labels ARE the inventory, the inventory IS the targeting. If your datasets don’t have labels, fleet management is ad-hoc. If they do, it’s systematic. Label everything. The Labeling & Asset Management Masterclass covers the labeling strategy in full.

8. Maintenance Windows

A maintenance window is not just time blocked on the calendar. It’s a structured procedure: announce, snapshot, execute, verify, close. Every step is documented. Every rollback path is identified before the window starts. The window is a success if the change is applied AND the rollback is never needed. The window is a failure if the rollback is needed and wasn’t prepared.

Planning a maintenance window

Phase	Action	Who
T-48h	Announce window, expected duration, affected services	Operator
T-24h	Verify rollback procedure. Document it. Test it on dev.	Operator
T-1h	Final health check. Take pre-maintenance snapshot.	Operator
T+0	Execute change. Follow the written procedure.	Operator
T+change	Verify: health check, smoke test, user confirmation	Operator + users
T+close	Close window. Announce completion. Document outcome.	Operator

The maintenance window script

#!/bin/bash
# /usr/local/bin/maint-window
# Usage: maint-window start|finish|rollback "description"
set -euo pipefail

ACTION="$1"
DESC="${2:-maintenance}"
SNAP="pre-maint-$(date +%Y%m%d-%H%M)"

case "$ACTION" in
  start)
    echo "=== Starting maintenance window: $DESC ==="
    echo "Time: $(date)"
    echo "Taking pre-maintenance snapshot: $SNAP"
    zfs snapshot -r rpool@"$SNAP"
    echo "Snapshot created. Rollback command if needed:"
    echo "  maint-window rollback '$DESC'"
    echo ""
    echo "Maintenance window is OPEN. Execute your change."
    ;;
  finish)
    echo "=== Closing maintenance window: $DESC ==="
    echo "Running post-maintenance health check..."
    morning-check || echo "WARNING: Health check reported issues. Investigate before declaring success."
    echo ""
    echo "Maintenance window CLOSED at $(date)."
    echo "Pre-maintenance snapshot retained: rpool@$SNAP"
    echo "Remove when confident: zfs destroy -r rpool@$SNAP"
    ;;
  rollback)
    # Find the most recent pre-maint snapshot
    SNAP=$(zfs list -t snapshot -H -o name -s creation | grep "pre-maint-" | tail -1)
    if [[ -z "$SNAP" ]]; then
      echo "No pre-maintenance snapshot found. Manual recovery required."
      exit 1
    fi
    echo "=== ROLLBACK: $DESC ==="
    echo "Rolling back to: $SNAP"
    echo "WARNING: This will discard all changes since the snapshot."
    read -rp "Type 'ROLLBACK' to confirm: " confirm
    [[ "$confirm" == "ROLLBACK" ]] || { echo "Aborted."; exit 1; }
    POOL=$(echo "$SNAP" | cut -d/ -f1 | cut -d@ -f1)
    zfs rollback -r "$SNAP"
    echo "Rollback complete. Run morning-check to verify."
    ;;
  *)
    echo "Usage: maint-window start|finish|rollback 'description'"
    exit 1
    ;;
esac

When maintenance goes wrong: rollback and communication

# Rollback procedure (fastest path)
# 1. Identify the pre-maintenance snapshot
zfs list -t snapshot -H -o name,creation -s creation | grep "pre-maint-" | tail -5

# 2. Roll back (this is instant for the dataset)
zfs rollback -r rpool@pre-maint-20260402-0200

# 3. If kernel was updated: reboot into previous boot environment
kbe list
kbe activate previous-be-name
systemctl reboot

# 4. Verify
morning-check

# Communication template (send to stakeholders immediately):
cat << 'EOF'
Subject: Maintenance Rollback - [service] - [date]

We encountered an issue during tonight's maintenance window for [service].

Status: Rolled back to pre-maintenance state at [time].
Impact: [describe what users experienced, duration]
Current status: All services restored to normal operation.

Root cause: [brief description]
Next steps: [when will we retry, what changes to the procedure]

We apologize for any disruption.
EOF

Out-of-character note: Every maintenance window starts with a snapshot. Every. Single. One. The cost is zero and the insurance is total. “I forgot to snapshot before the maintenance” is the sentence that precedes every “we need to rebuild from backup” conversation. The snapshot takes three seconds. Take it. The rollback is instant — zfs rollback takes less time than the application restart that follows it. There is no scenario where not snapshotting is the right call.

9. Health Check Endpoints

Every service should expose a health endpoint. Every health check should include pool health as a component — if the pool is degraded, the service is degraded, even if the application process is running.

Standard health endpoint conventions

# What each endpoint should return:
# GET /health  — liveness: is the process running? Returns 200 or 503
# GET /ready   — readiness: is the service ready to serve traffic? Returns 200 or 503
# GET /metrics — Prometheus metrics (if instrumented)

# Minimal health script (embed in any service)
cat > /usr/local/bin/health-check-service << 'EOF'
#!/bin/bash
# Returns 0 (healthy) or 1 (unhealthy)
# Check 1: ZFS pool
POOL_HEALTH=$(zpool list -H -o health tank 2>/dev/null || echo "FAULTED")
[[ "$POOL_HEALTH" == "ONLINE" ]] || { echo "UNHEALTHY: pool $POOL_HEALTH"; exit 1; }

# Check 2: Application process
pgrep -x myapp &>/dev/null || { echo "UNHEALTHY: myapp not running"; exit 1; }

# Check 3: Application port responding
curl -sf http://127.0.0.1:8080/ping &>/dev/null || { echo "UNHEALTHY: app not responding"; exit 1; }

echo "HEALTHY"
exit 0
EOF
chmod +x /usr/local/bin/health-check-service

HAProxy health checks

# /etc/haproxy/haproxy.cfg
backend app_servers
  option httpchk GET /health
  http-check expect status 200

  server app1 10.201.0.1:8080 check inter 10s rise 2 fall 3
  server app2 10.201.0.2:8080 check inter 10s rise 2 fall 3

# ZFS health as a backend check: use custom health endpoint
# GET /health returns 503 if pool is not ONLINE
backend storage_nodes
  option httpchk GET /health
  http-check expect status 200
  server stor1 10.203.0.1:9100 check inter 30s

Prometheus blackbox_exporter endpoint monitoring

# /etc/prometheus/prometheus.yml (blackbox probe section)
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - http://10.201.0.1:8080/health
          - http://10.201.0.2:8080/health
          - http://10.202.0.1:9090/-/healthy    # Prometheus itself
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115            # blackbox_exporter

# Alert: service down for > 1 minute
# - alert: ServiceDown
#   expr: probe_success == 0
#   for: 1m
#   labels:
#     severity: critical

10. Log Management

Logs are the audit trail of everything that happened. OpenZFS events, WireGuard handshakes, authentication attempts, application errors — all flowing into systemd journal. The discipline is knowing what to look for and when.

What to log and where

Source	Where	Retention
systemd units	Journal (automatic)	30 days local
ZFS events (zed)	Journal + /var/log/zed.log	30 days local, 1 year archive
WireGuard handshakes	Journal (kernel messages)	30 days local
Application logs	Journal (via stdout) or /var/log/app/	90 days central (Loki)
Audit log (auth)	Journal + /var/log/audit/	1 year (compliance)

Journal configuration

# /etc/systemd/journald.conf
[Journal]
SystemMaxUse=4G          # Cap journal at 4GB
SystemKeepFree=2G        # Leave 2GB free on the partition
MaxRetentionSec=2592000  # 30 days
Compress=yes             # Compress older entries
ForwardToSyslog=no       # Don’t double-log to syslog

# Apply
systemctl restart systemd-journald

ZFS event daemon

# Enable ZFS event daemon: sends email/executes scripts on ZFS events
systemctl enable --now zfs-zed

# ZED config: /etc/zfs/zed.d/zed.rc
# Set ZED_EMAIL_ADDR for email alerts on pool degradation, checksum errors, etc.
# ZED_EMAIL_ADDR="ops@yourdomain.com"
# ZED_EMAIL_PROG="sendmail"

# Test ZED is working:
zpool scrub tank
journalctl -t zed --no-pager -n 20

Log-based alerting: patterns that matter

# Patterns to alert on (via Loki alerting rules or a simple grep in cron)
# OOM killer
journalctl -p err --no-pager | grep "Out of memory"

# Segfault
journalctl --no-pager | grep "segfault at"

# Authentication failure (more than 10 in an hour)
journalctl -t sshd --since "1 hour ago" --no-pager | grep -c "Failed password"

# ZFS checksum errors
journalctl -t kernel --no-pager | grep -i "zfs.*checksum\|zfs.*error"

# Disk I/O errors
journalctl -t kernel --no-pager | grep -i "I/O error\|blk_update_request"

# Cron job for alert scanning (runs every 5 minutes)
cat > /etc/cron.d/log-alert-scan << 'EOF'
*/5 * * * * root /usr/local/bin/log-alert-scan 2>&1 | systemd-cat -t log-alert-scan -p warning
EOF

The log review ritual

# Weekly log review: what to look for
# 1. Error count trend (is it going up?)
journalctl --since "1 week ago" -p err --no-pager -q | wc -l

# 2. Top error sources
journalctl --since "1 week ago" -p err --no-pager -q | \
  awk '{print $5}' | sort | uniq -c | sort -rn | head -20

# 3. Any new error patterns not seen before?
journalctl --since "1 week ago" -p err --no-pager -q | \
  grep -v "audit\[" | grep -v "CRON\[" | head -50

# 4. Authentication anomalies
journalctl --since "1 week ago" -t sshd --no-pager | \
  grep -E "Accepted|Failed|Invalid" | \
  awk '{print $9, $11}' | sort | uniq -c | sort -rn | head -20

11. Upgrade Procedures

Every upgrade is the same structure: snapshot, update, verify. The only variable is what you’re updating and whether it requires a reboot. The boot environment is the safety net. Use it for every kernel and ZFS update.

The upgrade calendar

Type	Cadence	Procedure
Security patches	Within 48h of release	Snapshot, update, verify
Feature updates	Monthly (maintenance window)	Snapshot, update, verify
Kernel updates	Monthly or as needed	Boot environment, update, reboot, verify
ZFS updates	Monthly (with kernel)	Boot environment, update, reboot, verify module
Major OS versions	Quarterly (planned)	Test on dev, maintenance window, full rollback ready

Package updates (no reboot required)

# Snapshot first
zfs snapshot -r rpool@pre-update-$(date +%Y%m%d)

# Review what changes
dnf check-update                          # RPM
apt list --upgradable 2>/dev/null         # APT

# Apply updates
dnf upgrade -y                            # RPM
apt-get upgrade -y                        # APT

# Verify: check critical services still running
morning-check

Kernel updates (reboot required)

# Create a boot environment BEFORE rebooting
kbe create pre-kernel-$(date +%Y%m%d)
kbe list

# Apply kernel update
dnf upgrade -y kernel kernel-headers       # RPM
apt-get install -y linux-image-generic     # APT

# Verify initramfs includes ZFS (critical — if missing, pool won’t mount at boot)
dracut --force                             # RPM (rebuild initramfs)
update-initramfs -u -k all                 # APT (rebuild initramfs)

# Reboot
systemctl reboot

# After reboot: verify new kernel booted and pool is healthy
uname -r
zpool status
kst

# If the new kernel fails to boot:
# Select previous boot environment from GRUB menu
# kbe activate pre-kernel-YYYYMMDD
# systemctl reboot

ZFS module updates

# ZFS needs special handling: DKMS rebuild on kernel update
# This is automatic if zfs-dkms is installed correctly

# Verify ZFS module is loaded for the new kernel
modinfo zfs | grep vermagic

# Check DKMS build status
dkms status | grep zfs

# If DKMS build failed, rebuild manually
dkms build zfs/ -k $(uname -r)
dkms install zfs/ -k $(uname -r)

# Verify module loads
modprobe zfs
zfs version

Application updates

# Snapshot the application dataset
zfs snapshot rpool/data/myapp@pre-update-$(date +%Y%m%d)

# Update the application
systemctl stop myapp
# (deploy new version here)
systemctl start myapp

# Smoke test
curl -sf http://localhost:8080/health || {
  echo "FAIL: rolling back..."
  systemctl stop myapp
  zfs rollback rpool/data/myapp@pre-update-$(date +%Y%m%d)
  systemctl start myapp
  echo "Rolled back."
  exit 1
}

echo "Update successful. Monitoring for 1 hour before confirming."
# Monitor logs:
journalctl -u myapp -f

Out-of-character note: The boot environment is the safety net for kernel and ZFS updates. If the new kernel doesn’t boot, select the previous boot environment from the GRUB menu — the old environment is intact and the pool is unchanged. Total downtime: one reboot cycle (30 seconds). Without boot environments, a bad kernel update means booting from USB and manually recovering. The cost of creating a boot environment before every kernel update is three seconds. Take it.

12. Troubleshooting Decision Tree

When something breaks, the decision tree tells you where to look. Start at the top. Follow the branches. Each branch ends with a command. Run it. Read the output. Go to the next branch.

"Something is slow"

# Step 1: Is it ZFS?
zpool iostat -v 1 5       # Look for high w/r latency (>10ms sustained = investigate)
zfs get compressratio,logicalused tank

# Step 2: Is ARC being missed?
awk '/^hits/{h=$3}/^misses/{m=$3}END{printf "ARC hit ratio: %.1f%%\n",(h/(h+m))*100}' \
  /proc/spl/kstat/zfs/arcstats
# Hit ratio < 80% = working set exceeds ARC; add RAM or expect cache-cold I/O

# Step 3: Is it CPU?
top -bn1 | head -20
sar -u 1 5                # CPU utilization per second

# Step 4: Is it network?
iftop -i eth0             # Live network utilization
ss -s                     # Socket summary: any backlogs?
wg show                   # WireGuard: are all tunnels up and healthy?

# Step 5: Is it the application?
journalctl -u myapp -p warning --since "1 hour ago" --no-pager | tail -50

"Something is down"

# Step 1: What is down?
systemctl --failed --no-legend

# Step 2: Why did it fail?
systemctl status 
journalctl -u  --since "30 minutes ago" --no-pager | tail -50

# Step 3: Is it a dependency?
systemctl list-dependencies  --failed

# Step 4: Network issue?
wg show                   # WireGuard tunnels
ss -tlnp                  # What is listening?
nft list ruleset          # Is nftables blocking?

# Step 5: Storage issue?
zpool status              # Pool degraded?
df -h                     # Out of space?

"Disk failed"

# Step 1: Identify the failed disk
zpool status -v           # Shows which disk(s) are FAULTED or removed

# Step 2: Map to physical disk (use labels)
# kldload labels disks with dataset and pool information
zpool status -P           # Show full device paths
zdb -C | grep -A5 "path"  # Pool configuration with paths

# Step 3: If the pool is DEGRADED but still ONLINE, it’s safe to replace
# Order the replacement disk. Data is still protected.

# Step 4: Offline the failed disk (if it hasn’t auto-offlined)
zpool offline tank sda3

# Step 5: Replace the disk
zpool replace tank /dev/old-disk /dev/new-disk

# Step 6: Monitor resilver progress
watch -n 10 'zpool status | grep -A5 "scan:"'

# Step 7: Verify resilver completes without errors
zpool status
# Should show: scan: resilvered X with 0 errors

"Out of space"

# Step 1: Where is the space going?
zfs list -H -o name,used,usedsnap,usedbydataset -t filesystem | \
  sort -k2 -h -r | head -20

# Step 2: Snapshot space?
zfs list -H -t snapshot -o name,used | sort -k2 -h -r | head -20

# Step 3: Prune old snapshots (carefully)
# List snapshots older than 30 days
zfs list -t snapshot -H -o name,creation -s creation | \
  awk '{ cmd="date -d \""$2" "$3" "$4" "$5"\" +%s"; cmd | getline ts; close(cmd);
    age=(systime()-ts)/86400; if(age>30) print age" days: "$1 }'

# Step 4: Destroy selected old snapshots
zfs destroy rpool@old-snapshot-name

# Step 5: Use send|recv to offload cold data (example: archive dataset)
zfs send tank/cold-data@snap | ssh dr-host "zfs recv archive/cold-data"
zfs destroy -r tank/cold-data

# Step 6: Expand the pool (add a drive, or grow vdev if possible)
zpool add tank /dev/new-disk

"Replication lagging"

# Step 1: Check WireGuard tunnel to replication target
wg show wg3               # Storage plane should have recent handshake
ping 10.203.0.2           # Can you reach the DR host over the storage plane?

# Step 2: Check source snapshot freshness
zfs list -t snapshot -H -o name,creation -s creation | tail -5

# Step 3: Check syncoid status and errors
systemctl status syncoid
journalctl -u syncoid --since "24 hours ago" --no-pager | tail -50

# Step 4: Check bandwidth between source and DR
ssh 10.203.0.2 "dd if=/dev/zero bs=1M count=100 | pv > /dev/null"

# Step 5: Run syncoid manually to see live output
syncoid --no-privilege-elevation rpool/data 10.203.0.2:backup/data

"Can’t SSH"

# Step 1: Is WireGuard up?
wg show wg1               # Management plane
# Look for recent handshake (< 180 seconds)

# Step 2: If handshake is stale, check the peer’s endpoint
wg show wg1 endpoints     # Correct IP:port?

# Step 3: Is sshd bound to the correct address?
ss -tlnp | grep :22       # Should show 10.201.0.1:22 (management plane only)

# Step 4: Is nftables blocking?
nft list ruleset | grep -A20 "chain input"
# Confirm management plane (wg1) is allowed for SSH

# Step 5: Check sshd config
grep -E "ListenAddress|PermitRootLogin|AllowUsers|AuthorizedKeysFile" /etc/ssh/sshd_config

# Step 6: Check journal for sshd errors
journalctl -t sshd --since "1 hour ago" --no-pager | tail -30

13. The Ops Toolkit

Every command an operator needs, organized by category. Print this section. Pin it to the wall. Reference it when you’re working under pressure and can’t remember the exact flags.

Health

kst — kldload status overview
zpool status — pool state and disk health
zpool status -v — with device paths
systemctl --failed — failed units
wg show — WireGuard peer status
morning-check — full daily health check

Storage

zfs list — all filesystems
zfs list -t snapshot -s creation — snapshots by age
zpool iostat -v 1 — live I/O per vdev
arc_summary — ARC statistics
zfs get compressratio — compression ratio
zpool list -o name,capacity,health — pool usage

Snapshots

ksnap — create a snapshot
sanoid --run — force sanoid snapshot cycle
zfs snapshot -r rpool@name — recursive snapshot
zfs rollback -r rpool/data@snap — rollback
zfs clone snap dest — create a clone
zfs destroy -r rpool@old-snap — prune snapshot

Replication

syncoid --dryrun src dst — test replication
syncoid src host:dst — replicate dataset
zfs send -R snap | ssh host zfs recv dst — manual send
zfs get com.kldload:dr-target — check DR label
journalctl -u syncoid — replication logs

Network

wg show — all WireGuard interfaces
wg show wg1 latest-handshakes — handshake times
ip addr — interface addresses
ss -tlnp — listening services
nft list ruleset — firewall rules
ping 10.201.0.2 — test WireGuard connectivity

Monitoring

curl localhost:9090/metrics — Prometheus metrics
journalctl -p err --since today — today’s errors
journalctl -f -u myservice — follow service log
bpftrace -e 'tracepoint:...' — eBPF tracing
tcpconnect — trace outbound connections
sar -u 1 10 — CPU utilization

Lifecycle

kbe list — list boot environments
kbe create name — create boot environment
kbe activate name — switch boot environment
kupgrade — upgrade with boot environment
maint-window start "desc" — open maintenance window
maint-window finish "desc" — close maintenance window

Scrub & Health

zpool scrub tank — start scrub
zpool scrub -s tank — stop scrub
zpool status | grep scan — scrub progress
smartctl -H /dev/sda — SMART health
smartctl -A /dev/sda — SMART attributes
zpool clear tank — clear error counts after fix

14. Handoff and Documentation

The ultimate test of your operations: if you disappeared tomorrow, could someone else run your infrastructure from the documentation alone? If the answer is no, the documentation is incomplete. This section is about making the answer yes.

The ops runbook: one document, entire environment

# Template: /etc/kldload/runbook.md
# Keep this document current. Review it quarterly. Test it annually.

# Section 1: Infrastructure overview
# - How many nodes? What are they? What do they run?
# - Network topology (WireGuard planes, subnets)
# - Storage pools: which pool on which node, what data

# Section 2: Access
# - How to SSH into each node (WireGuard first!)
# - Where are the keys stored?
# - Emergency access if WireGuard is down (IPMI/console)

# Section 3: Daily operations
# - How to run the morning check
# - What to do for each red/yellow result

# Section 4: Common procedures (link to each procedure)
# - Replace a failed disk
# - Restore from snapshot
# - Roll back an upgrade
# - Fail over to DR
# - Add a new node
# - Rotate WireGuard keys

# Section 5: Emergency contacts
# - Primary operator
# - Backup operator
# - Hardware vendor support contacts
# - Network provider support

# Section 6: Escalation path
# - What can be fixed without escalation?
# - When to call the backup operator?
# - When to declare a disaster and invoke DR?

The on-call handoff

# On-call handoff template (written at shift change):
cat << 'EOF'
On-Call Handoff - [date] [time]
Outgoing: [name]
Incoming: [name]

CURRENT STATUS:
- All pools: [ONLINE/any issues]
- All services: [green/any issues]
- Active incidents: [none / describe]
- Pending maintenance: [none / describe]

WATCH LIST (things to keep an eye on):
- [any slow-burning issues, trends to watch]

RECENT CHANGES:
- [what changed in the last shift]

OPEN ITEMS:
- [anything that needs follow-up but isn’t urgent]

The runbook is at /etc/kldload/runbook.md
Morning check: morning-check
Alerts: Grafana at https://10.202.0.1:3000
EOF

New operator onboarding

# First hour: orient
# 1. Read the runbook (/etc/kldload/runbook.md)
# 2. Run the morning check (morning-check)
# 3. Review the pool topology (zpool status -v for each pool)
# 4. Review the WireGuard mesh (wg show all)
# 5. Look at Grafana dashboards with a senior operator

# First week: supervised operations
# 1. Run the morning check every day, discuss results
# 2. Observe a maintenance window
# 3. Practice snapshot and rollback on a non-production dataset
# 4. Walk through the DR procedure (don’t execute, just read)

# First month: independent operations
# 1. Run all weekly maintenance tasks
# 2. Participate in monthly review
# 3. Execute at least one maintenance window
# 4. Add or update a section of the runbook based on what was unclear

Out-of-character note: The “hit by a bus” test: if you disappeared tomorrow, could someone who’s never seen this infrastructure operate it from the documentation alone? If the labeling is correct, the naming conventions are followed, and the runbooks exist, the answer should be yes. If the answer is no, find out what’s missing and write it down. Not for the hypothetical emergency — for the Monday morning when you’re on vacation and someone needs to replace a disk. The runbook is the system. The infrastructure just runs it.

Blue/Green & SRE Masterclass — the philosophy and principles that underpin everything on this page
ZFS Masterclass — deep dive on pool design, snapshots, replication, and tuning
Backplane Networks Masterclass — designing and operating the encrypted infrastructure network
Observability Masterclass — Prometheus, Grafana, and alerting in depth
Boot Environments tutorial — full guide to creating, activating, and managing boot environments
Snapshots Guide — sanoid configuration, retention policies, and manual snapshot management
Backup & Disaster Recovery masterclass — end-to-end DR setup with syncoid and automated failover

← Labeling & Asset Management Blue/Green & SRE →