Blue/Green & SRE Masterclass
This guide treats site reliability engineering the way Google intended it: as an engineering discipline with measurable targets, defined error budgets, and automation that eliminates manual toil. It then applies every principle to infrastructure you own — bare metal, OpenZFS, WireGuard, kldload — where the primitives are often better than what cloud providers charge premium for.
By the end you will have defined SLOs, built blue/green and canary workflows around OpenZFS clones, written runbooks for every alert, and mapped your deployment maturity to a concrete checklist. The progression is zero to hero: each section builds on the last.
Site Reliability Engineering is not DevOps with a different title. SRE is the discipline of running production systems reliably at scale, codified by Google in 2016. The core principles are simple but radical: SLOs define your reliability target, error budgets determine how much risk you are allowed to take, toil is the enemy of every operations team, automation is the only answer that scales, and every change is a potential incident until proven otherwise.
On OpenZFS, these principles gain superpowers. Rollback is 2 seconds — not 2 hours. Cloning production for a test environment is free — not a separate infrastructure cost. Replication is incremental and block-level — not rsync guessing which files changed. Boot environment rollback means a bad kernel or bad configuration is a reboot away from recovery, not a reinstall. Every SRE operation that is expensive on other platforms becomes trivial on OpenZFS.
What this masterclass covers: SLIs, SLOs, and error budgets from first principles. OpenZFS as an SRE primitive. Blue/green and canary deployments with kldload. Incident response, change management, toil elimination, capacity planning, disaster recovery, observability, on-call runbooks, and a maturity model you can use to assess where you are and what to fix next.
1. SRE Is Not DevOps with a Different Title
DevOps is a cultural movement: break down the wall between development and operations, ship faster, deploy more often. SRE is a specific implementation of that culture with engineering discipline: you measure reliability precisely, you set targets for it, and you treat the gap between current reliability and the target as a budget you can spend on risk.
The Google SRE Book (2016, free online) defines the discipline. The core insight is that reliability is a feature, and like any feature, you need to decide how much of it to build. Too little reliability and users leave. Too much reliability and you are wasting engineering resources on uptime you do not need. The error budget is the mechanism that enforces this balance automatically.
SLI — Service Level Indicator
A quantitative measure of some aspect of the service. The raw signal. Examples: request success rate, latency at the 99th percentile, storage pool availability, replication lag in seconds.
SLO — Service Level Objective
A target for an SLI. The goal. Not a promise to users — that is an SLA. The SLO is your internal engineering target that gives you a budget for risk. Set it below what you can actually achieve.
Error Budget
The allowed unreliability: 100% minus the SLO. At 99.9% availability, your budget is 0.1% — 43.8 minutes per month. That budget is yours to spend on deployments, experiments, and migrations. When it runs out, you freeze changes.
Toil
Manual, repetitive, automatable work that scales linearly with service size. Running backup scripts by hand. Pruning snapshots manually. Renewing certificates by reminder email. Toil is the enemy — it consumes engineering time without improving reliability.
2. SLIs, SLOs, and Error Budgets — the Foundation
The theory is simple. Making it concrete requires choosing the right SLIs for your services, setting realistic SLO targets, and building the infrastructure to measure them continuously. Here is how to do it for a kldload web application.
Defining SLIs for a kldload web application
Start with the four golden signals (detailed in Section 12): latency, traffic, errors, and saturation. Map each to a concrete SLI:
# SLI 1: Availability — fraction of requests that succeed
# Measure: HTTP 2xx responses / total HTTP responses (excluding health checks)
# Source: nginx access log, application metrics, or Prometheus http_requests_total
# SLI 2: Latency — fraction of requests served within threshold
# Measure: requests with latency < 300ms / total requests
# Source: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# SLI 3: Throughput — requests per second (capacity SLI)
# Measure: rate(http_requests_total[5m])
# Alert when: drops below expected baseline (traffic anomaly)
# SLI 4: Freshness — for data services, age of newest record
# Measure: time since last successful write or replication cycle
# Source: custom gauge, or syncoid completion timestamp
Setting SLOs
SLOs should be set at what your users actually need, not at what you think you can achieve. Start conservative. A 99.5% SLO with a real error budget is more valuable than an aspirational 99.99% that you never measure.
| SLO | Downtime / year | Downtime / month | Downtime / week | Error budget (30 days) |
|---|---|---|---|---|
| 99% | 87.6 hours | 7.3 hours | 1.7 hours | 432 minutes |
| 99.5% | 43.8 hours | 3.6 hours | 50 minutes | 216 minutes |
| 99.9% | 8.7 hours | 43.8 minutes | 10.1 minutes | 43.8 minutes |
| 99.95% | 4.4 hours | 21.9 minutes | 5.0 minutes | 21.9 minutes |
| 99.99% | 52.6 minutes | 4.4 minutes | 1.0 minute | 4.4 minutes |
The error budget policy
An error budget without a policy is just a number. The policy defines what happens when the budget reaches specific thresholds:
# Error budget policy — example for a 99.9% SLO (43.8 min/month)
# Budget > 50% remaining: normal operations, deploy freely
# Budget 25-50% remaining: all deployments require pre-deploy snapshot + rollback plan
# Budget 10-25% remaining: new features freeze, reliability work only
# Budget < 10% remaining: all changes freeze, on-call focus shifts to root cause
# Budget exhausted (0%): incident declared, postmortem required before any deploys resume
3. OpenZFS as an SRE Primitive
Traditional SRE treats rollback as expensive. Restore from backup (slow), redeploy (slow), verify (slow), notify stakeholders (slow). The total time from "we need to rollback" to "rollback complete" is measured in hours. On OpenZFS, that same operation is measured in seconds — and the rollback is atomic and byte-perfect.
This changes the economics of reliability engineering. When rollback is expensive, you are conservative about changes. When rollback is 2 seconds, you can take more risk — which means more deployments, more experiments, and faster iteration, all within the same error budget.
Snapshots — instant recovery points
An OpenZFS snapshot is atomic, zero-cost at creation, and captures the exact filesystem state at that moment. Create one before every change: before a deploy, before a migration, before a package update. If anything breaks, roll back instantly.
Clones — zero-cost environment copies
A ZFS clone is a writeable copy of a snapshot. It uses zero disk until it diverges from the parent. Clone production to create a staging environment, a blue/green green environment, or a canary node — all at no storage cost.
Send/receive — measurable RPO
zfs send | zfs receive streams incremental block-level changes to a remote replica. Every snapshot sent gives you a measurable RPO: if you replicate every 15 minutes, your RPO is 15 minutes. syncoid automates this with retry logic and progress reporting.
Boot environments — OS-level rollback
kldload installs the OS itself onto a ZFS dataset. Before a kernel update, package upgrade, or configuration change, create a boot environment. If the system fails to boot or misbehaves, select the previous boot environment at the bootloader. Recovery is one reboot.
Checksums — silent corruption detection
Every block written to an OpenZFS pool is checksummed on write and verified on read. Bit rot, hardware failure, write errors — all detected silently. On a mirror or RAIDZ pool, corrupted blocks are automatically repaired from redundant copies. No manual fsck needed.
Compression — transparent and always-on
lz4 compression on all datasets costs nearly nothing in CPU and typically reduces storage usage by 30-60% for text, logs, and database content. This extends the time before you need to expand storage capacity — directly affecting your capacity planning SLIs.
The SRE operations table
| SRE operation | Traditional approach | OpenZFS approach | Time difference |
|---|---|---|---|
| Pre-deploy checkpoint | rsync backup to staging server | zfs snapshot (atomic) | minutes → <1s |
| Application rollback | Restore backup, redeploy, verify | zfs rollback to snapshot | hours → 2s |
| Staging environment | Provision new VM, copy data, configure | zfs clone of production snapshot | hours → seconds |
| DR replica | Full rsync, periodic, bandwidth-heavy | zfs send incremental (block-level) | GBs → MBs/cycle |
| OS rollback (bad kernel) | Boot from rescue, reinstall, reconfigure | Select previous boot environment at GRUB | hours → 1 reboot |
| DR test | Spin up separate DR site, hope it works | Clone replica, boot it, verify, destroy | days → minutes |
| Corruption detection | Discover it at restore time (too late) | Checksummed on every read, repaired from mirrors | silent failure → automatic |
4. Blue/Green Deployments — the Core Pattern
Blue/green is the most reliable deployment strategy for stateless services. The idea is simple: you run two identical environments. One is live (blue). One is idle (green). You deploy the new version to green, test it, then switch all traffic to green. Blue becomes your instant rollback. When you are confident green is healthy, you retire blue.
The traditional problem with blue/green is cost: you need 2x the hardware, 2x the configuration, and a reliable mechanism to switch traffic. OpenZFS eliminates all three problems. A clone of blue costs zero disk until it diverges. The clone IS production at the moment of cloning — no configuration drift possible. And the rollback is a load balancer change, not a data restoration.
OpenZFS blue/green — step by step
# Assume: production web application running on KVM VM with ZFS root
# blue-app: the current production VM
# Disk: tank/vms/blue-app (ZFS dataset backing the VM image)
# Step 1: Snapshot blue (production) before the deploy
zfs snapshot tank/vms/blue-app@pre-deploy-$(date +%Y%m%d-%H%M)
# Step 2: Clone blue to create green (instant, zero disk cost)
zfs clone tank/vms/blue-app@pre-deploy-20260402-1430 tank/vms/green-app
# Step 3: Boot green as a new VM using the clone
# (kvm-clone does this automatically — clones ZFS dataset + generates new VM config)
kvm-clone blue-app green-app
# Step 4: Deploy new application version to green
# (green is identical to blue — same OS, same config, same data)
ssh green-app 'cd /opt/app && git pull && systemctl restart app'
# Step 5: Run smoke tests against green
curl -sf http://green-app.internal/health || exit 1
./smoke-tests.sh --target green-app.internal
# Step 6: Switch traffic — update load balancer or DNS
# (DNS change, HAProxy backend swap, floating IP reassignment, etc.)
# Example: update HAProxy to route all traffic to green-app
haproxy-swap blue-app green-app # your automation here
# Green is now production. Blue is your rollback.
# Step 7a: If green breaks after traffic switch — rollback in 2 seconds
haproxy-swap green-app blue-app
# Users are back on blue. Investigate green at leisure.
# Step 7b: If green is healthy after the confidence period (e.g. 30 min)
# Destroy blue (or keep it as a warm spare for one release cycle)
zfs destroy tank/vms/blue-app@pre-deploy-20260402-1430
virsh undefine green-app # old blue VM gone
# tank/vms/green-app is now the new blue
The blue/green naming convention
Do not hardcode the names "blue" and "green" into your configuration. Instead, use a symbolic pointer — a symlink, a DNS record, a load balancer backend group — that points to whichever environment is currently live. The environments themselves can be named by timestamp or release version:
# Convention: production always points to current, standby to previous
# DNS: app.internal → points to current production IP
# Standby: standby-app.internal → points to idle environment
# Release naming instead of blue/green:
# tank/vms/app-v1.4.2 ← current production
# tank/vms/app-v1.5.0 ← new version, being tested
# After successful cutover:
# tank/vms/app-v1.4.2 ← warm standby for this release cycle
# tank/vms/app-v1.5.0 ← production
# After confidence period:
# tank/vms/app-v1.4.2 → destroy
# tank/vms/app-v1.5.0 → rename to app-current
5. Canary Deployments
Canary is blue/green at a smaller blast radius. Instead of switching all traffic at once, you route a small percentage (10%, 5%, even 1%) to the new version and observe how it behaves under real production load. If the canary metrics match production, you expand. If they diverge, you pull the canary without ever affecting the majority of users.
Canary with kvm-clone
# Assume: 5-node application cluster: app-1 through app-5
# Load balancer routes requests round-robin across all five nodes
# Step 1: Clone app-1 to create the canary
zfs snapshot tank/vms/app-1@pre-canary-$(date +%Y%m%d-%H%M)
kvm-clone app-1 app-canary
# Step 2: Deploy new version to canary only
ssh app-canary 'cd /opt/app && git checkout v1.5.0 && systemctl restart app'
# Step 3: Route 10% of traffic to canary (HAProxy weight-based)
# app-1 through app-5: weight 10 each = ~83% of traffic
# app-canary: weight 8 = ~17%, or use fewer weights for ~10%
# Adjust to your load balancer's weighting mechanism
# Step 4: Monitor the canary vs production
# Key comparison: error rate, latency p99, resource usage
# If canary error rate is 2x production error rate: pull the canary immediately
# Step 5a: Canary healthy — expand to full fleet
for node in app-1 app-2 app-3 app-4 app-5; do
ssh $node 'cd /opt/app && git checkout v1.5.0 && systemctl restart app'
sleep 30 # rolling, one node at a time
done
# Remove canary from rotation, destroy it
# Step 5b: Canary broken — pull it
# Remove app-canary from load balancer rotation
# Roll back the canary VM: zfs rollback tank/vms/app-canary@pre-canary-20260402-1445
# Or simply destroy it: virsh undefine app-canary && zfs destroy tank/vms/app-canary
Canary with Cilium L7 traffic splitting
For Kubernetes workloads on kldload, Cilium enables L7-aware traffic splitting. You can
canary a specific API endpoint — route 10% of /api/v2/checkout requests to the new
version while all other endpoints stay on the stable version. This is more surgical than
weight-based load balancing:
# CiliumEnvoyConfig for L7 traffic splitting (canary for /api/v2/checkout)
apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
name: checkout-canary
spec:
resources:
- "@type": type.googleapis.com/envoy.config.route.v3.RouteConfiguration
name: checkout-route
virtual_hosts:
- name: checkout
domains: ["app.internal"]
routes:
- match:
prefix: "/api/v2/checkout"
route:
weighted_clusters:
clusters:
- name: checkout-stable
weight: 90
- name: checkout-canary
weight: 10
What to monitor during a canary
The canary decision is binary: expand or rollback. Make it quantitative, not a gut feeling. Define the rollback criterion before you deploy:
# Canary success criteria (define BEFORE deploying):
# 1. Error rate on canary <= 1.5x error rate on production for 10 minutes
# 2. p99 latency on canary <= 1.2x p99 latency on production
# 3. No crash loops (canary process restarts > 2 in 10 minutes = fail)
# 4. No error budget consumption > 10% of monthly budget in first 10 minutes
# Prometheus queries for canary comparison:
# Error rate ratio:
sum(rate(http_requests_total{job="app-canary",status=~"5.."}[5m])) /
sum(rate(http_requests_total{job="app-canary"}[5m]))
/
sum(rate(http_requests_total{job="app-production",status=~"5.."}[5m])) /
sum(rate(http_requests_total{job="app-production"}[5m]))
6. Rolling Updates with Rollback
Rolling updates are the safest deployment strategy for stateful services and clusters. Instead of swapping all traffic at once (blue/green) or running two versions in parallel indefinitely (canary), you update one node at a time, verify it is healthy, then proceed to the next. If any node fails, you rollback that node and stop the rollout.
The kvm-snap rolling pattern
# Rolling update across a 5-node application cluster
# Pattern: snapshot → update → verify → continue → snapshot next
NODES="app-1 app-2 app-3 app-4 app-5"
NEW_VERSION="v1.5.0"
VERIFY_WAIT=300 # 5 minutes between nodes
for node in $NODES; do
echo "=== Updating $node ==="
# 1. Snapshot the node before update
zfs snapshot tank/vms/${node}@pre-update-$(date +%Y%m%d-%H%M)
# 2. Drain the node from the load balancer
haproxy-drain $node
# 3. Apply the update
ssh $node "cd /opt/app && git checkout $NEW_VERSION && systemctl restart app"
# 4. Wait for the node to become healthy
timeout 60 bash -c "until curl -sf http://$node.internal/health; do sleep 2; done"
if [ $? -ne 0 ]; then
echo "ERROR: $node health check failed. Rolling back."
zfs rollback tank/vms/${node}@pre-update-$(date +%Y%m%d-%H%M)
haproxy-restore $node
exit 1
fi
# 5. Re-add to load balancer
haproxy-restore $node
# 6. Wait before proceeding to next node
echo "Node $node healthy. Waiting ${VERIFY_WAIT}s before next node..."
sleep $VERIFY_WAIT
done
echo "Rolling update complete. All nodes on $NEW_VERSION."
The "pause and assess" pattern
Do not update all nodes in a tight loop. Use a graduated rollout: update one node, wait, update a few more, wait longer, update the rest. The longer wait periods give slow-burn failures (memory leaks, connection pool exhaustion, caching bugs) time to surface before they affect the entire cluster:
# Graduated rollout schedule (5-node cluster):
# Node 1: update, verify 10 minutes
# Nodes 2-3: update, verify 30 minutes
# Nodes 4-5: update, verify 60 minutes, declare success
# The pause-and-assess pattern for Kubernetes:
kubectl set image deployment/app app=app:v1.5.0
kubectl rollout pause deployment/app
# Wait 10 minutes, check metrics, then:
kubectl rollout resume deployment/app # or:
kubectl rollout undo deployment/app # rollback immediately
7. Change Management — Every Change Is a Potential Incident
The Google SRE book is unambiguous: the majority of production incidents are caused by changes. Not hardware failures, not cosmic rays, not mysterious network events — changes. A configuration change. A package update. A schema migration. A kernel upgrade. The implication is that every change needs a rollback plan before it executes, not after.
The pre-change checklist
# Pre-change checklist — required before every significant change
# 1. Create a pre-change snapshot
CHANGE_ID="pkg-update-kernel-$(date +%Y%m%d-%H%M)"
zfs snapshot -r tank@${CHANGE_ID}
# 2. Verify the rollback procedure works
# (actually test it — don't assume)
zfs rollback tank/app@${CHANGE_ID} # test on non-prod equivalent
# 3. Document the rollback command in the change ticket
# Example: "To rollback: zfs rollback -r tank@pkg-update-kernel-20260402-1430"
# 4. Notify stakeholders (maintenance window, who to contact if broken)
# 5. Set a change window with a defined abort time
# "If the change is not complete and verified by 16:00, abort and rollback"
# 6. Verify monitoring is in place to detect the change failing
# Check that Prometheus alerts are active, Grafana dashboard is open
Change categories and rollback strategies
| Category | Examples | Rollback strategy | Rollback time |
|---|---|---|---|
| Routine | App config change, package update | zfs rollback to pre-change snapshot |
2 seconds |
| Significant | Kernel update, major config change | Boot environment rollback at GRUB | 1 reboot |
| Critical | Schema migration, cluster upgrade | Full blue/green — clone before, swap back if broken | <60 seconds |
| Emergency | Hotfix under active incident | Snapshot before hotfix, rollback if hotfix fails | 2 seconds |
The blameless postmortem
Every significant incident produces a postmortem. The postmortem is not about blame — it is about learning. The blameless postmortem assumes that every engineer involved acted with the information they had at the time, and that the failure is systemic, not individual. The goal is to find the systemic causes and fix them:
# Postmortem template
# Incident: [name] — [date]
# Severity: S[1-4]
# Duration: [start] to [end] = [X minutes of impact]
# Affected: [users, services, error rate, revenue]
# Timeline:
# [timestamp] — [what happened]
# [timestamp] — [who detected it, how]
# [timestamp] — [first mitigation attempt]
# [timestamp] — [service restored]
# [timestamp] — [root cause identified]
# Root cause:
# [single sentence: what actually went wrong]
# Contributing factors:
# [what made the root cause possible or harder to detect]
# [what slowed down detection or mitigation]
# What went well:
# [things that helped — monitoring caught it fast, rollback worked instantly]
# Action items:
# [ID] [Owner] [Due date] — [what we are changing to prevent this]
8. Incident Response — the SRE Playbook
Incident response is the hardest SRE discipline to teach because every incident is different. But the structure — the lifecycle — is always the same. The key insight that separates SRE incident response from traditional ops is this: mitigation and resolution are different phases, and you should never be trying to resolve while users are down.
Severity levels
| Severity | Definition | Response time | Requires postmortem? |
|---|---|---|---|
| S1 | All users affected, service down or unusable | Immediate (minutes) | Always |
| S2 | Significant user impact, degraded service | <30 minutes | Always |
| S3 | Minor impact, subset of users affected | <4 hours | If novel failure mode |
| S4 | Cosmetic, no user impact | Next business day | No |
The incident lifecycle
# Phase 1: Detect
# Alert fires → on-call paged → incident channel opened
# "Something is broken" — do not try to diagnose yet
# Phase 2: Assess (< 5 minutes)
# What is broken? What is the blast radius?
# "API endpoint /checkout returning HTTP 500 for 60% of requests"
# "ZFS pool degraded — one disk offline, data redundancy lost"
# Set severity level here.
# Phase 3: Mitigate — restore service ASAP (not root cause!)
# OpenZFS mitigations:
zfs rollback tank/app@pre-deploy-20260402-1430 # bad deploy → rollback
kvm-snap rollback web-01 # broken node → restore snapshot
# GRUB → previous boot environment # bad kernel/packages → 1 reboot
haproxy-swap green-app blue-app # bad green → swap back to blue
# Phase 4: Communicate
# Update status page, notify stakeholders, post in incident channel:
# "[14:45] Mitigation applied — rolling back to v1.4.2. Monitoring recovery."
# "[14:47] Service restored. Error rate back to 0.1%. Root cause investigation ongoing."
# Phase 5: Resolve
# NOW you investigate root cause — after service is restored
# Preserve evidence: logs, metrics, snapshots of the bad state
# Do not destroy the bad snapshot until postmortem is complete
# Phase 6: Postmortem (within 48 hours for S1/S2)
# Blameless. Timeline. Root cause. Contributing factors. Action items.
9. Toil Elimination — Automate the Repetitive
Toil is manual, repetitive, automatable work that scales linearly with the size of your service. As your fleet grows, the toil grows with it — unless you automate it. Google's SRE teams have a hard rule: no more than 50% of an engineer's time on toil. If it exceeds that, you stop adding features and fix the automation.
Identifying toil in your infrastructure
# Signs you are doing toil:
# - You have a weekly "maintenance" task you do by hand
# - You SSH into machines to run commands that should be automated
# - You prune snapshots manually ("cleanup time again")
# - You renew certificates by calendar reminder
# - You check disk space by logging in and running df -h
# - You rotate logs manually
# - You run backup scripts because "cron was flaky"
# - You provision new VMs by copy-pasting commands from a wiki
# These are all toil. All of them have better answers.
How kldload eliminates toil
sanoid — snapshot lifecycle automation
Automated snapshot creation and pruning. Define a policy: hourly snapshots for 24 hours, daily for 30 days, monthly for 12 months. sanoid runs via systemd timer and enforces the policy continuously. You never think about snapshots again.
syncoid — replication automation
Automated incremental ZFS replication. A systemd timer runs syncoid every 15 minutes. It streams only the changed blocks since the last sync. No rsync, no full copies, no manual intervention. RPO is 15 minutes with zero operational toil.
cert-manager + step-ca — certificate automation
cert-manager (Kubernetes) or step-ca (anywhere) handles the full certificate lifecycle: issue, renew, rotate, revoke. Certificates are renewed automatically before expiry. No calendar reminders, no manual openssl commands, no "our cert expired at 3am" incidents.
systemd timers — maintenance automation
Replace all cron jobs with systemd timers. Timers have built-in logging (journald), dependency management, randomized delays to spread load, and restart-on-failure semantics. They are observable in a way cron never was.
kvm-clone — VM provisioning automation
Clone a golden image to provision a new VM: copy ZFS dataset, generate new MAC and hostname, boot. What takes 30 minutes manually (download ISO, install OS, configure, test) takes 30 seconds with kvm-clone from a pre-built image.
Prometheus alerting — reactive toil elimination
Alert on conditions that would otherwise require manual checking. Disk space at 75%? Alert. ARC hit ratio dropping? Alert. Replication lag exceeding 30 minutes? Alert. You do not check dashboards manually — you get paged when something needs attention.
10. Capacity Planning with OpenZFS
Capacity planning is the SRE discipline of ensuring you do not run out of resources before you have time to add more. On OpenZFS, this is easier than on any other platform because ZFS gives you exact, consistent numbers. No df vs du discrepancies. No hidden filesystem overhead. No guessing.
The numbers ZFS gives you
# Pool-level capacity — the source of truth
zpool list -v tank
# NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH
# tank 10.9T 3.21T 7.69T - - 4% 29% 1.00x ONLINE
# Dataset-level usage — per service, per environment
zfs list -o name,used,avail,refer,compressratio -r tank
# NAME USED AVAIL REFER RATIO
# tank/app 142G 7.55T 118G 2.31x
# tank/vms 890G 7.55T 200G 1.87x
# tank/snapshots 240G 7.55T - 1.00x (snapshot accumulation)
# Snapshot growth over time — how fast is your snapshot space growing?
zfs list -t snapshot -o name,used -r tank/app | sort -k2 -h | tail -20
# ARC efficiency — is your cache working?
arc_summary | grep -E "Hit|Miss|Size"
# or via Prometheus: node_zfs_arc_hits / (node_zfs_arc_hits + node_zfs_arc_misses)
The 80% rule
ZFS performance degrades past 80% pool utilization due to fragmentation and the copy-on-write transaction group management. The 80% threshold is your planning trigger, not your emergency threshold. When a pool hits 75%, you start planning expansion. When it hits 80%, you execute the expansion plan:
# Prometheus alert: pool approaching 80%
- alert: ZFSPoolCapacityHigh
expr: node_zfs_zpool_size_bytes{state="free"} / node_zfs_zpool_size_bytes{state="total"} < 0.25
for: 1h
labels:
severity: warning
annotations:
summary: "ZFS pool {{ $labels.zpool }} is over 75% full"
description: "Pool has {{ $value | humanizePercentage }} free. Plan expansion now."
- alert: ZFSPoolCapacityCritical
expr: node_zfs_zpool_size_bytes{state="free"} / node_zfs_zpool_size_bytes{state="total"} < 0.15
for: 30m
labels:
severity: critical
annotations:
summary: "ZFS pool {{ $labels.zpool }} is over 85% full — performance degrading"
Forecasting storage growth
# Track used space daily and project when you hit 80%
# Simple linear forecast from Prometheus:
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[7d], 30*24*3600)
# → predicts free bytes 30 days from now based on the past 7 days of trend
# For ZFS-specific:
predict_linear(node_zfs_zpool_size_bytes{state="alloc"}[14d], 90*24*3600)
# → how much will be allocated 90 days from now?
# Clone space accounting — clones use space as they diverge
# Account for blue/green clones in your capacity plan:
# At peak: production + one full green clone = 2x production data space
# After cleanup: back to 1x
# Plan for 2.5x headroom during active blue/green periods
zfs list output shows exactly how much space each dataset uses, how much the pool has free, and what the compression ratio is. There are no df vs du discrepancies (ZFS reports both correctly), no hidden overhead from journal files or filesystem metadata that surprises you, no confusion about whether snapshot space counts toward quotas. The compression ratio is particularly useful for capacity planning: if your pool has 2.3x compression and you are adding a new workload, you know from the first 24 hours of data how much real space it will consume. No guessing.11. Disaster Recovery — the Ultimate SRE Test
Disaster recovery is the most important SRE capability to have and the least tested in practice. The Google SRE book says: DR plans that have not been tested are DR hopes. OpenZFS makes DR testing free — you clone the production replica, boot it, verify it, and destroy the clone. The test costs nothing and proves the DR is real.
DR tiers
| Tier | Mechanism | RTO | RPO | Use case |
|---|---|---|---|---|
| Tier 1 | Local snapshot rollback | 2 seconds | 0 (atomic) | Bad deploy, bad config change |
| Tier 2 | Node recovery from snapshot | 30 seconds | Snapshot interval | Node failure, VM corruption |
| Tier 3 | syncoid replica failover at DR site | 5–15 minutes | Replication interval | Site failure, hardware loss |
| Tier 4 | Cold rebuild: kldload ISO + ZFS import | 30 minutes | Last offsite snapshot | Total site loss, catastrophic failure |
Tier 3 DR runbook — site failover
# Prerequisites:
# - syncoid replicating tank → dr-host:tank every 15 minutes
# - DR host has kldload installed but services dormant
# - WireGuard backplane connects production and DR sites
# Failover procedure:
# 1. Declare the incident — do not fail over on a hunch
# 2. Verify production is actually unreachable (not a monitoring flap)
# 3. On DR host — import the ZFS pool
ssh dr-host 'zpool import tank'
# 4. Check the last successful replication
zfs list -t snapshot -o name,creation -r tank | sort -k2 | tail -5
# → confirms how old the last sync is (your actual RPO)
# 5. Boot the replicated VMs on DR host
ssh dr-host 'kvm-clone --from-snapshot tank/vms/app-1@syncoid_auto tank/vms/app-1'
ssh dr-host 'virsh start app-1'
# 6. Verify applications are healthy on DR site
curl -sf http://app-1.dr.internal/health
# 7. Cut DNS or BGP route to point production traffic at DR site
# (your specific mechanism: DNS TTL update, BGP announcement, floating IP reassign)
# 8. Notify stakeholders: DR site active, RPO = [timestamp of last sync]
# Recovery (returning to production site):
# 1. Restore production hardware
# 2. Sync from DR back to production: syncoid --recursive dr-host:tank tank
# 3. Verify production is current
# 4. Cut traffic back to production (planned maintenance window)
# 5. Resume normal replication from production to DR
Testing your DR
# DR test — run monthly, costs nothing, proves everything
# Step 1: On DR host, clone the production replica
ssh dr-host 'zfs clone tank/app@syncoid_auto-2026-04-01 tank/app-drtest'
# Step 2: Boot a test VM from the clone
ssh dr-host 'kvm-clone --from-snapshot tank/vms/app-1@syncoid_auto tank/vms/app-1-drtest'
ssh dr-host 'virsh start app-1-drtest'
# Step 3: Verify application is healthy (internal, not exposed to users)
ssh app-1-drtest 'systemctl is-active app && curl -sf localhost/health'
# Step 4: Run smoke tests against the DR test environment
./smoke-tests.sh --target app-1-drtest.internal
# Step 5: Destroy the test (production was never touched)
ssh dr-host 'virsh destroy app-1-drtest && virsh undefine app-1-drtest'
ssh dr-host 'zfs destroy tank/vms/app-1-drtest && zfs destroy tank/app-drtest'
# Result: documented proof that DR is operational, RPO and RTO measured
12. Observability for SRE
SRE observability is built around the four golden signals, defined in the Google SRE book: latency, traffic, errors, and saturation. These four signals, measured at the service level, tell you whether your service is healthy. Supplement them with infrastructure signals — ZFS health, WireGuard connectivity, node resource usage — and you have complete coverage.
The four golden signals
# 1. Latency — how long requests take
# Measure: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Alert: p99 latency > 300ms for 5 minutes
# Dashboard: p50, p95, p99 on the same graph
# 2. Traffic — how much demand the system is handling
# Measure: rate(http_requests_total[5m])
# Alert: traffic drops > 50% from baseline (traffic anomaly, possible upstream failure)
# Dashboard: requests/second with historical comparison
# 3. Errors — rate of failed requests
# Measure: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Alert: error rate > 1% for 2 minutes
# Dashboard: error rate with SLO budget burn overlay
# 4. Saturation — how full the service is
# Measure: CPU, memory, ZFS ARC, disk I/O queue depth
# Alert: any saturation metric > 85% for 10 minutes
# Dashboard: all four saturation signals on one pane
ZFS-specific signals
# ZFS pool health (0=ONLINE, 1=DEGRADED, 2=FAULTED)
node_zfs_zpool_state_value
# ARC hit ratio — cache efficiency (below 80% means cold cache or undersized ARC)
rate(node_zfs_arc_hits[5m]) / (rate(node_zfs_arc_hits[5m]) + rate(node_zfs_arc_misses[5m]))
# Pool capacity — percentage used (alert at 75%)
1 - (node_zfs_zpool_size_bytes{state="free"} / node_zfs_zpool_size_bytes{state="total"})
# Scrub status — when did the last scrub complete? (alert if > 30 days)
time() - node_zfs_zpool_scrub_end_time
# Replication lag — how far behind is the DR replica?
time() - node_zfs_snapshot_creation_time{snapshot=~".*syncoid.*"}
# Checksum errors — potential disk hardware failure (any value > 0 = investigate)
node_zfs_zpool_checksum_errors_total
WireGuard signals
# WireGuard peer handshake age — stale > 3 minutes means peer is unreachable
time() - wireguard_peer_last_handshake_seconds{interface="wg1"}
# Transfer bytes (rate) — check if traffic is actually flowing
rate(wireguard_peer_received_bytes_total[5m])
rate(wireguard_peer_sent_bytes_total[5m])
# Number of connected peers (below expected = backplane failure)
count(wireguard_peer_last_handshake_seconds < 180) by (interface)
SLO burn rate alerting
Standard threshold alerts (error rate > 1%) fire when you are already in trouble. SLO burn rate alerts fire when you are burning through your error budget faster than sustainable — before the budget is exhausted:
# Burn rate alerting — fires when you are consuming error budget too fast
# A burn rate of 1.0 = consuming budget at exactly the rate that would exhaust it over the SLO window
# A burn rate of 14.4 = consuming budget 14.4x faster → exhausts 30-day budget in 2 days
# Fast burn (critical): > 2% error rate sustained for 2 minutes
# Burn rate = 2% / 0.1% (error budget) = 20x → exhausts budget in 36 hours
- alert: SLOBurnRateFast
expr: |
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.02
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate — burning SLO budget at 20x sustainable rate"
# Slow burn (warning): > 0.5% error rate sustained for 30 minutes
# Burn rate = 5x → exhausts budget in 6 days
- alert: SLOBurnRateSlow
expr: |
rate(http_requests_total{status=~"5.."}[30m]) /
rate(http_requests_total[30m]) > 0.005
for: 30m
labels:
severity: warning
annotations:
summary: "Elevated error rate — burning SLO budget at 5x sustainable rate"
13. On-Call and Runbooks
On-call is the mechanism that connects alerting to human response. Every alert that pages an engineer must have a runbook: a documented procedure that tells the on-call engineer exactly what to do. Without a runbook, the alert is noise — the engineer has to figure out the procedure from scratch at 3 AM under pressure, which produces mistakes and slow recovery.
Runbook format
# Runbook template
# Alert: [alert name from Prometheus]
# Severity: [S1/S2/S3/S4]
# Symptom: [what the user experiences]
# Likely causes: [ordered by probability]
# Diagnostic steps:
# 1. [what to check first — fastest to answer "is this the cause?"]
# 2. [second check]
# 3. [escalation trigger: if X is true, page the database team]
# Mitigation steps (restore service):
# 1. [fastest mitigation — often a rollback or restart]
# 2. [fallback if step 1 fails]
# Resolution steps (fix root cause — after service is restored):
# 1. [how to investigate and permanently fix]
# Escalation:
# [when to escalate, who to page, what information to provide]
Runbooks for common kldload alerts
# ─────────────────────────────────────────────────────────────────────
# RUNBOOK: ZFSPoolDegraded
# Severity: S2 (data redundancy lost — pool still running, data at risk)
# Symptom: ZFS pool is in DEGRADED state (one or more disks failed)
# ─────────────────────────────────────────────────────────────────────
# Diagnostic:
zpool status tank
# → identify which VDEV is faulted or offline
# → check if it is a transient error or a permanent disk failure
dmesg | grep -E "sd[a-z]|nvme|ata" | tail -30
# → look for disk I/O errors, SMART errors, timeout errors
smartctl -a /dev/sda # check SMART data on the suspect disk
# Mitigation (pool is still online — data is accessible, redundancy is lost):
# If disk dropped due to a transient error:
zpool clear tank # clear transient errors, may bring disk back online
# If disk is permanently failed:
# Order replacement. Pool is still running — no urgent action required.
# Alert is S2 because you have lost redundancy — any further failure = data loss.
# Resolution:
# Replace the failed disk:
zpool replace tank /dev/sda /dev/sdb # online replacement
# Wait for resilver to complete (zpool status shows progress)
# ─────────────────────────────────────────────────────────────────────
# ─────────────────────────────────────────────────────────────────────
# RUNBOOK: ZFSARCHitRatioLow
# Severity: S3 (performance degraded, not an outage)
# Symptom: ARC hit ratio below 80% — reads going to disk instead of cache
# ─────────────────────────────────────────────────────────────────────
# Diagnostic:
arc_summary | grep -E "Hit|Miss|ARC Size|Target"
# → is the ARC size at its maximum? If yes, ARC is full and evicting.
# → is there a new workload pattern causing cold reads?
grep ARC /proc/spl/kstat/zfs/arcstats | awk '{print $1, $3}'
# Mitigation:
# If ARC is hitting its limit (arc_max), increase it:
echo $((16 * 1024 * 1024 * 1024)) > /sys/module/zfs/parameters/zfs_arc_max
# (sets ARC max to 16GB — adjust to available RAM)
# Resolution:
# Persist the ARC size in /etc/modprobe.d/zfs.conf:
# options zfs zfs_arc_max=17179869184
# Then: dracut -f && reboot (to rebuild initramfs with new parameter)
# ─────────────────────────────────────────────────────────────────────
# ─────────────────────────────────────────────────────────────────────
# RUNBOOK: WireGuardPeerStale
# Severity: S2 (backplane connectivity lost to a peer)
# Symptom: WireGuard peer handshake age > 3 minutes (peer is unreachable)
# ─────────────────────────────────────────────────────────────────────
# Diagnostic:
wg show wg1
# → check last handshake time for the stale peer
# → check allowed-IPs and endpoint for the stale peer
ping 10.201.0.2 # can we reach the peer's WireGuard IP?
curl -sf http://10.202.0.2:9100/metrics # is the peer's node_exporter responding?
# Mitigation:
# Attempt to re-trigger handshake:
wg set wg1 peer endpoint : # reset the endpoint
# Or restart WireGuard on this node:
systemctl restart wg-quick@wg1
# If peer node is down, this is a node outage — escalate to node recovery runbook
# Resolution:
# If peer node is healthy but WireGuard is not reconnecting:
ssh peer-node 'systemctl restart wg-quick@wg1'
# If peer node has rebooted and WireGuard is not starting:
ssh peer-node 'systemctl enable --now wg-quick@wg1'
# ─────────────────────────────────────────────────────────────────────
# ─────────────────────────────────────────────────────────────────────
# RUNBOOK: DiskSpaceHigh (>80% ZFS pool utilization)
# Severity: S2 (performance degrading, approaching critical)
# ─────────────────────────────────────────────────────────────────────
# Diagnostic:
zfs list -o name,used,avail,refer -r tank | sort -k2 -rh | head -20
# → find the largest datasets
zfs list -t snapshot -o name,used -r tank | sort -k2 -rh | head -20
# → find snapshot accumulation (common cause)
# Mitigation (free space immediately):
# Destroy old snapshots (verify with sanoid policy before deleting manually):
zfs list -t snapshot -o name,creation -r tank/app | head -10
zfs destroy tank/app@manual-backup-2025-11-01 # destroy oldest manual snapshots
# Resolution:
# Review sanoid retention policy — is it keeping more snapshots than needed?
# Plan storage expansion if usage is growing beyond sanoid pruning
# ─────────────────────────────────────────────────────────────────────
# ─────────────────────────────────────────────────────────────────────
# RUNBOOK: ReplicationLagHigh (>2 hours behind)
# Severity: S2 (RPO target violated — DR is stale)
# ─────────────────────────────────────────────────────────────────────
# Diagnostic:
systemctl status syncoid.service
journalctl -u syncoid.service --since "3 hours ago"
# → is syncoid failing? What error?
# → is the DR host reachable over the storage WireGuard plane?
ping 10.203.0.2 # storage backplane IP of DR host
ssh dr-host 'zpool status tank' # is the DR pool healthy?
# Mitigation:
# Manually trigger a sync:
syncoid --recursive tank/app dr-host:tank/app
# → watch for errors
# Resolution:
# If syncoid is failing due to snapshot mismatch:
# On DR host: zfs rollback to the last common snapshot, then sync again
# If DR host is offline: address the outage, then sync when restored
# ─────────────────────────────────────────────────────────────────────
14. The SRE Maturity Model for kldload Deployments
Most infrastructure is at Level 0 or Level 1. The maturity model gives you a clear map of where you are and what to do next. Each level builds on the previous — you cannot skip levels, because the higher levels depend on the foundations built below.
Level 0 — Manual Everything
No automated snapshots. No monitoring. No runbooks. Changes are made ad hoc. DR is "we have a backup somewhere." Incidents are discovered by users reporting them. Recovery is measured in hours.
Level 1 — Automated Basics
sanoid running with a reasonable retention policy. Prometheus and node_exporter deployed. Basic alerts for disk, CPU, memory, and service health. SSH via WireGuard backplane. Pre-change snapshots taken (manually but consistently).
Level 2 — SLO-Driven
Defined SLIs and SLOs for every critical service. Error budget policy documented and enforced. Runbooks for every alert. Blameless postmortem process in place. syncoid replication with verified RPO. Monthly DR tests documented.
Level 3 — Fully Automated
Blue/green deploys via ZFS clone and kvm-clone. Canary deployments with automated rollback on SLO violation. Automated DR tests on schedule. Toil below 50% of operational time. cert-manager handling certificate lifecycle. All VMs provisioned via kvm-clone from golden images.
Level 4 — Self-Healing
Automatic rollback triggers when SLO burn rate exceeds threshold. Automatic failover to DR site when primary site health checks fail. Automatic capacity scaling (new storage device added to pool, VMs rescheduled on available hosts). Human involvement limited to policy decisions, not operational execution.
Getting from Level 1 to Level 2 — the highest-value jump
The jump from Level 1 to Level 2 is the most valuable reliability investment you can make. Level 1 gives you visibility and basic automation. Level 2 gives you accountability and measurable improvement. Here is the concrete sequence:
# Week 1: Define SLIs and SLOs
# For each critical service, write down:
# 1. What does "working" mean? (the SLI)
# 2. What percentage of the time must it work? (the SLO)
# 3. How do you measure it? (the Prometheus query)
# Week 2: Build the measurement infrastructure
# Add SLO recording rules to Prometheus:
groups:
- name: slo
interval: 30s
rules:
- record: job:slo_availability:rate5m
expr: |
rate(http_requests_total{status!~"5.."}[5m]) /
rate(http_requests_total[5m])
# Week 3: Write runbooks for every existing alert
# For each alert in AlertManager: write the runbook in your wiki
# Template: symptom, diagnosis steps, mitigation, resolution, escalation
# Week 4: Test your DR
# Run the DR test from Section 11
# Document the result: what worked, what did not, what the actual RPO was
# Month 2: Implement the error budget policy
# Document it. Get buy-in from development.
# Start tracking error budget consumption in your weekly ops review.
15. The SRE Reading List
These are the books and resources that underpin everything in this masterclass. The Google books are free online — there is no reason not to read them.
- Site Reliability Engineering (Beyer, Jones, Petoff, Murphy — Google, 2016) — the original. Free at sre.google. Read chapters 3 (Embracing Risk), 4 (SLOs), 5 (Toil Elimination), and 13 (Emergency Response) first.
- The Site Reliability Workbook (Beyer, Murphy, Rensin, Kawahara, Thorne — Google, 2018) — the practical companion. Concrete implementations of every concept in the SRE book. Also free at sre.google.
- Implementing Service Level Objectives (Alex Hidalgo, 2020) — the best deep dive on SLI/SLO/error budget practice. Goes well beyond the Google books into the organizational and measurement details.
- Release It! (Michael Nygard, 2nd ed. 2018) — stability patterns and antipatterns for production software. Circuit breakers, timeouts, bulkheads. Complementary to SRE — covers the application design side of reliability.
- The Practice of Cloud System Administration (Limoncelli, Chalup, Hogan, 2014) — the operations engineering textbook. Covers capacity planning, change management, on-call design, and runbook writing in depth.
Related kldload masterclasses
- ZFS Masterclass — the foundation: pool design, snapshots, replication, encryption, and all the primitives this page uses
- Observability Masterclass — Prometheus, Grafana, alerting, the full monitoring stack
- Backplane Networks Masterclass — the encrypted infrastructure network that makes multi-site SRE possible
- Security Hardening Masterclass — hardening your infrastructure so incidents are less likely and blast radius is smaller
- Kubernetes Masterclass — Kubernetes-specific deployment patterns: rolling updates, canary with Cilium, pod disruption budgets