Masterclass

Blue/Green & SRE Masterclass

This guide treats site reliability engineering the way Google intended it: as an engineering discipline with measurable targets, defined error budgets, and automation that eliminates manual toil. It then applies every principle to infrastructure you own — bare metal, OpenZFS, WireGuard, kldload — where the primitives are often better than what cloud providers charge premium for.

By the end you will have defined SLOs, built blue/green and canary workflows around OpenZFS clones, written runbooks for every alert, and mapped your deployment maturity to a concrete checklist. The progression is zero to hero: each section builds on the last.

Site Reliability Engineering is not DevOps with a different title. SRE is the discipline of running production systems reliably at scale, codified by Google in 2016. The core principles are simple but radical: SLOs define your reliability target, error budgets determine how much risk you are allowed to take, toil is the enemy of every operations team, automation is the only answer that scales, and every change is a potential incident until proven otherwise.

On OpenZFS, these principles gain superpowers. Rollback is 2 seconds — not 2 hours. Cloning production for a test environment is free — not a separate infrastructure cost. Replication is incremental and block-level — not rsync guessing which files changed. Boot environment rollback means a bad kernel or bad configuration is a reboot away from recovery, not a reinstall. Every SRE operation that is expensive on other platforms becomes trivial on OpenZFS.

What this masterclass covers: SLIs, SLOs, and error budgets from first principles. OpenZFS as an SRE primitive. Blue/green and canary deployments with kldload. Incident response, change management, toil elimination, capacity planning, disaster recovery, observability, on-call runbooks, and a maturity model you can use to assess where you are and what to fix next.

Most SRE content assumes you are on GCP or AWS with their managed tooling: Cloud Monitoring, SLO burn rate alerts built into the console, blue/green handled by load balancer backends, DR handled by multi-region replication you pay per-GB for. This page adapts every SRE principle to bare metal and OpenZFS. The principles are identical. The implementation is different — and in many ways better, because OpenZFS gives you primitives that cloud providers charge premium for or do not offer at all. A ZFS clone of a 500 GB production dataset costs zero disk until it diverges. A cloud provider charges you for the full 500 GB copy. That is not a detail. That is a fundamental economic difference that changes how you design reliability.

1. SRE Is Not DevOps with a Different Title

DevOps is a cultural movement: break down the wall between development and operations, ship faster, deploy more often. SRE is a specific implementation of that culture with engineering discipline: you measure reliability precisely, you set targets for it, and you treat the gap between current reliability and the target as a budget you can spend on risk.

The Google SRE Book (2016, free online) defines the discipline. The core insight is that reliability is a feature, and like any feature, you need to decide how much of it to build. Too little reliability and users leave. Too much reliability and you are wasting engineering resources on uptime you do not need. The error budget is the mechanism that enforces this balance automatically.

SLI — Service Level Indicator

A quantitative measure of some aspect of the service. The raw signal. Examples: request success rate, latency at the 99th percentile, storage pool availability, replication lag in seconds.

// SLI = what you measure // "99.3% of requests returned HTTP 200 this week" // "p99 latency was 187ms over the last hour"

SLO — Service Level Objective

A target for an SLI. The goal. Not a promise to users — that is an SLA. The SLO is your internal engineering target that gives you a budget for risk. Set it below what you can actually achieve.

// SLO = what you aim for // "99.9% of requests must succeed over any 30-day window" // "p99 latency must stay below 300ms"

Error Budget

The allowed unreliability: 100% minus the SLO. At 99.9% availability, your budget is 0.1% — 43.8 minutes per month. That budget is yours to spend on deployments, experiments, and migrations. When it runs out, you freeze changes.

// Error budget = 100% - SLO // 99.9% SLO → 43.8 min/month to spend on risk // Budget exhausted → freeze deploys, fix reliability

Toil

Manual, repetitive, automatable work that scales linearly with service size. Running backup scripts by hand. Pruning snapshots manually. Renewing certificates by reminder email. Toil is the enemy — it consumes engineering time without improving reliability.

// Toil: manual snapshots, manual pruning, manual restarts // Not toil: writing the automation that does these // Rule: if you do it more than twice, automate it

Error budgets are the most counterintuitive SRE concept. Being TOO reliable is actually wasteful. If your SLO is 99.9% and you are delivering 99.999%, you are spending engineering effort on five nines of reliability that your SLO does not require. The error budget says: you have 0.1% to spend. Spend it on deployments, migrations, experiments. If the budget runs out, stop deploying and fix reliability. This creates a natural, automatic balance between velocity and stability — enforced by measurement, not politics. Development wants to ship. Operations wants stability. The error budget is the shared currency that makes both teams accountable to the same number.

2. SLIs, SLOs, and Error Budgets — the Foundation

The theory is simple. Making it concrete requires choosing the right SLIs for your services, setting realistic SLO targets, and building the infrastructure to measure them continuously. Here is how to do it for a kldload web application.

Defining SLIs for a kldload web application

Start with the four golden signals (detailed in Section 12): latency, traffic, errors, and saturation. Map each to a concrete SLI:

# SLI 1: Availability — fraction of requests that succeed
# Measure: HTTP 2xx responses / total HTTP responses (excluding health checks)
# Source: nginx access log, application metrics, or Prometheus http_requests_total

# SLI 2: Latency — fraction of requests served within threshold
# Measure: requests with latency < 300ms / total requests
# Source: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# SLI 3: Throughput — requests per second (capacity SLI)
# Measure: rate(http_requests_total[5m])
# Alert when: drops below expected baseline (traffic anomaly)

# SLI 4: Freshness — for data services, age of newest record
# Measure: time since last successful write or replication cycle
# Source: custom gauge, or syncoid completion timestamp

Setting SLOs

SLOs should be set at what your users actually need, not at what you think you can achieve. Start conservative. A 99.5% SLO with a real error budget is more valuable than an aspirational 99.99% that you never measure.

SLO	Downtime / year	Downtime / month	Downtime / week	Error budget (30 days)
99%	87.6 hours	7.3 hours	1.7 hours	432 minutes
99.5%	43.8 hours	3.6 hours	50 minutes	216 minutes
99.9%	8.7 hours	43.8 minutes	10.1 minutes	43.8 minutes
99.95%	4.4 hours	21.9 minutes	5.0 minutes	21.9 minutes
99.99%	52.6 minutes	4.4 minutes	1.0 minute	4.4 minutes

The error budget policy

An error budget without a policy is just a number. The policy defines what happens when the budget reaches specific thresholds:

# Error budget policy — example for a 99.9% SLO (43.8 min/month)

# Budget > 50% remaining: normal operations, deploy freely
# Budget 25-50% remaining: all deployments require pre-deploy snapshot + rollback plan
# Budget 10-25% remaining: new features freeze, reliability work only
# Budget < 10% remaining: all changes freeze, on-call focus shifts to root cause
# Budget exhausted (0%): incident declared, postmortem required before any deploys resume

Most teams set SLOs and then ignore them. The error budget policy is what makes SLOs actionable. Without it, the SLO is a dashboard number. With it, the SLO is a gate: when the budget runs low, the policy automatically restricts what engineers are allowed to deploy. This is not bureaucracy — it is the mechanism that prevents "just one more deploy" from running you out of reliability budget. The policy should be agreed on by both development and operations before you ever start measuring. If you try to impose it after the fact, development will resist. If it is agreed upfront, it is just the rules of the game.

3. OpenZFS as an SRE Primitive

Traditional SRE treats rollback as expensive. Restore from backup (slow), redeploy (slow), verify (slow), notify stakeholders (slow). The total time from "we need to rollback" to "rollback complete" is measured in hours. On OpenZFS, that same operation is measured in seconds — and the rollback is atomic and byte-perfect.

This changes the economics of reliability engineering. When rollback is expensive, you are conservative about changes. When rollback is 2 seconds, you can take more risk — which means more deployments, more experiments, and faster iteration, all within the same error budget.

Snapshots — instant recovery points

An OpenZFS snapshot is atomic, zero-cost at creation, and captures the exact filesystem state at that moment. Create one before every change: before a deploy, before a migration, before a package update. If anything breaks, roll back instantly.

// zfs snapshot tank/app@pre-deploy-2026-04-02 // ... something breaks ... // zfs rollback tank/app@pre-deploy-2026-04-02 // # done. 2 seconds.

Clones — zero-cost environment copies

A ZFS clone is a writeable copy of a snapshot. It uses zero disk until it diverges from the parent. Clone production to create a staging environment, a blue/green green environment, or a canary node — all at no storage cost.

// zfs clone tank/app@pre-deploy tank/app-green // # green is now a byte-identical copy of production // # disk cost: 0 until green diverges

Send/receive — measurable RPO

zfs send | zfs receive streams incremental block-level changes to a remote replica. Every snapshot sent gives you a measurable RPO: if you replicate every 15 minutes, your RPO is 15 minutes. syncoid automates this with retry logic and progress reporting.

// syncoid --recursive tank/app dr-host:tank/app-replica // # replicates only blocks changed since last sync // # RPO = replication interval (e.g. 15 minutes)

Boot environments — OS-level rollback

kldload installs the OS itself onto a ZFS dataset. Before a kernel update, package upgrade, or configuration change, create a boot environment. If the system fails to boot or misbehaves, select the previous boot environment at the bootloader. Recovery is one reboot.

// ksnap pre-kernel-update // dnf update kernel // # system boots broken // # select previous BE at grub menu // # system is back. root cause: investigate later.

Checksums — silent corruption detection

Every block written to an OpenZFS pool is checksummed on write and verified on read. Bit rot, hardware failure, write errors — all detected silently. On a mirror or RAIDZ pool, corrupted blocks are automatically repaired from redundant copies. No manual fsck needed.

// zpool status tank // # scan: scrub repaired 0B in 00:04:22 // # no errors — or: // # checksum errors: 3 → investigate disk health

Compression — transparent and always-on

lz4 compression on all datasets costs nearly nothing in CPU and typically reduces storage usage by 30-60% for text, logs, and database content. This extends the time before you need to expand storage capacity — directly affecting your capacity planning SLIs.

// zfs set compression=lz4 tank/app // zfs get compressratio tank/app // # compressratio: 2.41x — halved your storage burn rate

The SRE operations table

SRE operation	Traditional approach	OpenZFS approach	Time difference
Pre-deploy checkpoint	rsync backup to staging server	zfs snapshot (atomic)	minutes → <1s
Application rollback	Restore backup, redeploy, verify	zfs rollback to snapshot	hours → 2s
Staging environment	Provision new VM, copy data, configure	zfs clone of production snapshot	hours → seconds
DR replica	Full rsync, periodic, bandwidth-heavy	zfs send incremental (block-level)	GBs → MBs/cycle
OS rollback (bad kernel)	Boot from rescue, reinstall, reconfigure	Select previous boot environment at GRUB	hours → 1 reboot
DR test	Spin up separate DR site, hope it works	Clone replica, boot it, verify, destroy	days → minutes
Corruption detection	Discover it at restore time (too late)	Checksummed on every read, repaired from mirrors	silent failure → automatic

The Google SRE book talks about "defense in depth" for data integrity. OpenZFS implements it at the filesystem level: every block is checksummed on write and verified on read. If the checksum does not match and you have a mirror or RAIDZ, ZFS reads the correct copy from redundant storage and silently repairs the bad block. This eliminates an entire class of silent failures that ext4 and XFS cannot detect at all — bit rot, marginal sectors that read bad intermittently, firmware bugs that write wrong data. On a kldload system, this protection is on by default from the first boot. There is nothing to configure.

4. Blue/Green Deployments — the Core Pattern

Blue/green is the most reliable deployment strategy for stateless services. The idea is simple: you run two identical environments. One is live (blue). One is idle (green). You deploy the new version to green, test it, then switch all traffic to green. Blue becomes your instant rollback. When you are confident green is healthy, you retire blue.

The traditional problem with blue/green is cost: you need 2x the hardware, 2x the configuration, and a reliable mechanism to switch traffic. OpenZFS eliminates all three problems. A clone of blue costs zero disk until it diverges. The clone IS production at the moment of cloning — no configuration drift possible. And the rollback is a load balancer change, not a data restoration.

OpenZFS blue/green — step by step

# Assume: production web application running on KVM VM with ZFS root
# blue-app: the current production VM
# Disk: tank/vms/blue-app (ZFS dataset backing the VM image)

# Step 1: Snapshot blue (production) before the deploy
zfs snapshot tank/vms/blue-app@pre-deploy-$(date +%Y%m%d-%H%M)

# Step 2: Clone blue to create green (instant, zero disk cost)
zfs clone tank/vms/blue-app@pre-deploy-20260402-1430 tank/vms/green-app

# Step 3: Boot green as a new VM using the clone
# (kvm-clone does this automatically — clones ZFS dataset + generates new VM config)
kvm-clone blue-app green-app

# Step 4: Deploy new application version to green
# (green is identical to blue — same OS, same config, same data)
ssh green-app 'cd /opt/app && git pull && systemctl restart app'

# Step 5: Run smoke tests against green
curl -sf http://green-app.internal/health || exit 1
./smoke-tests.sh --target green-app.internal

# Step 6: Switch traffic — update load balancer or DNS
# (DNS change, HAProxy backend swap, floating IP reassignment, etc.)
# Example: update HAProxy to route all traffic to green-app
haproxy-swap blue-app green-app   # your automation here

# Green is now production. Blue is your rollback.

# Step 7a: If green breaks after traffic switch — rollback in 2 seconds
haproxy-swap green-app blue-app
# Users are back on blue. Investigate green at leisure.

# Step 7b: If green is healthy after the confidence period (e.g. 30 min)
# Destroy blue (or keep it as a warm spare for one release cycle)
zfs destroy tank/vms/blue-app@pre-deploy-20260402-1430
virsh undefine green-app   # old blue VM gone
# tank/vms/green-app is now the new blue

The blue/green naming convention

Do not hardcode the names "blue" and "green" into your configuration. Instead, use a symbolic pointer — a symlink, a DNS record, a load balancer backend group — that points to whichever environment is currently live. The environments themselves can be named by timestamp or release version:

# Convention: production always points to current, standby to previous
# DNS: app.internal → points to current production IP
# Standby: standby-app.internal → points to idle environment

# Release naming instead of blue/green:
# tank/vms/app-v1.4.2  ← current production
# tank/vms/app-v1.5.0  ← new version, being tested

# After successful cutover:
# tank/vms/app-v1.4.2  ← warm standby for this release cycle
# tank/vms/app-v1.5.0  ← production

# After confidence period:
# tank/vms/app-v1.4.2  → destroy
# tank/vms/app-v1.5.0  → rename to app-current

The OpenZFS blue/green pattern eliminates the two biggest costs of traditional blue/green: hardware cost (clones use zero disk — you are not paying for a second server full of data) and configuration drift (the clone IS production at the moment of cloning, byte for byte, so there is no "is green configured the same as blue?" question). The clone diverges only as you write to it. This property — that the starting point is guaranteed identical — is something no CM tool, no Ansible playbook, no Terraform module can give you. Those tools approximate production. A ZFS clone is production, forked at a point in time.

5. Canary Deployments

Canary is blue/green at a smaller blast radius. Instead of switching all traffic at once, you route a small percentage (10%, 5%, even 1%) to the new version and observe how it behaves under real production load. If the canary metrics match production, you expand. If they diverge, you pull the canary without ever affecting the majority of users.

Canary with kvm-clone

# Assume: 5-node application cluster: app-1 through app-5
# Load balancer routes requests round-robin across all five nodes

# Step 1: Clone app-1 to create the canary
zfs snapshot tank/vms/app-1@pre-canary-$(date +%Y%m%d-%H%M)
kvm-clone app-1 app-canary

# Step 2: Deploy new version to canary only
ssh app-canary 'cd /opt/app && git checkout v1.5.0 && systemctl restart app'

# Step 3: Route 10% of traffic to canary (HAProxy weight-based)
# app-1 through app-5: weight 10 each = ~83% of traffic
# app-canary: weight 8 = ~17%, or use fewer weights for ~10%
# Adjust to your load balancer's weighting mechanism

# Step 4: Monitor the canary vs production
# Key comparison: error rate, latency p99, resource usage
# If canary error rate is 2x production error rate: pull the canary immediately

# Step 5a: Canary healthy — expand to full fleet
for node in app-1 app-2 app-3 app-4 app-5; do
  ssh $node 'cd /opt/app && git checkout v1.5.0 && systemctl restart app'
  sleep 30  # rolling, one node at a time
done
# Remove canary from rotation, destroy it

# Step 5b: Canary broken — pull it
# Remove app-canary from load balancer rotation
# Roll back the canary VM: zfs rollback tank/vms/app-canary@pre-canary-20260402-1445
# Or simply destroy it: virsh undefine app-canary && zfs destroy tank/vms/app-canary

Canary with Cilium L7 traffic splitting

For Kubernetes workloads on kldload, Cilium enables L7-aware traffic splitting. You can canary a specific API endpoint — route 10% of /api/v2/checkout requests to the new version while all other endpoints stay on the stable version. This is more surgical than weight-based load balancing:

# CiliumEnvoyConfig for L7 traffic splitting (canary for /api/v2/checkout)
apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
  name: checkout-canary
spec:
  resources:
  - "@type": type.googleapis.com/envoy.config.route.v3.RouteConfiguration
    name: checkout-route
    virtual_hosts:
    - name: checkout
      domains: ["app.internal"]
      routes:
      - match:
          prefix: "/api/v2/checkout"
        route:
          weighted_clusters:
            clusters:
            - name: checkout-stable
              weight: 90
            - name: checkout-canary
              weight: 10

What to monitor during a canary

The canary decision is binary: expand or rollback. Make it quantitative, not a gut feeling. Define the rollback criterion before you deploy:

# Canary success criteria (define BEFORE deploying):
# 1. Error rate on canary <= 1.5x error rate on production for 10 minutes
# 2. p99 latency on canary <= 1.2x p99 latency on production
# 3. No crash loops (canary process restarts > 2 in 10 minutes = fail)
# 4. No error budget consumption > 10% of monthly budget in first 10 minutes

# Prometheus queries for canary comparison:
# Error rate ratio:
sum(rate(http_requests_total{job="app-canary",status=~"5.."}[5m])) /
sum(rate(http_requests_total{job="app-canary"}[5m]))
/
sum(rate(http_requests_total{job="app-production",status=~"5.."}[5m])) /
sum(rate(http_requests_total{job="app-production"}[5m]))

Canary deployments are blue/green at a smaller scale. Instead of cloning the entire environment, you clone one node. The risk is proportional to the percentage of traffic you route to the canary. At 10% canary traffic, 10% of users are affected if the canary is broken. Combined with Cilium's L7 traffic splitting, you can canary specific API endpoints rather than whole services. This means the blast radius of a bad deploy can be as small as "10% of users hitting the checkout endpoint" — not "10% of all users." The more precisely you can define the blast radius, the more aggressively you can canary. OpenZFS makes the clone free, Cilium makes the routing precise, and Prometheus makes the success criteria measurable. The three together give you a canary infrastructure that costs nothing and risks almost nothing.

6. Rolling Updates with Rollback

Rolling updates are the safest deployment strategy for stateful services and clusters. Instead of swapping all traffic at once (blue/green) or running two versions in parallel indefinitely (canary), you update one node at a time, verify it is healthy, then proceed to the next. If any node fails, you rollback that node and stop the rollout.

The kvm-snap rolling pattern

# Rolling update across a 5-node application cluster
# Pattern: snapshot → update → verify → continue → snapshot next

NODES="app-1 app-2 app-3 app-4 app-5"
NEW_VERSION="v1.5.0"
VERIFY_WAIT=300  # 5 minutes between nodes

for node in $NODES; do
  echo "=== Updating $node ==="

  # 1. Snapshot the node before update
  zfs snapshot tank/vms/${node}@pre-update-$(date +%Y%m%d-%H%M)

  # 2. Drain the node from the load balancer
  haproxy-drain $node

  # 3. Apply the update
  ssh $node "cd /opt/app && git checkout $NEW_VERSION && systemctl restart app"

  # 4. Wait for the node to become healthy
  timeout 60 bash -c "until curl -sf http://$node.internal/health; do sleep 2; done"
  if [ $? -ne 0 ]; then
    echo "ERROR: $node health check failed. Rolling back."
    zfs rollback tank/vms/${node}@pre-update-$(date +%Y%m%d-%H%M)
    haproxy-restore $node
    exit 1
  fi

  # 5. Re-add to load balancer
  haproxy-restore $node

  # 6. Wait before proceeding to next node
  echo "Node $node healthy. Waiting ${VERIFY_WAIT}s before next node..."
  sleep $VERIFY_WAIT
done

echo "Rolling update complete. All nodes on $NEW_VERSION."

The "pause and assess" pattern

Do not update all nodes in a tight loop. Use a graduated rollout: update one node, wait, update a few more, wait longer, update the rest. The longer wait periods give slow-burn failures (memory leaks, connection pool exhaustion, caching bugs) time to surface before they affect the entire cluster:

# Graduated rollout schedule (5-node cluster):
# Node 1:         update, verify 10 minutes
# Nodes 2-3:      update, verify 30 minutes
# Nodes 4-5:      update, verify 60 minutes, declare success

# The pause-and-assess pattern for Kubernetes:
kubectl set image deployment/app app=app:v1.5.0
kubectl rollout pause deployment/app

# Wait 10 minutes, check metrics, then:
kubectl rollout resume deployment/app  # or:
kubectl rollout undo deployment/app    # rollback immediately

Rolling updates are the right strategy for stateful services. Blue/green is the right strategy for stateless services. The difference: stateless services can have all traffic swapped at once because there is no session state tied to a specific node. Stateful services — database clusters, Kubernetes clusters, ZFS storage fleets — need to update one node at a time because abrupt wholesale replacement risks split-brain, quorum loss, or data inconsistency. On OpenZFS, the per-node snapshot before each update gives you granular rollback: if node 3 breaks during a rolling update, you rollback node 3 alone, not the entire cluster. Nodes 1 and 2 are already running the new version successfully. You do not lose their progress.

7. Change Management — Every Change Is a Potential Incident

The Google SRE book is unambiguous: the majority of production incidents are caused by changes. Not hardware failures, not cosmic rays, not mysterious network events — changes. A configuration change. A package update. A schema migration. A kernel upgrade. The implication is that every change needs a rollback plan before it executes, not after.

The pre-change checklist

# Pre-change checklist — required before every significant change

# 1. Create a pre-change snapshot
CHANGE_ID="pkg-update-kernel-$(date +%Y%m%d-%H%M)"
zfs snapshot -r tank@${CHANGE_ID}

# 2. Verify the rollback procedure works
# (actually test it — don't assume)
zfs rollback tank/app@${CHANGE_ID}  # test on non-prod equivalent

# 3. Document the rollback command in the change ticket
# Example: "To rollback: zfs rollback -r tank@pkg-update-kernel-20260402-1430"

# 4. Notify stakeholders (maintenance window, who to contact if broken)

# 5. Set a change window with a defined abort time
# "If the change is not complete and verified by 16:00, abort and rollback"

# 6. Verify monitoring is in place to detect the change failing
# Check that Prometheus alerts are active, Grafana dashboard is open

Change categories and rollback strategies

Category	Examples	Rollback strategy	Rollback time
Routine	App config change, package update	`zfs rollback` to pre-change snapshot	2 seconds
Significant	Kernel update, major config change	Boot environment rollback at GRUB	1 reboot
Critical	Schema migration, cluster upgrade	Full blue/green — clone before, swap back if broken	<60 seconds
Emergency	Hotfix under active incident	Snapshot before hotfix, rollback if hotfix fails	2 seconds

The blameless postmortem

Every significant incident produces a postmortem. The postmortem is not about blame — it is about learning. The blameless postmortem assumes that every engineer involved acted with the information they had at the time, and that the failure is systemic, not individual. The goal is to find the systemic causes and fix them:

# Postmortem template

# Incident: [name] — [date]
# Severity: S[1-4]
# Duration: [start] to [end] = [X minutes of impact]
# Affected: [users, services, error rate, revenue]

# Timeline:
# [timestamp] — [what happened]
# [timestamp] — [who detected it, how]
# [timestamp] — [first mitigation attempt]
# [timestamp] — [service restored]
# [timestamp] — [root cause identified]

# Root cause:
# [single sentence: what actually went wrong]

# Contributing factors:
# [what made the root cause possible or harder to detect]
# [what slowed down detection or mitigation]

# What went well:
# [things that helped — monitoring caught it fast, rollback worked instantly]

# Action items:
# [ID] [Owner] [Due date] — [what we are changing to prevent this]

The Google SRE book says hope is not a strategy. Every change needs a rollback plan before you execute it. On OpenZFS, the rollback plan is always the same: snapshot before, rollback if broken. The cost of this insurance is zero — snapshots are instant and free. There is no excuse for not snapshotting before every change. The pre-change snapshot is not optional ceremony. It is the foundation that makes every SRE operation on this list possible. Without it, you are back to traditional ops: hope nothing breaks, spend hours recovering when it does.

8. Incident Response — the SRE Playbook

Incident response is the hardest SRE discipline to teach because every incident is different. But the structure — the lifecycle — is always the same. The key insight that separates SRE incident response from traditional ops is this: mitigation and resolution are different phases, and you should never be trying to resolve while users are down.

Severity levels

Severity	Definition	Response time	Requires postmortem?
S1	All users affected, service down or unusable	Immediate (minutes)	Always
S2	Significant user impact, degraded service	<30 minutes	Always
S3	Minor impact, subset of users affected	<4 hours	If novel failure mode
S4	Cosmetic, no user impact	Next business day	No

The incident lifecycle

# Phase 1: Detect
# Alert fires → on-call paged → incident channel opened
# "Something is broken" — do not try to diagnose yet

# Phase 2: Assess (< 5 minutes)
# What is broken? What is the blast radius?
# "API endpoint /checkout returning HTTP 500 for 60% of requests"
# "ZFS pool degraded — one disk offline, data redundancy lost"
# Set severity level here.

# Phase 3: Mitigate — restore service ASAP (not root cause!)
# OpenZFS mitigations:
zfs rollback tank/app@pre-deploy-20260402-1430   # bad deploy → rollback
kvm-snap rollback web-01                          # broken node → restore snapshot
# GRUB → previous boot environment                # bad kernel/packages → 1 reboot
haproxy-swap green-app blue-app                   # bad green → swap back to blue

# Phase 4: Communicate
# Update status page, notify stakeholders, post in incident channel:
# "[14:45] Mitigation applied — rolling back to v1.4.2. Monitoring recovery."
# "[14:47] Service restored. Error rate back to 0.1%. Root cause investigation ongoing."

# Phase 5: Resolve
# NOW you investigate root cause — after service is restored
# Preserve evidence: logs, metrics, snapshots of the bad state
# Do not destroy the bad snapshot until postmortem is complete

# Phase 6: Postmortem (within 48 hours for S1/S2)
# Blameless. Timeline. Root cause. Contributing factors. Action items.

The key SRE insight is that mitigation and resolution are different phases. Traditional ops tries to fix the problem while users are down. The engineer is under pressure, trying to diagnose and repair simultaneously, and often makes things worse. SRE says: restore service first by any means available — rollback, failover, disable the broken feature — and THEN investigate. On OpenZFS, mitigation is always fast because rollback is always available. You restore service in seconds, then spend hours understanding why it broke — with the system healthy and users satisfied. The incident timeline looks completely different: instead of "users affected for 2 hours while we diagnosed and fixed," it becomes "users affected for 3 minutes while we rolled back, then 2 hours of investigation with the system healthy." Same root cause work. Radically different user experience.

9. Toil Elimination — Automate the Repetitive

Toil is manual, repetitive, automatable work that scales linearly with the size of your service. As your fleet grows, the toil grows with it — unless you automate it. Google's SRE teams have a hard rule: no more than 50% of an engineer's time on toil. If it exceeds that, you stop adding features and fix the automation.

Identifying toil in your infrastructure

# Signs you are doing toil:
# - You have a weekly "maintenance" task you do by hand
# - You SSH into machines to run commands that should be automated
# - You prune snapshots manually ("cleanup time again")
# - You renew certificates by calendar reminder
# - You check disk space by logging in and running df -h
# - You rotate logs manually
# - You run backup scripts because "cron was flaky"
# - You provision new VMs by copy-pasting commands from a wiki

# These are all toil. All of them have better answers.

How kldload eliminates toil

sanoid — snapshot lifecycle automation

Automated snapshot creation and pruning. Define a policy: hourly snapshots for 24 hours, daily for 30 days, monthly for 12 months. sanoid runs via systemd timer and enforces the policy continuously. You never think about snapshots again.

// /etc/sanoid/sanoid.conf // [tank/app] // use_template = production // [template_production] // hourly = 24 // daily = 30 // monthly = 12 // autosnap = yes // autoprune = yes

syncoid — replication automation

Automated incremental ZFS replication. A systemd timer runs syncoid every 15 minutes. It streams only the changed blocks since the last sync. No rsync, no full copies, no manual intervention. RPO is 15 minutes with zero operational toil.

// systemd timer: syncoid.timer every 15 minutes // syncoid --recursive tank/app dr-host:tank/app-replica // RPO: 15 minutes. Toil: zero.

cert-manager + step-ca — certificate automation

cert-manager (Kubernetes) or step-ca (anywhere) handles the full certificate lifecycle: issue, renew, rotate, revoke. Certificates are renewed automatically before expiry. No calendar reminders, no manual openssl commands, no "our cert expired at 3am" incidents.

// cert-manager + step-ca issuer // Certificate renewed automatically at 80% of lifetime // No manual intervention. No expiry surprises.

systemd timers — maintenance automation

Replace all cron jobs with systemd timers. Timers have built-in logging (journald), dependency management, randomized delays to spread load, and restart-on-failure semantics. They are observable in a way cron never was.

// systemctl list-timers // → see every scheduled task, last run, next run // journalctl -u syncoid.service // → full logs of every sync, errors included

kvm-clone — VM provisioning automation

Clone a golden image to provision a new VM: copy ZFS dataset, generate new MAC and hostname, boot. What takes 30 minutes manually (download ISO, install OS, configure, test) takes 30 seconds with kvm-clone from a pre-built image.

// kvm-clone golden-image new-worker-04 // → new VM running in 30 seconds // → byte-identical starting point to golden image

Prometheus alerting — reactive toil elimination

Alert on conditions that would otherwise require manual checking. Disk space at 75%? Alert. ARC hit ratio dropping? Alert. Replication lag exceeding 30 minutes? Alert. You do not check dashboards manually — you get paged when something needs attention.

// AlertManager routes → PagerDuty / SMS / Slack // You do not log in to check things // You get paged when something is actually wrong

Toil is the silent killer of operations teams. Every manual step is a step that can be forgotten, done wrong, or done late. The kldload tool collection — sanoid, syncoid, ksnap, kvm-clone, systemd timers, cert-manager — exists specifically to eliminate the manual steps that other platforms require. If you are still running backup scripts by hand, pruning snapshots manually, or provisioning VMs by copying commands from a wiki, you are doing toil. Measure how much time per week your team spends on toil. If it is more than 50% of operational time, stop adding features and automate. The 50% rule is not arbitrary — it is what Google discovered was the threshold beyond which teams burn out and reliability degrades.

10. Capacity Planning with OpenZFS

Capacity planning is the SRE discipline of ensuring you do not run out of resources before you have time to add more. On OpenZFS, this is easier than on any other platform because ZFS gives you exact, consistent numbers. No df vs du discrepancies. No hidden filesystem overhead. No guessing.

The numbers ZFS gives you

# Pool-level capacity — the source of truth
zpool list -v tank
# NAME        SIZE   ALLOC    FREE  CKPOINT  EXPANDSZ  FRAG   CAP  DEDUP  HEALTH
# tank       10.9T  3.21T   7.69T        -         -     4%   29%  1.00x  ONLINE

# Dataset-level usage — per service, per environment
zfs list -o name,used,avail,refer,compressratio -r tank
# NAME               USED   AVAIL   REFER  RATIO
# tank/app          142G   7.55T   118G   2.31x
# tank/vms          890G   7.55T   200G   1.87x
# tank/snapshots    240G   7.55T     -    1.00x  (snapshot accumulation)

# Snapshot growth over time — how fast is your snapshot space growing?
zfs list -t snapshot -o name,used -r tank/app | sort -k2 -h | tail -20

# ARC efficiency — is your cache working?
arc_summary | grep -E "Hit|Miss|Size"
# or via Prometheus: node_zfs_arc_hits / (node_zfs_arc_hits + node_zfs_arc_misses)

The 80% rule

ZFS performance degrades past 80% pool utilization due to fragmentation and the copy-on-write transaction group management. The 80% threshold is your planning trigger, not your emergency threshold. When a pool hits 75%, you start planning expansion. When it hits 80%, you execute the expansion plan:

# Prometheus alert: pool approaching 80%
- alert: ZFSPoolCapacityHigh
  expr: node_zfs_zpool_size_bytes{state="free"} / node_zfs_zpool_size_bytes{state="total"} < 0.25
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "ZFS pool {{ $labels.zpool }} is over 75% full"
    description: "Pool has {{ $value | humanizePercentage }} free. Plan expansion now."

- alert: ZFSPoolCapacityCritical
  expr: node_zfs_zpool_size_bytes{state="free"} / node_zfs_zpool_size_bytes{state="total"} < 0.15
  for: 30m
  labels:
    severity: critical
  annotations:
    summary: "ZFS pool {{ $labels.zpool }} is over 85% full — performance degrading"

Forecasting storage growth

# Track used space daily and project when you hit 80%
# Simple linear forecast from Prometheus:
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[7d], 30*24*3600)
# → predicts free bytes 30 days from now based on the past 7 days of trend

# For ZFS-specific:
predict_linear(node_zfs_zpool_size_bytes{state="alloc"}[14d], 90*24*3600)
# → how much will be allocated 90 days from now?

# Clone space accounting — clones use space as they diverge
# Account for blue/green clones in your capacity plan:
# At peak: production + one full green clone = 2x production data space
# After cleanup: back to 1x
# Plan for 2.5x headroom during active blue/green periods

Capacity planning on OpenZFS is easier than on any other filesystem because ZFS gives you exact numbers. The zfs list output shows exactly how much space each dataset uses, how much the pool has free, and what the compression ratio is. There are no df vs du discrepancies (ZFS reports both correctly), no hidden overhead from journal files or filesystem metadata that surprises you, no confusion about whether snapshot space counts toward quotas. The compression ratio is particularly useful for capacity planning: if your pool has 2.3x compression and you are adding a new workload, you know from the first 24 hours of data how much real space it will consume. No guessing.

11. Disaster Recovery — the Ultimate SRE Test

Disaster recovery is the most important SRE capability to have and the least tested in practice. The Google SRE book says: DR plans that have not been tested are DR hopes. OpenZFS makes DR testing free — you clone the production replica, boot it, verify it, and destroy the clone. The test costs nothing and proves the DR is real.

DR tiers

Tier	Mechanism	RTO	RPO	Use case
Tier 1	Local snapshot rollback	2 seconds	0 (atomic)	Bad deploy, bad config change
Tier 2	Node recovery from snapshot	30 seconds	Snapshot interval	Node failure, VM corruption
Tier 3	syncoid replica failover at DR site	5–15 minutes	Replication interval	Site failure, hardware loss
Tier 4	Cold rebuild: kldload ISO + ZFS import	30 minutes	Last offsite snapshot	Total site loss, catastrophic failure

Tier 3 DR runbook — site failover

# Prerequisites:
# - syncoid replicating tank → dr-host:tank every 15 minutes
# - DR host has kldload installed but services dormant
# - WireGuard backplane connects production and DR sites

# Failover procedure:
# 1. Declare the incident — do not fail over on a hunch
# 2. Verify production is actually unreachable (not a monitoring flap)

# 3. On DR host — import the ZFS pool
ssh dr-host 'zpool import tank'

# 4. Check the last successful replication
zfs list -t snapshot -o name,creation -r tank | sort -k2 | tail -5
# → confirms how old the last sync is (your actual RPO)

# 5. Boot the replicated VMs on DR host
ssh dr-host 'kvm-clone --from-snapshot tank/vms/app-1@syncoid_auto tank/vms/app-1'
ssh dr-host 'virsh start app-1'

# 6. Verify applications are healthy on DR site
curl -sf http://app-1.dr.internal/health

# 7. Cut DNS or BGP route to point production traffic at DR site
# (your specific mechanism: DNS TTL update, BGP announcement, floating IP reassign)

# 8. Notify stakeholders: DR site active, RPO = [timestamp of last sync]

# Recovery (returning to production site):
# 1. Restore production hardware
# 2. Sync from DR back to production: syncoid --recursive dr-host:tank tank
# 3. Verify production is current
# 4. Cut traffic back to production (planned maintenance window)
# 5. Resume normal replication from production to DR

Testing your DR

# DR test — run monthly, costs nothing, proves everything
# Step 1: On DR host, clone the production replica
ssh dr-host 'zfs clone tank/app@syncoid_auto-2026-04-01 tank/app-drtest'

# Step 2: Boot a test VM from the clone
ssh dr-host 'kvm-clone --from-snapshot tank/vms/app-1@syncoid_auto tank/vms/app-1-drtest'
ssh dr-host 'virsh start app-1-drtest'

# Step 3: Verify application is healthy (internal, not exposed to users)
ssh app-1-drtest 'systemctl is-active app && curl -sf localhost/health'

# Step 4: Run smoke tests against the DR test environment
./smoke-tests.sh --target app-1-drtest.internal

# Step 5: Destroy the test (production was never touched)
ssh dr-host 'virsh destroy app-1-drtest && virsh undefine app-1-drtest'
ssh dr-host 'zfs destroy tank/vms/app-1-drtest && zfs destroy tank/app-drtest'

# Result: documented proof that DR is operational, RPO and RTO measured

The Google SRE book says disaster recovery plans that have not been tested are disaster recovery hopes. OpenZFS makes DR testing free: clone the production pool, boot the clone, verify everything works, destroy the clone. The test itself is a snapshot plus a clone — both instant, both zero disk cost until the clone diverges. No separate DR environment to maintain between tests. No "we think the DR site works but we have never actually tried it." You run the test monthly. It takes 20 minutes. It produces documented evidence that your DR works. When you actually need DR, you are executing a procedure you have done twelve times this year, not a procedure you are trying for the first time under pressure at 3 AM.

12. Observability for SRE

SRE observability is built around the four golden signals, defined in the Google SRE book: latency, traffic, errors, and saturation. These four signals, measured at the service level, tell you whether your service is healthy. Supplement them with infrastructure signals — ZFS health, WireGuard connectivity, node resource usage — and you have complete coverage.

The four golden signals

# 1. Latency — how long requests take
# Measure: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Alert: p99 latency > 300ms for 5 minutes
# Dashboard: p50, p95, p99 on the same graph

# 2. Traffic — how much demand the system is handling
# Measure: rate(http_requests_total[5m])
# Alert: traffic drops > 50% from baseline (traffic anomaly, possible upstream failure)
# Dashboard: requests/second with historical comparison

# 3. Errors — rate of failed requests
# Measure: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Alert: error rate > 1% for 2 minutes
# Dashboard: error rate with SLO budget burn overlay

# 4. Saturation — how full the service is
# Measure: CPU, memory, ZFS ARC, disk I/O queue depth
# Alert: any saturation metric > 85% for 10 minutes
# Dashboard: all four saturation signals on one pane

ZFS-specific signals

# ZFS pool health (0=ONLINE, 1=DEGRADED, 2=FAULTED)
node_zfs_zpool_state_value

# ARC hit ratio — cache efficiency (below 80% means cold cache or undersized ARC)
rate(node_zfs_arc_hits[5m]) / (rate(node_zfs_arc_hits[5m]) + rate(node_zfs_arc_misses[5m]))

# Pool capacity — percentage used (alert at 75%)
1 - (node_zfs_zpool_size_bytes{state="free"} / node_zfs_zpool_size_bytes{state="total"})

# Scrub status — when did the last scrub complete? (alert if > 30 days)
time() - node_zfs_zpool_scrub_end_time

# Replication lag — how far behind is the DR replica?
time() - node_zfs_snapshot_creation_time{snapshot=~".*syncoid.*"}

# Checksum errors — potential disk hardware failure (any value > 0 = investigate)
node_zfs_zpool_checksum_errors_total

WireGuard signals

# WireGuard peer handshake age — stale > 3 minutes means peer is unreachable
time() - wireguard_peer_last_handshake_seconds{interface="wg1"}

# Transfer bytes (rate) — check if traffic is actually flowing
rate(wireguard_peer_received_bytes_total[5m])
rate(wireguard_peer_sent_bytes_total[5m])

# Number of connected peers (below expected = backplane failure)
count(wireguard_peer_last_handshake_seconds < 180) by (interface)

SLO burn rate alerting

Standard threshold alerts (error rate > 1%) fire when you are already in trouble. SLO burn rate alerts fire when you are burning through your error budget faster than sustainable — before the budget is exhausted:

# Burn rate alerting — fires when you are consuming error budget too fast
# A burn rate of 1.0 = consuming budget at exactly the rate that would exhaust it over the SLO window
# A burn rate of 14.4 = consuming budget 14.4x faster → exhausts 30-day budget in 2 days

# Fast burn (critical): > 2% error rate sustained for 2 minutes
# Burn rate = 2% / 0.1% (error budget) = 20x → exhausts budget in 36 hours
- alert: SLOBurnRateFast
  expr: |
    rate(http_requests_total{status=~"5.."}[5m]) /
    rate(http_requests_total[5m]) > 0.02
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "High error rate — burning SLO budget at 20x sustainable rate"

# Slow burn (warning): > 0.5% error rate sustained for 30 minutes
# Burn rate = 5x → exhausts budget in 6 days
- alert: SLOBurnRateSlow
  expr: |
    rate(http_requests_total{status=~"5.."}[30m]) /
    rate(http_requests_total[30m]) > 0.005
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Elevated error rate — burning SLO budget at 5x sustainable rate"

The four golden signals come directly from the Google SRE book. They are universal — every service has latency, traffic, errors, and saturation. The key word in "alert on symptoms, not causes" is symptoms: users experience high latency, not high CPU. Users experience errors, not disk I/O queue depth. Alert on what users feel. Then use the underlying metrics (CPU, disk, ARC, WireGuard) to diagnose the cause after the alert fires. On kldload, add ZFS-specific signals (ARC hit ratio IS saturation for storage-heavy workloads) and WireGuard signals (handshake age IS availability for the backplane network). If the eight signals — the four golden plus ZFS pool health, ARC ratio, WireGuard peer age, and replication lag — are all green, your infrastructure is healthy.

13. On-Call and Runbooks

On-call is the mechanism that connects alerting to human response. Every alert that pages an engineer must have a runbook: a documented procedure that tells the on-call engineer exactly what to do. Without a runbook, the alert is noise — the engineer has to figure out the procedure from scratch at 3 AM under pressure, which produces mistakes and slow recovery.

Runbook format

# Runbook template
# Alert: [alert name from Prometheus]
# Severity: [S1/S2/S3/S4]
# Symptom: [what the user experiences]
# Likely causes: [ordered by probability]

# Diagnostic steps:
# 1. [what to check first — fastest to answer "is this the cause?"]
# 2. [second check]
# 3. [escalation trigger: if X is true, page the database team]

# Mitigation steps (restore service):
# 1. [fastest mitigation — often a rollback or restart]
# 2. [fallback if step 1 fails]

# Resolution steps (fix root cause — after service is restored):
# 1. [how to investigate and permanently fix]

# Escalation:
# [when to escalate, who to page, what information to provide]

Runbooks for common kldload alerts

# ─────────────────────────────────────────────────────────────────────
# RUNBOOK: ZFSPoolDegraded
# Severity: S2 (data redundancy lost — pool still running, data at risk)
# Symptom: ZFS pool is in DEGRADED state (one or more disks failed)
# ─────────────────────────────────────────────────────────────────────
# Diagnostic:
zpool status tank
# → identify which VDEV is faulted or offline
# → check if it is a transient error or a permanent disk failure
dmesg | grep -E "sd[a-z]|nvme|ata" | tail -30
# → look for disk I/O errors, SMART errors, timeout errors
smartctl -a /dev/sda   # check SMART data on the suspect disk

# Mitigation (pool is still online — data is accessible, redundancy is lost):
# If disk dropped due to a transient error:
zpool clear tank       # clear transient errors, may bring disk back online
# If disk is permanently failed:
# Order replacement. Pool is still running — no urgent action required.
# Alert is S2 because you have lost redundancy — any further failure = data loss.

# Resolution:
# Replace the failed disk:
zpool replace tank /dev/sda /dev/sdb   # online replacement
# Wait for resilver to complete (zpool status shows progress)
# ─────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────
# RUNBOOK: ZFSARCHitRatioLow
# Severity: S3 (performance degraded, not an outage)
# Symptom: ARC hit ratio below 80% — reads going to disk instead of cache
# ─────────────────────────────────────────────────────────────────────
# Diagnostic:
arc_summary | grep -E "Hit|Miss|ARC Size|Target"
# → is the ARC size at its maximum? If yes, ARC is full and evicting.
# → is there a new workload pattern causing cold reads?
grep ARC /proc/spl/kstat/zfs/arcstats | awk '{print $1, $3}'

# Mitigation:
# If ARC is hitting its limit (arc_max), increase it:
echo $((16 * 1024 * 1024 * 1024)) > /sys/module/zfs/parameters/zfs_arc_max
# (sets ARC max to 16GB — adjust to available RAM)

# Resolution:
# Persist the ARC size in /etc/modprobe.d/zfs.conf:
# options zfs zfs_arc_max=17179869184
# Then: dracut -f && reboot (to rebuild initramfs with new parameter)
# ─────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────
# RUNBOOK: WireGuardPeerStale
# Severity: S2 (backplane connectivity lost to a peer)
# Symptom: WireGuard peer handshake age > 3 minutes (peer is unreachable)
# ─────────────────────────────────────────────────────────────────────
# Diagnostic:
wg show wg1
# → check last handshake time for the stale peer
# → check allowed-IPs and endpoint for the stale peer
ping 10.201.0.2   # can we reach the peer's WireGuard IP?
curl -sf http://10.202.0.2:9100/metrics   # is the peer's node_exporter responding?

# Mitigation:
# Attempt to re-trigger handshake:
wg set wg1 peer  endpoint :   # reset the endpoint
# Or restart WireGuard on this node:
systemctl restart wg-quick@wg1
# If peer node is down, this is a node outage — escalate to node recovery runbook

# Resolution:
# If peer node is healthy but WireGuard is not reconnecting:
ssh peer-node 'systemctl restart wg-quick@wg1'
# If peer node has rebooted and WireGuard is not starting:
ssh peer-node 'systemctl enable --now wg-quick@wg1'
# ─────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────
# RUNBOOK: DiskSpaceHigh (>80% ZFS pool utilization)
# Severity: S2 (performance degrading, approaching critical)
# ─────────────────────────────────────────────────────────────────────
# Diagnostic:
zfs list -o name,used,avail,refer -r tank | sort -k2 -rh | head -20
# → find the largest datasets
zfs list -t snapshot -o name,used -r tank | sort -k2 -rh | head -20
# → find snapshot accumulation (common cause)

# Mitigation (free space immediately):
# Destroy old snapshots (verify with sanoid policy before deleting manually):
zfs list -t snapshot -o name,creation -r tank/app | head -10
zfs destroy tank/app@manual-backup-2025-11-01   # destroy oldest manual snapshots

# Resolution:
# Review sanoid retention policy — is it keeping more snapshots than needed?
# Plan storage expansion if usage is growing beyond sanoid pruning
# ─────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────
# RUNBOOK: ReplicationLagHigh (>2 hours behind)
# Severity: S2 (RPO target violated — DR is stale)
# ─────────────────────────────────────────────────────────────────────
# Diagnostic:
systemctl status syncoid.service
journalctl -u syncoid.service --since "3 hours ago"
# → is syncoid failing? What error?
# → is the DR host reachable over the storage WireGuard plane?
ping 10.203.0.2   # storage backplane IP of DR host
ssh dr-host 'zpool status tank'   # is the DR pool healthy?

# Mitigation:
# Manually trigger a sync:
syncoid --recursive tank/app dr-host:tank/app
# → watch for errors

# Resolution:
# If syncoid is failing due to snapshot mismatch:
# On DR host: zfs rollback to the last common snapshot, then sync again
# If DR host is offline: address the outage, then sync when restored
# ─────────────────────────────────────────────────────────────────────

The Google SRE book says an alert without a runbook is an alert that will be ignored or handled inconsistently. Runbooks are the bridge between "something is wrong" and "here is exactly what to do." They encode operational knowledge so the on-call engineer at 3 AM does not have to figure it out from scratch under pressure with a phone full of notifications. The rule is simple: every alert must have a runbook. If you add an alert without a runbook, the alert is not finished. On-call rotation design: primary plus secondary, with defined escalation criteria. If the primary cannot mitigate within 15 minutes, they page secondary. If neither can mitigate within 30 minutes for an S1, they escalate to the service owner. The escalation path must be documented before the incident, not discovered during it.

14. The SRE Maturity Model for kldload Deployments

Most infrastructure is at Level 0 or Level 1. The maturity model gives you a clear map of where you are and what to do next. Each level builds on the previous — you cannot skip levels, because the higher levels depend on the foundations built below.

Level 0 — Manual Everything

No automated snapshots. No monitoring. No runbooks. Changes are made ad hoc. DR is "we have a backup somewhere." Incidents are discovered by users reporting them. Recovery is measured in hours.

// Signs: "we do backups manually on Fridays" // "I know how to fix it, it's in my head" // "we found out it was down when the CEO called"

Level 1 — Automated Basics

sanoid running with a reasonable retention policy. Prometheus and node_exporter deployed. Basic alerts for disk, CPU, memory, and service health. SSH via WireGuard backplane. Pre-change snapshots taken (manually but consistently).

// Checklist: // [ ] sanoid running, snapshots verified // [ ] Prometheus scraping all nodes // [ ] Alerts for disk, CPU, memory, service health // [ ] ZFS pool health monitored

Level 2 — SLO-Driven

Defined SLIs and SLOs for every critical service. Error budget policy documented and enforced. Runbooks for every alert. Blameless postmortem process in place. syncoid replication with verified RPO. Monthly DR tests documented.

// Checklist: // [ ] SLOs defined and measured for all services // [ ] Error budget policy: what changes when budget depletes // [ ] Every alert has a runbook // [ ] Postmortem process: blameless, action items tracked // [ ] DR tested monthly, results documented

Level 3 — Fully Automated

Blue/green deploys via ZFS clone and kvm-clone. Canary deployments with automated rollback on SLO violation. Automated DR tests on schedule. Toil below 50% of operational time. cert-manager handling certificate lifecycle. All VMs provisioned via kvm-clone from golden images.

// Checklist: // [ ] Blue/green deploy pipeline using ZFS clones // [ ] Canary with defined success criteria and auto-rollback // [ ] DR test automated (monthly cron, results logged) // [ ] Toil measured: < 50% of operational time // [ ] Certificate renewal: fully automated

Level 4 — Self-Healing

Automatic rollback triggers when SLO burn rate exceeds threshold. Automatic failover to DR site when primary site health checks fail. Automatic capacity scaling (new storage device added to pool, VMs rescheduled on available hosts). Human involvement limited to policy decisions, not operational execution.

// Checklist: // [ ] SLO burn rate → automatic rollback (no human needed) // [ ] Site health check failure → automatic DR failover // [ ] Disk space high → automatic alert + capacity plan // [ ] Engineer role: set policy, review outcomes, improve systems

Getting from Level 1 to Level 2 — the highest-value jump

The jump from Level 1 to Level 2 is the most valuable reliability investment you can make. Level 1 gives you visibility and basic automation. Level 2 gives you accountability and measurable improvement. Here is the concrete sequence:

# Week 1: Define SLIs and SLOs
# For each critical service, write down:
# 1. What does "working" mean? (the SLI)
# 2. What percentage of the time must it work? (the SLO)
# 3. How do you measure it? (the Prometheus query)

# Week 2: Build the measurement infrastructure
# Add SLO recording rules to Prometheus:
groups:
- name: slo
  interval: 30s
  rules:
  - record: job:slo_availability:rate5m
    expr: |
      rate(http_requests_total{status!~"5.."}[5m]) /
      rate(http_requests_total[5m])

# Week 3: Write runbooks for every existing alert
# For each alert in AlertManager: write the runbook in your wiki
# Template: symptom, diagnosis steps, mitigation, resolution, escalation

# Week 4: Test your DR
# Run the DR test from Section 11
# Document the result: what worked, what did not, what the actual RPO was

# Month 2: Implement the error budget policy
# Document it. Get buy-in from development.
# Start tracking error budget consumption in your weekly ops review.

Getting to Level 2 — SLO-driven — is the biggest reliability jump for most teams. Level 1 gives you observability: you can see what is happening. Level 2 gives you accountability: you have agreed-on targets, and you measure whether you are hitting them. OpenZFS makes the jump easier than on other platforms because the hardest SRE operations — rollback, DR, environment cloning — are built into the filesystem and cost almost nothing to implement. The first time you demonstrate a 2-second rollback from a bad deploy to a skeptical development team, the SLO conversation becomes much easier. "We can deploy as aggressively as the error budget allows, because rollback is instant" is a powerful argument for both reliability and velocity.

15. The SRE Reading List

These are the books and resources that underpin everything in this masterclass. The Google books are free online — there is no reason not to read them.

Site Reliability Engineering (Beyer, Jones, Petoff, Murphy — Google, 2016) — the original. Free at sre.google. Read chapters 3 (Embracing Risk), 4 (SLOs), 5 (Toil Elimination), and 13 (Emergency Response) first.
The Site Reliability Workbook (Beyer, Murphy, Rensin, Kawahara, Thorne — Google, 2018) — the practical companion. Concrete implementations of every concept in the SRE book. Also free at sre.google.
Implementing Service Level Objectives (Alex Hidalgo, 2020) — the best deep dive on SLI/SLO/error budget practice. Goes well beyond the Google books into the organizational and measurement details.
Release It! (Michael Nygard, 2nd ed. 2018) — stability patterns and antipatterns for production software. Circuit breakers, timeouts, bulkheads. Complementary to SRE — covers the application design side of reliability.
The Practice of Cloud System Administration (Limoncelli, Chalup, Hogan, 2014) — the operations engineering textbook. Covers capacity planning, change management, on-call design, and runbook writing in depth.

Related kldload masterclasses

ZFS Masterclass — the foundation: pool design, snapshots, replication, encryption, and all the primitives this page uses
Observability Masterclass — Prometheus, Grafana, alerting, the full monitoring stack
Backplane Networks Masterclass — the encrypted infrastructure network that makes multi-site SRE possible
Security Hardening Masterclass — hardening your infrastructure so incidents are less likely and blast radius is smaller
Kubernetes Masterclass — Kubernetes-specific deployment patterns: rolling updates, canary with Cilium, pod disruption budgets

← Construction Kit ZFS Masterclass →