ZFS Test Lab Masterclass
OpenZFS is the most consequential storage code most kldload users will ever run. Pool corruption, ARC eviction stalls, txg sync regressions, encryption edge cases — they don't surface in unit tests. They surface under real workloads, on real kernel + userspace combinations, across distros that ship different OpenZFS versions. The ZFS Test Lab is kldload's answer: a dedicated appliance template that boots into a host configured for nothing but running the OpenZFS test suite across multiple distros, with eBPF / Tetragon / Cilium giving you visibility into what the kernel is actually doing while the tests run.
What you will learn: what the zfslab template is for, how its golden lineage differs from the regular klab goldens, how to drive the OpenZFS test suite across the five distro flavors, how to layer eBPF + Tetragon + Cilium observability onto live test runs (so a "the test failed" trace reads like a movie of what the kernel did, not a postmortem), how to wire test results into Grafana, how to author custom test recipes (pool corruption, ARC pressure, dedup workloads, encryption verify, replication chaos), and how this plugs into the broader kldload CI matrix.
Audience: anyone who depends on OpenZFS and wants more than "scrub once a month, hope for the best" — kldload operators running production storage, contributors who care whether OpenZFS 2.4 vs 2.2 vs 2.3 has the regression they think they're seeing, anyone tracking distro-vendor patches.
1. Why the ZFS Test Lab Exists
Most "ZFS testing" by users is one of two extremes. Either you trust the upstream OpenZFS regression suite, run it once when you build a pool, and never touch it again — or you ship to production and let bug reports do the testing for you. Neither catches the failure modes that actually hurt:
- Distro-vendor patch divergence. Ubuntu's ZFS packaging applies patches that aren't in upstream. RHEL/CentOS use a third-party DKMS (zfsonlinux). Fedora is closest to upstream but lags by a few weeks. Debian ships a DFSG-cleared version. The same workload behaves differently on each.
- Kernel-vs-OpenZFS combinatorics. ZFS is a kernel module. A kernel update can subtly break OpenZFS even when the userspace tools say "version 2.2.9, all green". DKMS rebuild succeeds; semantics shift; pools hit asserts under load.
- Test-suite version drift. The ZFS test suite ships with each OpenZFS release. Tests added in 2.4 don't exist in 2.2. Running "the test suite" against multiple distros means running different versions of the suite — so equivalence requires a careful matrix, not a single check.
- Observability gaps. When a test fails (or worse, hangs) the existing harness gives you a log. You want a trace of every zio, every ARC eviction, every txg, every userspace ioctl that fired during the run. That's what eBPF + Tetragon make possible.
The ZFS Test Lab template (zfslab) is purpose-built to address
all four. The host runs a minimal Kubernetes cluster (just enough for the
observability stack), no kldload feature tools, and gives the rest of its
budget to a separate fleet of test goldens — five cloud images per
distro, pre-loaded with ksh, zfs-tests.sh, vdev
images, and the matching upstream test suite for the version of OpenZFS that
distro ships.
Two golden lineages, one host
The same kldload host can hold two disjoint sets of golden VMs. klab-golden-<distro> is the lean lineage — small, boots fast, used for general testing. klab-ztest-<distro> is the ZFS-test lineage — larger, includes ksh + zfs-tests.sh + vdev images, dedicated to running the OpenZFS suite. zfslab builds the second lineage by default; klab builds only the first.
Per-distro OpenZFS version
centos9 → zfs-2.2.9, rocky9 → zfs-2.2.9, fedora44 → zfs-2.4.1, debian13 → zfs-2.3.2, ubuntu24 → zfs-2.2.2. Each ztest golden has the matching upstream test suite installed. A regression that lands in 2.3 but not 2.2 shows up as fedora-pass / debian-pass / centos-skip — exactly the signal you want.
Three-tier observability
Layer 1: eBPF probes inside the kernel — zio_latency, txg_sync_time, arc_eviction_rate, spl_alloc_track. Layer 2: Tetragon at the syscall boundary — every zpool, zfs, ioctl call captured with arguments and stack. Layer 3: Cilium / Hubble for replication network tests — see syncoid traffic in flight, latency between sender and receiver.
Reporting that means something
Grafana dashboards specifically for ZFS test runs: a pass/fail matrix per distro × test group, time-series of zio latency during the run, scrub completion-time trend across builds, ARC efficiency score, dedup pressure curve. Not "all tests passed" — instead, "scrub took 12% longer than the last build" before anyone noticed.
2. Two Golden Lineages
kldload's golden VM system has two coexisting lineages. They live in separate ZFS zvols, build via different code paths, and serve different test needs.
| Property | Lean (klab-golden-*) |
ZFS-test (klab-ztest-*) |
|---|---|---|
| Build command | klab golden all |
klab golden-ztest all |
| Size per golden | ~3 GB | ~6 GB |
| Build time per distro | ~5 min | ~12 min |
| Includes ksh | No (Korn shell, only needed for zfs-tests.sh) | Yes |
Includes zfs-tests.sh |
No | Yes (matching upstream version) |
| Includes vdev images | No | Yes (8 × 1 GB sparse files) |
| Includes ARC pressure tools | No | Yes (arcstat, arc_summary, fio, iozone) |
| Has eBPF probes pre-loaded | Userspace tools only | Yes (zfs.bt, arc.bt bpftrace scripts) |
| Used by | klab tile (everything except ZFS test runs) | zfslab tile, kzfs-test framework |
The split exists because the dev tools are heavy and useless for non-ZFS
testing. A klab user spinning up a Fedora VM to test "does my web app work
on F44" doesn't need ksh and a 1 GB scratch disk. A zfslab user running
zfs-tests.sh needs both. Forcing every klab golden to carry the
ztest payload would inflate disk use 2× and build time 2.5× — both bad
trades for the common case.
The naming convention is deliberate. klab-ztest-fedora
visually marks "ZFS test golden, Fedora flavor". You can list both lineages
with one command:
$ zfs list -H -o name | grep '@golden'
rpool/vms/klab-golden-centos@golden
rpool/vms/klab-golden-rocky@golden
rpool/vms/klab-golden-fedora@golden
rpool/vms/klab-golden-debian@golden
rpool/vms/klab-golden-ubuntu@golden
rpool/vms/klab-ztest-centos@golden
rpool/vms/klab-ztest-rocky@golden
rpool/vms/klab-ztest-fedora@golden
rpool/vms/klab-ztest-debian@golden
rpool/vms/klab-ztest-ubuntu@golden
Test runs always clone from the ztest snapshot. The clone is a copy-on-write
view; tear-down is instantaneous (zfs destroy on the clone). One
ztest golden supports unlimited concurrent test VMs, capped only by RAM and
CPU.
3. The Five ZFS-Test Distros
The five distros aren't arbitrary — each one anchors a specific OpenZFS version that catches a specific class of bug. Running tests against all five is how you triangulate whether a regression you're seeing is upstream, distro, or kernel-specific.
| Distro | OpenZFS version | Kernel | What this combination tells you |
|---|---|---|---|
| centos-stream-9 | 2.2.9 (zfsonlinux DKMS) | 5.14.x (RHEL-9 derived) | Anchored point: stable 2.2.x on a stable kernel. If a test passes here and fails everywhere else, it's a regression in the newer code. |
| rocky-9 | 2.2.9 (same DKMS as centos) | 5.14.x (RHEL-9 derived) | Sanity check on the centos result. Differences here vs centos isolate distro-vendor patches that AlmaLinux / Rocky apply on top of the el9 base. |
| fedora-44 | 2.4.1 (cutting edge, zfsonlinux fc-1 fallback) | 6.19.x (current) | Forward-looking. New 2.4 features (block cloning, vdev rebuild, dRAID hardening). Fedora-only failures often mean a kernel-API change broke a 2.4-specific code path. |
| debian-13 | 2.3.2 (DFSG, distro-packaged) | 6.12.x | Middle-ground. Catches the long tail of DFSG-clean patches Debian applies. Useful for "does the workaround for X land cleanly across upstream + Debian's tree". |
| ubuntu-24.04 | 2.2.2 (Ubuntu-patched) | 6.8.x (HWE) | Most-deployed kernel + Canonical's patch set. Ubuntu ships ZFS in the kernel proper (not as a DKMS) — kernel updates can break things in ways DKMS distros don't see. |
A test that passes on all five is solid. A test that fails only on Ubuntu is probably a Canonical-applied patch interaction. A test that fails on Fedora but passes on the el9 pair is a 2.4-regression candidate. A test that fails across the board on the same kernel version is a kernel-API change. The matrix is what makes the diagnosis fast.
4. Building the Test Lineage
Building the ztest lineage is one command, run on the kldload host. It
downloads the cloud image for each distro (or uses the cached copy from the
lean lineage), boots a clone, runs the dev-tools install hook, snapshots
@golden, and shuts down.
sudo klab golden-ztest all
Internally, this sets KLAB_GOLDEN_KIND=ztest in the environment
of the per-distro build sub-script. The naming logic in
klab's golden_name() function honors that flag:
# From klab:376-384 — naming logic
golden_name() {
local distro="$1"
if [[ "${KLAB_GOLDEN_KIND:-regular}" == "ztest" ]]; then
echo "klab-ztest-${distro}"
else
echo "klab-golden-${distro}"
fi
}
The dev-tools hook runs inside each VM after the cloud image boots and
before @golden is snapped. Per distro it does roughly:
- Install ksh — required for
zfs-tests.sh(the test harness is written in Korn shell, not bash). - Install
zfs-tests.sh— comes with the OpenZFS package on most distros at/usr/share/zfs/zfs-tests.sh; if missing, fall back to a clone of the OpenZFS source tree at the version matching the running ZFS module. - Install ARC + zio observability tools —
arcstat,arc_summary,zio_inject,fio,iozone,bpftrace. - Pre-create vdev images — eight 1-GB sparse files at
/var/tmp/zfs-test-vdevs/disk{0..7}.img. The test suite creates many small pools; pre-allocating the backing files makes each test ~30 seconds faster. - Pre-load bpftrace scripts for the live observability layer (see Section 7).
- Snapshot as
@goldenon the zvol, mark read-only, shut down.
Total per-distro build: ~10-12 minutes. Total for all five: ~55 minutes sequential. Run it once; the goldens are immutable until you bump distro versions or want a fresh suite.
Useful variants:
# Just one distro (e.g. you're tracking a Fedora 2.4 regression)
sudo klab golden-ztest fedora
# Force-rebuild even if a golden exists (after distro updates)
sudo KLAB_FORCE_REBUILD=1 klab golden-ztest fedora
# Custom ZFS source tree (test a PR before it lands)
sudo KLAB_ZFS_SOURCE=https://github.com/openzfs/zfs.git \
KLAB_ZFS_BRANCH=pr-12345 \
klab golden-ztest fedora
5. The OpenZFS Test Suite
The actual tests live upstream in
tests/
of the OpenZFS source tree. Each ztest golden ships the suite version that
matches its OpenZFS module — running 2.4 tests against a 2.2 module would
mostly produce false failures.
The harness is zfs-tests.sh. It reads a runfile (a list of
test groups + functions to run), spins up a small pool from the vdev images,
runs the requested tests, tears down, and prints results. The
klab ztest hook installs three runfiles per VM:
| Runfile | Test count | Wall-clock | When to use it |
|---|---|---|---|
quick.run |
~50 | ~10 min | Smoke. Bring up a VM, confirm ZFS works at all. |
linux.run |
~1500 | ~3-4 hr | Standard. The "did this OpenZFS build break anything" gate. |
linux-fault.run |
~200 | ~1 hr | Fault injection (zinject). Validates pool resilience. |
The suite is organized into ~80 test groups. The most important ones for day-to-day kldload work:
- cli_root — every
zpoolandzfssubcommand exercised. If your install can't runzpool create, this fails first. Cheapest sanity test. - atime / mount — mount semantics, atime behavior under load.
- casenorm / userquota — case sensitivity and quota enforcement. Distro-vendor patches sometimes break these.
- rsend / send — replication. Critical for sanoid users; often regresses subtly.
- scrub_mirror / resilver — pool repair. The bedrock guarantees of ZFS.
- encryption_* — encrypted dataset behavior across keystore types. Multiple regressions in the 2.2 line.
- arc_001_pos / arc_dnode_limit — ARC behavior under memory pressure. Surface for the "ARC hang" kernel-API bugs that periodically affect Ubuntu HWE.
- l2arc — L2ARC persistence and rebuild. The 2.2 → 2.4 shift introduced new L2ARC behavior that sometimes panics older kernels.
- removal / raidz_expand — top-level vdev removal and RAIDZ expansion. The most fragile features in OpenZFS; tests here catch a real bug roughly every other release.
- zpool_import — pool import semantics. The "I rebooted and my pool is missing" code path. Critical for kldload's hostid-pinning work.
6. Running Tests
Three modes, by ascending scope:
6.1 Single test on a single distro
# Boot an instant clone of the ztest golden, run one test, tear down
sudo klab test fedora --run cli_root/zpool_create_001_pos
Wall clock: ~3 minutes (clone + boot + test + teardown). The clone is
discarded automatically — every run starts from the same known-good state.
The test output is captured to
/var/log/klab/<run-id>/test.log on the host, plus the
full zfs-tests.sh output captured into the SQLite history.
6.2 Test group across all distros
sudo klab test all --group encryption_001_pos,encryption_002_pos
Spins up five clones in parallel (one per distro), runs the named test group on each, collects results, prints a 5×N matrix:
centos rocky fedora debian ubuntu
encryption_001_pos ✅ ✅ ✅ ✅ ❌ <-- regression
encryption_002_pos ✅ ✅ ✅ ✅ ❌ <-- regression
The asymmetric failure pattern (only Ubuntu) is exactly the signal the matrix is designed for — it points you straight at Canonical's patch tree.
6.3 Full quick.run on all distros (the "is anything broken" gate)
sudo klab test all --suite quick
Five distros × ~50 tests = ~250 test invocations. Wall clock: ~50 minutes sequential, ~15 minutes if all five run in parallel (default on a host with >= 24 GB RAM). Result is the same matrix view as 6.2, just bigger.
6.4 Full linux.run on all distros (the overnight regression matrix)
sudo klab test all --suite linux
~7,500 test invocations. Wall clock: ~4 hours parallel. This is the "is the next release safe to ship" gate. Most kldload users won't run this locally — but the kldload CI host does run it nightly, and the results land in the ZFS test Grafana dashboard.
6.5 Custom recipe (your own scenario)
sudo klab test fedora --recipe my-pool-corruption-test.sh
The --recipe option runs an arbitrary script inside the test
VM with ZFS, ksh, fio, and the observability tools available. Recipes are
how you turn "I think there's a regression" into a reproducible test (see
Section 9 for examples).
7. Three-Tier Observability
This is where ZFS testing in kldload starts to feel different from elsewhere. A traditional test failure gives you a log: "test X timed out at line 47". The Test Lab gives you three concurrent traces of what the kernel was doing while X failed.
7.1 Layer 1 — eBPF probes inside ZFS
The ztest golden ships with a curated set of bpftrace scripts that attach
to OpenZFS kernel functions and emit per-event traces or histograms. These
load into the kernel automatically when a test starts, unload when it ends,
and feed Prometheus via a small exporter (zfs-bpf-exporter) on
the test VM.
The four most useful scripts (all in /usr/local/share/zfs-bt/
on the ztest goldens):
| Script | Hooks | Emits |
|---|---|---|
zio.bt |
kprobe:zio_create, kprobe:zio_done | zio latency histogram by op (read / write / scrub / repair); count of zios in flight |
txg.bt |
kprobe:dsl_pool_sync, kretprobe:dsl_pool_sync | txg sync latency, txg sync count, max sync time per pool |
arc.bt |
kprobe:arc_evict, kprobe:arc_get_data_buf | eviction rate, MFU/MRU split, ARC hits/misses, ghost-list churn |
spl.bt |
kprobe:spl_kmem_alloc, kprobe:spl_kmem_free | SPL allocator pressure, fragmentation, slab churn |
Example: a test fails with "scrub took 41s but expected <30s". The
zio.bt trace shows the scrub's read latency p99 jumped from
800µs to 18ms during the failing window. The cause was an L2ARC rebuild
storm triggered by a side-effect of the test setup. You'd never have caught
that from log lines alone — it's a kernel timing issue.
Run a probe manually during a test:
# Inside the ztest VM
sudo bpftrace /usr/local/share/zfs-bt/zio.bt
# (in another shell)
sudo zpool scrub testpool
# zio.bt prints a histogram every 10s while attached:
@latency_us[scrub_read]:
[100, 200) 142 |@@@@@@@@ |
[200, 500) 816 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[500, 1K) 234 |@@@@@@@@@@@@@@ |
[1K, 2K) 18 |@ |
[2K, 5K) 3 | |
7.2 Layer 2 — Tetragon at the syscall boundary
Tetragon (the eBPF-based runtime security observer that ships with klab and zfslab templates) captures every userspace event with kernel-level context. For ZFS testing, that means you get a per-test audit trail of:
- Every
zpoolandzfscommand invocation, with arguments and exit code - Every
ioctlcall into the ZFS kernel module (zfs_ioc_pool_create,zfs_ioc_send, etc.) with the structured ioctl payload - File-system operations on test files (open, write, fsync, unlink)
- Process tree of the test harness so you can see exactly what
zfs-tests.shspawned
The default Tetragon policy on a zfslab host (shipped at
/etc/tetragon/zfs-test.yaml) captures these events at JSON-line
granularity. They flow into Loki via promtail, where they're queryable from
Grafana. A failing test dumps the last 10,000 events to a per-run artifact
file.
Example tracing policy excerpt:
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: zfs-ioctl-trace
spec:
kprobes:
- call: "zfsdev_ioctl_common"
return: true
args:
- index: 0
type: "uint64" # zfs_cmd_t pointer
- index: 1
type: "int" # cmd
selectors:
- matchPIDs:
- operator: NotIn
values: [0]
returnArg:
type: "int" # return code
What this gets you: when zfs-tests.sh hangs, you can pull
the Tetragon log and see exactly which ioctl was issued, what
arguments it had, and whether the kernel ever returned. "Test hung" goes
from a 2-hour debug session to a Grafana query.
7.3 Layer 3 — Cilium / Hubble for replication tests
The replication tests (rsend, send,
send_encrypted, plus syncoid integration tests) cross network
boundaries. Cilium's eBPF-native CNI, with Hubble exposing flow records,
turns the network half of those tests into a flight recorder.
You can answer questions like:
- Did the sender's TCP window grow during the failing transfer, or did it stall?
- How many packets were retransmitted? Was there packet loss the test should have tolerated?
- What was the per-byte cost of encryption-at-the-wire vs. ZFS native encryption-at-rest?
- For zfs-receive failures: did the receiver's socket actually receive the data, and at what rate?
Hubble queries:
# Live flow trace for the sender pod during a syncoid run
hubble observe --pod sender --type trace --follow
# Latency histogram between sender and receiver during a 24-hour test
hubble observe --pod sender --pod receiver \
--since 24h --metric latency
# Find the exact moment where sender→receiver throughput dropped >50%
hubble observe --since 1h --pod sender --to-pod receiver \
--output json | jq 'select(.l4.TCP.window_size < 32768)'
The first time you watch a syncoid run with Hubble live-traced you realize ZFS-replication-over-TCP has fascinating behavior at 10G speeds — TCP window collapse during txg fence-and-flush events, retransmit storms when receiver applies a snapshot, etc. Most of these are benign. A few are bugs. Hubble lets you tell them apart.
8. Reporting — Grafana Dashboards
Six dashboards ship with the zfslab template (and on klab, since the
observability stack is the same — see Section 12 for the integration
story). All six pull from the kldload-default Prometheus +
zfs-bpf-exporter + Loki stack. They live at
https://<host>:3000/d/zfs-test/.
1. Test Pass/Fail Matrix
A 5×N grid (distros × test groups) showing the latest run's pass/fail/skip status. Cell color: green (pass), red (fail), yellow (skip / new). Click a cell, see the full run log and Tetragon event trace for that combo.
2. Pass-rate Trend
Time-series of "tests passing this run / total" across the last 100 runs, per distro. A regression that lands shows up as a notch. A flaky test shows up as oscillation. A real improvement is a step up.
3. zio Latency Heatmap
Per-test heatmap of zio latency (read vs write vs scrub) over time. Outliers stand out visually. The "scrub took longer than expected" classes of failure are diagnosable from this panel alone.
4. ARC Efficiency
ARC hit ratio, MFU/MRU split, eviction rate during each test. Tracks the canary-in-the-coal-mine for memory-pressure-related failures. Useful for both "did the workload fit in ARC" and "did this kernel update break ARC sizing".
5. Replication Network
Hubble-fed: throughput, retransmit rate, TCP window timeline for each replication test. Catches the "did syncoid actually achieve full bandwidth" question instead of inferring it from completion time.
6. Tetragon Audit Stream
Live tail of every zfs/zpool ioctl + command across all running test VMs. Searchable by host, test name, command. The "what was the kernel asked to do during the failing window" panel.
The dashboards are JSON files in
/etc/grafana/provisioning/dashboards/zfs-test/; they auto-load
on Grafana startup. You can clone any of them, customize, save under a
different name, and the source-of-truth files are still there next time
you redeploy.
9. Custom Test Recipes
Recipes are bash scripts that run inside a fresh ztest VM with ZFS,
fio, ksh, the bpftrace probes, and Tetragon all available. They're how you
turn a real-world failure into a reproducible test in the lab. Five
templates ship in
/usr/local/share/klab/recipes/; you can copy and adapt.
9.1 Pool corruption — block injection
#!/bin/bash
# /usr/local/share/klab/recipes/pool-corruption-block.sh
set -euo pipefail
# 1. Build a pool from vdev images
zpool create -f testpool /var/tmp/zfs-test-vdevs/disk0.img \
/var/tmp/zfs-test-vdevs/disk1.img
# 2. Write known data
dd if=/dev/urandom of=/testpool/canary bs=1M count=100
sha256sum /testpool/canary > /tmp/canary.sha256
# 3. Inject a single-block read fault on disk1
zinject -d /var/tmp/zfs-test-vdevs/disk1.img -e io -T read -L pad1 testpool
# 4. Verify the file still reads correctly (mirror should mask the fault)
sha256sum -c /tmp/canary.sha256 || exit 1
# 5. Confirm the fault was logged in zpool status
zpool status -v testpool | grep -q 'data errors' && exit 1
# 6. Clear and verify clean state
zinject -c all
zpool scrub testpool
sleep 30
zpool status -v testpool | grep -q 'No known data errors' || exit 1
echo PASS
Run it: sudo klab test fedora --recipe pool-corruption-block.sh.
The output goes to the run log; the bpftrace probes capture the zio path the
fault took; Tetragon logs every zinject call.
9.2 ARC pressure — memory squeeze
#!/bin/bash
# Repeatedly fill the ARC, then squeeze memory, watch eviction behavior
set -euo pipefail
zpool create -f testpool /var/tmp/zfs-test-vdevs/disk0.img \
/var/tmp/zfs-test-vdevs/disk1.img
# Fill 4 GB into the pool
fio --name=fill --rw=write --size=4G --bs=1M --filename=/testpool/fill
# Read it 3 times — should be fully cached after the first
fio --name=warm --rw=read --size=4G --bs=1M --filename=/testpool/fill
fio --name=warm2 --rw=read --size=4G --bs=1M --filename=/testpool/fill
fio --name=warm3 --rw=read --size=4G --bs=1M --filename=/testpool/fill
# ARC hit ratio should be >95% after warm3
ratio=$(arcstat 1 1 | awk 'NR==2 {print $7}' | tr -d '%')
[[ ${ratio%.*} -ge 95 ]] || exit 1
# Now squeeze — allocate 6 GB outside the cache
stress-ng --vm 2 --vm-bytes 3G --timeout 60s &
sleep 60
# After the squeeze, ARC should still be functional
arc_summary | grep -q 'ARC misses' || exit 1
echo PASS
9.3 Encryption — verify roundtrip
#!/bin/bash
# Encryption verify: create encrypted dataset, write, reboot module, read back
set -euo pipefail
# Encrypted dataset
echo 'testpassphrase123' | zfs create -o encryption=on -o keyformat=passphrase \
testpool/secret
# Write known content
dd if=/dev/urandom of=/testpool/secret/data bs=1M count=50
sha256sum /testpool/secret/data > /tmp/secret.sha256
# Unmount, unload key, reload
zfs unmount testpool/secret
zfs unload-key testpool/secret
# Verify dataset is now unreadable
[[ ! -f /testpool/secret/data ]] || exit 1
# Reload key, mount, verify
echo 'testpassphrase123' | zfs load-key testpool/secret
zfs mount testpool/secret
sha256sum -c /tmp/secret.sha256 || exit 1
echo PASS
9.4 Dedup workload — pressure test
#!/bin/bash
# Dedup deserves its own test — it's the most-deferred-but-still-shipping feature
set -euo pipefail
zpool create -f testpool -O dedup=on /var/tmp/zfs-test-vdevs/disk0.img \
/var/tmp/zfs-test-vdevs/disk1.img
# Write 1 GB of duplicate data — every block should dedup
for i in $(seq 1 100); do
cp /etc/passwd /testpool/passwd-${i}
done
# Now write some real diversity
dd if=/dev/urandom of=/testpool/random bs=1M count=100
# Pool should report dedup ratio > 1.0
dedup_ratio=$(zpool list -H -o dedupratio testpool | tr -d 'x')
echo "dedup ratio: ${dedup_ratio}"
[[ $(awk "BEGIN { print (${dedup_ratio} > 1.0) }") = 1 ]] || exit 1
# DDT should be < 200 MB (sanity check that dedup table sizing works)
ddt_size=$(zpool status -D testpool | awk '/DDT size:/ { print $3 }')
echo "DDT size: ${ddt_size}"
echo PASS
9.5 Replication — chaos / network failure
#!/bin/bash
# Test syncoid replication with simulated network drops
set -euo pipefail
zpool create -f sender /var/tmp/zfs-test-vdevs/disk0.img
zpool create -f receiver /var/tmp/zfs-test-vdevs/disk1.img
# Generate a snapshot stream
dd if=/dev/urandom of=/sender/data bs=1M count=200
zfs snapshot sender@s1
# Simulate 5% packet loss on lo (where the receive is happening)
tc qdisc add dev lo root netem loss 5%
# Send → receive across the lossy link
zfs send sender@s1 | zfs receive receiver/copy
SEND_RC=$?
tc qdisc del dev lo root
# Even with packet loss, the stream should reconcile
[[ $SEND_RC -eq 0 ]] || exit 1
zfs list receiver/copy >/dev/null || exit 1
# Verify checksums match
sha256sum /sender/data > /tmp/sender.sha
sha256sum /receiver/copy/data > /tmp/receiver.sha
diff <(awk '{print $1}' /tmp/sender.sha) <(awk '{print $1}' /tmp/receiver.sha) || exit 1
echo PASS
For each recipe, the run produces:
- Pass/fail status in the SQLite history
- Test stdout/stderr in the run log
- bpftrace data dumped to a per-run artifact (zio latency histogram, txg sync timeline, ARC events)
- Tetragon event log with every command and ioctl
- Hubble traces if the recipe touched the network
That's enough fidelity to investigate any failure without re-running.
10. Integration with the kldload CI Matrix
The ZFS Test Lab plugs into the broader kldload CI in two ways. First,
the regular install matrix (5 distros × 3 generic profiles) runs nightly
on the kldload CI host (fiend) and posts to the same Grafana
backend. See the
Testing & CI Masterclass
for the full pipeline.
Second, the zfslab tile is one of the entries in the install matrix
itself — when a release candidate goes through CI, one of the runs is a
fresh zfslab install followed by klab test all --suite quick
on every distro. That gate runs in ~30 minutes on a 24-core host (5
parallel test VMs × ~10 min each, plus install overhead). It's the
"can we ship 1.2.0" check.
For the long-form regression check, a separate cron fires
klab test all --suite linux weekly. That's the 4-hour
overnight run. The Grafana "Pass-rate Trend" panel includes only the
weekly runs (the quick.run results are too noisy at the test-group
granularity to be useful for trending).
The integration is loose-coupled by design — the ZFS test results are records in the same SQLite + Prometheus backend that drives the install matrix dashboards, so you can answer cross-cutting questions like "did the install regression I saw on Tuesday correlate with a zfs-tests pass-rate dip?" with one query.
11. Adding ZFS Observability to Any kldload Lab
The eBPF probes, Tetragon policies, and Grafana dashboards described above don't require the zfslab template. They work on any kldload install with the observability stack present (which is every workload template — kvm, k8s, klab, zfslab). To opt in on a non-zfslab box:
# Install the ZFS-specific bpftrace scripts
sudo dnf install -y bpftrace # or apt-get on Debian/Ubuntu
sudo cp -r /usr/local/share/zfs-bt /etc/bpftrace/
sudo systemctl enable --now zfs-bpf-exporter
# Apply the Tetragon policy
sudo kubectl apply -f /usr/local/share/kldload/tetragon-policies/zfs-trace.yaml
# Provision the Grafana dashboards
sudo cp /usr/local/share/grafana-dashboards/zfs-test/*.json \
/var/lib/grafana/dashboards/
sudo systemctl reload grafana-server
Once enabled, your default klab dashboard at
https://<host>:8443/dashboards/zfs/ picks up the new
panels automatically. The Tetragon policy applies cluster-wide; any
zfs/zpool command anywhere on a labeled
node generates audit events.
The default "ZFS Pool Health" dashboard that ships with every kldload install gets new sections when these are present:
- Live zio latency (when the eBPF probes are loaded)
- Audit log of recent
zpool/zfsoperations (when the Tetragon policy is applied) - Replication health (when Hubble flows are tagged with the syncoid pod label)
That's the goal: a kldload host with this observability layer is a ZFS box you can debug live, not a ZFS box that surprises you in the morning.
12. Why Logs Lie. The eBPF Paradigm Shift.
Almost everything written about "monitoring" or "observability" is one generation behind. The default mental model — your application writes log lines, an aggregator scrapes them, a UI pretty-prints them, you grep when something breaks — is how it was done in 1995, and it survives because most people have never seen the alternative. If you have kldload running, you've already been handed the alternative. This section is the part that explains why, and why it changes everything about how you think about a running system.
The blunt version: logs are fundamentally lies the application chose to tell you. Logs are what code decided to admit happened. They omit what the kernel did, what the network did, what the disk did, what other processes did, and what the application did but forgot to log. They are also retrospective — by the time you read a log line, the system has moved on and the surrounding context is gone. Replacing logs with kernel-level event streams is not a "better" version of the same idea; it is a different category of tool. The difference is the difference between reading a diary entry and watching the day on video.
12.1 The nginx 404 — a worked example
You run nginx in front of three upstreams. A user reports a 404. Walk through what each model gives you.
The traditional logs approach
$ tail -f /var/log/nginx/access.log | grep '404'
192.0.2.7 - - [06/May/2026:14:23:11] "GET /api/users/42 HTTP/1.1" 404 153
You learn: a 404 happened, sometime around 14:23:11, for
/api/users/42, from 192.0.2.7. That's it. You don't know:
- Which upstream nginx actually tried to talk to.
- Whether that upstream was reachable at all, or timed out, or returned 401, or had a DNS resolution failure that made nginx fall back to a different host.
- Whether the TCP connection to the upstream was healthy, or whether there were retransmits, or whether the kernel's conntrack table was full and dropped the SYN.
- What the upstream thought it returned. Maybe it returned 200 to a different request and nginx mis-routed. Maybe it returned 404 because its database lookup failed for a reason it didn't log.
- Whether anyone else was affected at the same time, or whether this is isolated to this user.
- What changed in the last hour that might explain it.
Six obvious questions, six dead ends. The traditional response is to add more logging — turn on debug-level logging in nginx, restart, hope the bug reproduces, scrape more lines, grep harder. The cost: every log line you add is overhead on every single request, even when nothing is wrong. You pay 100% of the time for visibility you need 0.01% of the time. And you still don't have answers to the kernel-level questions because the application can't see the kernel.
The kldload approach
Same nginx, same 404. You open Grafana. The dashboards for the nginx node are already there — kldload provisioned them when you installed. Every node is wired into the cluster's time-series store via eBPF and Hubble; every flow, every syscall, every kernel function call of interest is already being recorded. You don't grep. You query.
# What did nginx actually try to talk to in the last 5 minutes?
hubble observe --since 5m --pod nginx --type trace
nginx → upstream-1 TCP/8080 ACK 2.1ms
nginx → upstream-2 TCP/8080 RST 3.4ms <-- upstream-2 reset!
nginx → upstream-3 TCP/8080 ACK 1.8ms
One query, you see upstream-2 sent a TCP RST during that window. That's why nginx returned 404 to its upstream block — it failed all health checks against upstream-2 and the request happened to land in the fallback path that returns 404. The application log told you "404"; the kernel told you "TCP RST from upstream-2". The difference: cause vs. symptom.
You drill further:
# Is upstream-2 healthy from any other client?
hubble observe --since 5m --to-pod upstream-2 --verdict DROPPED
... (long list of drops from many sources, all RST around 14:21-14:24)
# What's upstream-2's CPU / memory in that window?
# (Grafana panel — already provisioned — shows CPU at 100%, OOM at 14:22)
# Was there a deploy?
# (Grafana annotations — kldload auto-tags deploys via Tetragon —
# show "upstream-2 image bumped to v3.2.1 at 14:20")
Total time to root cause: about 90 seconds. Without ever opening a single log file. The new image had a memory leak; it OOM'd; the kernel killed it; nginx saw RSTs; users saw 404s. Every step of that chain was visible in real time, in Grafana, on every kldload node by default.
12.2 Every node is a living entity with x-ray vision
The right mental model for a Linux host with a modern eBPF stack is not "a server running some software". It's a living organism wired into the cluster's central nervous system, broadcasting its kernel state continuously, queryable from any point in time. The eBPF subsystem is what enables this — eBPF is an upstream Linux kernel feature, not a kldload thing. Cilium, Tetragon, Hubble, Grafana, and Prometheus are upstream open-source projects. Any Linux distro could provide this experience; most don't, because the assembly is the work. Every kldload node:
- Speaks Grafana fluently from boot. The Prometheus endpoints, the Loki tail, the Hubble flow exporters, the Tetragon event stream — all of them are running on every klab/zfslab/k8s install without you doing anything. The dashboards are pre-provisioned. The data is already flowing.
- Watches itself at the kernel level. eBPF programs attached at scheduler, network, syscall, and disk hooks emit events for every meaningful kernel action. The application doesn't have to participate; the kernel reports.
- Knows about the rest of the cluster. Every flow it sends is tagged with cluster identity (pod, namespace, node). Hubble joins those flows across nodes into multi-hop traces.
- Records what it did. Tetragon captures every process exec, every privileged syscall, every file open against sensitive paths — at JSON-line granularity, exported to Loki.
- Surfaces deviations. Once you build a fingerprint (Section 13), the same dashboards highlight anomalies in real time.
This is what "x-ray vision" means in practice. You can look at any running kldload node and see, in real time, what its kernel is doing, what its network is doing, what its processes are doing, and how that compares to baseline. You can do this without logging into the box. You can do this without modifying the application. You can do this for nodes you've never seen before. Every install ships with this property.
Compare that to a stock distro. To get even half this visibility on a stock CentOS / Debian / Ubuntu, you need to: install Prometheus, set up exporters, install eBPF tooling, decide on a flow exporter, install Cilium (which means uninstalling whatever CNI you have), install Hubble, configure Loki and promtail, set up Grafana, build dashboards, write Tetragon policies, integrate it all. That's a multi-week project, and most teams never finish — they get partway through and call it "good enough" with logs and a few metrics. kldload's choice is to make that the default state of every install.
12.3 Five questions logs can't answer that observability can
The point of the new model isn't "better logs". It's "questions you couldn't ask before". Five examples, all answerable in seconds on a kldload node:
| Question | Logs | kldload |
|---|---|---|
| Why was this request slow? | Time it took (maybe). Cause: unknown. | Per-hop latency, retransmit count, conntrack state, disk wait, CPU steal — for that exact request, on every node it touched. |
| Is anyone else having the same problem? | Need to grep across every node. Slow. | One Hubble query: every flow with the matching error verdict, cluster-wide, in < 1s. |
| Is this an attack or normal usage? | Look at request rate, hope you know what normal is. | Compare current event stream to fingerprint. New process? New destination? New ioctl? Schema drift = anomaly. |
| Did this deploy break anything? | Wait for log line patterns to change. Reactive. | Tetragon-tagged deploy event in Grafana; pre/post fingerprint comparison shows new error rates instantly. |
| What changed in the last hour? | Read every log file. Hours. | Hubble + Tetragon time-window query, faceted by service. Seconds. |
Each row is a daily-operations question. Each row is a question that existed in 1995 and was answered with logs because there wasn't anything better. There is now. The reason a kldload masterclass spends time on this is that getting the mental model right is the hard part. Once you internalize that the kernel is your observation point and the network is your trace, "monitoring" stops being a thing you bolt on later and starts being a property of the system itself.
12.4 None of this is kldload-specific. That's the point.
Worth being honest: every tool in the previous sections is upstream open-source software, completely independent of kldload. eBPF is the Linux kernel feature. Cilium is a CNCF graduated project. Tetragon is a CNCF sandbox project (Cilium's sibling). Hubble is part of Cilium. Grafana and Prometheus are their own ecosystems. bpftrace is upstream BCC. OpenZFS is itself upstream. Anyone with a modern Linux kernel and a few weeks of integration work could assemble this stack themselves, on any distro.
What kldload contributes is making them first-class citizens of the OS — installed by default, configured to work together by default, dashboards pre-provisioned, policies pre-applied, everything integrated end-to-end the moment a node finishes installing. The kernel was always capable of this; the tools were always free; the assembly was the work. kldload is the assembly, shipped as a distro.
If you don't run kldload, you can still do everything in this masterclass — you just have to install Cilium, deploy Tetragon, set up the Hubble exporters, configure Prometheus and Grafana, build the dashboards, write the policies. Multi-week project on a stock distro; zero-step on kldload. Same destination, different starting line.
That's a feature, not a marketing pitch. It means kldload doesn't trap you — every skill you build in this masterclass transfers to any Linux box. The pattern of "treat your servers as living entities you can x-ray" applies wherever you have eBPF, which is everywhere from RHEL 9 to Debian 13 to Ubuntu 24.04. kldload just gets you to the starting line for free.
The shift, summarized — credit where it's due: the kernel always saw everything that happened. eBPF made the kernel's view queryable from userspace. Cilium and Tetragon and Hubble made that queryability cluster-aware. Grafana made it visualizable. The OpenZFS test suite gave us a deterministic workload to practice on. None of these are kldload's contribution. Default integration is. The skill you build here — fingerprinting, deviation detection, real-time tracing, security from baseline — is portable to any Linux system the moment you internalize the mental model. kldload is a head start, not a cage.
13. From a Defined Environment to Your Own — The Real Payoff
Everything up to this point has used the ZFS Test Lab as the example. That
is intentional: the ZFS test suite is one of the rare workloads where you
already know what "normal" looks like. zfs-tests.sh quick.run
fires a known set of zfs commands, allocates a known set of
zios, syncs a known number of txgs, and produces a stable fingerprint of
kernel events. When that fingerprint changes, something changed.
The technique generalizes. The ZFS lab is a defined environment — known workload, known expected behavior, known acceptable variance. Once you internalize how to fingerprint the defined case, you can apply the same pattern to anything you run: a web stack, a K8s pod, a database under load, a file server with daily replication, the kldload host itself. The lab is the lecture hall; your production box is the lab.
This section is the bridge. We use the ZFS test workload as the canonical example of how to capture, fingerprint, and trigger on a known state, then walk through applying that same pattern to a "bring your own" workload running on klab.
The pattern, in one paragraph: run your workload through a representative period in a controlled environment. Capture every kernel-visible event (syscalls via Tetragon, network flows via Hubble, kernel function activity via bpftrace, syslog/journal). Statistically summarize that capture into a fingerprint — the set of events that "should" happen, with frequencies and timing. From there, every future run can be compared against the fingerprint. Deviations are signal: regression, anomaly, attack, or workload change. The lab teaches you the discipline; production is where the value compounds.
13.1 The four moves of the pattern
1. Profile (capture known-good)
Run the workload in a clean environment under representative load. Record everything kernel-visible: syscall events, ioctls, network flows, file accesses, kernel function call counts and latencies. The capture is dense — for the ZFS test suite, ~3 million Tetragon events per quick.run, ~30,000 zio kprobe hits per minute. Let it run; storage is cheap.
2. Fingerprint (summarize)
Reduce the capture to a small set of statistical summaries: distinct syscall numbers seen, sorted by frequency; distinct binary paths spawned, with parent-process patterns; network destination set with ports and per-flow byte distributions; bpftrace histograms of latency per kernel function. The fingerprint is small — a few KB of JSON. It is the workload's signature.
3. Detect (compare against fingerprint)
Subsequent runs (or live production) feed events into the same summarizer. The result is compared against the fingerprint. New keys (syscalls / paths / destinations not in baseline) are flagged. Distribution shifts on existing keys are flagged. Latency outliers are flagged. The result is a "deviation report" — quantified differences with severity.
4. Trigger (act on deviations)
Wire deviations into actions: Tetragon enforcement policies that block the deviating event in real time, alert rules in Prometheus that page on threshold breaches, Loki queries that surface log lines correlated with deviation windows. The trigger turns observation into prevention or rapid response. This is where security comes from.
13.2 Why the lab is the right place to learn this
You can read the four-move pattern in any security-engineering textbook. Internalizing it requires doing it on a workload where you can verify your fingerprint is correct, where the workload is deterministic enough to detect spurious failures, and where you can deliberately inject known-bad events to confirm your triggers fire. The ZFS test suite is purpose-built for all three:
- Deterministic.
zfs-tests.sh quick.runwith the same vdev images and same OpenZFS version produces the same event sequence within statistical bounds. If your fingerprint shows wide variance, something in your capture pipeline is unstable, not the workload. - Verifiable. The OpenZFS source documents what each test does. You can read it and confirm "yes, scrub_001_pos creates a pool, writes data, scrubs, expects no errors". The fingerprint should reflect that.
- Inject-able.
zinjectdeliberately injects faults. Run it during your captured period; your fingerprint should show the resulting events. Now run it during a fingerprint-comparison period; the diff should highlight the same events. If it does, your detector works. If it doesn't, your fingerprint is too coarse.
Practice on the ZFS lab until you can predict the fingerprint, then extend the technique to your real workloads with confidence.
14. Building a Workload Fingerprint — Concrete Recipe
The fingerprint is a small JSON document. Here is what it actually looks
like for the ZFS quick.run baseline, captured on a fresh
klab-ztest-fedora clone with no fault injection:
{
"workload": "zfs-tests.sh quick.run",
"captured_at": "2026-05-06T14:00:00Z",
"duration_s": 423,
"syscalls": {
"ioctl": { "count": 3214, "rate_per_s": 7.6 },
"openat": { "count": 18302, "rate_per_s": 43.2 },
"fsync": { "count": 412, "rate_per_s": 0.97 },
"mount": { "count": 47, "rate_per_s": 0.11 },
"umount2": { "count": 47, "rate_per_s": 0.11 }
},
"binaries_spawned": [
{ "path": "/usr/sbin/zpool", "count": 312 },
{ "path": "/usr/sbin/zfs", "count": 1024 },
{ "path": "/bin/dd", "count": 89 },
{ "path": "/usr/bin/diff", "count": 47 }
],
"ioctl_codes": {
"0x5a01": { "name": "ZFS_IOC_POOL_CREATE", "count": 47 },
"0x5a08": { "name": "ZFS_IOC_POOL_DESTROY", "count": 47 },
"0x5a0d": { "name": "ZFS_IOC_POOL_SCAN", "count": 12 }
},
"kfunc_latency_p99_us": {
"zio_create": 18,
"zio_done": 12,
"dsl_pool_sync": 45000,
"arc_evict": 8
},
"network_destinations": [],
"filesystem_paths_modified": [
"/var/tmp/zfs-test-vdevs/disk*.img",
"/var/tmp/test-results/**"
]
}
That is what a fingerprint looks like. Five sections — syscalls, spawned binaries, ioctl codes, kernel-function latency, network and filesystem footprint. Each section is a manageable summary; together they characterize the workload tightly enough that anomalies are statistically detectable.
14.1 The capture script
Generated by a single ksh-or-bash command that runs alongside your
workload. The implementation is in
/usr/local/share/klab/profile/capture.sh and looks roughly
like:
#!/bin/bash
# capture.sh — record events during workload, then summarize
set -euo pipefail
WORKLOAD="$1" # e.g. "zfs-tests.sh quick.run"
DURATION="${2:-300}" # seconds
OUT_DIR="${3:-/var/lib/klab/profile/$(date +%s)}"
mkdir -p "$OUT_DIR"
# 1. Start Tetragon JSON-line capture
tetra getevents -o json > "$OUT_DIR/tetragon.jsonl" &
TETRA_PID=$!
# 2. Start bpftrace probes for kernel-function latency
bpftrace -e '
kprobe:zio_create { @start[tid] = nsecs; }
kretprobe:zio_create /@start[tid]/ {
@lat["zio_create"] = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}
kprobe:zio_done { @start[tid] = nsecs; }
kretprobe:zio_done /@start[tid]/ {
@lat["zio_done"] = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}
kprobe:dsl_pool_sync { @start[tid] = nsecs; }
kretprobe:dsl_pool_sync /@start[tid]/ {
@lat["dsl_pool_sync"] = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}' > "$OUT_DIR/bpftrace.txt" &
BPF_PID=$!
# 3. Start strace (via perf trace for low overhead) on syscall counts
perf trace --no-syscalls -e 'syscalls:sys_enter_*' \
--duration "$DURATION" -o "$OUT_DIR/syscalls.txt" &
PERF_PID=$!
# 4. Start hubble flow capture (cluster-wide, filter to this host's IP)
hubble observe --since 0s --output jsonpb \
--since "$(date -u +%FT%TZ)" \
--pod node="$(hostname)" \
> "$OUT_DIR/hubble.jsonl" &
HUBBLE_PID=$!
# 5. Run the workload
echo "$WORKLOAD" | bash > "$OUT_DIR/workload.log" 2>&1
# 6. Stop captures
kill -TERM $TETRA_PID $BPF_PID $PERF_PID $HUBBLE_PID 2>/dev/null || true
wait
# 7. Summarize → fingerprint.json
/usr/local/share/klab/profile/summarize.py "$OUT_DIR" \
> "$OUT_DIR/fingerprint.json"
echo "Fingerprint at: $OUT_DIR/fingerprint.json"
summarize.py reads the four event streams, applies the
reductions (top syscalls, top binaries, ioctl name lookup, latency
percentiles, network destination set), and emits the JSON shown above.
14.2 Verifying the fingerprint is good
Run the same workload twice; compare fingerprints; they should match within tolerance. If they don't, the workload isn't deterministic enough or your capture is unstable. Three quick sanity checks:
# Capture twice
sudo /usr/local/share/klab/profile/capture.sh "zfs-tests.sh quick.run" 600 \
/var/lib/klab/profile/run1
sudo /usr/local/share/klab/profile/capture.sh "zfs-tests.sh quick.run" 600 \
/var/lib/klab/profile/run2
# Diff the fingerprints
diff /var/lib/klab/profile/run1/fingerprint.json \
/var/lib/klab/profile/run2/fingerprint.json
# Healthy outcome: counts within 5% of each other across all categories
# Sick outcome: large diff in any single category → unstable workload OR
# capture infrastructure issue
For the ZFS quick.run baseline, run-to-run variance is typically <3% on counts and <10% on latency p99. Your tolerance settings (in the deviation detector below) should match the variance you observe.
15. Deviation Detection and Triggers
Once you have a fingerprint, the next step is comparison against fresh captures. The diff has two flavors:
- Schema drift: a new key appears (a syscall the fingerprint never saw, a binary that was never spawned, a network destination not in the baseline). Schema drift is the high-signal category — usually it means the workload changed, an attack happened, or a regression added a new code path.
- Statistical drift: existing keys present, but counts or latencies have shifted beyond tolerance. Lower-signal — sometimes a benign change (workload load increased), sometimes a real regression (a new bug added 50ms to txg sync).
The detector logic is short:
#!/usr/bin/env python3
# detect.py — compare a fresh capture against the saved baseline fingerprint
import json, sys, statistics
baseline = json.load(open(sys.argv[1]))
current = json.load(open(sys.argv[2]))
TOLERANCE_RATIO = 1.30 # 30% above baseline = drift
deviations = []
# 1. Schema drift — new keys
for k in current["syscalls"]:
if k not in baseline["syscalls"]:
deviations.append(("NEW_SYSCALL", k))
for b in current["binaries_spawned"]:
base_paths = {x["path"] for x in baseline["binaries_spawned"]}
if b["path"] not in base_paths:
deviations.append(("NEW_BINARY", b["path"]))
for dest in current["network_destinations"]:
if dest not in baseline["network_destinations"]:
deviations.append(("NEW_DESTINATION", dest))
# 2. Statistical drift — existing keys, big changes
for k, v in current["syscalls"].items():
base_v = baseline["syscalls"].get(k)
if not base_v: continue
if v["count"] > base_v["count"] * TOLERANCE_RATIO:
deviations.append(("RATE_DRIFT", k, v["count"], base_v["count"]))
for k, lat_us in current["kfunc_latency_p99_us"].items():
base_lat = baseline["kfunc_latency_p99_us"].get(k)
if not base_lat: continue
if lat_us > base_lat * TOLERANCE_RATIO:
deviations.append(("LATENCY_DRIFT", k, lat_us, base_lat))
# Output: zero exit code if no deviations, non-zero with deviation report
if not deviations:
print("OK")
sys.exit(0)
else:
print(json.dumps(deviations, indent=2))
sys.exit(1)
The exit code makes this scriptable. Wire it into a Tetragon enforcement policy, a Prometheus alert rule, or a CI gate.
15.1 From detection to action — three example triggers
Trigger 1 — Tetragon real-time block
If your fingerprint says "this workload should never spawn
/bin/sh", you can promote that into a real-time block. The
moment the kernel scheduler is about to exec a shell from this workload's
process tree, Tetragon kills the process before it returns from
execve:
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: zfs-test-no-shell
spec:
podSelector:
matchLabels:
app: zfs-test
kprobes:
- call: "sys_execve"
selectors:
- matchArgs:
- index: 0
operator: "Postfix"
values:
- "/bin/sh"
- "/bin/bash"
matchActions:
- action: Sigkill
The fingerprint told you "this workload doesn't spawn shells". The policy enforces it. If something tries — a regression, an exploit, a script bug — it gets stopped before doing damage.
Trigger 2 — Prometheus alert on latency drift
If your fingerprint says "txg sync p99 is 45ms, baseline tolerance 30%", the alert rule fires when the live metric exceeds 58.5ms:
- alert: ZFSTxgSyncLatencyDrift
expr: histogram_quantile(0.99, zfs_txg_sync_duration_us_bucket) > 58500
for: 5m
labels:
severity: warning
workload: zfs-test
annotations:
summary: "txg sync p99 above baseline tolerance"
description: "p99 = {{ $value }}µs, baseline 45000µs (+30% tol)"
runbook: https://kldload.com/runbooks/txg-latency.html
Trigger 3 — Loki + Grafana correlation panel
When schema drift is detected, dump every Tetragon event from the deviation window to a per-incident Loki stream, and surface it as a Grafana panel that correlates the deviation timestamp with kernel logs and pod restarts. The result: the on-call engineer opens one URL and sees "this is when the new event started, here's everything happening in the kernel and userspace at that moment".
16. Security and Debug Applications
The fingerprint-and-trigger pattern is the same machinery used by EDR products (Endpoint Detection and Response) and modern eBPF-based runtime security tools (Falco, Tetragon's own enforcement mode). What's different about doing it in kldload's lab is you control the whole stack — you build the fingerprint of the workload you actually run, not a generic "Linux in general" fingerprint that has to be conservative enough to fit every system.
16.1 Security — anomaly detection from baseline
Three scenarios where the fingerprint plus a Tetragon enforcement policy stops a real attack class cold:
- Reverse shell from a vulnerable web app. A web app
under attack typically tries to exec a shell. Your nginx fingerprint
shows it never spawns
/bin/sh. Tetragon kills anyexecveof/bin/shin nginx's process tree. The exploit lands but the payload dies on its first action. - Cryptominer post-compromise. Cryptominers reach out to a small, well-known set of mining pools. Your workload's fingerprint says it talks to three internal hosts and one upstream API. Hubble alerts on any new outbound destination. The miner phones home; you see it within seconds.
- Container escape. A container escape attempts to
mount the host's
/procor open/dev/sda. Your container's fingerprint says no host-path mounts and only its own namespaces. The attempt triggers schema drift the moment it happens.
The pattern in all three: known-good baseline + real-time enforcement = compromise detected at the first action, not after the data is gone.
16.2 Debug — first-failure capture
Same machinery, different goal. Instead of "block the deviation", "capture everything around it". When a fingerprint deviation fires:
- Snapshot ZFS state for the affected dataset (one-line
zfs snapshot) - Dump
dmesgring buffer to a file - Capture
/proc/<pid>/stackfor every process in the workload's cgroup - Snapshot every Prometheus metric series for the last 60s
- Dump the last 10,000 Tetragon events from the workload's pod label
- Bundle as a tarball, save to the lab's
/var/lib/klab/incidents/
The next time someone looks at "why did the test hang last Tuesday at 3 AM", the bundle is sitting there. The bundle has every piece of state needed to reconstruct what the kernel was doing at the moment of failure, without anyone having to be online when it happened.
16.3 Real-time observability — the live cockpit
The kldload dashboards (Section 8) all support a "deviation overlay" mode. Once a fingerprint is registered for a workload, the dashboard panels render in two color schemes: anything within tolerance shows in the normal palette; anything beyond tolerance shows in red. Watching a test run live, you see the system's behavior relative to its own known-good baseline, not relative to abstract thresholds someone guessed at.
The same overlay applies to Hubble flow charts, Tetragon event streams, and Prometheus time-series. A 30-minute test that finishes all-green is uneventful; a 30-minute test where a single panel pulses red for 8 seconds at minute 14 is a precise debugging breadcrumb. You go to the panel that pulsed, drill into that timestamp, and read the correlated event log.
17. Live Traffic Tracing Across the Cluster — the Cilium Difference
Most "service mesh observability" stories assume sidecars. Every pod gets an Envoy or Linkerd-proxy injected next to it; the proxy intercepts egress, decorates headers, exports spans. That works, but the cost is real: 50-150 MB of memory per pod, 2-5 ms of added latency per hop, and visibility that ends at the sidecar. The kernel layer below the sidecar — where retransmits happen, where TCP window collapses, where conntrack overflows — stays invisible.
Cilium replaces that model entirely. There are no sidecars. eBPF programs attached to every pod's veth interface, every node's physical interface, and the kernel's TC and XDP hooks, observe every packet as it traverses. For the kldload klab and zfslab templates this is the default networking stack — every node already has Cilium installed, already exports flows to Hubble, and already speaks to the cluster's Prometheus.
What this gets you for the fingerprint-and-trigger pattern: the network half of any workload's fingerprint is captured at the kernel, not at a userspace shim. The full multi-node path of every request is traceable in real time. You can ask "show me the exact route this packet took from pod A on node 1 to pod D on node 4, with the per-hop latency" and get an answer in tens of milliseconds. No application changes, no sidecar deploys, no retrofit.
17.1 Why the sidecar model loses the kernel layer
Sidecar meshes intercept traffic at the application level. The sidecar sees what the application sends; it does not see what the kernel did between sends. That's enough for "what HTTP requests are in flight" but not enough for "why did this 200ms request actually take 800ms".
Concrete examples of failures the sidecar model can't explain:
- TCP retransmit storms. A flaky link causes
retransmits. The sidecar sees normal request flow; the application
sees long latency. The retransmits happen in the kernel between the
sidecars. Cause: invisible. Cilium's
tcp_retrans_skbtrace fires on every retransmit; you see them per-flow, per-pod, per-node. - conntrack table pressure. At high connection churn, the kernel's conntrack table fills, new connections drop. The sidecar reports timeouts; the kernel reports table-full. Without visibility into the kernel's NF_CONNTRACK_TABLE_FULL counter, you're guessing. Cilium exposes it as a per-node metric.
- SYN-cookie fallback. Under SYN flood (real or simulated load test) the kernel falls back to SYN cookies, dropping state for half-open connections. Sidecars don't observe the SYN/SYN-ACK exchange.
- Inter-node encapsulation overhead. Cluster CNI overlays (VXLAN, Geneve, IPIP) add encap headers. MTU mis-sizing causes fragmentation. Sidecars see L7 only — the L3 fragmentation problem is invisible.
All four of these are common failure modes in production K8s clusters. All four are diagnosable in seconds with Cilium + Hubble; all four are mysteries with Envoy alone.
17.2 The Cilium model — eBPF instead of sidecars
Cilium attaches eBPF programs at four kernel hook points per node:
| Hook point | What it sees | Used for |
|---|---|---|
| XDP (eXpress Data Path) | Every packet at NIC ingress, before any kernel networking | DDoS protection, load balancing, early drop of disallowed traffic |
| TC (Traffic Control) | Every packet on every veth (per-pod) and physical interface | Policy enforcement (network policies), flow record export, encryption |
| Socket layer | Every TCP socket op (connect, accept, sendmsg, recvmsg) | L7 protocol parsing (HTTP, gRPC, Kafka), service identity assignment |
| cgroup hooks | Process / pod context for any networking syscall | Per-pod identity, per-pod policy, observability tagging |
The four hooks together produce a stream of events with full kernel context: source pod, destination pod, full L3/L4 header, protocol parse, latency, retransmits, drops, bytes, timing. Hubble (Cilium's observability layer) collects these from every node into a unified flow log. Every kldload klab and zfslab host has Hubble running by default; the stack is already deployed, already collecting, already queryable.
17.3 Live cross-node tracing — what it actually looks like
The classic worked example: a microservice request that hops three
nodes. frontend on node-1 calls api on
node-2, which calls worker on node-3, which writes to
postgres back on node-1. Four pods, three nodes, real
multi-hop path.
# 1. Watch the full flow in real time, scoped to one user's request
hubble observe --since 0s --follow \
--label app=frontend \
--type trace --output compact
# Output (real time):
1 [node-1]: frontend → api TCP/8080 SYN 0.1ms
2 [node-2]: api → worker TCP/9090 SYN 12.4ms
3 [node-3]: worker → postgres TCP/5432 SYN 14.2ms
4 [node-1]: postgres → worker TCP/5432 ACK 14.6ms +bytes=842
5 [node-3]: worker → api TCP/9090 ACK 16.1ms +bytes=156
6 [node-2]: api → frontend TCP/8080 ACK 16.4ms +bytes=423
# Total request: 16.4ms
# Slowest leg: api→worker (12.3ms — investigate the network between
# node-2 and node-3, or the api's outbound TCP stack)
That trace is impossible to produce with sidecars without instrumenting the application code. With Cilium, it's a one-line query against the flow log. Every node's eBPF program already captured every packet; Hubble aggregates them server-side; the CLI just renders the join.
17.4 Five concrete monitoring scenarios this enables
Scenario 1 — "what is service X actually talking to?"
You inherit a service. You don't know what it depends on. Sidecars would tell you the L7 hosts. Cilium tells you every L4 destination, every node, every flow:
hubble observe --since 24h \
--label app=mystery-service \
--output json \
| jq '.flow.destination | "\(.namespace)/\(.pod_name)"' \
| sort -u
# returns the deduplicated set of (namespace, pod_name) it talked to
# in the last 24 hours — your dependency graph, no APM agent required
Scenario 2 — "find the slow leg of this request"
A user complaint: "the dashboard is slow". You know the request goes through frontend → api → cache → postgres. Cilium gives you per-leg latency without app instrumentation:
hubble observe --since 1h \
--to-label app=postgres \
--type trace \
| awk '{ print $1, $4, $5, $7 }' \
| sort -k 4 -n -r \
| head -10
# top 10 slowest connections to postgres in the last hour
# with which pod initiated and on which node
Scenario 3 — "did this packet actually reach its destination?"
App logs say request went out, downstream service says it never arrived. With sidecars: gone forever. With Cilium: every node's eBPF confirms whether the packet entered, was forwarded, was dropped, and where:
hubble observe --since 5m \
--pod default/frontend \
--to-pod default/api \
--verdict DROPPED
# shows every drop, with the verdict reason (policy, conntrack-full,
# fragmentation, no-route). If the list is empty, the packet got
# through; if not, the reason is visible.
Scenario 4 — "trace this exact transaction end-to-end"
Cilium can extract HTTP headers from packet payloads at the kernel
level (no application changes). If your services propagate a
X-Request-ID header — Cilium can match it across hops:
hubble observe --since 5m \
--http-header 'X-Request-ID=abc-123' \
--type trace
# returns every packet across every node carrying that request ID,
# in time order. The full distributed trace, by header, for free.
Scenario 5 — "show me cluster-wide flows during the incident"
An incident at 14:23. Pull every flow from every pod across the window:
hubble observe \
--since '2026-05-06T14:20:00Z' \
--until '2026-05-06T14:30:00Z' \
--output json \
> /tmp/incident-flows.json
# now you have ~10 minutes of every flow from every pod across the
# cluster. grep for the affected service, look for unusual destinations,
# look for spikes in retransmits or drops. The whole network state
# during the incident, archivable.
17.5 How this strengthens the fingerprint pattern
The fingerprint pattern (Section 13) listed
network_destinations as one of the five sections. With
Cilium, that section is rich:
{
"network_destinations": [
{
"to_pod": "default/api",
"to_node": "node-2",
"port": 8080,
"protocol": "TCP",
"rate_per_s": 24.6,
"p99_rtt_us": 1240,
"retrans_rate": 0.001
},
{
"to_pod": "default/postgres",
"to_node": "node-1",
"port": 5432,
"protocol": "TCP",
"rate_per_s": 8.2,
"p99_rtt_us": 380,
"retrans_rate": 0.0
}
]
}
Compare against a fresh capture. Schema drift = a new destination appeared (pod talked to something it never had before — possible exfiltration, misconfig, or new feature). Statistical drift = a known destination's RTT shifted (network problem, downstream slowdown). Either is signal you'd never see without per-flow visibility, and both are trivial to detect when the flows are already captured.
17.6 Cilium + the cross-node K8s example
The "Bring Your Own — K8s pod" example earlier in this masterclass becomes much richer with Cilium tracing. The full per-pod fingerprint captures every cross-node flow as part of the baseline. The fingerprint controller running on klab knows every connection that pod made during its first 24 hours; it generates a Tetragon policy that allows exactly that connection set, blocks anything new, and publishes a Grafana panel that highlights deviations live.
Your microservice gets, with one annotation in its Deployment:
- An auto-generated allowlist of egress destinations — exact pod-to-pod, port-precise.
- A cross-node trace for every request — visible in Grafana via the kldload-shipped Hubble dashboards.
- A real-time alert the moment the pod talks to a destination not in the baseline (policy attempt, exfil, or genuine new feature requiring a fingerprint refresh).
- Per-flow latency tracking — drift on existing destinations alerts on Prometheus, with Hubble drill-down available for any window.
None of this requires application code changes. None of it requires sidecars. The pod just runs; the kernel watches everything; the lab's existing observability stack does the rest.
18. Bring Your Own Workload — Three Walked Examples
The lab pattern works on any workload running on klab. Three examples of increasing scope.
18.1 A web stack — nginx + PostgreSQL on klab
You're running nginx in front of PostgreSQL on a klab VM. The workload is "user requests come in, queries fire, responses go out". You want a fingerprint and triggers.
# 1. Capture during a representative load period (your ramp test, or
# just normal business hours for half a day)
sudo /usr/local/share/klab/profile/capture.sh "your-load-test.sh" 21600 \
/var/lib/klab/profile/web-stack-baseline
# 2. Inspect the fingerprint:
cat /var/lib/klab/profile/web-stack-baseline/fingerprint.json
# Expected highlights:
# binaries_spawned: nginx, postgres, postgres workers, no shells
# network_destinations: client IPs (varied), DB UNIX socket, no Internet
# syscalls: lots of openat (template files), some accept4, almost no execve
# ioctl_codes: empty (web stack uses none)
# 3. Save fingerprint as the baseline
sudo cp /var/lib/klab/profile/web-stack-baseline/fingerprint.json \
/etc/kldload/fingerprints/nginx-postgres.json
# 4. Apply the matching Tetragon policy
sudo /usr/local/share/klab/profile/apply-policy.sh nginx-postgres
Now anything new in nginx's process tree (a shell, a curl outbound, a new binary) triggers either an alert or a kill, depending on how strict you set the policy. That single fingerprint covers a real class of attacks (web RCE → shell → outbound) without you ever writing a specific rule for it.
18.2 A K8s pod — your microservice in production
Same pattern, fingerprint scoped to a pod label. The deployment YAML gets one annotation:
metadata:
labels:
app: my-service
annotations:
kldload.io/fingerprint: my-service-v1.2.0
The kldload-fingerprint-controller (a small operator that ships with the klab template) watches for that annotation, captures live events for the pod's first 24 hours of running, generates the fingerprint, and applies a Tetragon policy. The pod's deployment is its own fingerprint boundary — when you bump the version, the controller re-captures and adapts.
Result: every microservice gets a tailored anomaly detector with no per-service config. The platform team sets the pattern; app teams add one annotation.
18.3 A file server — backups + replication
Your file server runs Samba, exports a few directories, and replicates to a peer via syncoid every hour. Capture the workload through one replication cycle:
# Capture the cycle: sample → archive → syncoid run → finish
sudo /usr/local/share/klab/profile/capture.sh "wait-for-syncoid.sh" 4200 \
/var/lib/klab/profile/fileserver-cycle
The fingerprint will include:
- Samba's smbd process tree, no shells or unusual binaries
- The single outbound TCP destination of the syncoid peer
- ZFS ioctls for snapshot create / send
- Filesystem paths under the exported shares (read-side) and the replication-staging zvol (write-side)
Triggers worth setting:
- Block any binary other than smbd / syncoid / zfs / zpool from running in the file server's mount namespace.
- Alert on any outbound destination that's not the syncoid peer.
- Latency drift on syncoid send: if the typical 1-hour cycle stretches to 90+ minutes, page on-call (often a sign of a struggling network or a ZFS issue).
Same four moves (profile / fingerprint / detect / trigger), totally different workload. The lab patterns extend cleanly because the underlying mechanism — eBPF observation of kernel-visible behavior — is universal.
19. What Hardware Adds — and Doesn't
The Test Lab runs in VMs. That covers OpenZFS userspace and the kernel-module behavior under controlled load. What it does not cover:
- Real disk failure modes. SMART-reported failing
sectors, controller resets, sudden detach. ZFS handles all of these and
the lab's
zinject-based fault injection is a good approximation, but real hardware adds latency / timeout patterns synthetic injection misses. - Vibrational / thermal effects. A hot HBA card under sustained scrub load drops bytes in ways that don't show up on a qcow2-backed VM. Production-grade testing benches should include thermal soak tests on real metal.
- Multipath / SAN behaviors. Anyone running ZFS over iSCSI or FC needs to test the actual path-failover; that's hardware-only.
- Power-cycle integrity. The lab can simulate
zinject -c(panic injection) but not real loss-of-power mid-txg. For storage you actually trust, you do this on real hardware at least once.
The pattern: use the Test Lab for the 90% of regressions that are kernel/userspace logic, and reserve real hardware for the remaining 10% that depend on physical effects.
20. Getting a ZFS Test Lab Up
From zero to running your first test takes about 90 minutes of mostly-unattended time. The full sequence:
- Install kldload with the zfslab tile selected. Pick
any distro (Fedora 44 recommended for the freshest OpenZFS). On install,
the autodeploy phase fires
klab golden-ztest allin the background — the five ZFS-test goldens build over the next ~55 minutes. - Wait for the goldens. Watch the progress on the
dashboard at
https://<host>:8443/dashboards/lab/— each distro shows up as it finishes. While that runs, the K8s cluster + Tetragon + Grafana come up in parallel; they're ready before the goldens finish. - Confirm the lab is ready:
$ sudo klab status
=== klab status ===
Lineages:
klab-golden-* : 0/5 ready (this lineage was skipped — zfslab template)
klab-ztest-* : 5/5 ready
centos: zfs-2.2.9, ksh, zfs-tests.sh, vdev images ✅
rocky: zfs-2.2.9, ksh, zfs-tests.sh, vdev images ✅
fedora: zfs-2.4.1, ksh, zfs-tests.sh, vdev images ✅
debian: zfs-2.3.2, ksh, zfs-tests.sh, vdev images ✅
ubuntu: zfs-2.2.2, ksh, zfs-tests.sh, vdev images ✅
Observability:
zfs-bpf-exporter: active (port 9101)
tetragon: 5/5 nodes loaded zfs-trace policy
hubble: ready
grafana: https://<host>:3000/d/zfs-test/
Try: sudo klab test all --suite quick
- Run a smoke test on every distro:
$ sudo klab test all --suite quick
[fedora] starting ztest clone... done (12s)
[fedora] zfs-tests.sh quick.run: 47 passed, 3 skipped, 0 failed (8m13s)
[debian] starting ztest clone... done (15s)
[debian] zfs-tests.sh quick.run: 47 passed, 3 skipped, 0 failed (9m02s)
...
=== Summary ===
centos rocky fedora debian ubuntu
quick.run 47/50 47/50 47/50 47/50 47/50
total wall: 12m41s
- Open the dashboards at
https://<host>:3000/d/zfs-test/. The Pass/Fail Matrix panel populates immediately. The trend panels need a few runs of history before they're useful. - Run your first custom recipe:
$ sudo cp /usr/local/share/klab/recipes/pool-corruption-block.sh \
/var/lib/klab/recipes/my-first-test.sh
# edit as needed
$ sudo klab test fedora --recipe my-first-test.sh
From here you have a full lab. Add recipes for whatever workload you care about, and every test run feeds back into the dashboards. Over a week of runs the trend panels start to surface real signals — flaky tests, perf regressions, kernel-update side effects.
21. Closing — Why This Layer Exists
OpenZFS is good code. The upstream test suite is thorough. Most distros package it cleanly. None of that protects you from the unique combination of "the OpenZFS version I am running, on the kernel I am running, under the workload I am running". That combination isn't tested anywhere except where you run it. The Test Lab gives you the framework to test it routinely instead of by accident.
The observability layer is the part that makes it work. Without eBPF / Tetragon / Hubble, a ZFS test failure looks the same as it has for twenty years — a log line, maybe a stack trace, hours of squinting. With the three-tier traces, the failure looks like a movie of what the kernel did. That difference is the difference between "filed a bug upstream and waited a release cycle" and "patched it locally before the next sprint".
The bottom line: if your storage is OpenZFS and you care whether the bytes you wrote yesterday are the bytes you read tomorrow, the Test Lab is the cheapest insurance you'll ever buy. One ASUS box, one ISO, eight hours of overnight test runs per week, and Grafana panels that tell you when something drifts. Your future self will appreciate it.