Testing Masterclass
kldload ships an installer that lays down ZFS-on-root across nine distros, plus a post-install autodeploy that brings up Kubernetes, Tetragon, the klab sandbox, and Bob AI. That is a lot of moving parts. This document describes how kldload tests itself: the two-layer model (CI in VMs, hardware on real metal), the 15-combo distro × profile matrix, what each test actually checks, the fix loop that turns a red CI run into a green one, and the buried gotchas that earlier sessions paid for in debugging time so you don't have to.
What you will learn: the philosophy behind kldload's two-layer test model, the smoke-test framework end to end (from deploy.sh smoke-test down to the per-test _pass/_fail calls), how to drive the matrix yourself, how to interpret a failure, how to land a fix and verify it, how to bootstrap a CI host from scratch, and a 13-item "war stories" list of every non-obvious bug the framework has surfaced in the last week.
Audience: anyone running kldload in production, anyone considering contributing, anyone curious how a one-person project keeps a 9-distro × 3-profile install matrix from rotting under their feet.
1. Why kldload Tests This Way
Most distro install testing is hand-driven: someone burns a USB, walks an
installer, eyeballs the result, files a bug if something looks wrong. That works
when you have a paid QA team and a small target matrix. kldload has neither — it
ships nine distros (CentOS Stream 9, Rocky 9, Fedora 44, Debian 13, Ubuntu 24.04,
RHEL 9, Arch, Alpine, FreeBSD) across three generic profiles
(core / server / desktop) and four workload
templates (kvm / k8s / klab /
zfslab). Hand-testing every meaningful combination would take days
per release, and humans get bored — bored humans miss things.
So kldload took the opposite approach. The test matrix runs in throwaway KVM
VMs on a single dedicated box (fiend), under a runner that captures
per-combo logs, records every result in SQLite, and dumps console traces on
failure. Every commit can re-light the matrix overnight. The first time the matrix
ran, in a single 2-hour run, it surfaced eight distinct latent bugs that had been
shipping unnoticed across multiple releases. That is the loop this masterclass
documents.
The two-layer test model
Layer 1 is the CI matrix in VMs — fast, reproducible, runs on every change. Layer 2 is hardware install on real metal — slow, manual, only on releases. They catch different bugs. CI is the airbag (always on, prevents the worst); hardware is the seatbelt (you put it on for the trip).
One box, two roles
You don't need a fleet to run this. fiend is a single ASUS TUF X570 that serves as both the CI runner (15 throwaway VMs in libvirt) AND the hardware test target (when you nuke and reinstall it). The matrix tests other configurations in VMs while fiend itself was tested by the install + self-validate flow.
The framework finds itself
Most "testing infrastructure" projects spend the first six months building the framework and the next six months finding bugs. kldload's framework is pre-existing (tests/lifecycle.sh, tests/smoke-*.sh) — the matrix runs simply exposed bugs in both the framework AND the installer in cascading layers. Each fix unblocked the next layer.
SQLite, not Jenkins
The runner records every result in /opt/kldload-ci/history.sqlite. Want to know which combos failed in the last green run? Two-line SQL. No CI portal, no YAML pipelines, no plugin ecosystem. The whole runner is 318 lines of bash.
2. The Two-Layer Test Model
Every test in kldload falls into one of two categories. Knowing which is which saves a lot of debugging time when something goes wrong.
| Layer | Where it runs | What it catches | What it misses |
|---|---|---|---|
| CI matrix (Layer 1) | 15 KVM VMs on fiend, OVMF firmware, libvirt default network | Installer logic, ZFS layout, package set drift, bootloader chain in standardized firmware, service config, post-install smoke validation | Real-firmware UEFI quirks, NVRAM behavior of MOK enrollment, NVIDIA driver compat with running kernel, slow-USB races, IOMMU groupings, BIOS-specific boot order, real network drivers, suspend / resume |
| Hardware install (Layer 2) | Real metal, you burn a USB and install | Everything in the right column above | Most installer logic bugs (you usually only test one combo per hardware run) |
The crucial point: they are not substitutes. CI tells you "the installer is logically correct"; hardware tells you "it actually works on this specific machine". You need both. A rough cadence:
- On every commit (or every push): CI matrix runs in 2 hours. If anything regresses, you know within the day. Most fixes never need hardware verification — they are pure-logic bugs (a typo in a package set, a missing profile gate, a typo in an SQL query).
- On every release candidate: burn the ISO to USB and install on fiend. Confirm the firmware path works (Secure Boot enroll, NVIDIA driver build, real disk geometry). This is where the bugs CI cannot see show up.
- Per major release: install on a second physical box if you have one. ASUS / Dell / HP / Lenovo all have firmware quirks that don't generalize.
Concretely from the recent kldload 1.1.0 work: the four-template architecture landed clean in CI on the first matrix run, but fiend itself wouldn't boot the new USB at all (a separate boot-path regression invisible to CI). The two layers were catching different bugs at the same time.
3. The Matrix
The Layer-1 matrix is fixed at five distros × three generic profiles = fifteen combinations.
core server desktop
centos ✅ ✅ ✅
rocky ✅ ✅ ✅
fedora ✅ ✅ ✅
debian ✅ ✅ ✅
ubuntu ✅ ✅ ✅
Why these five distros and not all nine?
- Arch is in the installer (rolling release; no darksite — needs
internet) but not in the matrix because there is no
tests/smoke-arch.shyet. It would work; it just hasn't been written. - Alpine same as Arch — installer supports it, no smoke wrapper.
- FreeBSD is on the roadmap; the installer plumbing exists but needs a darksite and bhyve test path.
- RHEL needs subscription credentials to fetch packages; the matrix can't run unattended without them. RHEL is hand-tested before releases.
And why these three profiles and not the four workload templates?
The generic profiles (core, server,
desktop) have well-defined post-install state — they're additive.
core = ZFS + stock distro + WireGuard + eBPF + diagnostic tools
(deliberately minimal). server = core + the k* tools, sanoid, the
observability stack, NVIDIA. desktop = server + GNOME + GDM +
Firefox. Each one has a smoke-$profile.sh script that asserts the
expected post-install state.
The workload templates (kvm, k8s, klab,
zfslab) need their own smoke wrappers — they bring up libvirt
networks, K8s clusters, klab goldens (5 cloud-init VMs), Tetragon, and Bob. None
of that is asserted yet by an automated test. That is phase-2 work; the framework
supports it (just add tests/smoke-{kvm,klab,zfslab}.sh and they'll
appear in the matrix).
4. What Each Test Checks
Five test scripts, each covering a layer of the install. They live in
tests/ in the kldload-free repo. Reading them is a fast way to
understand what kldload promises a finished install will look like.
4.1 tests/smoke-build.sh — pre-install ISO sanity
Runs after ./deploy.sh build but before any VM is touched.
Validates the ISO file itself:
- File exists, size > 8 GB and < 16 GB (sanity bounds for a kldload ISO)
- SHA256 checksum file present and matches
- ISO mounts cleanly via loopback
- Contains
squashfs.img(the live root) - EFI directory is present (so it can UEFI-boot)
- GRUB config in the right place
- ISO timestamp is fresh (catches the "you forgot to rebuild" bug)
Without this check, you can spend two hours running the matrix against yesterday's ISO and never know — every combo would behave the same way it did yesterday.
4.2 tests/smoke-core.sh — 51 tests, run on every profile
This is the baseline. Every profile must pass smoke-core; server and desktop add to it but never replace it. The 51 tests fall into eight categories:
# ── ZFS userspace and pool ─────────────────────────────────────
test_cmd "ZFS userspace (zfs)" "zfs"
test_cmd "ZFS pool tools (zpool)" "zpool"
test_output_contains "ZFS module loaded" "lsmod" "zfs"
test_output_contains "Pool rpool exists" "zpool list" "rpool"
test_output_contains "Pool rpool is ONLINE" "zpool list -H -o health rpool" "ONLINE"
test_output_contains "Pool has zero errors" "zpool status rpool" "No known data errors"
test_succeeds "Pool scrub runs" "zpool scrub rpool"
# ── Datasets ──────────────────────────────────────────────────
test_dataset "rpool/ROOT exists" "rpool/ROOT"
test_output_contains "Root dataset mounted at /" \
"zfs get -rH -o value mountpoint rpool/ROOT" "/"
test_dataset "rpool/home exists" "rpool/home"
test_dataset "rpool/var exists" "rpool/var"
test_dataset "rpool/var/log exists" "rpool/var/log"
test_dataset "rpool/srv exists" "rpool/srv"
# ... and ten more dataset checks
Then EFI / bootloader, networking, hostid match between live and target,
universal install markers (/etc/kldload-build-sha,
/etc/kldload/edition, /etc/kldload/profile,
/etc/kldload/boot-environment), and finally the
negative assertions that make core meaningful:
# Core profile: kldload feature tools must be ABSENT
for tool in kst ksnap kbe kclone kdf kdir kpkg kupgrade \
kexport krecovery kldload-webui sanoid; do
if command -v "$tool" >/dev/null 2>&1; then
_fail "$tool absent (core)" "$tool found — should not be in core profile"
else
_pass "$tool absent (core)"
fi
done
Plus a runtime ZFS test (create a snapshot, verify it appears in the listing,
destroy it cleanly) and a check that the diagnostic tool
kldload-debug-bundle is installed and prints help. That's 51 tests
total; a passing core install hits all 51.
4.3 tests/smoke-server.sh — extends core
Server inherits everything in smoke-core, then adds the positive assertions that core just made negative assertions about:
- All k* tools (
kst,ksnap,kbe, etc.) PRESENT and respond to--help - sanoid binary installed and the systemd service is enabled
- WireGuard userspace (
wgcommand) installed - eBPF tools (
bpftool,bpftrace) installed - If a GPU is detected: NVIDIA driver loaded and
nvidia-smiworks - kldload-webui binary in
/usr/local/sbin
The negative-then-positive pattern is deliberate — it means a misconfigured install that partially includes server bits will fail core's negative assertion. You can't accidentally pass both core and server with the same install.
4.4 tests/smoke-desktop.sh — extends server
Desktop = server + the GUI:
- GNOME session files present
- GDM service enabled
- Firefox installed
- The Firefox autostart desktop file points at
https://localhost:8443(the kldload web UI)
4.5 tests/smoke-auto.sh — the dispatcher
Reads /etc/kldload/profile on the running system to figure out
which profile is installed, then invokes the matching smoke-$profile.sh.
This is what tests/lifecycle.sh calls on the freshly-installed VM —
the test runner doesn't need to know the profile; the installed system tells it.
5. The Per-Combo Lifecycle
What happens when you run sudo ./deploy.sh smoke-test fedora core?
The wrapper invokes tests/lifecycle.sh with arguments
fedora core. That script does the following, in order:
- Spawn a clean VM with
virt-install: 4 vCPU, 8 GB RAM, 32 GB qcow2 disk, OVMF UEFI firmware, attached to libvirt's default network, boot-from-ISO. The VM is namedkldload-smoke-fedora-core. - Wait up to 15 minutes for the live env to be SSH-able. The
runner polls
virsh net-dhcp-leases default+ tries a TCP connect to port 22 + does a no-op SSH login aslive/live. The 15 minute ceiling exists because the kldload live ISO is heavy (~5 GB squashfs + ZFS rootfs init + every kldload tool loaded); on a 4-core VM it routinely needs 7-10 minutes from-bootto ssh-ready. - Compose an answers file with the install parameters and SCP
it to
/tmp/answers.envon the live env. The SCP wraps withsshpass -p live— directscpfalls back to pubkey auth which the live ISO doesn't have configured. - Kick off the headless install with
setsid nohup /usr/sbin/kldload-install-target --config /tmp/answers.env >/tmp/install.log 2>&1 &. The setsid + nohup means the install survives the SSH disconnect that follows immediately after the launch. Output goes to/tmp/install.logon the VM. - Poll for completion by checking
pgrep -f "[/]usr/sbin/kldload-install-target"over SSH. Once no match (the install process exited), grep the kldload installer log for "Install completed successfully". The bracket trick on the path is critical — without it pgrep matches its own SSH session's argv and the loop runs to its 60-minute ceiling without ever detecting the exit. - Reboot the VM with disk-first boot order (the CDROM stays
attached but ignored).
virsh shutdownwith a 30-second graceful window, thenvirsh destroy, thenvirt-xml --edit --boot hd,cdrom, thenvirsh start. - Wait up to 15 minutes for the installed system to be
SSH-able as
admin/admin. This is the moment of truth — if the bootloader chain is broken, this is where it shows up as "installed system never came up". - SCP the entire
tests/directory to the installed target's/tmp/tests/. - Run
tests/smoke-auto.shon the installed target. That dispatcher reads/etc/kldload/profileand calls the matching smoke-$profile.sh. The full output is captured. - Pass if smoke-auto reports zero failures; fail otherwise.
On failure, leave the VM defined for
virsh consoleinspection and dump the installer log +/tmp/install.log+ storage log into the combo log file.
Total wall-clock per combo: ~25 minutes for core, ~30 minutes for server, ~45 minutes for desktop (more packages = longer install). Sequential matrix: ~6-8 hours. Parallel under libvirt's scheduling on a 24-core box: ~2 hours.
6. The CI Runner: kldload-ci-run
The matrix wraps the per-combo lifecycle in a loop. The runner is in
ci/kldload-ci-run in the repo (~318 lines of bash). It does six
things:
- Acquires a flock on
/opt/kldload-ci/.run.lock. Refuses to start if another run is in progress. This prevents the nightly timer from racing a manual run. - Optionally syncs source via
CI_SYNC_CMDenv var (or git pull if the source dir is a git checkout, or skips if neither). - Builds the ISO if not
--skip-build. Calls./deploy.sh builder-imagefirst (idempotent — only fires if the builder container image isn't present), then./deploy.sh build. Output goes to a per-run log under/opt/kldload-ci/results/$run_id/. - Runs
./deploy.sh smoke-buildfor the ISO sanity check. - Loops the matrix — for each combo, calls
sudo ./deploy.sh smoke-test <distro> <profile>, captures stdout/stderr to a per-combo log, records the result in SQLite, and on failure dumps the VM's serial console + domain XML to afailures/<combo>/subdirectory. - Writes a summary as Markdown at
$results_dir/SUMMARY.md— pass/fail count plus a 5×3 matrix table. Useful for grepping or pasting into a release thread.
SQLite schema:
CREATE TABLE runs (
run_id TEXT PRIMARY KEY, -- 2026-05-06-044342
started_at TEXT NOT NULL,
finished_at TEXT,
iso_sha TEXT, -- sha256 of the ISO this run tested
iso_size INTEGER,
source_sha TEXT, -- git rev or 'rsync'
matrix_total INTEGER NOT NULL,
matrix_pass INTEGER NOT NULL DEFAULT 0,
matrix_fail INTEGER NOT NULL DEFAULT 0,
rc INTEGER
);
CREATE TABLE results (
run_id TEXT NOT NULL,
distro TEXT NOT NULL,
profile TEXT NOT NULL,
status TEXT NOT NULL, -- pass | fail | skip
duration_s INTEGER,
fail_reason TEXT, -- one-line excerpt
log_path TEXT,
PRIMARY KEY (run_id, distro, profile)
);
Two indexes worth knowing about: idx_results_status for
"show me everything that failed in run X", and started_at DESC for
"show me the last N runs". That's all you need.
7. Driving the Tests
Three modes, depending on what you're trying to do.
7.1 Single combo (the dev loop)
sudo /usr/local/bin/kldload-ci-run --only fedora-core --skip-build
Uses the most recent ISO. Runs only the fedora-core combo. ~25 minutes total. This is what you use when you've made a change you think fixes one specific combo's failure, and you want to verify before kicking off the full matrix.
7.2 Full matrix (the nightly run)
sudo /usr/local/bin/kldload-ci-run
Builds a fresh ISO. Runs all 15 combos. ~2-3 hours. This is what fires from the systemd timer at 03:00 UTC nightly. Manual invocation works the same way.
7.3 Just what failed last time
sudo /usr/local/bin/kldload-ci-run --diff-last
Subsets the matrix to only the combos that failed in the last completed run. Use this when you've fixed a specific bug and want to verify it without re-running combos that were already passing.
7.4 Status and reports
# Last 10 runs, columnar
sudo /usr/local/bin/kldload-ci-run --status
# Full report for a specific run
sudo /usr/local/bin/kldload-ci-run --report 2026-05-06-044342
# Or query SQLite directly
sudo sqlite3 /opt/kldload-ci/history.sqlite \
"SELECT distro, profile, status, fail_reason
FROM results WHERE run_id='2026-05-06-044342'"
7.5 Watching a run live
sudo tail -f /var/log/kldload-ci-bootstrap.log
# or for a specific combo:
sudo tail -f /opt/kldload-ci/results/<run-id>/smoke-fedora-core.log
7.6 Inspecting a failed VM
The framework leaves failed VMs running so you can poke at them:
# What's there
sudo virsh list --all | grep smoke
# Serial console
sudo virsh console kldload-smoke-fedora-core
# Or SSH (find the IP via DHCP lease)
mac=$(sudo virsh domiflist kldload-smoke-fedora-core \
| awk 'NR>2 {print $5; exit}')
ip=$(sudo virsh net-dhcp-leases default \
| awk -v m="$mac" 'tolower($3)==tolower(m){print $5}' \
| cut -d/ -f1)
# Installed system (after install completed)
sshpass -p admin ssh admin@${ip}
# Live env (if the install never finished)
sshpass -p live ssh live@${ip}
Set KEEP_VM=1 to keep the VM around even on success — useful
when you want to poke at a passing install to understand its layout.
8. The Fix Loop
Bug surfaces in the matrix. You need to land a fix and verify it cleared. The loop is five steps:
- Diagnose from the per-combo log. The smoke log captures the
specific test name that failed and a one-line reason. If the failure is in the
install (not the post-install validation), the lifecycle script also dumped
the installer log +
/tmp/install.log+ storage log into the same file. - Edit source on the build host (typically
onyx, the kldload-free repo). Add a comment in the fixed code referencing the run ID that caught it — future readers will thank you. - Commit. Use a message that names the bug, the failure mode, and the fix. Reference the run ID in the commit body. Example:
fix(install-target): kldload-debug-bundle on core too
Bug seen by fiend matrix run #2 (2026-05-06-155717): all combos pass
49/51 (was 47/51), with the new failure being:
✗ kldload-debug-bundle present — kldload-debug-bundle not found
✗ kldload-debug-bundle --help works — command failed
Earlier commit 307e7f7 added an early-return for core in k_install_tools
to stop the kldload-webui leak. That correctly skipped kldload-webui
but ALSO skipped kldload-debug-bundle and kldload-recovery — which are
diagnostic / incident-response tools, not kldload features. Even a
core install should ship these for support purposes.
Fix: install both diagnostic tools BEFORE the core early-return.
- Rsync the fix to the CI host. The runner doesn't pull from git automatically (most kldload repos are private; we use rsync push). Either sync the whole tree or just the changed file:
sshpass -p Passw0rd rsync -av \
-e 'ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null' \
/root/kldload-free/live-build/config/includes.chroot/usr/sbin/kldload-install-target \
admin@fiend.unixbox.net:/opt/kldload-ci/kldload-free/live-build/config/includes.chroot/usr/sbin/kldload-install-target
- Re-trigger the matrix. The runner builds a fresh ISO with the new source, then re-runs whatever subset you specify:
sshpass -p Passw0rd ssh admin@fiend.unixbox.net \
'sudo systemd-run --unit=kldload-ci-bootstrap --collect \
--property=StandardOutput=append:/var/log/kldload-ci-bootstrap.log \
--property=TimeoutStartSec=12h \
bash -c "env CI_SYNC_CMD= /usr/local/bin/kldload-ci-run --diff-last"'
One critical gotcha here: the rsync must complete BEFORE the build starts. If you kick off the matrix before the rsync finishes, the build uses the stale source and the fix won't be in the new ISO. The runner acquires a flock; the rsync doesn't. So always rsync, then trigger.
Wait the matrix duration (typically 2 hours for full, ~25 minutes for one
combo with --skip-build). Check --status. If pass
count went up, you fixed something. If a NEW failure surfaces, that's the
next layer of bug — by design, this is how the cascading fix loop works.
9. War Stories — 13 Buried Gotchas
This is the section that earlier sessions paid for in debugging time. Each gotcha lists the bug, the symptom, the root cause, and the fix. Every one of these is a comment in the relevant source file too — but knowing they exist saves a future debugging session.
9.1 F44 zram-generator and kubelet
Symptom: kubelet fails to start on every Fedora 44 K8s
golden VM with "running with swap on is not supported". The script ran
swapoff -a; you can confirm swap is off; kubelet still refuses.
Cause: Fedora 44 cloud images ship
zram-generator-defaults which auto-creates /dev/zram0
swap on every boot via systemd-zram-setup@zram0.service.
swapoff kills the live instance but the next boot brings it back.
Kubelet refuses → kubeadm init retries → the second init fails on existing
manifests → the cluster silently never converges.
Fix: in kube-setup setup_kernel(), also mask
the generator unit and remove the package:
swapoff -a || true
sed -ri '/\sswap\s/s/^/#/' /etc/fstab
systemctl disable --now 'systemd-zram-setup@*.service' 2>/dev/null || true
systemctl mask 'systemd-zram-setup@zram0.service' 2>/dev/null || true
rm -f /etc/systemd/zram-generator.conf /usr/lib/systemd/zram-generator.conf
dnf remove -y zram-generator zram-generator-defaults 2>/dev/null || true
9.2 AI ERR trap fires on missing GPU
Symptom: on hosts without an NVIDIA GPU, the autodeploy
ai-pull subshell fails immediately and the install is marked
ai-failed even though the install would have skipped AI gracefully
(insufficient VRAM check).
Cause: nvidia-smi exits 1 if no GPU is
present. With set -e + set -o pipefail, the GPU
probe pipeline trips the ai-pull subshell's ERR trap, which is supposed to
catch install/pull failures.
Fix: wrap the GPU probe with
set +e/set -e so probes can return non-zero without
triggering the trap.
9.3 Core profile leaks kldload-webui via TWO copy sites
Symptom: smoke-core asserts kldload-webui is
absent. On a clean core install, the binary appears at
/usr/local/sbin/kldload-webui.
Cause: profiles.sh:649 copies
/usr/local/bin/kldload-webui to the target. A SECOND copy site is
kldload-install-target:925, which does
for f in /usr/local/sbin/kldload-*; do install ... — this catches
kldload-webui because the live env has the binary in BOTH bin and sbin.
Fixing only the first site leaves the second leaking.
Fix: early-return for core in BOTH functions. Keep diagnostic tools (kldload-debug-bundle, kldload-recovery) above the early-return because they should ship even on core.
9.4 Universal install markers stripped by core early-return
Symptom: smoke-core fails three tests on every core install:
/etc/kldload-build-sha, /etc/kldload/edition,
/etc/kldload/boot-environment all missing.
Cause: these markers are written deep inside the non-core
branch of k_install_system_files. Adding an early-return for core
to fix bug 9.3 also skipped these.
Fix: move marker writes BEFORE the core early-return — they're install-identification metadata, not kldload features. Every install needs them.
9.5 pgrep matches its own SSH session
Symptom: smoke-test poll loop says "install still running (20 min elapsed)" forever. The install actually finished 18 minutes ago and exited cleanly.
Cause: pgrep -f kldload-install-target over
SSH matches the parent bash process invoked by sshd, whose argv contains the
literal pattern. Even pgrep -f /usr/sbin/kldload-install-target
doesn't help — the SSH session has the full path verbatim too.
Fix: the classic bracket trick. Use
pgrep -f "[/]usr/sbin/kldload-install-target". The brackets are
a regex character class — they match the literal "/" in the install
process's argv but NOT the literal characters
[/]usr/sbin/... in the SSH session's argv (which is what pgrep
sees when it scans /proc/PID/cmdline on its own caller).
9.6 virsh domifaddr returns 127.0.0.1 first
Symptom: smoke-test reports "live env never came up" after 15 minutes, even though manual SSH to the VM works fine.
Cause: virsh domifaddr --source agent emits ALL
interface addresses, including 127.0.0.1 (lo). The first match in
the awk pipe was the loopback, which short-circuited the function. The caller
then tried SSH to 127.0.0.1, got "Permission denied" forever, and timed out.
Fix: filter loopback and link-local before consuming:
ip=$(virsh domifaddr "$VM_NAME" --source agent 2>/dev/null \
| awk '/ipv4/ {print $4}' \
| cut -d/ -f1 \
| grep -vE '^(127\.|169\.254\.|0\.0\.0\.0$)' \
| head -1)
9.7 scp without sshpass falls back to pubkey auth
Symptom: smoke-test reports "couldn't scp answers to live env" instantly. SCP works fine when you do it manually with the same arguments.
Cause: the smoke-test framework's SCP calls used plain
scp ${SSH_OPTS[@]} ... without sshpass. SSH falls
back to pubkey auth, the live ISO has no key configured for the source host,
auth fails silently in non-interactive mode.
Fix: always wrap SCP for the live env with
sshpass -p live (and sshpass -p admin for the
installed system).
9.8 printf with format string starting with dash
Symptom: the runner's summary table generation throws "printf: --: invalid option" and the SUMMARY.md is partially generated.
Cause: printf '---x---' — bash's printf
parses the format string for flags first, sees the leading dashes, errors
out.
Fix: printf '%s' '---x---' or
printf -- '---x---'. The first is more idiomatic.
9.9 zfs get rejects the * character
Symptom: smoke-core halts silently after "rpool/ROOT exists" passes. No FAIL line is printed; the script just exits.
Cause: the test
test_output_contains "Root mounted at /" "zfs get -H -o value mountpoint
rpool/ROOT/*" "/" uses a shell glob inside the eval. ZFS rejects
* as an invalid character in dataset names. The error message
contains "/" so technically the test should pass — but the eval failure
cascaded through set -e and halted the script.
Fix: use the ZFS recursive flag instead of a shell glob:
zfs get -rH -o value mountpoint rpool/ROOT. Same semantic, no
'*' character.
9.10 Concurrent runs race the VM name
Symptom: two matrix runs in flight at the same time stomp
on each other's kldload-smoke-<distro>-<profile> VM
names. The second run sees the first's VM, gets confused, fails weirdly.
Cause: the systemd timer can fire while a manual run is in progress. No mutex was in place initially.
Fix: the runner now acquires a flock at
/opt/kldload-ci/.run.lock at startup. Second invoker dies cleanly
with "another kldload-ci-run is in progress (holder: PID, started_at)".
9.11 Podman short-name-mode = enforcing kills non-TTY builds
Symptom: the build container fails with "short-name resolution enforced but cannot prompt without a TTY". Manual builds in your shell work fine; CI builds fail.
Cause: default Fedora /etc/containers/registries.conf
sets short-name-mode = "enforcing" which prompts for confirmation
when pulling docker.io/library/fedora:44 by short name. The CI
runner has no TTY → fails.
Fix: add a permissive override in
/etc/containers/registries.conf.d/00-kldload-ci-permissive.conf:
short-name-mode = "permissive"
unqualified-search-registries = ["docker.io", "quay.io", "registry.fedoraproject.org"]
9.12 Podman ZFS storage driver wants the dataset pre-created
Symptom: the build fails immediately with "cannot find root filesystem rpool/var/lib/containers/storage/zfs: dataset does not exist".
Cause: kldload installs configure podman to use the ZFS
storage driver but don't create the underlying dataset. The default ZFS
layout has rpool/var/lib but not the containers subtree.
Fix: create with -p to materialize the
parents:
sudo zfs create -p -o mountpoint=/var/lib/containers/storage \
rpool/var/lib/containers/storage
sudo zfs create rpool/var/lib/containers/storage/zfs
This needs to be added to the kldload installer; for now it's a manual post-install step on any new CI host.
9.13 Failure-debug capture itself was broken
Symptom: when an installer aborts (without a success marker), the smoke-test was supposed to dump the last 40 lines of the installer log into the combo log. For early aborts (Ubuntu, ~50s), the dump was empty.
Cause: the original capture only tailed
/var/log/installer/kldload-installer.log and
/var/log/installer/bootstrap.log. Both files don't exist yet
when the installer aborts in pre-flight. Meanwhile the actual error was in
/tmp/install.log (the smoke-test's own stdout redirect from the
nohup command), which was never tailed.
Fix: dump file existence + sizes first, then tail every
plausible log file (/tmp/install.log,
/var/log/installer/*.log, storage.log,
zfs.log) — only those that exist and have content.
10. Bootstrap a Fresh CI Host
If fiend gets nuked or you want to bring up a second CI runner, the full setup is below. It assumes you've installed kldload (klab template recommended) on the new box and have static-IP'd it.
# 1. Install build deps on the new host:
ssh admin@new-ci-host 'echo Passw0rd | sudo -S dnf install -y \
git ShellCheck sqlite jq qemu-img sshpass'
# 2. Layout:
ssh admin@new-ci-host 'sudo mkdir -p /opt/kldload-ci/{results,bin}; \
sudo chown -R admin:admin /opt/kldload-ci'
# 3. ZFS dataset for podman storage (the kldload installer doesn't create this):
ssh admin@new-ci-host 'sudo zfs create -p \
-o mountpoint=/var/lib/containers/storage \
rpool/var/lib/containers/storage; \
sudo zfs create rpool/var/lib/containers/storage/zfs'
# 4. Podman: permissive short-name resolution for non-TTY builds:
ssh admin@new-ci-host 'echo "short-name-mode = \"permissive\"
unqualified-search-registries = [\"docker.io\",\"quay.io\",\"registry.fedoraproject.org\"]" \
| sudo tee /etc/containers/registries.conf.d/00-kldload-ci-permissive.conf'
# 5. Rsync source from your dev host (excludes output, caches, .git):
sshpass -p Passw0rd rsync -av --delete \
--exclude='live-build/output' \
--exclude='live-build/output-pass*' \
--exclude='live-build/cache' \
--exclude='live-build/darksite-ollama-cache.disabled' \
--exclude='live-build/logs' \
--exclude='.claude' \
-e 'ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null' \
/root/kldload-free/ \
admin@new-ci-host:/opt/kldload-ci/kldload-free/
# 6. Install runner + sudoers:
sshpass -p Passw0rd ssh admin@new-ci-host '
echo Passw0rd | sudo -S install -m 0755 \
/opt/kldload-ci/kldload-free/ci/kldload-ci-run \
/usr/local/bin/kldload-ci-run
echo "admin ALL=(ALL) NOPASSWD: /usr/local/bin/kldload-ci-run, \
/opt/kldload-ci/kldload-free/deploy.sh" | sudo tee \
/etc/sudoers.d/95-kldload-ci > /dev/null
sudo chmod 0440 /etc/sudoers.d/95-kldload-ci'
# 7. Systemd unit + nightly timer:
ssh admin@new-ci-host 'sudo tee /etc/systemd/system/kldload-ci.service > /dev/null <<UNIT
[Unit]
Description=kldload CI matrix
Wants=network-online.target
After=network-online.target
[Service]
Type=oneshot
TimeoutStartSec=12h
ExecStart=/usr/local/bin/kldload-ci-run
StandardOutput=append:/var/log/kldload-ci.log
StandardError=inherit
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=4
UNIT
sudo tee /etc/systemd/system/kldload-ci.timer > /dev/null <<TIMER
[Unit]
Description=Run kldload CI nightly at 03:00 UTC
[Timer]
OnCalendar=*-*-* 03:00:00
RandomizedDelaySec=15min
Persistent=true
Unit=kldload-ci.service
[Install]
WantedBy=timers.target
TIMER
sudo systemctl daemon-reload
sudo systemctl enable --now kldload-ci.timer'
# 8. First smoke run to verify:
ssh admin@new-ci-host 'sudo kldload-ci-run --only fedora-core'
That's it. Total wall-clock: ~10 minutes of setup + the duration of the first matrix run.
11. What CI Doesn't Catch
Be explicit about the limitations. Treating CI as the seatbelt instead of the airbag is how you ship a release that boots in OVMF and bricks on real hardware.
The matrix runs in OVMF (libvirt's standard UEFI firmware). Real-machine UEFI implementations diverge wildly. We've seen all of these in the last few releases:
- Slow-USB rootdelay races — HP-branded USB sticks consistently need an extra rootdelay parameter to mount the squashfs at boot. OVMF doesn't exhibit the timing.
- MOK enrollment NVRAM behavior — different firmware vendors implement MokManager differently. Some boot once after enrollment then never again. OVMF's NVRAM is a flat file and behaves predictably.
- NVIDIA driver vs running kernel — a kernel update that shipped at 2 AM can break NVIDIA DKMS rebuilds. Caught only on hardware with an actual GPU.
- BIOS boot order — some firmwares ignore the EFI variable your installer set and boot the wrong device. OVMF respects what we set.
- Real disk geometry — 4Kn vs 512n drives, NVMe namespaces, RAID controllers. OVMF gives you a clean qcow2 with predictable geometry.
- Real network drivers — Intel I210 vs Realtek r8169 vs Mellanox ConnectX. Virtio hides all of that.
- Power management — suspend/resume, idle states, AC events. None of this is in the VM.
So the per-release hardware install on fiend (or your equivalent) isn't optional. CI shrinks the surface to "will it install correctly given the right firmware"; hardware install confirms the firmware is right.
12. Roadmap
What's not done yet, in rough priority order:
Workload-template smoke wrappers
The matrix covers core / server / desktop. The four workload templates (kvm, k8s, klab, zfslab) need their own tests/smoke-{kvm,klab,zfslab}.sh. Each needs to assert post-install state for its template — for klab that means 5 cloud-init goldens snapshot-ready, K8s cluster Ready, Tetragon Running. ~200 LoC each.
RHEL in the matrix
RHEL needs subscription credentials (username + password OR activation key + org). Adding it requires either embedding a test account in the runner config or skipping when creds aren't present. Probably "skip with a warning" is right.
Arch + Alpine wrappers
The installer supports both, but no smoke-arch.sh or smoke-alpine.sh exists. Arch is interesting because it's rolling-release — package versions change daily and the matrix would catch that drift. Alpine has its own quirks (apk vs dnf/apt, OpenRC vs systemd). Each wrapper is a 100-line script.
FreeBSD path
Installer plumbing exists; smoke wrapper does not; no darksite. The bigger question is whether to do smoke-test in bhyve VMs (real BSD test surface) or in a Linux VM with FreeBSD as the install target (faster, weaker test). Probably both eventually.
Per-failure artifact upload to R2
Today the failure VM is left running and the per-combo logs stay on fiend. For long-running investigation, uploading the failure tarball (VM console + journal + ZFS state) to Cloudflare R2 would mean failures are inspectable from anywhere, not just by SSH'ing fiend.
Hardware test self-validation
The reverse of CI: when you do a hardware install, run the smoke suite against the running install (not in a VM). The autodeploy could fire kldload-self-test and write the result to /var/lib/kldload/self-test-results.json. Then "did the install work" is automated even on hardware.
13. Closing — Why This Approach Holds Up
Most distro projects (and most open-source infrastructure) get tested by their users. A user installs, hits a bug, files an issue, the maintainer investigates, ships a fix in the next release. That feedback loop is measured in weeks. For nine distros across three profiles, the loop never closes — by the time you've fixed the bug a user found on Debian-server, three more bugs have shipped in CentOS-desktop.
The kldload approach inverts that. Every commit triggers (or can trigger) a 15-combo matrix that reproduces what a user would do — burn ISO, install, validate. The matrix found 8 latent bugs in its first overnight run that could have been shipping for weeks or months otherwise. That's not because the maintainer was sloppy — it's because no human can verify a 9 × 3 matrix by hand on every commit. The framework can.
The tradeoff is honest: you lose the "real hardware" coverage CI can't give you. You compensate with periodic hardware installs on the one box you have. The combination is more thorough than either alone, and it's cheap enough to run that "did this commit break anything?" becomes a 2-hour question instead of a release-cycle question.
If you've read this far and you're standing up your own kldload (or any distro project) — the test infrastructure is the unglamorous part of the project that compounds the most over time. Build the matrix early. Add to it every time you ship a new feature. Trust the framework when it surfaces a regression even if it looks like a flake. The dollar you save by skipping test infrastructure costs ten in user-facing bugs three months later.
The bottom line: kldload's test infrastructure is 318 lines of bash, one ASUS box, and a SQLite database. It catches more regressions per dollar than any CI cloud you could rent. It runs overnight while you sleep. The next time you stand up a project that needs to ship across heterogeneous targets, copy this pattern. It works because it stays small.