Documentation

Testing Masterclass

kldload ships an installer that lays down ZFS-on-root across nine distros, plus a post-install autodeploy that brings up Kubernetes, Tetragon, the klab sandbox, and Bob AI. That is a lot of moving parts. This document describes how kldload tests itself: the two-layer model (CI in VMs, hardware on real metal), the 15-combo distro × profile matrix, what each test actually checks, the fix loop that turns a red CI run into a green one, and the buried gotchas that earlier sessions paid for in debugging time so you don't have to.

What you will learn: the philosophy behind kldload's two-layer test model, the smoke-test framework end to end (from deploy.sh smoke-test down to the per-test _pass/_fail calls), how to drive the matrix yourself, how to interpret a failure, how to land a fix and verify it, how to bootstrap a CI host from scratch, and a 13-item "war stories" list of every non-obvious bug the framework has surfaced in the last week.

Audience: anyone running kldload in production, anyone considering contributing, anyone curious how a one-person project keeps a 9-distro × 3-profile install matrix from rotting under their feet.

1. Why kldload Tests This Way

Most distro install testing is hand-driven: someone burns a USB, walks an installer, eyeballs the result, files a bug if something looks wrong. That works when you have a paid QA team and a small target matrix. kldload has neither — it ships nine distros (CentOS Stream 9, Rocky 9, Fedora 44, Debian 13, Ubuntu 24.04, RHEL 9, Arch, Alpine, FreeBSD) across three generic profiles (core / server / desktop) and four workload templates (kvm / k8s / klab / zfslab). Hand-testing every meaningful combination would take days per release, and humans get bored — bored humans miss things.

So kldload took the opposite approach. The test matrix runs in throwaway KVM VMs on a single dedicated box (fiend), under a runner that captures per-combo logs, records every result in SQLite, and dumps console traces on failure. Every commit can re-light the matrix overnight. The first time the matrix ran, in a single 2-hour run, it surfaced eight distinct latent bugs that had been shipping unnoticed across multiple releases. That is the loop this masterclass documents.

The two-layer test model

Layer 1 is the CI matrix in VMs — fast, reproducible, runs on every change. Layer 2 is hardware install on real metal — slow, manual, only on releases. They catch different bugs. CI is the airbag (always on, prevents the worst); hardware is the seatbelt (you put it on for the trip).

// CI catches: installer bugs, package drift, content regressions // Hardware catches: UEFI quirks, NVIDIA, Secure Boot NVRAM, real disks

One box, two roles

You don't need a fleet to run this. fiend is a single ASUS TUF X570 that serves as both the CI runner (15 throwaway VMs in libvirt) AND the hardware test target (when you nuke and reinstall it). The matrix tests other configurations in VMs while fiend itself was tested by the install + self-validate flow.

// Total budget: 1 box, electricity, 12 GB of disk for the ISO. // No CI cloud, no hosted runners, no monthly bills.

The framework finds itself

Most "testing infrastructure" projects spend the first six months building the framework and the next six months finding bugs. kldload's framework is pre-existing (tests/lifecycle.sh, tests/smoke-*.sh) — the matrix runs simply exposed bugs in both the framework AND the installer in cascading layers. Each fix unblocked the next layer.

// Layer 1 fixed: smoke-test wait, get_vm_ip, scp auth, pgrep // Layer 2 fixed: marker writes, kldload-webui leak, debug-bundle // Layer 3 still in flight: ubuntu installer, fedora-core no-boot

SQLite, not Jenkins

The runner records every result in /opt/kldload-ci/history.sqlite. Want to know which combos failed in the last green run? Two-line SQL. No CI portal, no YAML pipelines, no plugin ecosystem. The whole runner is 318 lines of bash.

// SELECT distro, profile, status FROM results // WHERE run_id = '2026-05-06-044342';

A common objection: "isn't this fragile? bash + sqlite + no docker, no k8s — how does that scale?" Answer: it doesn't scale, on purpose. kldload is one person's project that needs to ship across nine distros. The right tool is the smallest tool that catches the regressions. If you scale to a 100-engineer team, replace this with Jenkins + cloud runners + artifact storage. Until then, 318 lines of bash on one ASUS box catches more regressions than you have time to fix.

2. The Two-Layer Test Model

Every test in kldload falls into one of two categories. Knowing which is which saves a lot of debugging time when something goes wrong.

Layer	Where it runs	What it catches	What it misses
CI matrix (Layer 1)	15 KVM VMs on fiend, OVMF firmware, libvirt default network	Installer logic, ZFS layout, package set drift, bootloader chain in standardized firmware, service config, post-install smoke validation	Real-firmware UEFI quirks, NVRAM behavior of MOK enrollment, NVIDIA driver compat with running kernel, slow-USB races, IOMMU groupings, BIOS-specific boot order, real network drivers, suspend / resume
Hardware install (Layer 2)	Real metal, you burn a USB and install	Everything in the right column above	Most installer logic bugs (you usually only test one combo per hardware run)

The crucial point: they are not substitutes. CI tells you "the installer is logically correct"; hardware tells you "it actually works on this specific machine". You need both. A rough cadence:

On every commit (or every push): CI matrix runs in 2 hours. If anything regresses, you know within the day. Most fixes never need hardware verification — they are pure-logic bugs (a typo in a package set, a missing profile gate, a typo in an SQL query).
On every release candidate: burn the ISO to USB and install on fiend. Confirm the firmware path works (Secure Boot enroll, NVIDIA driver build, real disk geometry). This is where the bugs CI cannot see show up.
Per major release: install on a second physical box if you have one. ASUS / Dell / HP / Lenovo all have firmware quirks that don't generalize.

Concretely from the recent kldload 1.1.0 work: the four-template architecture landed clean in CI on the first matrix run, but fiend itself wouldn't boot the new USB at all (a separate boot-path regression invisible to CI). The two layers were catching different bugs at the same time.

3. The Matrix

The Layer-1 matrix is fixed at five distros × three generic profiles = fifteen combinations.

            core        server      desktop
centos      ✅          ✅          ✅
rocky       ✅          ✅          ✅
fedora      ✅          ✅          ✅
debian      ✅          ✅          ✅
ubuntu      ✅          ✅          ✅

Why these five distros and not all nine?

Arch is in the installer (rolling release; no darksite — needs internet) but not in the matrix because there is no tests/smoke-arch.sh yet. It would work; it just hasn't been written.
Alpine same as Arch — installer supports it, no smoke wrapper.
FreeBSD is on the roadmap; the installer plumbing exists but needs a darksite and bhyve test path.
RHEL needs subscription credentials to fetch packages; the matrix can't run unattended without them. RHEL is hand-tested before releases.

And why these three profiles and not the four workload templates?

The generic profiles (core, server, desktop) have well-defined post-install state — they're additive. core = ZFS + stock distro + WireGuard + eBPF + diagnostic tools (deliberately minimal). server = core + the k* tools, sanoid, the observability stack, NVIDIA. desktop = server + GNOME + GDM + Firefox. Each one has a smoke-$profile.sh script that asserts the expected post-install state.

The workload templates (kvm, k8s, klab, zfslab) need their own smoke wrappers — they bring up libvirt networks, K8s clusters, klab goldens (5 cloud-init VMs), Tetragon, and Bob. None of that is asserted yet by an automated test. That is phase-2 work; the framework supports it (just add tests/smoke-{kvm,klab,zfslab}.sh and they'll appear in the matrix).

The matrix size of 15 isn't accidental. Sequential at ~25 minutes per combo it's a 6-hour run, which fits comfortably overnight. With 5 cores per VM and parallel scheduling under libvirt it actually finishes in ~2 hours on a 24-core box. Adding more combos requires either a second box, parallel execution within the runner, or accepting longer wall-clock. For a one-person project, 15 combos × 2 hours is the right size.

4. What Each Test Checks

Five test scripts, each covering a layer of the install. They live in tests/ in the kldload-free repo. Reading them is a fast way to understand what kldload promises a finished install will look like.

4.1 `tests/smoke-build.sh` — pre-install ISO sanity

Runs after ./deploy.sh build but before any VM is touched. Validates the ISO file itself:

File exists, size > 8 GB and < 16 GB (sanity bounds for a kldload ISO)
SHA256 checksum file present and matches
ISO mounts cleanly via loopback
Contains squashfs.img (the live root)
EFI directory is present (so it can UEFI-boot)
GRUB config in the right place
ISO timestamp is fresh (catches the "you forgot to rebuild" bug)

Without this check, you can spend two hours running the matrix against yesterday's ISO and never know — every combo would behave the same way it did yesterday.

4.2 `tests/smoke-core.sh` — 51 tests, run on every profile

This is the baseline. Every profile must pass smoke-core; server and desktop add to it but never replace it. The 51 tests fall into eight categories:

# ── ZFS userspace and pool ─────────────────────────────────────
test_cmd            "ZFS userspace (zfs)"      "zfs"
test_cmd            "ZFS pool tools (zpool)"   "zpool"
test_output_contains "ZFS module loaded"        "lsmod" "zfs"
test_output_contains "Pool rpool exists"        "zpool list" "rpool"
test_output_contains "Pool rpool is ONLINE"     "zpool list -H -o health rpool" "ONLINE"
test_output_contains "Pool has zero errors"     "zpool status rpool" "No known data errors"
test_succeeds        "Pool scrub runs"          "zpool scrub rpool"

# ── Datasets ──────────────────────────────────────────────────
test_dataset        "rpool/ROOT exists"        "rpool/ROOT"
test_output_contains "Root dataset mounted at /" \
                     "zfs get -rH -o value mountpoint rpool/ROOT" "/"
test_dataset        "rpool/home exists"        "rpool/home"
test_dataset        "rpool/var exists"         "rpool/var"
test_dataset        "rpool/var/log exists"     "rpool/var/log"
test_dataset        "rpool/srv exists"         "rpool/srv"
# ... and ten more dataset checks

Then EFI / bootloader, networking, hostid match between live and target, universal install markers (/etc/kldload-build-sha, /etc/kldload/edition, /etc/kldload/profile, /etc/kldload/boot-environment), and finally the negative assertions that make core meaningful:

# Core profile: kldload feature tools must be ABSENT
for tool in kst ksnap kbe kclone kdf kdir kpkg kupgrade \
            kexport krecovery kldload-webui sanoid; do
  if command -v "$tool" >/dev/null 2>&1; then
    _fail "$tool absent (core)" "$tool found — should not be in core profile"
  else
    _pass "$tool absent (core)"
  fi
done

Plus a runtime ZFS test (create a snapshot, verify it appears in the listing, destroy it cleanly) and a check that the diagnostic tool kldload-debug-bundle is installed and prints help. That's 51 tests total; a passing core install hits all 51.

4.3 `tests/smoke-server.sh` — extends core

Server inherits everything in smoke-core, then adds the positive assertions that core just made negative assertions about:

All k* tools (kst, ksnap, kbe, etc.) PRESENT and respond to --help
sanoid binary installed and the systemd service is enabled
WireGuard userspace (wg command) installed
eBPF tools (bpftool, bpftrace) installed
If a GPU is detected: NVIDIA driver loaded and nvidia-smi works
kldload-webui binary in /usr/local/sbin

The negative-then-positive pattern is deliberate — it means a misconfigured install that partially includes server bits will fail core's negative assertion. You can't accidentally pass both core and server with the same install.

4.4 `tests/smoke-desktop.sh` — extends server

Desktop = server + the GUI:

GNOME session files present
GDM service enabled
Firefox installed
The Firefox autostart desktop file points at https://localhost:8443 (the kldload web UI)

4.5 `tests/smoke-auto.sh` — the dispatcher

Reads /etc/kldload/profile on the running system to figure out which profile is installed, then invokes the matching smoke-$profile.sh. This is what tests/lifecycle.sh calls on the freshly-installed VM — the test runner doesn't need to know the profile; the installed system tells it.

5. The Per-Combo Lifecycle

What happens when you run sudo ./deploy.sh smoke-test fedora core? The wrapper invokes tests/lifecycle.sh with arguments fedora core. That script does the following, in order:

Spawn a clean VM with virt-install: 4 vCPU, 8 GB RAM, 32 GB qcow2 disk, OVMF UEFI firmware, attached to libvirt's default network, boot-from-ISO. The VM is named kldload-smoke-fedora-core.
Wait up to 15 minutes for the live env to be SSH-able. The runner polls virsh net-dhcp-leases default + tries a TCP connect to port 22 + does a no-op SSH login as live/live. The 15 minute ceiling exists because the kldload live ISO is heavy (~5 GB squashfs + ZFS rootfs init + every kldload tool loaded); on a 4-core VM it routinely needs 7-10 minutes from -boot to ssh-ready.
Compose an answers file with the install parameters and SCP it to /tmp/answers.env on the live env. The SCP wraps with sshpass -p live — direct scp falls back to pubkey auth which the live ISO doesn't have configured.
Kick off the headless install with setsid nohup /usr/sbin/kldload-install-target --config /tmp/answers.env >/tmp/install.log 2>&1 &. The setsid + nohup means the install survives the SSH disconnect that follows immediately after the launch. Output goes to /tmp/install.log on the VM.
Poll for completion by checking pgrep -f "[/]usr/sbin/kldload-install-target" over SSH. Once no match (the install process exited), grep the kldload installer log for "Install completed successfully". The bracket trick on the path is critical — without it pgrep matches its own SSH session's argv and the loop runs to its 60-minute ceiling without ever detecting the exit.
Reboot the VM with disk-first boot order (the CDROM stays attached but ignored). virsh shutdown with a 30-second graceful window, then virsh destroy, then virt-xml --edit --boot hd,cdrom, then virsh start.
Wait up to 15 minutes for the installed system to be SSH-able as admin/admin. This is the moment of truth — if the bootloader chain is broken, this is where it shows up as "installed system never came up".
SCP the entire tests/ directory to the installed target's /tmp/tests/.
Run tests/smoke-auto.sh on the installed target. That dispatcher reads /etc/kldload/profile and calls the matching smoke-$profile.sh. The full output is captured.
Pass if smoke-auto reports zero failures; fail otherwise. On failure, leave the VM defined for virsh console inspection and dump the installer log + /tmp/install.log + storage log into the combo log file.

Total wall-clock per combo: ~25 minutes for core, ~30 minutes for server, ~45 minutes for desktop (more packages = longer install). Sequential matrix: ~6-8 hours. Parallel under libvirt's scheduling on a 24-core box: ~2 hours.

6. The CI Runner: `kldload-ci-run`

The matrix wraps the per-combo lifecycle in a loop. The runner is in ci/kldload-ci-run in the repo (~318 lines of bash). It does six things:

Acquires a flock on /opt/kldload-ci/.run.lock. Refuses to start if another run is in progress. This prevents the nightly timer from racing a manual run.
Optionally syncs source via CI_SYNC_CMD env var (or git pull if the source dir is a git checkout, or skips if neither).
Builds the ISO if not --skip-build. Calls ./deploy.sh builder-image first (idempotent — only fires if the builder container image isn't present), then ./deploy.sh build. Output goes to a per-run log under /opt/kldload-ci/results/$run_id/.
Runs ./deploy.sh smoke-build for the ISO sanity check.
Loops the matrix — for each combo, calls sudo ./deploy.sh smoke-test <distro> <profile>, captures stdout/stderr to a per-combo log, records the result in SQLite, and on failure dumps the VM's serial console + domain XML to a failures/<combo>/ subdirectory.
Writes a summary as Markdown at $results_dir/SUMMARY.md — pass/fail count plus a 5×3 matrix table. Useful for grepping or pasting into a release thread.

SQLite schema:

CREATE TABLE runs (
  run_id        TEXT PRIMARY KEY,        -- 2026-05-06-044342
  started_at    TEXT NOT NULL,
  finished_at   TEXT,
  iso_sha       TEXT,                    -- sha256 of the ISO this run tested
  iso_size      INTEGER,
  source_sha    TEXT,                    -- git rev or 'rsync'
  matrix_total  INTEGER NOT NULL,
  matrix_pass   INTEGER NOT NULL DEFAULT 0,
  matrix_fail   INTEGER NOT NULL DEFAULT 0,
  rc            INTEGER
);

CREATE TABLE results (
  run_id        TEXT NOT NULL,
  distro        TEXT NOT NULL,
  profile       TEXT NOT NULL,
  status        TEXT NOT NULL,           -- pass | fail | skip
  duration_s    INTEGER,
  fail_reason   TEXT,                    -- one-line excerpt
  log_path      TEXT,
  PRIMARY KEY (run_id, distro, profile)
);

Two indexes worth knowing about: idx_results_status for "show me everything that failed in run X", and started_at DESC for "show me the last N runs". That's all you need.

7. Driving the Tests

Three modes, depending on what you're trying to do.

7.1 Single combo (the dev loop)

sudo /usr/local/bin/kldload-ci-run --only fedora-core --skip-build

Uses the most recent ISO. Runs only the fedora-core combo. ~25 minutes total. This is what you use when you've made a change you think fixes one specific combo's failure, and you want to verify before kicking off the full matrix.

7.2 Full matrix (the nightly run)

sudo /usr/local/bin/kldload-ci-run

Builds a fresh ISO. Runs all 15 combos. ~2-3 hours. This is what fires from the systemd timer at 03:00 UTC nightly. Manual invocation works the same way.

7.3 Just what failed last time

sudo /usr/local/bin/kldload-ci-run --diff-last

Subsets the matrix to only the combos that failed in the last completed run. Use this when you've fixed a specific bug and want to verify it without re-running combos that were already passing.

7.4 Status and reports

# Last 10 runs, columnar
sudo /usr/local/bin/kldload-ci-run --status

# Full report for a specific run
sudo /usr/local/bin/kldload-ci-run --report 2026-05-06-044342

# Or query SQLite directly
sudo sqlite3 /opt/kldload-ci/history.sqlite \
  "SELECT distro, profile, status, fail_reason
   FROM results WHERE run_id='2026-05-06-044342'"

7.5 Watching a run live

sudo tail -f /var/log/kldload-ci-bootstrap.log
# or for a specific combo:
sudo tail -f /opt/kldload-ci/results/<run-id>/smoke-fedora-core.log

7.6 Inspecting a failed VM

The framework leaves failed VMs running so you can poke at them:

# What's there
sudo virsh list --all | grep smoke

# Serial console
sudo virsh console kldload-smoke-fedora-core

# Or SSH (find the IP via DHCP lease)
mac=$(sudo virsh domiflist kldload-smoke-fedora-core \
        | awk 'NR>2 {print $5; exit}')
ip=$(sudo virsh net-dhcp-leases default \
        | awk -v m="$mac" 'tolower($3)==tolower(m){print $5}' \
        | cut -d/ -f1)

# Installed system (after install completed)
sshpass -p admin ssh admin@${ip}

# Live env (if the install never finished)
sshpass -p live ssh live@${ip}

Set KEEP_VM=1 to keep the VM around even on success — useful when you want to poke at a passing install to understand its layout.

8. The Fix Loop

Bug surfaces in the matrix. You need to land a fix and verify it cleared. The loop is five steps:

Diagnose from the per-combo log. The smoke log captures the specific test name that failed and a one-line reason. If the failure is in the install (not the post-install validation), the lifecycle script also dumped the installer log + /tmp/install.log + storage log into the same file.
Edit source on the build host (typically onyx, the kldload-free repo). Add a comment in the fixed code referencing the run ID that caught it — future readers will thank you.
Commit. Use a message that names the bug, the failure mode, and the fix. Reference the run ID in the commit body. Example:

fix(install-target): kldload-debug-bundle on core too

Bug seen by fiend matrix run #2 (2026-05-06-155717): all combos pass
49/51 (was 47/51), with the new failure being:
  ✗ kldload-debug-bundle present — kldload-debug-bundle not found
  ✗ kldload-debug-bundle --help works — command failed

Earlier commit 307e7f7 added an early-return for core in k_install_tools
to stop the kldload-webui leak. That correctly skipped kldload-webui
but ALSO skipped kldload-debug-bundle and kldload-recovery — which are
diagnostic / incident-response tools, not kldload features. Even a
core install should ship these for support purposes.

Fix: install both diagnostic tools BEFORE the core early-return.

Rsync the fix to the CI host. The runner doesn't pull from git automatically (most kldload repos are private; we use rsync push). Either sync the whole tree or just the changed file:

sshpass -p Passw0rd rsync -av \
  -e 'ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null' \
  /root/kldload-free/live-build/config/includes.chroot/usr/sbin/kldload-install-target \
  admin@fiend.unixbox.net:/opt/kldload-ci/kldload-free/live-build/config/includes.chroot/usr/sbin/kldload-install-target

Re-trigger the matrix. The runner builds a fresh ISO with the new source, then re-runs whatever subset you specify:

sshpass -p Passw0rd ssh admin@fiend.unixbox.net \
  'sudo systemd-run --unit=kldload-ci-bootstrap --collect \
    --property=StandardOutput=append:/var/log/kldload-ci-bootstrap.log \
    --property=TimeoutStartSec=12h \
    bash -c "env CI_SYNC_CMD= /usr/local/bin/kldload-ci-run --diff-last"'

One critical gotcha here: the rsync must complete BEFORE the build starts. If you kick off the matrix before the rsync finishes, the build uses the stale source and the fix won't be in the new ISO. The runner acquires a flock; the rsync doesn't. So always rsync, then trigger.

Wait the matrix duration (typically 2 hours for full, ~25 minutes for one combo with --skip-build). Check --status. If pass count went up, you fixed something. If a NEW failure surfaces, that's the next layer of bug — by design, this is how the cascading fix loop works.

A real example from the kldload 1.1.0 work: the matrix's first run was 0/15. After landing 4 fixes (3 missing markers, kldload-webui leak in profiles.sh, kldload-webui leak in install-target, debug-bundle stripped), 11 of the 15 combos went from ❌ to ✅. The remaining 4 failures were specific issues (3 ubuntu installer aborts, 1 fedora-core post-install boot failure) that needed their own fixes. Each layer surfaced when the previous layer was unblocked. That is exactly what CI is for.

9. War Stories — 13 Buried Gotchas

This is the section that earlier sessions paid for in debugging time. Each gotcha lists the bug, the symptom, the root cause, and the fix. Every one of these is a comment in the relevant source file too — but knowing they exist saves a future debugging session.

9.1 F44 zram-generator and kubelet

Symptom: kubelet fails to start on every Fedora 44 K8s golden VM with "running with swap on is not supported". The script ran swapoff -a; you can confirm swap is off; kubelet still refuses.

Cause: Fedora 44 cloud images ship zram-generator-defaults which auto-creates /dev/zram0 swap on every boot via systemd-zram-setup@zram0.service. swapoff kills the live instance but the next boot brings it back. Kubelet refuses → kubeadm init retries → the second init fails on existing manifests → the cluster silently never converges.

Fix: in kube-setup setup_kernel(), also mask the generator unit and remove the package:

swapoff -a || true
sed -ri '/\sswap\s/s/^/#/' /etc/fstab
systemctl disable --now 'systemd-zram-setup@*.service' 2>/dev/null || true
systemctl mask 'systemd-zram-setup@zram0.service' 2>/dev/null || true
rm -f /etc/systemd/zram-generator.conf /usr/lib/systemd/zram-generator.conf
dnf remove -y zram-generator zram-generator-defaults 2>/dev/null || true

9.2 AI ERR trap fires on missing GPU

Symptom: on hosts without an NVIDIA GPU, the autodeploy ai-pull subshell fails immediately and the install is marked ai-failed even though the install would have skipped AI gracefully (insufficient VRAM check).

Cause: nvidia-smi exits 1 if no GPU is present. With set -e + set -o pipefail, the GPU probe pipeline trips the ai-pull subshell's ERR trap, which is supposed to catch install/pull failures.

Fix: wrap the GPU probe with set +e/set -e so probes can return non-zero without triggering the trap.

9.3 Core profile leaks kldload-webui via TWO copy sites

Symptom: smoke-core asserts kldload-webui is absent. On a clean core install, the binary appears at /usr/local/sbin/kldload-webui.

Cause: profiles.sh:649 copies /usr/local/bin/kldload-webui to the target. A SECOND copy site is kldload-install-target:925, which does for f in /usr/local/sbin/kldload-*; do install ... — this catches kldload-webui because the live env has the binary in BOTH bin and sbin. Fixing only the first site leaves the second leaking.

Fix: early-return for core in BOTH functions. Keep diagnostic tools (kldload-debug-bundle, kldload-recovery) above the early-return because they should ship even on core.

9.4 Universal install markers stripped by core early-return

Symptom: smoke-core fails three tests on every core install: /etc/kldload-build-sha, /etc/kldload/edition, /etc/kldload/boot-environment all missing.

Cause: these markers are written deep inside the non-core branch of k_install_system_files. Adding an early-return for core to fix bug 9.3 also skipped these.

Fix: move marker writes BEFORE the core early-return — they're install-identification metadata, not kldload features. Every install needs them.

9.5 pgrep matches its own SSH session

Symptom: smoke-test poll loop says "install still running (20 min elapsed)" forever. The install actually finished 18 minutes ago and exited cleanly.

Cause: pgrep -f kldload-install-target over SSH matches the parent bash process invoked by sshd, whose argv contains the literal pattern. Even pgrep -f /usr/sbin/kldload-install-target doesn't help — the SSH session has the full path verbatim too.

Fix: the classic bracket trick. Use pgrep -f "[/]usr/sbin/kldload-install-target". The brackets are a regex character class — they match the literal "/" in the install process's argv but NOT the literal characters [/]usr/sbin/... in the SSH session's argv (which is what pgrep sees when it scans /proc/PID/cmdline on its own caller).

9.6 virsh domifaddr returns 127.0.0.1 first

Symptom: smoke-test reports "live env never came up" after 15 minutes, even though manual SSH to the VM works fine.

Cause: virsh domifaddr --source agent emits ALL interface addresses, including 127.0.0.1 (lo). The first match in the awk pipe was the loopback, which short-circuited the function. The caller then tried SSH to 127.0.0.1, got "Permission denied" forever, and timed out.

Fix: filter loopback and link-local before consuming:

ip=$(virsh domifaddr "$VM_NAME" --source agent 2>/dev/null \
       | awk '/ipv4/ {print $4}' \
       | cut -d/ -f1 \
       | grep -vE '^(127\.|169\.254\.|0\.0\.0\.0$)' \
       | head -1)

9.7 scp without sshpass falls back to pubkey auth

Symptom: smoke-test reports "couldn't scp answers to live env" instantly. SCP works fine when you do it manually with the same arguments.

Cause: the smoke-test framework's SCP calls used plain scp ${SSH_OPTS[@]} ... without sshpass. SSH falls back to pubkey auth, the live ISO has no key configured for the source host, auth fails silently in non-interactive mode.

Fix: always wrap SCP for the live env with sshpass -p live (and sshpass -p admin for the installed system).

9.8 printf with format string starting with dash

Symptom: the runner's summary table generation throws "printf: --: invalid option" and the SUMMARY.md is partially generated.

Cause: printf '---x---' — bash's printf parses the format string for flags first, sees the leading dashes, errors out.

Fix: printf '%s' '---x---' or printf -- '---x---'. The first is more idiomatic.

9.9 zfs get rejects the * character

Symptom: smoke-core halts silently after "rpool/ROOT exists" passes. No FAIL line is printed; the script just exits.

Cause: the test test_output_contains "Root mounted at /" "zfs get -H -o value mountpoint rpool/ROOT/*" "/" uses a shell glob inside the eval. ZFS rejects * as an invalid character in dataset names. The error message contains "/" so technically the test should pass — but the eval failure cascaded through set -e and halted the script.

Fix: use the ZFS recursive flag instead of a shell glob: zfs get -rH -o value mountpoint rpool/ROOT. Same semantic, no '*' character.

9.10 Concurrent runs race the VM name

Symptom: two matrix runs in flight at the same time stomp on each other's kldload-smoke-<distro>-<profile> VM names. The second run sees the first's VM, gets confused, fails weirdly.

Cause: the systemd timer can fire while a manual run is in progress. No mutex was in place initially.

Fix: the runner now acquires a flock at /opt/kldload-ci/.run.lock at startup. Second invoker dies cleanly with "another kldload-ci-run is in progress (holder: PID, started_at)".

9.11 Podman short-name-mode = enforcing kills non-TTY builds

Symptom: the build container fails with "short-name resolution enforced but cannot prompt without a TTY". Manual builds in your shell work fine; CI builds fail.

Cause: default Fedora /etc/containers/registries.conf sets short-name-mode = "enforcing" which prompts for confirmation when pulling docker.io/library/fedora:44 by short name. The CI runner has no TTY → fails.

Fix: add a permissive override in /etc/containers/registries.conf.d/00-kldload-ci-permissive.conf:

short-name-mode = "permissive"
unqualified-search-registries = ["docker.io", "quay.io", "registry.fedoraproject.org"]

9.12 Podman ZFS storage driver wants the dataset pre-created

Symptom: the build fails immediately with "cannot find root filesystem rpool/var/lib/containers/storage/zfs: dataset does not exist".

Cause: kldload installs configure podman to use the ZFS storage driver but don't create the underlying dataset. The default ZFS layout has rpool/var/lib but not the containers subtree.

Fix: create with -p to materialize the parents:

sudo zfs create -p -o mountpoint=/var/lib/containers/storage \
  rpool/var/lib/containers/storage
sudo zfs create rpool/var/lib/containers/storage/zfs

This needs to be added to the kldload installer; for now it's a manual post-install step on any new CI host.

9.13 Failure-debug capture itself was broken

Symptom: when an installer aborts (without a success marker), the smoke-test was supposed to dump the last 40 lines of the installer log into the combo log. For early aborts (Ubuntu, ~50s), the dump was empty.

Cause: the original capture only tailed /var/log/installer/kldload-installer.log and /var/log/installer/bootstrap.log. Both files don't exist yet when the installer aborts in pre-flight. Meanwhile the actual error was in /tmp/install.log (the smoke-test's own stdout redirect from the nohup command), which was never tailed.

Fix: dump file existence + sizes first, then tail every plausible log file (/tmp/install.log, /var/log/installer/*.log, storage.log, zfs.log) — only those that exist and have content.

10. Bootstrap a Fresh CI Host

If fiend gets nuked or you want to bring up a second CI runner, the full setup is below. It assumes you've installed kldload (klab template recommended) on the new box and have static-IP'd it.

# 1. Install build deps on the new host:
ssh admin@new-ci-host 'echo Passw0rd | sudo -S dnf install -y \
  git ShellCheck sqlite jq qemu-img sshpass'

# 2. Layout:
ssh admin@new-ci-host 'sudo mkdir -p /opt/kldload-ci/{results,bin}; \
  sudo chown -R admin:admin /opt/kldload-ci'

# 3. ZFS dataset for podman storage (the kldload installer doesn't create this):
ssh admin@new-ci-host 'sudo zfs create -p \
  -o mountpoint=/var/lib/containers/storage \
  rpool/var/lib/containers/storage; \
  sudo zfs create rpool/var/lib/containers/storage/zfs'

# 4. Podman: permissive short-name resolution for non-TTY builds:
ssh admin@new-ci-host 'echo "short-name-mode = \"permissive\"
unqualified-search-registries = [\"docker.io\",\"quay.io\",\"registry.fedoraproject.org\"]" \
  | sudo tee /etc/containers/registries.conf.d/00-kldload-ci-permissive.conf'

# 5. Rsync source from your dev host (excludes output, caches, .git):
sshpass -p Passw0rd rsync -av --delete \
  --exclude='live-build/output' \
  --exclude='live-build/output-pass*' \
  --exclude='live-build/cache' \
  --exclude='live-build/darksite-ollama-cache.disabled' \
  --exclude='live-build/logs' \
  --exclude='.claude' \
  -e 'ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null' \
  /root/kldload-free/ \
  admin@new-ci-host:/opt/kldload-ci/kldload-free/

# 6. Install runner + sudoers:
sshpass -p Passw0rd ssh admin@new-ci-host '
  echo Passw0rd | sudo -S install -m 0755 \
    /opt/kldload-ci/kldload-free/ci/kldload-ci-run \
    /usr/local/bin/kldload-ci-run
  echo "admin ALL=(ALL) NOPASSWD: /usr/local/bin/kldload-ci-run, \
    /opt/kldload-ci/kldload-free/deploy.sh" | sudo tee \
    /etc/sudoers.d/95-kldload-ci > /dev/null
  sudo chmod 0440 /etc/sudoers.d/95-kldload-ci'

# 7. Systemd unit + nightly timer:
ssh admin@new-ci-host 'sudo tee /etc/systemd/system/kldload-ci.service > /dev/null <<UNIT
[Unit]
Description=kldload CI matrix
Wants=network-online.target
After=network-online.target

[Service]
Type=oneshot
TimeoutStartSec=12h
ExecStart=/usr/local/bin/kldload-ci-run
StandardOutput=append:/var/log/kldload-ci.log
StandardError=inherit
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=4
UNIT
sudo tee /etc/systemd/system/kldload-ci.timer > /dev/null <<TIMER
[Unit]
Description=Run kldload CI nightly at 03:00 UTC

[Timer]
OnCalendar=*-*-* 03:00:00
RandomizedDelaySec=15min
Persistent=true
Unit=kldload-ci.service

[Install]
WantedBy=timers.target
TIMER
sudo systemctl daemon-reload
sudo systemctl enable --now kldload-ci.timer'

# 8. First smoke run to verify:
ssh admin@new-ci-host 'sudo kldload-ci-run --only fedora-core'

That's it. Total wall-clock: ~10 minutes of setup + the duration of the first matrix run.

11. What CI Doesn't Catch

Be explicit about the limitations. Treating CI as the seatbelt instead of the airbag is how you ship a release that boots in OVMF and bricks on real hardware.

The matrix runs in OVMF (libvirt's standard UEFI firmware). Real-machine UEFI implementations diverge wildly. We've seen all of these in the last few releases:

Slow-USB rootdelay races — HP-branded USB sticks consistently need an extra rootdelay parameter to mount the squashfs at boot. OVMF doesn't exhibit the timing.
MOK enrollment NVRAM behavior — different firmware vendors implement MokManager differently. Some boot once after enrollment then never again. OVMF's NVRAM is a flat file and behaves predictably.
NVIDIA driver vs running kernel — a kernel update that shipped at 2 AM can break NVIDIA DKMS rebuilds. Caught only on hardware with an actual GPU.
BIOS boot order — some firmwares ignore the EFI variable your installer set and boot the wrong device. OVMF respects what we set.
Real disk geometry — 4Kn vs 512n drives, NVMe namespaces, RAID controllers. OVMF gives you a clean qcow2 with predictable geometry.
Real network drivers — Intel I210 vs Realtek r8169 vs Mellanox ConnectX. Virtio hides all of that.
Power management — suspend/resume, idle states, AC events. None of this is in the VM.

So the per-release hardware install on fiend (or your equivalent) isn't optional. CI shrinks the surface to "will it install correctly given the right firmware"; hardware install confirms the firmware is right.

12. Roadmap

What's not done yet, in rough priority order:

Workload-template smoke wrappers

The matrix covers core / server / desktop. The four workload templates (kvm, k8s, klab, zfslab) need their own tests/smoke-{kvm,klab,zfslab}.sh. Each needs to assert post-install state for its template — for klab that means 5 cloud-init goldens snapshot-ready, K8s cluster Ready, Tetragon Running. ~200 LoC each.

// Once these land: 9 distros × 7 profiles = 63 combos // Sequential: ~24 hours. Parallel: ~6 hours. Probably want a 2nd box.

RHEL in the matrix

RHEL needs subscription credentials (username + password OR activation key + org). Adding it requires either embedding a test account in the runner config or skipping when creds aren't present. Probably "skip with a warning" is right.

// Goal: smoke-test rhel core/server/desktop unattended // Blocker: subscription-manager auth in the VM

Arch + Alpine wrappers

The installer supports both, but no smoke-arch.sh or smoke-alpine.sh exists. Arch is interesting because it's rolling-release — package versions change daily and the matrix would catch that drift. Alpine has its own quirks (apk vs dnf/apt, OpenRC vs systemd). Each wrapper is a 100-line script.

FreeBSD path

Installer plumbing exists; smoke wrapper does not; no darksite. The bigger question is whether to do smoke-test in bhyve VMs (real BSD test surface) or in a Linux VM with FreeBSD as the install target (faster, weaker test). Probably both eventually.

Per-failure artifact upload to R2

Today the failure VM is left running and the per-combo logs stay on fiend. For long-running investigation, uploading the failure tarball (VM console + journal + ZFS state) to Cloudflare R2 would mean failures are inspectable from anywhere, not just by SSH'ing fiend.

Hardware test self-validation

The reverse of CI: when you do a hardware install, run the smoke suite against the running install (not in a VM). The autodeploy could fire kldload-self-test and write the result to /var/lib/kldload/self-test-results.json. Then "did the install work" is automated even on hardware.

13. Closing — Why This Approach Holds Up

Most distro projects (and most open-source infrastructure) get tested by their users. A user installs, hits a bug, files an issue, the maintainer investigates, ships a fix in the next release. That feedback loop is measured in weeks. For nine distros across three profiles, the loop never closes — by the time you've fixed the bug a user found on Debian-server, three more bugs have shipped in CentOS-desktop.

The kldload approach inverts that. Every commit triggers (or can trigger) a 15-combo matrix that reproduces what a user would do — burn ISO, install, validate. The matrix found 8 latent bugs in its first overnight run that could have been shipping for weeks or months otherwise. That's not because the maintainer was sloppy — it's because no human can verify a 9 × 3 matrix by hand on every commit. The framework can.

The tradeoff is honest: you lose the "real hardware" coverage CI can't give you. You compensate with periodic hardware installs on the one box you have. The combination is more thorough than either alone, and it's cheap enough to run that "did this commit break anything?" becomes a 2-hour question instead of a release-cycle question.

If you've read this far and you're standing up your own kldload (or any distro project) — the test infrastructure is the unglamorous part of the project that compounds the most over time. Build the matrix early. Add to it every time you ship a new feature. Trust the framework when it surfaces a regression even if it looks like a flake. The dollar you save by skipping test infrastructure costs ten in user-facing bugs three months later.

The bottom line: kldload's test infrastructure is 318 lines of bash, one ASUS box, and a SQLite database. It catches more regressions per dollar than any CI cloud you could rent. It runs overnight while you sleep. The next time you stand up a project that needs to ship across heterogeneous targets, copy this pattern. It works because it stays small.