| pick your distro, get ZFS on root
kldload — your platform, your way, free
Source

The Image IS the Code

Infrastructure as Code does not mean YAML files in a git repository. It means your infrastructure is defined, versioned, reproducible, and deployable from a single artifact. kldload builds that artifact. A bootable disk image with ZFS on root, WireGuard networking, eBPF observability, boot environments, and every package your workload needs — baked in at build time. Not converged after the fact. Not hoped into existence by a configuration management run. Built once. Tested once. Deployed everywhere. Identical every time.

The thesis: Real Infrastructure as Code means the image defines the machine. Not a playbook. Not a state file. Not a manifest that hopes the package mirror is up. The image. It boots, it is already done. Every machine that receives this image is identical to every other machine that received this image. That is reproducibility. That is IaC.

kldload is Stage 0 of every pipeline. It produces the base image. Packer layers your application on top. Terraform stamps out copies. Ansible handles runtime configuration. Git versions everything. CI/CD automates the flow. But it all starts with the image. And the image starts with kldload.

The industry has been lying to itself about IaC for a decade. "Infrastructure as Code" became "we have Terraform files in git" — but those Terraform files do not define the OS. They define cloud resources. The OS comes from an AMI that someone built by hand six months ago, or worse, from a stock cloud image that gets mutated by Ansible after launch. The actual infrastructure — the operating system, the filesystem, the kernel modules, the boot chain — is not in code at all. It is in a wiki page titled "How to set up a new server" that nobody has updated since 2019.

kldload fixes this. The ISO build process IS the code. deploy.sh is in git. The package sets are in git. The answer file is in git. The postinstaller is in git. Run ./deploy.sh build and you get a byte-identical image every time. That image IS your infrastructure definition. Boot it on bare metal, a VM, or export it to any cloud. The machine is defined by the image, not by whatever happened to it after it booted.

What Infrastructure as Code Actually Means

Infrastructure as Code has three requirements. Declarative definition — you describe what the infrastructure should be, not the steps to get there. Version control — every change is tracked, diffable, reviewable, and reversible. Reproducibility — the same inputs always produce the same outputs. If your "IaC" requires a human to click through a cloud console, SSH into a box, or run ad-hoc commands to fix drift, it is not IaC. It is automation with extra steps.

Declarative definition

kldload's answer file declares the target state: distro, profile, disk layout, ZFS topology, networking, packages, WireGuard config, export format. Every parameter has a default. You override what you need. The installer reads the declaration and produces the machine. No imperative steps. No ordering dependencies. No "run this before that."

The answer file is a blueprint. kldload is the factory. The image is the product.

Version control

The entire kldload build is in git. deploy.sh, package sets, answer files, postinstallers, build scripts. git diff shows you exactly what changed between two image builds. git blame shows you who changed it and why. git revert undoes it. The image is a function of the git commit. Tag the commit, tag the image.

git log IS your infrastructure changelog. Every image maps to a commit SHA.

Reproducibility

kldload embeds offline package mirrors (darksites) into the ISO. No network dependency during install. No "the package mirror was down" failures. No "the GPG key rotated" surprises. No "the new version of nginx broke our config" drift. The darksite pins every package at build time. Same ISO, same packages, same image. Every time. On every machine.

The darksite is a lockfile for your entire operating system.

Most teams think they have IaC because their Terraform is in git. But Terraform does not define the OS. It defines cloud resources — VPCs, subnets, security groups, instances. The instance gets a stock AMI. The AMI gets mutated by user-data scripts or Ansible. Those scripts pull packages from the internet at deploy time. Different deploy, different package versions. Different package versions, different behavior. "But it worked in staging" — because staging was deployed on Tuesday when the package mirror had version 2.3.1, and production was deployed on Thursday when the mirror had 2.4.0.

kldload eliminates this entire class of failure. The darksite pins packages. The image is deterministic. There is no "works on my deploy." There is one image, built once, tested once, deployed to everything. That is IaC. Not "we have YAML files."

The image as a function

Think of the kldload image as a pure function:

f(distro, profile, packages, answers, postinstaller) = disk_image

# Same inputs, same output. Every time.
f(debian, server, darksite-v42, answers-prod.env, harden.sh) = golden-prod-v42.qcow2

# Change one input, get a new output. The diff is the change.
f(debian, server, darksite-v43, answers-prod.env, harden.sh) = golden-prod-v43.qcow2

# The git diff between darksite-v42 and darksite-v43 IS the changelog.

No side effects. No external dependencies at deploy time. No state that drifts. The image is immutable. Deploy it to one machine or one thousand. Identical.

kldload vs Traditional IaC Tools

Every IaC tool solves a different slice of the problem. None of them build what kldload builds. kldload is not a replacement for Terraform or Ansible. It is Stage 0 — the foundation image that every other tool consumes.

Dimension kldload Terraform Ansible Packer
What it builds Bootable disk image Cloud resources System state Machine image
ZFS on root Built in
Bare metal
Offline install ✓ (darksite)
Deterministic ✓ (pinned pkgs) ✓ (state file) Drift possible
Deploy speed Boot time (15s) API calls (2-5m) Convergence (10-30m) Build only
State management Stateless (image) State file required Inventory + facts Stateless
Rollback ZFS boot env (30s) Destroy + recreate Manual
Multi-distro 8 distros, 1 ISO Per-AMI With effort Per-builder

kldload: the image factory

Builds bootable disk images with ZFS on root from scratch. Handles disk partitioning, pool creation, DKMS compilation, bootloader installation, darksite embedding, kernel module signing, and image sealing. Works offline. Works on bare metal. Produces qcow2, vmdk, vhd, raw, and OVA. No existing infrastructure required — just a USB stick.

Stage 0: raw materials. The foundation everything else builds on.

Terraform: the orchestrator

Creates and manages cloud/virtualization resources via provider APIs. Deploys instances, networks, storage, DNS, load balancers. Consumes images that already exist — AMIs, templates, base images. Does not build operating systems. Does not touch bare metal. Requires a state file and API access.

Stage 2: deployment. Takes finished images and stamps them into infrastructure.

Ansible: the configurator

Pushes configuration to running machines via SSH. Installs packages, writes config files, restarts services. Each run re-evaluates the entire state. Depends on network, package mirrors, and SSH access. Powerful for application-layer changes on top of a known-good base. Fragile as the only source of truth for the entire OS.

Stage 3: application layer. Best when the OS is already done.

Packer: the customizer

Builds machine images by booting a source image, running provisioners (shell scripts, Ansible, Chef), and capturing the result. Powerful for layering applications onto a base. Cannot build an OS from scratch. Cannot set up ZFS on root. Cannot create darksites. Needs a base image to start from — kldload provides it.

Stage 1: customization. Takes the kldload base and adds your application.

The key insight: these tools are not competitors. They are stages in a pipeline. kldload builds the OS image (Stage 0). Packer layers your application on top (Stage 1). Terraform deploys the result (Stage 2). Ansible handles runtime config (Stage 3). Each tool does one job well. The problem is that most teams skip Stage 0 and start with a stock cloud image — which means their "infrastructure as code" does not include the infrastructure. It includes the configuration management that runs on top of infrastructure someone built by hand.

kldload is the missing stage. It turns "we have Terraform files" into "we have the entire stack, from disk partitioning to application deployment, defined in code and versioned in git."

The Golden Image Pipeline

A golden image is a tested, sealed, versioned disk image that serves as the single source of truth for every machine in your fleet. kldload builds golden images. The pipeline has six stages: build, test, seal, store, deploy, verify.

The complete pipeline

# Stage 1: BUILD — produce the ISO and install to a VM
cd /root/kldload-free
git pull origin main
PROFILE=server ./deploy.sh build

virt-install --name golden-build --ram 4096 --vcpus 4 \
  --disk path=/var/lib/libvirt/images/golden-build.qcow2,size=40,format=qcow2 \
  --cdrom output/kldload-free-*.iso \
  --os-variant centos-stream9 --boot uefi --noautoconsole \
  --extra-args "kldload.seed=auto"

# Stage 2: TEST — boot the image and run smoke tests
virt-install --name golden-test --ram 4096 --vcpus 4 \
  --disk path=/var/lib/libvirt/images/golden-build.qcow2,bus=virtio \
  --import --os-variant centos-stream9 --boot uefi --noautoconsole

sleep 60  # wait for boot
ssh admin@golden-test "zpool status && systemctl is-system-running && uname -r"
ssh admin@golden-test "sudo zfs list -t snapshot"
ssh admin@golden-test "wg show 2>/dev/null || echo 'WireGuard: not configured (expected for core)'"

# Stage 3: SEAL — export as cloud-ready image
ssh admin@golden-test "sudo kexport qcow2"
scp admin@golden-test:/root/kldload-export-*.qcow2 ./golden-v$(date +%Y%m%d).qcow2

# Stage 4: STORE — version and archive
sha256sum golden-v*.qcow2 > golden-v$(date +%Y%m%d).sha256
aws s3 cp golden-v$(date +%Y%m%d).qcow2 s3://infra-images/golden/
aws s3 cp golden-v$(date +%Y%m%d).sha256 s3://infra-images/golden/

# Stage 5: DEPLOY — stamp out instances via Terraform
cd terraform/
terraform plan -var="image_version=$(date +%Y%m%d)"
terraform apply -auto-approve

# Stage 6: VERIFY — confirm deployed instances match golden image
for host in $(terraform output -json instance_ips | jq -r '.[]'); do
  ssh admin@$host "zpool status -x && systemctl is-system-running"
done

Every stage is automatable. Every stage is in a script. Every script is in git. The pipeline runs in CI. A developer pushes a package set change, CI builds a new ISO, installs it to a VM, runs smoke tests, seals the image, uploads it, deploys it to staging, verifies it works, and creates a PR for production promotion. No human touches a keyboard between "git push" and "staging is green."

The critical part: the seal stage. kexport clears machine-id, removes SSH host keys, enables cloud-init, and exports the ZFS pool cleanly. Every instance that boots from this image generates fresh identity. No two machines share host keys. No two machines share machine-id. The image is a template, not a clone of a specific machine.

What sealing does

The kexport command prepares the image for cloning:

# Machine identity cleared
/etc/machine-id         → emptied (regenerated on first boot)
/var/lib/dbus/machine-id → emptied

# SSH host keys removed
/etc/ssh/ssh_host_*     → deleted (regenerated by cloud-init)

# Cloud-init enabled
cloud-init              → enabled with multi-datasource config
                          (NoCloud, ConfigDrive, GCE, EC2, Azure)

# ZFS pool exported cleanly
zpool export rpool      → ensures consistent on-disk state
qemu-img convert        → produces the final image format

Image versioning strategy

Tag images with the git commit SHA that built them. Every image traces back to the exact code that produced it:

# Image naming convention
golden-debian-server-v20260404-a1b2c3d.qcow2
│       │      │      │         │
│       │      │      │         └── git short SHA
│       │      │      └── build date
│       │      └── profile
│       └── distro
└── image type

# The SHA traces back to the exact commit
git show a1b2c3d  # shows exactly what was in this image

Packer Integration

Packer takes a base image, boots it, runs provisioners, and captures the result. kldload provides the base image. Together they produce application-ready golden images with ZFS on root — something Packer cannot do alone because it does not know how to partition disks, create ZFS pools, compile DKMS modules, or configure ZFSBootMenu.

Full Packer template — QEMU builder with kldload base

# kldload-webapp.pkr.hcl
# Builds an application-ready golden image from a kldload Core export

variable "kldload_image" {
  type    = string
  default = "golden-debian-core-v20260404.qcow2"
}

variable "ssh_username" {
  type    = string
  default = "admin"
}

variable "ssh_password" {
  type    = string
  default = "changeme"
  sensitive = true
}

source "qemu" "kldload-webapp" {
  disk_image       = true
  iso_url          = var.kldload_image
  iso_checksum     = "file:${var.kldload_image}.sha256"
  output_directory = "output-webapp"
  format           = "qcow2"

  # UEFI boot — kldload images are always UEFI
  qemuargs = [
    ["-bios", "/usr/share/OVMF/OVMF_CODE.fd"],
    ["-m", "4096"],
    ["-smp", "4"],
  ]

  ssh_username     = var.ssh_username
  ssh_password     = var.ssh_password
  ssh_timeout      = "5m"
  shutdown_command  = "sudo shutdown -h now"

  # Disk sizing — kldload Core is ~4GB, expand for app data
  disk_size        = "40G"
  disk_interface   = "virtio"
}

build {
  sources = ["source.qemu.kldload-webapp"]

  # Snapshot before any changes — ZFS gives us this for free
  provisioner "shell" {
    inline = [
      "sudo zfs snapshot rpool/ROOT/debian@pre-packer",
      "sudo zfs snapshot rpool@pre-packer",
    ]
  }

  # Install application packages
  provisioner "shell" {
    inline = [
      "sudo apt-get update",
      "sudo apt-get install -y nginx postgresql-16 redis-server certbot python3-certbot-nginx",
      "sudo systemctl enable nginx postgresql redis-server",
    ]
  }

  # Create ZFS datasets for application data
  provisioner "shell" {
    inline = [
      # Separate dataset for app — independent snapshots
      "sudo zfs create -o mountpoint=/srv/app -o compression=lz4 rpool/srv/app",

      # Database dataset with tuned recordsize
      "sudo zfs create -o mountpoint=/var/lib/postgresql -o recordsize=8k -o logbias=throughput rpool/data/postgres",

      # Redis dataset — small recordsize for key-value
      "sudo zfs create -o mountpoint=/var/lib/redis -o recordsize=4k rpool/data/redis",

      # Logs dataset — high compression, no atime
      "sudo zfs create -o mountpoint=/var/log/app -o compression=zstd -o atime=off rpool/logs/app",
    ]
  }

  # Deploy application config
  provisioner "file" {
    source      = "configs/nginx-webapp.conf"
    destination = "/tmp/nginx-webapp.conf"
  }

  provisioner "shell" {
    inline = [
      "sudo mv /tmp/nginx-webapp.conf /etc/nginx/sites-available/webapp.conf",
      "sudo ln -sf /etc/nginx/sites-available/webapp.conf /etc/nginx/sites-enabled/",
      "sudo rm -f /etc/nginx/sites-enabled/default",
      "sudo nginx -t",
    ]
  }

  # Snapshot after — this IS the rollback point
  provisioner "shell" {
    inline = [
      "sudo zfs snapshot rpool/ROOT/debian@post-packer",
      "sudo zfs snapshot rpool@post-packer",
    ]
  }

  # Seal for cloning
  provisioner "shell" {
    inline = [
      "sudo cloud-init clean --logs",
      "sudo truncate -s 0 /etc/machine-id",
      "sudo rm -f /etc/ssh/ssh_host_*",
      "sudo sync",
    ]
  }

  # Post-processors — convert to multiple formats
  post-processor "shell-local" {
    inline = [
      "qemu-img convert -f qcow2 -O vmdk output-webapp/packer-kldload-webapp golden-webapp.vmdk",
      "qemu-img convert -f qcow2 -O raw output-webapp/packer-kldload-webapp golden-webapp.raw",
      "sha256sum output-webapp/packer-kldload-webapp golden-webapp.vmdk golden-webapp.raw > checksums.sha256",
    ]
  }
}

Notice the zfs snapshot calls in the provisioner. This is not possible with any other base image. On ext4, if the Packer build fails at step 7 of 12, you start over from scratch. On ZFS (kldload base), you snapshot before each major step. If step 7 fails, you roll back to the step 6 snapshot and debug from there. The Packer build becomes resumable. The snapshots also serve as documentation — you can diff the pre-packer and post-packer snapshots to see exactly what changed.

The ZFS datasets created in the provisioner are not just directories — they are independent filesystems with their own snapshot schedules, compression settings, and recordsizes tuned for the workload. PostgreSQL gets 8K recordsize matching its page size. Redis gets 4K for small key-value operations. Logs get zstd compression because they are write-heavy and highly compressible. This is infrastructure engineering baked into the image, not bolted on after the fact.

Multi-distro matrix builds

Build the same application image across multiple distros in parallel:

# kldload-matrix.pkr.hcl
# Build the same app on Debian, Rocky, and Ubuntu simultaneously

variable "distros" {
  type = map(object({
    image   = string
    pkg_mgr = string
    packages = string
  }))
  default = {
    debian = {
      image    = "golden-debian-core-v20260404.qcow2"
      pkg_mgr  = "apt-get"
      packages = "nginx postgresql-16 redis-server"
    }
    rocky = {
      image    = "golden-rocky-core-v20260404.qcow2"
      pkg_mgr  = "dnf"
      packages = "nginx postgresql-server redis"
    }
    ubuntu = {
      image    = "golden-ubuntu-core-v20260404.qcow2"
      pkg_mgr  = "apt-get"
      packages = "nginx postgresql-16 redis-server"
    }
  }
}

source "qemu" "kldload-matrix" {
  disk_image = true
  format     = "qcow2"
  qemuargs   = [["-bios", "/usr/share/OVMF/OVMF_CODE.fd"]]
  ssh_username     = "admin"
  ssh_password     = "changeme"
  ssh_timeout      = "5m"
  shutdown_command  = "sudo shutdown -h now"
  disk_size        = "40G"
}

build {
  dynamic "source" {
    for_each = var.distros
    labels   = ["qemu.kldload-matrix"]
    content {
      name             = source.key
      iso_url          = source.value.image
      output_directory = "output-${source.key}"
    }
  }

  provisioner "shell" {
    inline = [
      "sudo zfs snapshot rpool@pre-app",
      "sudo ${each.value.pkg_mgr} install -y ${each.value.packages}",
      "sudo zfs snapshot rpool@post-app",
    ]
  }
}

Terraform Deployment

Terraform takes the golden image and stamps it into infrastructure. kldload images work with every Terraform provider — libvirt for KVM, Proxmox for homelab clusters, AWS/GCP/Azure for cloud. The image is the same. The provider changes.

KVM/libvirt — on-premises deployment

# providers.tf
terraform {
  required_providers {
    libvirt = {
      source  = "dmacvicar/libvirt"
      version = "~> 0.8"
    }
  }
}

provider "libvirt" {
  uri = "qemu:///system"
}

# main.tf — deploy kldload golden images to KVM

variable "instance_count" {
  type    = number
  default = 3
}

variable "golden_image" {
  type    = string
  default = "golden-debian-server-v20260404.qcow2"
}

# Upload the golden image as a base volume
resource "libvirt_volume" "golden_base" {
  name   = "golden-base.qcow2"
  pool   = "default"
  source = var.golden_image
  format = "qcow2"
}

# Create a CoW clone for each instance — ZFS makes this instant
resource "libvirt_volume" "instance_disk" {
  count          = var.instance_count
  name           = "app-${count.index}.qcow2"
  pool           = "default"
  base_volume_id = libvirt_volume.golden_base.id
  format         = "qcow2"
  size           = 42949672960  # 40GB
}

# Cloud-init configuration — kldload images have cloud-init ready
resource "libvirt_cloudinit_disk" "init" {
  count = var.instance_count
  name  = "init-${count.index}.iso"

  user_data = templatefile("${path.module}/cloud-init.yaml", {
    hostname = "app-${count.index}"
    fqdn     = "app-${count.index}.infra.local"
    ssh_key  = file("~/.ssh/id_ed25519.pub")
  })

  network_config = templatefile("${path.module}/network-config.yaml", {
    address = "10.0.1.${10 + count.index}/24"
    gateway = "10.0.1.1"
  })
}

# The VMs themselves
resource "libvirt_domain" "app" {
  count  = var.instance_count
  name   = "app-${count.index}"
  memory = 4096
  vcpu   = 4

  firmware = "/usr/share/OVMF/OVMF_CODE.fd"

  disk {
    volume_id = libvirt_volume.instance_disk[count.index].id
  }

  cloudinit = libvirt_cloudinit_disk.init[count.index].id

  network_interface {
    bridge = "br0"
  }

  console {
    type        = "pty"
    target_type = "serial"
    target_port = "0"
  }

  graphics {
    type        = "vnc"
    listen_type = "address"
    autoport    = true
  }
}

# cloud-init.yaml template
# #cloud-config
# hostname: ${hostname}
# fqdn: ${fqdn}
# manage_etc_hosts: true
# users:
#   - name: admin
#     sudo: ALL=(ALL) NOPASSWD:ALL
#     ssh_authorized_keys:
#       - ${ssh_key}
# runcmd:
#   - zpool status  # verify ZFS is healthy on first boot

output "instance_ips" {
  value = [for i in range(var.instance_count) : "10.0.1.${10 + i}"]
}

Every one of those three instances boots with ZFS on root. Not because Terraform configured ZFS. Not because Ansible installed it. Because the golden image already has it. Terraform's job is to stamp the image onto VMs and hand them cloud-init data. That is it. The hard part — the ZFS pools, the boot environments, the DKMS modules, the snapshot timers, the dataset hierarchy — was done once, when kldload built the image. Terraform just copies it.

This is the correct separation of concerns. The image factory (kldload) handles OS-level complexity. The orchestrator (Terraform) handles deployment-level complexity. Neither needs to know the other's internals. kldload produces a qcow2. Terraform consumes a qcow2. The interface is a file.

Proxmox — homelab and production clusters

# proxmox.tf — deploy kldload images to Proxmox cluster

terraform {
  required_providers {
    proxmox = {
      source  = "Telmate/proxmox"
      version = "~> 3.0"
    }
  }
}

provider "proxmox" {
  pm_api_url      = "https://pve1.infra.local:8006/api2/json"
  pm_tls_insecure = true
}

resource "proxmox_vm_qemu" "app" {
  count       = 3
  name        = "app-${count.index}"
  target_node = "pve1"
  clone       = "kldload-golden-template"  # import qcow2 as template first

  cores   = 4
  memory  = 4096
  scsihw  = "virtio-scsi-single"
  bios    = "ovmf"
  machine = "q35"
  agent   = 1

  disk {
    storage = "local-zfs"         # Proxmox ZFS storage — CoW clones are instant
    size    = "40G"
    type    = "scsi"
    ssd     = 1
    discard = "on"
  }

  network {
    model  = "virtio"
    bridge = "vmbr0"
    tag    = 10
  }

  os_type   = "cloud-init"
  ipconfig0 = "ip=10.0.1.${10 + count.index}/24,gw=10.0.1.1"
  ciuser    = "admin"
  sshkeys   = file("~/.ssh/id_ed25519.pub")
}

# Import the golden image as a Proxmox template (run once)
# qm create 9000 --name kldload-golden-template --memory 4096 --cores 4 \
#   --net0 virtio,bridge=vmbr0,tag=10 --scsihw virtio-scsi-single \
#   --machine q35 --bios ovmf --boot order=scsi0 \
#   --efidisk0 local-zfs:1,format=raw,efitype=4m,pre-enrolled-keys=0
# qm set 9000 --scsi0 local-zfs:0,import-from=/root/golden-v20260404.qcow2
# qm template 9000

AWS — import and deploy

# Upload raw image to S3
aws s3 cp golden-v20260404.raw s3://infra-images/

# Import as AMI
aws ec2 import-image \
  --disk-containers "Format=raw,Url=s3://infra-images/golden-v20260404.raw" \
  --boot-mode uefi

# Terraform deployment
data "aws_ami" "kldload" {
  filter {
    name   = "name"
    values = ["kldload-golden-*"]
  }
  owners = ["self"]
  most_recent = true
}

resource "aws_instance" "app" {
  count         = 3
  ami           = data.aws_ami.kldload.id
  instance_type = "t3.xlarge"
  key_name      = "ops-key"

  root_block_device {
    volume_size = 40
    volume_type = "gp3"
    iops        = 3000
    throughput  = 125
  }

  tags = {
    Name    = "app-${count.index}"
    Image   = "kldload-golden-v20260404"
    ManagedBy = "terraform"
  }
}

Azure — import and deploy

# Upload VHD to Azure
az storage blob upload \
  --account-name infraimages \
  --container-name golden \
  --name golden-v20260404.vhd \
  --file golden-v20260404.vhd \
  --type page

# Create managed image
az image create \
  --resource-group infra-rg \
  --name kldload-golden-v20260404 \
  --os-type Linux \
  --source "https://infraimages.blob.core.windows.net/golden/golden-v20260404.vhd"

# Terraform deployment
resource "azurerm_linux_virtual_machine" "app" {
  count               = 3
  name                = "app-${count.index}"
  resource_group_name = azurerm_resource_group.infra.name
  location            = "westus2"
  size                = "Standard_D4s_v5"
  admin_username      = "admin"
  source_image_id     = azurerm_image.kldload.id

  admin_ssh_key {
    username   = "admin"
    public_key = file("~/.ssh/id_ed25519.pub")
  }

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "Premium_LRS"
    disk_size_gb         = 40
  }
}

GCP — import and deploy

# Upload raw image to GCS (must be tar.gz)
tar -czf golden-v20260404.tar.gz golden-v20260404.raw
gsutil cp golden-v20260404.tar.gz gs://infra-images/

# Import as GCE image
gcloud compute images create kldload-golden-v20260404 \
  --source-uri=gs://infra-images/golden-v20260404.tar.gz \
  --guest-os-features=UEFI_COMPATIBLE

# Terraform deployment
resource "google_compute_instance" "app" {
  count        = 3
  name         = "app-${count.index}"
  machine_type = "e2-standard-4"
  zone         = "us-west1-a"

  boot_disk {
    initialize_params {
      image = "projects/my-project/global/images/kldload-golden-v20260404"
      size  = 40
      type  = "pd-ssd"
    }
  }

  network_interface {
    subnetwork = "projects/my-project/regions/us-west1/subnetworks/default"
    access_config {}
  }

  metadata = {
    ssh-keys = "admin:${file("~/.ssh/id_ed25519.pub")}"
  }
}

Ansible on Top

Ansible's job changes when the base image is already done. Instead of 400 tasks that build an OS from scratch, you write 40 tasks that deploy your application. The OS, ZFS, WireGuard, eBPF, boot environments — all of that is in the image. Ansible handles what changes between deployments: application code, configuration, secrets, feature flags.

Inventory from WireGuard mesh

kldload machines with WireGuard configured form a mesh network. Use WireGuard IPs as your Ansible inventory — encrypted, authenticated, routable across sites:

# inventory/hosts.yaml
all:
  children:
    web:
      hosts:
        web-01:
          ansible_host: 10.100.0.11  # WireGuard IP — encrypted tunnel
        web-02:
          ansible_host: 10.100.0.12
        web-03:
          ansible_host: 10.100.0.13
    db:
      hosts:
        db-01:
          ansible_host: 10.100.0.21
          zfs_recordsize: 8k
        db-02:
          ansible_host: 10.100.0.22
          zfs_recordsize: 8k
    cache:
      hosts:
        cache-01:
          ansible_host: 10.100.0.31
          zfs_recordsize: 4k

  vars:
    ansible_user: admin
    ansible_become: true
    ansible_python_interpreter: /usr/bin/python3
    # All traffic over WireGuard — no VPN needed, no bastion needed

WireGuard as the Ansible transport is underappreciated. Every kldload machine with WireGuard configured has a stable, encrypted, authenticated IP on the mesh. No SSH bastion hosts. No VPN concentrators. No firewall rules for Ansible. The WireGuard tunnel is already there. Ansible just uses it. The inventory is just WireGuard IPs. The traffic is encrypted at the kernel level by WireGuard, then encrypted again by SSH. Two layers of encryption without trying.

This also means your Ansible controller can be anywhere. In the office, at home, in CI. As long as it has a WireGuard peer configured, it can reach every machine in the fleet. No network topology dependencies. No "the Ansible server must be in the same datacenter." The mesh is the network.

Application deployment playbook

# playbooks/deploy-webapp.yaml
---
- name: Deploy web application on kldload base
  hosts: web
  become: true

  vars:
    app_version: "2.4.1"
    app_domain: "app.example.com"
    zfs_snap_prefix: "deploy"

  tasks:
    # ZFS snapshot before deploy — instant rollback point
    - name: Create pre-deploy ZFS snapshot
      command: "zfs snapshot rpool/srv/app@{{ zfs_snap_prefix }}-{{ app_version }}-pre"

    # Deploy application
    - name: Pull application container image
      containers.podman.podman_image:
        name: "registry.example.com/webapp:{{ app_version }}"

    - name: Deploy application container
      containers.podman.podman_container:
        name: webapp
        image: "registry.example.com/webapp:{{ app_version }}"
        state: started
        restart_policy: always
        ports:
          - "127.0.0.1:8080:8080"
        volumes:
          - "/srv/app/data:/app/data:Z"
          - "/srv/app/config:/app/config:ro,Z"
        env:
          DATABASE_URL: "postgresql://app:{{ vault_db_password }}@db-01.infra.local/webapp"
          REDIS_URL: "redis://cache-01.infra.local:6379"

    - name: Configure nginx reverse proxy
      template:
        src: templates/nginx-webapp.conf.j2
        dest: /etc/nginx/sites-available/webapp.conf
      notify: reload nginx

    - name: Enable nginx site
      file:
        src: /etc/nginx/sites-available/webapp.conf
        dest: /etc/nginx/sites-enabled/webapp.conf
        state: link
      notify: reload nginx

    # Health check
    - name: Wait for application to be healthy
      uri:
        url: "http://127.0.0.1:8080/health"
        status_code: 200
      retries: 30
      delay: 2

    # ZFS snapshot after successful deploy
    - name: Create post-deploy ZFS snapshot
      command: "zfs snapshot rpool/srv/app@{{ zfs_snap_prefix }}-{{ app_version }}-post"

  handlers:
    - name: reload nginx
      service:
        name: nginx
        state: reloaded

Rollback playbook — 30 seconds, not 30 minutes

# playbooks/rollback-webapp.yaml
---
- name: Rollback web application to previous version
  hosts: web
  become: true

  vars:
    rollback_to: "2.3.8"
    zfs_snap_prefix: "deploy"

  tasks:
    - name: Stop current application
      containers.podman.podman_container:
        name: webapp
        state: stopped

    # ZFS rollback — instant, atomic, complete
    - name: Rollback ZFS dataset to previous version
      command: "zfs rollback rpool/srv/app@{{ zfs_snap_prefix }}-{{ rollback_to }}-post"

    - name: Start previous application version
      containers.podman.podman_container:
        name: webapp
        image: "registry.example.com/webapp:{{ rollback_to }}"
        state: started
        restart_policy: always
        ports:
          - "127.0.0.1:8080:8080"
        volumes:
          - "/srv/app/data:/app/data:Z"
          - "/srv/app/config:/app/config:ro,Z"

    - name: Verify rollback
      uri:
        url: "http://127.0.0.1:8080/health"
        status_code: 200
      retries: 10
      delay: 2

    - name: Report rollback status
      debug:
        msg: "Rolled back to {{ rollback_to }} on {{ inventory_hostname }}"

The rollback playbook is 30 lines. On ext4, a rollback means: stop the app, restore the backup (if you have one), redeploy the old version, hope the config files match, restart everything, pray. On ZFS (kldload base), a rollback is zfs rollback. One command. Atomic. Instant. The dataset returns to the exact state it was in at the snapshot point — files, permissions, timestamps, everything. The app container starts against the old data. Done.

This changes how you think about deployments. Deploys become cheap because rollbacks are free. You deploy more often because the risk is lower. You snapshot before every deploy because it costs nothing. Your deployment history is your snapshot history. zfs list -t snapshot rpool/srv/app shows you every deployment, when it happened, and how much data changed.

GitOps Workflow

GitOps means git is the single source of truth. Push a change, CI builds the image, tests it, promotes it, deploys it. No SSH-ing into boxes. No ad-hoc commands. No "I updated the config but forgot to commit." Git push is the only deployment mechanism.

GitHub Actions — complete CI/CD pipeline

# .github/workflows/golden-image.yaml
name: Build Golden Image

on:
  push:
    branches: [main]
    paths:
      - 'build/**'
      - 'live-build/**'
      - 'answers/**'
      - 'postinstallers/**'

  workflow_dispatch:
    inputs:
      distro:
        description: 'Target distro'
        default: 'debian'
        type: choice
        options: [debian, ubuntu, centos, rocky, fedora, rhel, arch, alpine]
      profile:
        description: 'Install profile'
        default: 'server'
        type: choice
        options: [desktop, server, core]

env:
  REGISTRY: ghcr.io
  IMAGE_TAG: ${{ github.sha }}

jobs:
  build-iso:
    runs-on: self-hosted  # Needs KVM — use a bare-metal runner
    timeout-minutes: 120

    steps:
      - uses: actions/checkout@v4

      - name: Build kldload ISO
        run: |
          PROFILE=${{ inputs.profile || 'server' }} ./deploy.sh build
        env:
          KLDLOAD_DISTRO: ${{ inputs.distro || 'debian' }}

      - name: Upload ISO artifact
        uses: actions/upload-artifact@v4
        with:
          name: kldload-iso
          path: live-build/output/kldload-free-*.iso
          retention-days: 7

  install-and-test:
    needs: build-iso
    runs-on: self-hosted
    timeout-minutes: 30

    steps:
      - uses: actions/download-artifact@v4
        with:
          name: kldload-iso

      - name: Create answer file for unattended install
        run: |
          cat > answers.env <<'EOF'
          KLDLOAD_DISTRO=debian
          KLDLOAD_PROFILE=server
          KLDLOAD_DISK=/dev/vda
          KLDLOAD_HOSTNAME=golden-ci
          KLDLOAD_USERNAME=admin
          KLDLOAD_PASSWORD=ci-temp-pass
          KLDLOAD_NET_METHOD=dhcp
          KLDLOAD_FORCE_WIPE=1
          KLDLOAD_EXPORT_FORMAT=qcow2
          EOF

      - name: Create seed disk
        run: |
          truncate -s 10M seed.img
          mkfs.vfat -n KLDLOAD-SEED seed.img
          mcopy -i seed.img answers.env ::

      - name: Install to VM (unattended)
        run: |
          qemu-img create -f qcow2 golden-build.qcow2 40G
          qemu-system-x86_64 -enable-kvm -m 4096 -smp 4 \
            -bios /usr/share/OVMF/OVMF_CODE.fd \
            -drive file=golden-build.qcow2,format=qcow2,if=virtio \
            -cdrom kldload-free-*.iso \
            -drive file=seed.img,format=raw,if=virtio \
            -nographic -serial mon:stdio \
            -net nic,model=virtio -net user,hostfwd=tcp::2222-:22

      - name: Boot and run smoke tests
        run: |
          # Boot the installed image
          qemu-system-x86_64 -enable-kvm -m 4096 -smp 4 \
            -bios /usr/share/OVMF/OVMF_CODE.fd \
            -drive file=golden-build.qcow2,format=qcow2,if=virtio \
            -nographic -daemonize \
            -net nic,model=virtio -net user,hostfwd=tcp::2222-:22

          # Wait for SSH
          for i in $(seq 1 60); do
            ssh -o StrictHostKeyChecking=no -p 2222 admin@localhost "echo ready" 2>/dev/null && break
            sleep 2
          done

          # Smoke tests
          ssh -p 2222 admin@localhost "zpool status -x"          # ZFS healthy
          ssh -p 2222 admin@localhost "zfs list"                  # Datasets exist
          ssh -p 2222 admin@localhost "systemctl is-system-running" # No failed units
          ssh -p 2222 admin@localhost "uname -r"                  # Kernel version
          ssh -p 2222 admin@localhost "zfs list -t snapshot"      # Snapshots exist

      - name: Extract golden image
        run: |
          scp -P 2222 admin@localhost:/root/kldload-export-*.qcow2 \
            golden-${{ env.IMAGE_TAG }}.qcow2

      - name: Upload golden image
        uses: actions/upload-artifact@v4
        with:
          name: golden-image
          path: golden-${{ env.IMAGE_TAG }}.qcow2
          retention-days: 30

  promote:
    needs: install-and-test
    runs-on: self-hosted
    if: github.ref == 'refs/heads/main'

    steps:
      - uses: actions/download-artifact@v4
        with:
          name: golden-image

      - name: Upload to image registry
        run: |
          sha256sum golden-*.qcow2 > checksum.sha256
          aws s3 cp golden-*.qcow2 s3://infra-images/golden/
          aws s3 cp checksum.sha256 s3://infra-images/golden/

      - name: Deploy to staging
        run: |
          cd terraform/staging
          terraform init
          terraform apply -auto-approve \
            -var="image_path=s3://infra-images/golden/golden-${{ env.IMAGE_TAG }}.qcow2"

      - name: Verify staging
        run: |
          for host in $(cd terraform/staging && terraform output -json ips | jq -r '.[]'); do
            ssh admin@$host "zpool status -x && systemctl is-system-running"
          done

The entire pipeline is triggered by git push. A developer changes a package set, pushes to main, and CI does the rest: build ISO, install to a VM, run smoke tests, extract the golden image, upload it, deploy to staging, verify. No human intervention. No SSH sessions. No "let me just quickly fix this on the server." The pipeline is the only deployment path. Git is the only interface.

The self-hosted runner requirement is real. You need KVM for the install-and-test step. A bare-metal runner with nested virtualization works. So does a dedicated build server. The ISO build takes ~20 minutes, the install takes ~5 minutes, the tests take ~2 minutes. Total pipeline: 30 minutes from push to staging. Compare that to "SSH in and hope the Ansible run works" which takes longer and is less reliable.

GitLab CI — equivalent pipeline

# .gitlab-ci.yml
stages:
  - build
  - test
  - promote
  - deploy

variables:
  KLDLOAD_DISTRO: debian
  KLDLOAD_PROFILE: server

build-iso:
  stage: build
  tags: [kvm]  # runner with KVM access
  script:
    - PROFILE=${KLDLOAD_PROFILE} ./deploy.sh build
  artifacts:
    paths:
      - live-build/output/kldload-free-*.iso
    expire_in: 7 days

install-test:
  stage: test
  tags: [kvm]
  needs: [build-iso]
  script:
    - |
      # Create seed disk with answers
      cat > answers.env <<'ANSWERS'
      KLDLOAD_DISTRO=${KLDLOAD_DISTRO}
      KLDLOAD_PROFILE=${KLDLOAD_PROFILE}
      KLDLOAD_DISK=/dev/vda
      KLDLOAD_HOSTNAME=golden-ci
      KLDLOAD_USERNAME=admin
      KLDLOAD_PASSWORD=ci-temp
      KLDLOAD_FORCE_WIPE=1
      KLDLOAD_EXPORT_FORMAT=qcow2
      ANSWERS

      truncate -s 10M seed.img
      mkfs.vfat -n KLDLOAD-SEED seed.img
      mcopy -i seed.img answers.env ::

      # Install and export
      qemu-img create -f qcow2 golden.qcow2 40G
      qemu-system-x86_64 -enable-kvm -m 4096 -smp 4 \
        -bios /usr/share/OVMF/OVMF_CODE.fd \
        -drive file=golden.qcow2,format=qcow2,if=virtio \
        -cdrom live-build/output/kldload-free-*.iso \
        -drive file=seed.img,format=raw,if=virtio \
        -nographic -serial mon:stdio

      # Boot and smoke test
      qemu-system-x86_64 -enable-kvm -m 4096 -smp 4 \
        -bios /usr/share/OVMF/OVMF_CODE.fd \
        -drive file=golden.qcow2,format=qcow2,if=virtio \
        -nographic -daemonize \
        -net nic,model=virtio -net user,hostfwd=tcp::2222-:22

      for i in $(seq 1 60); do
        ssh -o StrictHostKeyChecking=no -p 2222 admin@localhost "echo ok" && break
        sleep 2
      done

      ssh -p 2222 admin@localhost "zpool status -x"
      ssh -p 2222 admin@localhost "systemctl is-system-running"
      ssh -p 2222 admin@localhost "sudo kexport qcow2"
      scp -P 2222 admin@localhost:/root/kldload-export-*.qcow2 golden-${CI_COMMIT_SHORT_SHA}.qcow2
  artifacts:
    paths:
      - golden-${CI_COMMIT_SHORT_SHA}.qcow2
    expire_in: 30 days

promote-image:
  stage: promote
  needs: [install-test]
  only: [main]
  script:
    - sha256sum golden-*.qcow2 > checksum.sha256
    - aws s3 cp golden-*.qcow2 s3://infra-images/golden/
    - aws s3 cp checksum.sha256 s3://infra-images/golden/

deploy-staging:
  stage: deploy
  needs: [promote-image]
  only: [main]
  script:
    - cd terraform/staging
    - terraform init
    - terraform apply -auto-approve -var="image_tag=${CI_COMMIT_SHORT_SHA}"

Fleet Updates with ZFS

Traditional fleet updates mean running apt upgrade or dnf update on every machine and hoping nothing breaks. kldload fleet updates use ZFS primitives: send/receive for image distribution, boot environments for atomic rollback, and canary deployments for safe promotion.

ZFS send/receive for image distribution

Build a new golden image. Send the incremental delta to every machine in the fleet. No full image transfer — only the blocks that changed:

# On the build server: create new boot environment from golden image
# Assume golden-v42 is the current version, golden-v43 is the new one

# Take a snapshot of the new version
zfs snapshot rpool/ROOT/debian@golden-v43

# Send incremental delta to every machine in the fleet
for host in web-01 web-02 web-03 db-01 db-02 cache-01; do
  echo "Updating $host..."
  zfs send -i rpool/ROOT/debian@golden-v42 rpool/ROOT/debian@golden-v43 | \
    ssh admin@$host "sudo zfs receive rpool/ROOT/debian-v43"
  echo "$host: received $(ssh admin@$host 'sudo zfs list -o used rpool/ROOT/debian-v43 | tail -1')"
done

# The incremental send only transfers changed blocks
# A kernel update + 50 package updates might be 200MB, not 4GB

# On each machine: create a boot environment from the received snapshot
for host in web-01 web-02 web-03 db-01 db-02 cache-01; do
  ssh admin@$host "sudo kbe create golden-v43 rpool/ROOT/debian-v43"
done

Rolling updates with boot environments

#!/bin/bash
# rolling-update.sh — update fleet one machine at a time with automatic rollback

FLEET="web-01 web-02 web-03"
NEW_BE="golden-v43"
HEALTH_URL="http://lb.infra.local/health"
ROLLBACK_TIMEOUT=120

for host in $FLEET; do
  echo "=== Updating $host ==="

  # 1. Drain from load balancer
  ssh admin@lb "sudo nft delete element inet filter lb_pool { $host }"

  # 2. Set new boot environment as default
  ssh admin@$host "sudo kbe activate $NEW_BE"

  # 3. Reboot into new boot environment
  ssh admin@$host "sudo reboot"

  # 4. Wait for host to come back
  echo "Waiting for $host to reboot..."
  for i in $(seq 1 $ROLLBACK_TIMEOUT); do
    ssh -o ConnectTimeout=2 admin@$host "echo up" 2>/dev/null && break
    sleep 1
  done

  # 5. Verify health
  if ssh admin@$host "curl -sf http://localhost:8080/health > /dev/null"; then
    echo "$host: healthy on $NEW_BE"
    # 6. Add back to load balancer
    ssh admin@lb "sudo nft add element inet filter lb_pool { $host }"
  else
    echo "$host: UNHEALTHY — rolling back"
    # Rollback: reactivate previous boot environment
    ssh admin@$host "sudo kbe activate golden-v42 && sudo reboot"
    echo "ABORT: $host failed health check. Fleet update halted."
    exit 1
  fi

  echo "$host: complete. Moving to next."
  sleep 10  # soak time between hosts
done

echo "Fleet update complete. All hosts on $NEW_BE."

The key insight: the rollback is just a reboot. The previous boot environment is still there, untouched, on the same disk. kbe activate golden-v42 && reboot and the machine comes back on the old version. No restoring backups. No re-running Ansible. No hoping the package downgrade works. The old boot environment is a complete, bootable, tested snapshot of the previous state. Boot into it. Done.

This is why ZFS boot environments change the economics of updates. The cost of a failed update is one reboot (30 seconds), not a 2-hour incident. When rollback is free, you update more often. When you update more often, each update is smaller. When each update is smaller, each update is safer. It is a virtuous cycle that starts with ZFS and boot environments.

Canary deployments

Update one machine first. If it survives a soak period, update the rest. If it fails, roll back just the canary:

#!/bin/bash
# canary-deploy.sh — test on one machine before fleet rollout

CANARY="web-01"
FLEET="web-02 web-03"
NEW_BE="golden-v43"
SOAK_MINUTES=30

# Phase 1: Deploy to canary
echo "Phase 1: Deploying to canary ($CANARY)"
ssh admin@$CANARY "sudo kbe activate $NEW_BE && sudo reboot"

# Wait for canary to come back
sleep 60
ssh admin@$CANARY "zpool status -x && systemctl is-system-running"

# Phase 2: Soak test
echo "Phase 2: Soak testing for $SOAK_MINUTES minutes"
for i in $(seq 1 $SOAK_MINUTES); do
  if ! ssh admin@$CANARY "curl -sf http://localhost:8080/health > /dev/null"; then
    echo "CANARY FAILED at minute $i — rolling back"
    ssh admin@$CANARY "sudo kbe activate golden-v42 && sudo reboot"
    exit 1
  fi
  # Check error rate from metrics
  ERROR_RATE=$(curl -s "http://prometheus.infra.local:9090/api/v1/query?query=rate(http_errors_total{host=\"$CANARY\"}[5m])" | jq '.data.result[0].value[1] // "0"' -r)
  if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
    echo "CANARY ERROR RATE HIGH ($ERROR_RATE) — rolling back"
    ssh admin@$CANARY "sudo kbe activate golden-v42 && sudo reboot"
    exit 1
  fi
  sleep 60
done

# Phase 3: Canary passed — update fleet
echo "Phase 3: Canary passed. Updating fleet."
for host in $FLEET; do
  ssh admin@$host "sudo kbe activate $NEW_BE && sudo reboot"
  sleep 60
  ssh admin@$host "systemctl is-system-running" || {
    echo "FAILED: $host — halting fleet update"
    exit 1
  }
done

echo "Fleet update complete. All hosts on $NEW_BE."

Testing Infrastructure

Every golden image should be tested before deployment. Boot it in KVM, run smoke tests, verify ZFS health, check services, validate configuration. If the tests pass, the image is good. If they fail, fix the build and try again. The image never reaches production untested.

Automated validation pipeline

#!/bin/bash
# test-golden-image.sh — boot a golden image, validate, destroy
# Usage: ./test-golden-image.sh golden-v20260404.qcow2

set -euo pipefail

IMAGE="$1"
VM_NAME="test-$(date +%s)"
SSH_PORT=$((2200 + RANDOM % 100))
PASS=0
FAIL=0

log()  { echo "[$(date +%H:%M:%S)] $*"; }
pass() { ((PASS++)); log "PASS: $*"; }
fail() { ((FAIL++)); log "FAIL: $*"; }

# Create a temporary CoW copy — do not modify the original
cp --reflink=auto "$IMAGE" "/tmp/${VM_NAME}.qcow2"

# Boot the image
log "Booting $IMAGE as $VM_NAME (SSH port $SSH_PORT)"
qemu-system-x86_64 -enable-kvm -m 4096 -smp 4 \
  -bios /usr/share/OVMF/OVMF_CODE.fd \
  -drive "file=/tmp/${VM_NAME}.qcow2,format=qcow2,if=virtio" \
  -nographic -daemonize \
  -pidfile "/tmp/${VM_NAME}.pid" \
  -net nic,model=virtio -net "user,hostfwd=tcp::${SSH_PORT}-:22"

# Wait for SSH
log "Waiting for SSH..."
for i in $(seq 1 90); do
  ssh -o StrictHostKeyChecking=no -o ConnectTimeout=2 -p $SSH_PORT admin@localhost \
    "echo ready" 2>/dev/null && break
  sleep 2
done

run_test() {
  local desc="$1"; shift
  if ssh -p $SSH_PORT admin@localhost "$@" &>/dev/null; then
    pass "$desc"
  else
    fail "$desc"
  fi
}

# ── ZFS tests ──
run_test "ZFS pool healthy"           "sudo zpool status -x | grep -q 'all pools are healthy'"
run_test "ZFS pool is rpool"          "sudo zpool list rpool"
run_test "Root dataset exists"        "sudo zfs list rpool/ROOT"
run_test "Compression enabled"        "sudo zfs get compression rpool | grep -q lz4"
run_test "Snapshots exist"            "sudo zfs list -t snapshot | grep -q rpool"
run_test "Dataset hierarchy correct"  "sudo zfs list rpool/home && sudo zfs list rpool/var/log"

# ── Boot tests ──
run_test "System running"             "systemctl is-system-running | grep -qE 'running|degraded'"
run_test "No failed units"            "systemctl --failed --no-pager | grep -q '0 loaded'"
run_test "UEFI boot"                  "test -d /sys/firmware/efi"
run_test "Correct kernel"             "uname -r | grep -q el9"

# ── Network tests ──
run_test "Network configured"         "ip addr show | grep -q 'inet '"
run_test "DNS resolves"               "getent hosts github.com"
run_test "SSH listening"              "ss -tlnp | grep -q ':22'"

# ── Security tests ──
run_test "Root login disabled"        "sudo grep -q 'PermitRootLogin no' /etc/ssh/sshd_config"
run_test "Firewall active"            "sudo nft list ruleset | grep -q 'table'"
run_test "SELinux enforcing"          "getenforce | grep -qE 'Enforcing|Permissive' 2>/dev/null || true"

# ── Module tests ──
run_test "ZFS module loaded"          "lsmod | grep -q zfs"
run_test "WireGuard module available" "modinfo wireguard &>/dev/null"

# ── Cleanup ──
log "Destroying test VM"
kill $(cat "/tmp/${VM_NAME}.pid") 2>/dev/null || true
rm -f "/tmp/${VM_NAME}.qcow2" "/tmp/${VM_NAME}.pid"

# ── Report ──
echo ""
echo "════════════════════════════════════════"
echo "  Results: $PASS passed, $FAIL failed"
echo "  Image:   $IMAGE"
echo "════════════════════════════════════════"

[ $FAIL -eq 0 ] && exit 0 || exit 1

This test script is the gate between build and deploy. No image passes to production without running it. The tests are fast — boot (30 seconds), test (10 seconds), destroy (instant). The temporary copy ensures the golden image is never modified by testing. The test matrix covers ZFS health, boot correctness, networking, security, and kernel modules.

In a CI pipeline, this script runs automatically after every image build. Green means the image is promotable. Red means the build broke something. The developer gets immediate feedback. No "we deployed it and found out three days later that ZFS was not mounted." The test catches it before the image leaves the build server.

Multi-distro test matrix

Test every distro your fleet runs. One script, all distros:

#!/bin/bash
# test-all-distros.sh — validate golden images for every supported distro

DISTROS="debian ubuntu centos rocky fedora"
RESULTS=()

for distro in $DISTROS; do
  IMAGE="golden-${distro}-server-v20260404.qcow2"

  if [ ! -f "$IMAGE" ]; then
    echo "SKIP: $IMAGE not found"
    continue
  fi

  echo "Testing $distro..."
  if ./test-golden-image.sh "$IMAGE"; then
    RESULTS+=("$distro: PASS")
  else
    RESULTS+=("$distro: FAIL")
  fi
done

echo ""
echo "════════════════════════════"
echo "  Multi-Distro Test Results"
echo "════════════════════════════"
printf '%s\n' "${RESULTS[@]}"

Image-Based vs Config Management

The fundamental question: do you build a new image for every change, or do you patch running systems? The answer is both — but with clear boundaries. The base OS is an image. The application layer is config management. kldload builds the image. Ansible manages the application. The two never overlap.

Dimension Image-based (kldload) Config management (Ansible)
Deploy time Boot time (15-30s) Convergence (10-30min)
Reproducibility Byte-identical Eventual consistency
Network dependency None (offline) Package mirrors, SSH
Drift Impossible (immutable) Constant battle
Rollback Boot env switch (30s) Revert playbook (maybe)
Testing Boot image, test, destroy Molecule/Vagrant + hope
Scaling Copy image (O(1) per node) Run playbook (O(n) tasks)
Security audit Scan one image Scan every machine

The drift problem

Configuration management tools converge systems to a desired state. But between convergence runs, anything can happen. Someone SSHs in and changes a config. A package auto-updates. A cron job modifies a file. The system drifts from the declared state. The next Ansible run tries to fix it, but drift accumulates faster than convergence corrects it. After six months, no two machines in the fleet are identical.

Config management is a janitor that cleans up after people. Image-based deployment locks the door.

The immutable answer

Image-based deployment eliminates drift by definition. The base OS is an image. It boots from that image. It cannot drift from that image because it IS the image. There is no "someone SSHed in and changed the config" because the config is in the image and the image is read-only. Need a change? Build a new image. Test it. Deploy it. The old image stays around as a boot environment in case you need to go back.

You do not patch a container. You build a new one. Same principle, applied to the entire OS.

I am not anti-Ansible. I use Ansible every day. But I use it for the right job: application-layer changes on top of a known-good base image. Deploy a new version of the app. Rotate a secret. Update a feature flag. Toggle a configuration. These are things that change frequently and need to change without rebuilding the entire OS.

What I stopped using Ansible for: installing ZFS, compiling DKMS modules, configuring bootloaders, setting up WireGuard, creating dataset hierarchies, tuning kernel parameters, hardening SSH, configuring nftables base rules. All of that is in the image now. Baked in. Tested. Immutable. Ansible went from 400 tasks to 40. The 40 remaining tasks are application-specific. They run in 2 minutes instead of 20. They fail less because they depend on less. The base is solid. Ansible just decorates it.

This is the ideal architecture: image-based for the platform layer, config management for the application layer. Two tools, clean separation, clear responsibilities. kldload builds the platform. Ansible manages the application. Git versions both. CI automates both. The result is infrastructure that is truly defined by code — not by whatever Ansible managed to converge this time.

The two-layer architecture

┌─────────────────────────────────────────────────────────┐
│  APPLICATION LAYER  (changes often — config management) │
│                                                         │
│  App containers, configs, secrets, feature flags        │
│  Managed by: Ansible / Salt / Puppet / shell scripts   │
│  Deploys: per-change, multiple times per day            │
│  Rollback: ZFS snapshot rollback (30 seconds)           │
├─────────────────────────────────────────────────────────┤
│  PLATFORM LAYER    (changes rarely — golden image)      │
│                                                         │
│  OS, ZFS, WireGuard, eBPF, boot environments, kernel   │
│  Managed by: kldload → Packer → Terraform               │
│  Deploys: per-release, weekly/monthly                   │
│  Rollback: boot environment switch (30 seconds)         │
└─────────────────────────────────────────────────────────┘

Both layers roll back in 30 seconds.
Both layers are versioned in git.
Both layers are tested in CI.
Neither layer knows or cares about the other's internals.

The bottom line: Infrastructure as Code is not about which YAML dialect you write. It is about whether your infrastructure is deterministic, reproducible, and version-controlled. kldload makes the base image deterministic. Packer makes the application image deterministic. Terraform makes the deployment deterministic. Git makes everything version-controlled. Together, they form a pipeline where git push is the only operation and the entire stack — from disk partitioning to application deployment — is defined by code.

That is IaC. Not "we have Terraform files."

Quick reference — the complete IaC pipeline

# 1. Build the base image (kldload — Stage 0)
cd /root/kldload-free
git pull && PROFILE=server ./deploy.sh build

# 2. Install and export golden image
# (unattended via answer file + seed disk)
kexport qcow2

# 3. Layer application packages (Packer — Stage 1)
cd /root/packer
packer build -var "kldload_image=golden-v43.qcow2" kldload-webapp.pkr.hcl

# 4. Deploy to infrastructure (Terraform — Stage 2)
cd /root/terraform/production
terraform apply -var "image_tag=v43"

# 5. Configure applications (Ansible — Stage 3)
cd /root/ansible
ansible-playbook -i inventory/production playbooks/deploy-webapp.yaml

# 6. Verify everything
./scripts/test-golden-image.sh golden-webapp-v43.qcow2
for host in $(terraform output -json ips | jq -r '.[]'); do
  ssh admin@$host "zpool status -x && systemctl is-system-running"
done

# Rollback any stage independently:
# Stage 0-1: kbe activate golden-v42 && reboot     (30 seconds)
# Stage 2:   terraform apply -var "image_tag=v42"   (2 minutes)
# Stage 3:   zfs rollback rpool/srv/app@deploy-v42  (instant)