Packer & IaC Masterclass
This guide covers the full image-factory pipeline: building golden images with kldload, automating that build with Packer, and deploying the results at scale with Terraform. If you have installed kldload and exported an image manually, this is the next step — making the entire process repeatable, versioned, and automated across every cloud and hypervisor you run.
What this page covers: the philosophy of image-based deployment, the kldload image pipeline and kexport tool, Packer templates for QEMU/AWS/GCP/Azure, Terraform configs for KVM/libvirt and all three major clouds, the golden image lifecycle, secrets injection, and a complete CI/CD pipeline that goes from git push to production fleet.
Prerequisites: a running kldload build environment, familiarity with the unattended install and export formats guides, and a basic understanding of how VMs and cloud instances work.
1. Images Are the Deployment Unit
The dominant infrastructure pattern of the last decade is simple to state and hard to fully internalize: you do not configure servers — you build images and deploy them. The image contains the OS, the application, the runtime configuration, the kernel tuning, the security baseline. Deploying a new server means booting a known-good image, not running a 200-line Ansible playbook against an unknown base.
This is not just a DevOps aesthetic. It solves a class of production problems that configuration management cannot. A server that has been patched in place twelve times is a unique snowflake. It has accumulated state, partial upgrades, leftover config from services that were removed, and subtle drift from every other server in the fleet. When it breaks at 2am you cannot reproduce the failure anywhere else. An image-based server is identical to every other server built from that image. When it breaks, you deploy the previous image version in under a minute.
The shift in one sentence: Packer builds the image. Terraform deploys it. kldload provides the ZFS-rooted base. Together they give you immutable infrastructure — servers that are replaced, never patched in place.
kldload is an image factory: it builds golden images on ZFS, exports them to qcow2, vmdk, vhd, ova, or raw, and deploys them anywhere — on-prem KVM, Proxmox, AWS, GCP, Azure, or bare metal. The ISO installer is the image builder. The darksites inside the ISO mean no internet access is required during the build. The entire pipeline runs in an air-gapped room if needed.
What an image contains
The complete OS filesystem, kernel, drivers, installed packages, configuration files, systemd units, compiled applications, and tuning parameters. Everything needed to boot a fully functional server with no further configuration.
What an image does NOT contain
Secrets. Hostnames. IP addresses. SSH host keys. Machine IDs. Anything that must be unique per instance is injected at deploy time via cloud-init or environment variables. The image is a template, not a server.
Immutable means replaceable
An immutable server is never modified after deployment. Configuration changes happen in the image pipeline, not on live servers. Upgrading means deploying a new image and destroying the old one. Rollback means deploying the previous image version.
The toolchain
Packer automates image creation from a declarative template. Terraform declares the infrastructure that runs those images. kldload provides the hardened ZFS base with the darksites for offline builds. Git is the source of truth for all of it.
2. The kldload Image Pipeline
Most image pipelines use Packer to boot an OS ISO, answer installer prompts via a preseed or kickstart file, wait while the installer downloads packages from the internet, then export the disk. kldload's approach is different at each step.
Build, seal, export
The kldload image pipeline has three phases:
- Build: Boot the kldload ISO in a VM. Run the unattended installer, which reads an answers file and installs the target distro to disk — ZFS on root, WireGuard, eBPF tools, and all selected packages. Because the darksites are baked into the ISO, no internet access is required. A full install completes in 3–5 minutes.
- Seal: Run
kexport seal(or callk_seal_image_for_clone()directly). This clears the machine ID, removes SSH host keys, enables cloud-init with a multi-datasource config that auto-detects AWS, GCP, Azure, and NoCloud, and exports the ZFS pools. The system is now a template, not a server. - Export: Run
kexport convert. This callsqemu-img convertto produce the target format. Optionally SCP the image to a remote host, upload to an object store, or register as a cloud provider image.
Export formats
| Format | Target platforms | Notes |
|---|---|---|
| qcow2 | KVM, Proxmox, OpenStack | Native QEMU format, supports snapshots and thin provisioning |
| vmdk | VMware ESXi, vSphere, Fusion | Use streamOptimized subformat for OVA packaging |
| vhd / vhdx | Hyper-V, Azure | Azure requires fixed-size VHD; use --subformat fixed |
| ova | VMware, VirtualBox, generic | Self-contained archive: vmdk + OVF descriptor |
| raw | Bare metal, any hypervisor | dd directly to a disk; import into any platform with qemu-img convert |
The kexport tool
# Seal the installed system for cloning (run on the installed target, before export)
kexport seal
# What kexport seal does:
# - Truncates /etc/machine-id (systemd regenerates on next boot)
# - Removes /etc/ssh/ssh_host_* (new keys generated on first boot)
# - Writes /etc/cloud/cloud.cfg.d/99-datasource.cfg with multi-datasource list
# - Exports all ZFS pools (zpool export -a)
# - Sets a firstboot flag so cloud-init runs on next boot
# Convert the disk image (run from the build host after the VM is shut down)
kexport convert --format qcow2 --input /dev/vda --output /images/kldload-server-v1.0.0.qcow2
# Export to multiple formats in one pass
kexport convert --format qcow2,vmdk,vhd \
--input /dev/vda \
--output-dir /images/kldload-server-v1.0.0/
# SCP to remote host after conversion
kexport convert --format qcow2 \
--input /dev/vda \
--output /images/kldload-server-v1.0.0.qcow2 \
--scp-target images@build.example.com:/exports/
3. Packer Basics
Packer is a tool from HashiCorp that automates the creation of machine images from a declarative template. You describe what you want — which ISO to boot, what commands to run, how to export the result — and Packer handles the VM lifecycle, the boot sequence, and the export. The same template can produce images for a dozen different platforms.
Builders
A builder is a Packer plugin that creates a VM on a specific platform, boots it, and waits for provisioners to run. Common builders: qemu (local KVM), proxmox (Proxmox VE API), amazon-ebs (build in EC2), googlecompute (build in GCE), azure-arm (build in Azure).
Provisioners
Provisioners run after the VM boots and before the image is exported. Common provisioners: shell (run bash scripts), file (upload files), ansible (run an Ansible playbook), salt-masterless (run Salt states). This is where you install software, configure services, and seal the image.
Post-processors
Post-processors run after the VM is shut down and the image is captured. Common post-processors: compress (gzip the image), checksum (sha256sum for verification), manifest (write a JSON manifest of all outputs), vagrant (package as a Vagrant box).
HCL2 template format
Modern Packer uses HCL2 (HashiCorp Configuration Language v2) — the same language as Terraform. Templates are .pkr.hcl files. Variables, locals, expressions, and loops are all supported. The old JSON format still works but HCL2 is the standard for new templates.
Install Packer
# On a kldload host (CentOS Stream 9 / RHEL / Rocky)
sudo dnf config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
sudo dnf install -y packer
# On Debian / Ubuntu
wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \
sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt-get update && sudo apt-get install -y packer
# Verify
packer --version
# Install required plugins (run in your template directory)
packer init .
4. Building a kldload Image with Packer (QEMU Builder)
The QEMU builder creates a local VM, boots the kldload ISO, runs the installer, and exports the disk. This is the foundational build — the same image produced here becomes the source for cloud uploads, so it is worth getting right.
Directory structure
packer/
kldload-server/
kldload-server.pkr.hcl # main template
variables.pkrvars.hcl # default variable values
profiles/
server.pkrvars.hcl # server profile overrides
desktop.pkrvars.hcl # desktop profile overrides
k8s-node.pkrvars.hcl # Kubernetes node profile
scripts/
post-install.sh # runs inside the VM after install
seal.sh # calls kexport seal
Main template: kldload-server.pkr.hcl
packer {
required_plugins {
qemu = {
version = ">= 1.0.9"
source = "github.com/hashicorp/qemu"
}
}
}
# ─── Variables ────────────────────────────────────────────────────────────────
variable "iso_url" {
type = string
default = "/images/kldload-desktop-1.0.2-x86_64.iso"
}
variable "iso_checksum" {
type = string
default = "file:/images/kldload-desktop-1.0.2-x86_64.iso.sha256"
}
variable "disk_size" {
type = string
default = "40960" # 40 GiB in MiB
}
variable "memory" {
type = number
default = 4096
}
variable "cpus" {
type = number
default = 4
}
variable "output_dir" {
type = string
default = "/images/output"
}
variable "image_name" {
type = string
default = "kldload-server"
}
variable "image_version" {
type = string
default = "1.0.0"
}
variable "target_distro" {
type = string
default = "centos" # centos | debian | ubuntu | fedora | rocky
}
variable "install_profile" {
type = string
default = "server" # server | desktop | core
}
variable "ssh_username" {
type = string
default = "root"
}
variable "ssh_password" {
type = string
default = "kldload"
sensitive = true
}
# ─── Locals ───────────────────────────────────────────────────────────────────
locals {
output_filename = "${var.image_name}-${var.image_version}-${var.target_distro}"
timestamp = formatdate("YYYYMMDD", timestamp())
}
# ─── Source: QEMU ─────────────────────────────────────────────────────────────
source "qemu" "kldload" {
# ISO to boot
iso_url = var.iso_url
iso_checksum = var.iso_checksum
# Disk
disk_size = var.disk_size
disk_interface = "virtio"
format = "qcow2"
output_directory = "${var.output_dir}/${local.output_filename}"
vm_name = "${local.output_filename}.qcow2"
# Machine
machine_type = "q35"
memory = var.memory
cpus = var.cpus
net_device = "virtio-net"
# UEFI boot (kldload requires UEFI)
efi_boot = true
efi_firmware_code = "/usr/share/edk2/ovmf/OVMF_CODE.fd"
efi_firmware_vars = "/usr/share/edk2/ovmf/OVMF_VARS.fd"
# Boot command: kldload live environment autologins as root
# We write an answers file and kick off the unattended installer
boot_wait = "15s"
boot_command = [
# Wait for the live desktop/shell to come up, then write the answers file
"",
"cat > /tmp/answers.env << 'EOF'",
"K_TARGET_DISTRO=${var.target_distro}",
"K_INSTALL_PROFILE=${var.install_profile}",
"K_DISK=vda",
"K_HOSTNAME=kldload-template",
"K_TIMEZONE=UTC",
"K_ROOT_PASSWORD=kldload",
"K_INSTALL_USER=ops",
"K_INSTALL_USER_PASSWORD=kldload",
"EOF",
"",
# Launch the unattended installer
"kldload-install-target --answers /tmp/answers.env --unattended"
]
# SSH connection (installer reboots into the target system)
communicator = "ssh"
ssh_username = var.ssh_username
ssh_password = var.ssh_password
ssh_timeout = "30m"
ssh_handshake_attempts = 30
# Shutdown
shutdown_command = "shutdown -h now"
shutdown_timeout = "5m"
# Headless build (no GUI window)
headless = true
# QEMU extra args for performance
qemuargs = [
["-cpu", "host"],
["-enable-kvm"]
]
}
# ─── Build ────────────────────────────────────────────────────────────────────
build {
name = "kldload-server"
sources = ["source.qemu.kldload"]
# Wait for the system to fully come up after install reboot
provisioner "shell" {
inline = ["echo 'System is up'", "uname -a", "zpool status"]
}
# Run post-install configuration
provisioner "shell" {
script = "scripts/post-install.sh"
environment_vars = [
"IMAGE_NAME=${var.image_name}",
"IMAGE_VERSION=${var.image_version}",
"TARGET_DISTRO=${var.target_distro}"
]
}
# Seal the image for cloning
provisioner "shell" {
script = "scripts/seal.sh"
}
# Write a manifest
post-processor "manifest" {
output = "${var.output_dir}/${local.output_filename}/manifest.json"
strip_path = false
}
# Checksum
post-processor "checksum" {
checksum_types = ["sha256"]
output = "${var.output_dir}/${local.output_filename}/${local.output_filename}.{{.ChecksumType}}sum"
}
}
Post-install script: scripts/post-install.sh
#!/bin/bash
set -euo pipefail
echo "=== Post-install configuration ==="
echo "Image: ${IMAGE_NAME} v${IMAGE_VERSION} (${TARGET_DISTRO})"
# Install additional packages specific to this image type
if command -v dnf &>/dev/null; then
dnf install -y htop tmux vim-enhanced nmap-ncat
elif command -v apt-get &>/dev/null; then
apt-get install -y -q htop tmux vim ncat
fi
# Configure sshd for image use
cat > /etc/ssh/sshd_config.d/99-image.conf << 'EOF'
PermitRootLogin prohibit-password
PasswordAuthentication no
ChallengeResponseAuthentication no
EOF
# Enable services that should start on first boot
systemctl enable cloud-init
systemctl enable cloud-init-local
systemctl enable cloud-config
systemctl enable cloud-final
# Write image metadata
mkdir -p /etc/kldload
cat > /etc/kldload/image-metadata.json << EOF
{
"image_name": "${IMAGE_NAME}",
"image_version": "${IMAGE_VERSION}",
"target_distro": "${TARGET_DISTRO}",
"build_timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"builder": "packer-qemu"
}
EOF
echo "=== Post-install complete ==="
Seal script: scripts/seal.sh
#!/bin/bash
set -euo pipefail
echo "=== Sealing image for cloning ==="
# Clear machine identity
truncate -s 0 /etc/machine-id
rm -f /var/lib/dbus/machine-id
# Remove SSH host keys (regenerated on first boot)
rm -f /etc/ssh/ssh_host_*
# Clear persistent network interface naming
rm -f /etc/udev/rules.d/70-persistent-net.rules
rm -f /etc/udev/rules.d/75-net-description.rules
# Clear bash history
unset HISTFILE
history -c
rm -f /root/.bash_history /home/*/.bash_history
# Remove cloud-init's "already ran" flag so it runs on first boot
rm -f /var/lib/cloud/instances
rm -rf /var/lib/cloud/instance
cloud-init clean --logs
# Configure cloud-init multi-datasource (auto-detects AWS, GCP, Azure, NoCloud)
mkdir -p /etc/cloud/cloud.cfg.d
cat > /etc/cloud/cloud.cfg.d/99-datasource.cfg << 'EOF'
datasource_list:
- NoCloud
- ConfigDrive
- Ec2
- GCE
- Azure
- AltCloud
- OpenStack
- None
EOF
# Export ZFS pools so the image can be imported fresh on first boot
zpool export -a 2>/dev/null || true
echo "=== Image sealed ==="
Build commands
# Initialize plugins
cd packer/kldload-server
packer init .
# Validate the template
packer validate kldload-server.pkr.hcl
# Build with default variables (CentOS server profile)
packer build kldload-server.pkr.hcl
# Build with a specific profile
packer build -var-file=profiles/k8s-node.pkrvars.hcl kldload-server.pkr.hcl
# Build all profiles in parallel
packer build \
-var-file=profiles/server.pkrvars.hcl \
kldload-server.pkr.hcl &
packer build \
-var-file=profiles/k8s-node.pkrvars.hcl \
kldload-server.pkr.hcl &
wait
echo "All builds complete"
5. Cloud-Specific Packer Builds
Once you have a local qcow2 image, you can upload it to any cloud and register it as a native image. Alternatively, you can build the image directly in the cloud using the cloud provider's Packer builder — this is faster for cloud-specific images because you skip the upload step and build in the same region where the image will run.
AWS AMI (amazon-ebs builder)
packer {
required_plugins {
amazon = {
version = ">= 1.2.8"
source = "github.com/hashicorp/amazon"
}
}
}
variable "aws_region" {
type = string
default = "us-east-1"
}
variable "aws_instance_type" {
type = string
default = "t3.medium"
}
variable "base_ami" {
type = string
description = "A recent CentOS Stream 9 or Rocky 9 AMI to use as the base"
default = "ami-0xxxxxxxxxxxxxxxxx" # find with: aws ec2 describe-images
}
source "amazon-ebs" "kldload" {
region = var.aws_region
instance_type = var.aws_instance_type
source_ami = var.base_ami
ssh_username = "ec2-user"
ami_name = "kldload-server-${formatdate("YYYYMMDD", timestamp())}"
ami_description = "kldload golden image — ZFS + WireGuard + cloud-init"
ami_regions = [
"us-east-1",
"us-west-2",
"eu-west-1"
]
tags = {
Name = "kldload-server"
Version = "1.0.0"
BuildDate = formatdate("YYYY-MM-DD", timestamp())
ManagedBy = "packer"
}
# Encrypt the AMI root volume
encrypt_boot = true
kms_key_id = "alias/kldload-images"
# Launch block device for the AMI
launch_block_device_mappings {
device_name = "/dev/xvda"
volume_size = 20
volume_type = "gp3"
iops = 3000
throughput = 125
delete_on_termination = true
}
}
build {
name = "kldload-aws"
sources = ["source.amazon-ebs.kldload"]
provisioner "shell" {
script = "scripts/post-install-aws.sh"
}
provisioner "shell" {
script = "scripts/seal.sh"
}
}
GCP image (googlecompute builder)
source "googlecompute" "kldload" {
project_id = "my-gcp-project"
source_image_family = "centos-stream-9"
zone = "us-central1-a"
machine_type = "n2-standard-2"
image_name = "kldload-server-${formatdate("YYYYMMDD", timestamp())}"
image_description = "kldload golden image — ZFS + WireGuard + cloud-init"
image_family = "kldload-server"
image_labels = {
managed_by = "packer"
version = "1-0-0"
}
disk_size = 20
disk_type = "pd-ssd"
ssh_username = "packer"
}
build {
name = "kldload-gcp"
sources = ["source.googlecompute.kldload"]
provisioner "shell" {
script = "scripts/post-install-gcp.sh"
}
provisioner "shell" {
script = "scripts/seal.sh"
}
}
Azure managed image (azure-arm builder)
source "azure-arm" "kldload" {
# Authentication — use a service principal or managed identity
# Set via environment: ARM_CLIENT_ID, ARM_CLIENT_SECRET, ARM_SUBSCRIPTION_ID, ARM_TENANT_ID
managed_image_name = "kldload-server-${formatdate("YYYYMMDD", timestamp())}"
managed_image_resource_group_name = "kldload-images-rg"
os_type = "Linux"
image_publisher = "OpenLogic"
image_offer = "CentOS"
image_sku = "8_5-gen2"
azure_tags = {
ManagedBy = "packer"
Version = "1.0.0"
}
location = "eastus"
vm_size = "Standard_D2s_v5"
os_disk_size_gb = 30
# Azure requires VHDs to be fixed-size
# Packer handles this automatically for azure-arm
communicator = "ssh"
ssh_username = "packer"
}
build {
name = "kldload-azure"
sources = ["source.azure-arm.kldload"]
provisioner "shell" {
script = "scripts/post-install-azure.sh"
}
provisioner "shell" {
script = "scripts/seal.sh"
}
}
Cloud-init multi-datasource configuration
kldload's seal script writes a cloud-init datasource config that auto-detects the cloud environment. On first boot, cloud-init reads instance metadata from whatever metadata service is available — AWS IMDSv2, GCP metadata server, Azure IMDS, or a local NoCloud seed — and configures the hostname, network, and injected SSH keys automatically.
# /etc/cloud/cloud.cfg.d/99-datasource.cfg (written by kexport seal)
# datasource_list in priority order — first match wins
datasource_list:
- NoCloud # local: seed from ISO or filesystem (KVM, VirtualBox)
- ConfigDrive # OpenStack
- Ec2 # AWS (also works for Exoscale, Outscale, etc.)
- GCE # Google Cloud
- Azure # Azure
- AltCloud # CloudStack
- OpenStack # generic OpenStack
- None # fallback: no cloud-init, run with defaults
6. Terraform Basics
Terraform is an infrastructure-as-code tool that declares what infrastructure
you want, then creates, modifies, or destroys resources to match that declaration.
You describe VMs, networks, DNS records, storage buckets, and load balancers in
.tf files. Terraform figures out what needs to change and executes it in the
right order.
Resources
A resource is a thing Terraform manages — a VM, a network, a DNS record, a storage bucket. Each resource has a type (e.g. libvirt_domain, aws_instance) and a set of arguments. Terraform tracks resources in state and reconciles them on every apply.
Providers
A provider is a plugin that knows how to talk to a specific API. Common providers: hashicorp/libvirt (KVM), telmate/proxmox (Proxmox VE), hashicorp/aws, hashicorp/google, hashicorp/azurerm. Providers are downloaded automatically on terraform init.
State
Terraform stores what it has created in a state file (terraform.tfstate). On every plan/apply, it compares desired state (your .tf files) to actual state (the file) to real resources (the API). Without state, Terraform cannot know what it already created.
Remote state backends
For team use, store state in a remote backend: S3 + DynamoDB (AWS), GCS (GCP), Azure Blob, or Terraform Cloud. Remote state allows multiple team members to run Terraform without stomping on each other and enables state locking to prevent concurrent runs.
Basic Terraform workflow
# Initialize: download providers and set up backend
terraform init
# Preview changes without applying them
terraform plan
# Apply changes (creates/modifies/destroys resources)
terraform apply
# Destroy all resources managed by this workspace
terraform destroy
# Show current state
terraform show
# List all resources in state
terraform state list
# Remove a specific resource from state (without destroying it)
terraform state rm libvirt_domain.kldload_vm["worker-1"]
image_path variable in your Terraform config, and run terraform apply. Terraform destroys the old VMs and creates new ones from the new image. The new servers are identical. The state is clean. No drift, no snowflakes.7. Deploying kldload Images with Terraform (KVM / libvirt)
On a kldload KVM host, the libvirt Terraform provider replaces manual
virt-install commands. You describe a fleet of VMs in a .tf file and
Terraform creates them all in parallel, each with unique hostname, IP, and
cloud-init configuration derived from the same golden image.
Provider configuration
terraform {
required_providers {
libvirt = {
source = "dmacvicar/libvirt"
version = "~> 0.7"
}
}
}
provider "libvirt" {
uri = "qemu:///system"
# For remote KVM host:
# uri = "qemu+ssh://root@kvm-host.example.com/system"
}
Base image and network
# Pool where images are stored
resource "libvirt_pool" "kldload" {
name = "kldload"
type = "dir"
path = "/var/lib/libvirt/images/kldload"
}
# The golden image (built by Packer, uploaded once)
resource "libvirt_volume" "base_image" {
name = "kldload-server-1.0.0.qcow2"
pool = libvirt_pool.kldload.name
source = "/images/kldload-server-1.0.0.qcow2"
format = "qcow2"
}
# Isolated network for the fleet
resource "libvirt_network" "kldload_net" {
name = "kldload-fleet"
mode = "nat"
domain = "fleet.local"
addresses = ["10.100.0.0/24"]
dhcp { enabled = false } # we assign IPs via cloud-init
dns { enabled = true }
}
Fleet definition with cloud-init
# Variables
variable "fleet_nodes" {
description = "Map of node name to IP address"
type = map(string)
default = {
"kldload-web-1" = "10.100.0.11"
"kldload-web-2" = "10.100.0.12"
"kldload-app-1" = "10.100.0.21"
}
}
variable "ssh_public_key" {
type = string
default = "~/.ssh/id_ed25519.pub"
}
locals {
ssh_key = file(var.ssh_public_key)
}
# ─── Per-node disk (thin clone from golden image) ─────────────────────────────
resource "libvirt_volume" "node_disk" {
for_each = var.fleet_nodes
name = "${each.key}.qcow2"
pool = libvirt_pool.kldload.name
base_volume_id = libvirt_volume.base_image.id
format = "qcow2"
size = 42949672960 # 40 GiB
}
# ─── Per-node cloud-init ISO ──────────────────────────────────────────────────
resource "libvirt_cloudinit_disk" "node_init" {
for_each = var.fleet_nodes
name = "${each.key}-init.iso"
pool = libvirt_pool.kldload.name
user_data = <<-EOF
#cloud-config
hostname: ${each.key}
fqdn: ${each.key}.fleet.local
manage_etc_hosts: true
users:
- name: ops
groups: wheel
sudo: ALL=(ALL) NOPASSWD:ALL
shell: /bin/bash
ssh_authorized_keys:
- ${local.ssh_key}
ssh_pwauth: false
packages:
- vim
- tmux
runcmd:
- systemctl enable --now zfs-import-cache
- echo "Node ${each.key} is up" > /etc/motd
EOF
network_config = <<-EOF
version: 2
ethernets:
eth0:
addresses:
- ${each.value}/24
gateway4: 10.100.0.1
nameservers:
addresses: [10.100.0.1, 1.1.1.1]
EOF
}
# ─── VM definitions ───────────────────────────────────────────────────────────
resource "libvirt_domain" "fleet_node" {
for_each = var.fleet_nodes
name = each.key
memory = 2048
vcpu = 2
cpu { mode = "host-passthrough" }
disk {
volume_id = libvirt_volume.node_disk[each.key].id
}
cloudinit = libvirt_cloudinit_disk.node_init[each.key].id
network_interface {
network_id = libvirt_network.kldload_net.id
hostname = each.key
wait_for_lease = true
}
console {
type = "pty"
target_type = "serial"
target_port = "0"
}
graphics {
type = "vnc"
listen_type = "address"
autoport = true
}
}
# ─── Outputs ──────────────────────────────────────────────────────────────────
output "fleet_ips" {
value = {
for name, ip in var.fleet_nodes : name => ip
}
}
output "ssh_commands" {
value = {
for name, ip in var.fleet_nodes : name => "ssh ops@${ip}"
}
}
# Deploy the fleet
terraform init
terraform plan
terraform apply
# View the deployed IPs
terraform output fleet_ips
# Destroy everything cleanly
terraform destroy
8. Deploying to AWS with Terraform
The Packer build in section 5 produced a registered AMI. Now Terraform uses that AMI to deploy EC2 instances with the full kldload configuration — VPC, subnets, security groups, and a separate EBS volume for ZFS data.
AWS provider and variables
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "kldload-terraform-state"
key = "aws/production/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "kldload-terraform-locks"
encrypt = true
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
ManagedBy = "terraform"
Project = "kldload"
Environment = var.environment
}
}
}
variable "aws_region" { default = "us-east-1" }
variable "environment" { default = "production" }
variable "kldload_ami" { description = "AMI ID from Packer build" }
variable "instance_type" { default = "t3.large" }
variable "node_count" { default = 3 }
variable "ssh_key_name" { description = "EC2 key pair name" }
VPC and networking
resource "aws_vpc" "kldload" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = { Name = "kldload-${var.environment}" }
}
resource "aws_subnet" "kldload_private" {
count = 3
vpc_id = aws_vpc.kldload.id
cidr_block = "10.0.${count.index + 1}.0/24"
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = { Name = "kldload-private-${count.index + 1}" }
}
data "aws_availability_zones" "available" {
state = "available"
}
resource "aws_security_group" "kldload_nodes" {
name = "kldload-nodes-${var.environment}"
vpc_id = aws_vpc.kldload.id
ingress {
description = "SSH from VPC"
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [aws_vpc.kldload.cidr_block]
}
ingress {
description = "WireGuard"
from_port = 51820
to_port = 51820
protocol = "udp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
EC2 instances with ZFS EBS data volumes
resource "aws_instance" "kldload_node" {
count = var.node_count
ami = var.kldload_ami
instance_type = var.instance_type
key_name = var.ssh_key_name
subnet_id = aws_subnet.kldload_private[count.index % 3].id
vpc_security_group_ids = [aws_security_group.kldload_nodes.id]
# Root volume — ext4, AWS limitation for AMI boot
root_block_device {
volume_type = "gp3"
volume_size = 20
iops = 3000
throughput = 125
encrypted = true
delete_on_termination = true
}
user_data = base64encode(<<-EOF
#cloud-config
hostname: kldload-node-${count.index + 1}
fqdn: kldload-node-${count.index + 1}.${var.environment}.internal
manage_etc_hosts: true
runcmd:
# Import or create the ZFS data pool on the attached EBS volume
- |
if ! zpool status data &>/dev/null; then
# First boot: create the pool
# Wait for the EBS volume to appear
while [ ! -b /dev/nvme1n1 ]; do sleep 1; done
zpool create -o ashift=12 \
-O compression=lz4 \
-O atime=off \
-O mountpoint=/data \
data /dev/nvme1n1
else
# Subsequent boots: import existing pool
zpool import data
fi
EOF
)
tags = { Name = "kldload-node-${count.index + 1}" }
}
# ZFS data volume — separate EBS volume, persistent across instance replacements
resource "aws_ebs_volume" "kldload_data" {
count = var.node_count
availability_zone = aws_instance.kldload_node[count.index].availability_zone
size = 100
type = "gp3"
iops = 3000
throughput = 125
encrypted = true
tags = { Name = "kldload-data-${count.index + 1}" }
}
resource "aws_volume_attachment" "kldload_data" {
count = var.node_count
device_name = "/dev/sdf"
volume_id = aws_ebs_volume.kldload_data[count.index].id
instance_id = aws_instance.kldload_node[count.index].id
}
output "node_private_ips" {
value = aws_instance.kldload_node[*].private_ip
}
9. Deploying to GCP and Azure
GCP: Compute Engine instance from a kldload image
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
provider "google" {
project = var.gcp_project
region = var.gcp_region
}
variable "gcp_project" { description = "GCP project ID" }
variable "gcp_region" { default = "us-central1" }
variable "gcp_zone" { default = "us-central1-a" }
variable "kldload_image" { description = "GCP image name from Packer build" }
variable "machine_type" { default = "n2-standard-2" }
variable "node_count" { default = 3 }
resource "google_compute_network" "kldload" {
name = "kldload-network"
auto_create_subnetworks = false
}
resource "google_compute_subnetwork" "kldload" {
name = "kldload-subnet"
network = google_compute_network.kldload.id
ip_cidr_range = "10.10.0.0/24"
region = var.gcp_region
}
resource "google_compute_firewall" "kldload_ssh" {
name = "kldload-allow-ssh"
network = google_compute_network.kldload.id
allow {
protocol = "tcp"
ports = ["22"]
}
source_ranges = ["35.235.240.0/20"] # Cloud IAP IP range for SSH tunneling
}
resource "google_compute_instance" "kldload_node" {
count = var.node_count
name = "kldload-node-${count.index + 1}"
machine_type = var.machine_type
zone = var.gcp_zone
boot_disk {
initialize_params {
image = "projects/${var.gcp_project}/global/images/${var.kldload_image}"
size = 20
type = "pd-ssd"
}
}
# Separate persistent disk for ZFS data
attached_disk {
source = google_compute_disk.kldload_data[count.index].self_link
device_name = "data"
mode = "READ_WRITE"
}
network_interface {
subnetwork = google_compute_subnetwork.kldload.id
# No external IP — use Cloud IAP for SSH access
}
metadata = {
user-data = <<-EOF
#cloud-config
hostname: kldload-node-${count.index + 1}
runcmd:
- |
if ! zpool status data &>/dev/null; then
while [ ! -b /dev/disk/by-id/google-data ]; do sleep 1; done
zpool create -o ashift=12 -O compression=lz4 -O atime=off \
-O mountpoint=/data data /dev/disk/by-id/google-data
else
zpool import data
fi
EOF
}
service_account {
scopes = ["cloud-platform"]
}
labels = {
managed_by = "terraform"
project = "kldload"
}
}
resource "google_compute_disk" "kldload_data" {
count = var.node_count
name = "kldload-data-${count.index + 1}"
type = "pd-ssd"
zone = var.gcp_zone
size = 100
labels = { managed_by = "terraform" }
}
Azure: VM from a managed image
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
}
}
provider "azurerm" {
features {}
}
variable "azure_location" { default = "eastus" }
variable "resource_group" { default = "kldload-production" }
variable "kldload_image" { description = "Managed image name from Packer build" }
variable "vm_size" { default = "Standard_D2s_v5" }
variable "node_count" { default = 3 }
data "azurerm_image" "kldload" {
name = var.kldload_image
resource_group_name = var.resource_group
}
resource "azurerm_resource_group" "kldload" {
name = var.resource_group
location = var.azure_location
}
resource "azurerm_virtual_network" "kldload" {
name = "kldload-vnet"
address_space = ["10.20.0.0/16"]
location = azurerm_resource_group.kldload.location
resource_group_name = azurerm_resource_group.kldload.name
}
resource "azurerm_subnet" "kldload" {
name = "kldload-subnet"
resource_group_name = azurerm_resource_group.kldload.name
virtual_network_name = azurerm_virtual_network.kldload.name
address_prefixes = ["10.20.1.0/24"]
}
resource "azurerm_network_interface" "kldload_node" {
count = var.node_count
name = "kldload-nic-${count.index + 1}"
location = azurerm_resource_group.kldload.location
resource_group_name = azurerm_resource_group.kldload.name
ip_configuration {
name = "internal"
subnet_id = azurerm_subnet.kldload.id
private_ip_address_allocation = "Dynamic"
}
}
resource "azurerm_linux_virtual_machine" "kldload_node" {
count = var.node_count
name = "kldload-node-${count.index + 1}"
resource_group_name = azurerm_resource_group.kldload.name
location = azurerm_resource_group.kldload.location
size = var.vm_size
admin_username = "ops"
admin_ssh_key {
username = "ops"
public_key = file("~/.ssh/id_ed25519.pub")
}
network_interface_ids = [
azurerm_network_interface.kldload_node[count.index].id
]
source_image_id = data.azurerm_image.kldload.id
os_disk {
caching = "ReadWrite"
storage_account_type = "Premium_LRS"
disk_size_gb = 30
}
# ZFS data disk
data_disk {
disk_size_gb = 100
lun = 10
storage_account_type = "Premium_LRS"
caching = "None"
}
custom_data = base64encode(<<-EOF
#cloud-config
hostname: kldload-node-${count.index + 1}
runcmd:
- |
if ! zpool status data &>/dev/null; then
while [ ! -b /dev/sdc ]; do sleep 1; done
zpool create -o ashift=12 -O compression=lz4 \
-O atime=off -O mountpoint=/data data /dev/sdc
else
zpool import data
fi
EOF
)
tags = { managed_by = "terraform" }
}
10. The Golden Image Lifecycle
A golden image is not a one-time artifact. It is a versioned, tested, promoted release that goes through the same pipeline as your application code. Every change to the image — a security patch, a new package, a configuration update — produces a new image version. The old version is not deleted until the new one is validated in production.
Image naming and versioning
# Consistent naming convention
kldload-server-YYYYMMDD # date-based (simple, chronological)
kldload-server-v1.2.3 # semantic version (structured, for releases)
kldload-server-main-abc1234 # git branch + commit hash (CI builds)
# Examples
kldload-server-20260402 # built April 2, 2026
kldload-k8s-node-v2.1.0 # k8s node image, version 2.1.0
kldload-server-main-7f3a8b2 # built from main branch commit 7f3a8b2
Image rotation policy
#!/bin/bash
# rotate-images.sh — keep the last 5 images per family, delete older ones
IMAGE_FAMILY="kldload-server"
KEEP_COUNT=5
REGION="us-east-1"
# List all AMIs for this family, sorted by creation date (newest first)
mapfile -t AMIS < <(
aws ec2 describe-images \
--owners self \
--filters "Name=name,Values=${IMAGE_FAMILY}-*" \
--query 'sort_by(Images, &CreationDate)[].[ImageId,Name]' \
--output text \
--region "${REGION}" \
| awk '{print $1}'
)
TOTAL=${#AMIS[@]}
DELETE_COUNT=$(( TOTAL - KEEP_COUNT ))
if (( DELETE_COUNT > 0 )); then
echo "Keeping ${KEEP_COUNT} of ${TOTAL} images, deleting ${DELETE_COUNT} oldest"
for i in $(seq 0 $(( DELETE_COUNT - 1 ))); do
AMI_ID="${AMIS[$i]}"
echo "Deregistering ${AMI_ID}..."
aws ec2 deregister-image --image-id "${AMI_ID}" --region "${REGION}"
# Also delete the associated snapshot
SNAPSHOT=$(aws ec2 describe-images --image-ids "${AMI_ID}" --region "${REGION}" \
--query 'Images[0].BlockDeviceMappings[0].Ebs.SnapshotId' --output text)
aws ec2 delete-snapshot --snapshot-id "${SNAPSHOT}" --region "${REGION}"
done
else
echo "Only ${TOTAL} images exist, nothing to delete"
fi
Test pipeline: build → validate → promote
# 1. Build the image
packer build \
-var "image_version=$(date +%Y%m%d)" \
kldload-server.pkr.hcl
# 2. Deploy to a test VM
terraform apply \
-var "kldload_ami=${NEW_AMI_ID}" \
-var "environment=test" \
-target=aws_instance.kldload_node[0]
# 3. Run validation tests
ssh ops@"${TEST_NODE_IP}" 'bash -s' << 'EOF'
set -e
echo "=== Smoke tests ==="
systemctl is-active sshd || { echo "FAIL: sshd"; exit 1; }
zpool status || { echo "FAIL: ZFS"; exit 1; }
wg show || { echo "FAIL: WireGuard"; exit 1; }
df -h / # verify root filesystem
free -h # verify memory
uname -r # verify kernel
echo "=== All smoke tests passed ==="
EOF
# 4. If tests pass, update production variable file
echo "kldload_ami = \"${NEW_AMI_ID}\"" > environments/production.tfvars
# 5. Deploy to production (rolling, one node at a time)
for i in 0 1 2; do
terraform apply \
-var-file=environments/production.tfvars \
-target="aws_instance.kldload_node[${i}]"
sleep 30 # wait for node to come up and pass health checks
done
11. Secrets and Configuration
The number one image pipeline mistake is baking secrets into the golden image. Every instance gets the same database password. Every instance has your WireGuard private key. If the image leaks — and images are large files that get copied, uploaded, and shared — everything leaks with it. Secrets are injected at deploy time, not baked at build time.
What goes in the image vs what is injected
| Bake into image (safe) | Inject at deploy time (required) |
|---|---|
| OS packages and kernel | SSH host keys (regenerated by cloud-init) |
| Application binaries | Machine ID (cleared by kexport seal) |
| systemd unit files | Hostname and IP address |
| Kernel tuning (sysctl) | WireGuard private key |
| ZFS tuning and datasets | Database passwords |
| Non-sensitive configuration | API tokens and certificates |
| User accounts (without passwords) | User SSH authorized keys |
Secrets tools: Vault, SOPS, age
# HashiCorp Vault — centralized secrets management
# Store WireGuard key in Vault
vault kv put secret/wireguard/node-1 private_key="$(wg genkey)"
# Read from Vault in a deployment script
WG_PRIVATE_KEY=$(vault kv get -field=private_key secret/wireguard/node-1)
# ─── SOPS + age — encrypted secrets in git ────────────────────────────────────
# SOPS encrypts specific values in YAML/JSON files, leaving keys readable
# age is a modern replacement for GPG
# Generate an age key pair
age-keygen -o ~/.config/sops/age/keys.txt
# Encrypt a secrets file
sops --encrypt --age age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx secrets.yaml > secrets.enc.yaml
# Decrypt at deploy time (key must be in ~/.config/sops/age/keys.txt)
sops --decrypt secrets.enc.yaml | terraform apply -var-file=/dev/stdin
Injecting WireGuard keys via cloud-init from Vault
# In Terraform — fetch WireGuard key from Vault, inject via cloud-init
data "vault_kv_secret_v2" "wireguard" {
count = var.node_count
mount = "secret"
name = "wireguard/node-${count.index + 1}"
}
resource "aws_instance" "kldload_node" {
count = var.node_count
ami = var.kldload_ami
instance_type = var.instance_type
user_data = base64encode(<<-EOF
#cloud-config
hostname: kldload-node-${count.index + 1}
write_files:
- path: /etc/wireguard/wg0.conf
permissions: '0600'
content: |
[Interface]
PrivateKey = ${data.vault_kv_secret_v2.wireguard[count.index].data["private_key"]}
Address = 10.200.${count.index + 1}.1/24
ListenPort = 51820
[Peer]
PublicKey = ${var.wireguard_server_pubkey}
Endpoint = ${var.wireguard_server_endpoint}:51820
AllowedIPs = 10.200.0.0/16
runcmd:
- systemctl enable --now wg-quick@wg0
EOF
)
}
12. CI/CD Integration
The full pipeline: a git push triggers a Packer build, tests pass, Terraform deploys to staging, manual approval unlocks production, Terraform rolls out to production. The entire deployment history is your git history. Rollback is a git revert.
GitHub Actions pipeline
# .github/workflows/image-pipeline.yml
name: Image Pipeline
on:
push:
branches: [main]
paths:
- 'packer/**'
- 'scripts/**'
env:
AWS_REGION: us-east-1
PACKER_LOG: 1
jobs:
# ─── Build the golden image ─────────────────────────────────────────────────
build-image:
runs-on: [self-hosted, kldload-builder] # runs on a kldload KVM host
outputs:
ami_id: ${{ steps.packer.outputs.ami_id }}
image_version: ${{ steps.version.outputs.version }}
steps:
- uses: actions/checkout@v4
- name: Set image version
id: version
run: echo "version=$(date +%Y%m%d)-${GITHUB_SHA::8}" >> "$GITHUB_OUTPUT"
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ vars.AWS_ACCOUNT_ID }}:role/packer-builder
aws-region: ${{ env.AWS_REGION }}
- name: Setup Packer
uses: hashicorp/setup-packer@main
with:
version: "latest"
- name: Packer init
run: packer init packer/kldload-server/
- name: Packer validate
run: |
packer validate \
-var "image_version=${{ steps.version.outputs.version }}" \
packer/kldload-server/kldload-server.pkr.hcl
- name: Packer build
id: packer
run: |
packer build \
-var "image_version=${{ steps.version.outputs.version }}" \
-machine-readable \
packer/kldload-server/kldload-server.pkr.hcl \
| tee /tmp/packer-output.txt
AMI_ID=$(grep 'artifact,0,id' /tmp/packer-output.txt \
| cut -d, -f6 | cut -d: -f2)
echo "ami_id=${AMI_ID}" >> "$GITHUB_OUTPUT"
echo "Built AMI: ${AMI_ID}"
# ─── Deploy to staging and run smoke tests ──────────────────────────────────
deploy-staging:
needs: build-image
runs-on: [self-hosted, kldload-builder]
environment: staging
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform init
run: terraform -chdir=terraform/aws init
- name: Deploy to staging
run: |
terraform -chdir=terraform/aws apply -auto-approve \
-var "kldload_ami=${{ needs.build-image.outputs.ami_id }}" \
-var "environment=staging" \
-var "node_count=1"
- name: Get staging IP
id: staging_ip
run: |
IP=$(terraform -chdir=terraform/aws output -raw node_private_ips | head -1)
echo "ip=${IP}" >> "$GITHUB_OUTPUT"
- name: Run smoke tests
run: |
# Wait for SSH to be available
for i in $(seq 1 30); do
ssh -o StrictHostKeyChecking=no \
-o ConnectTimeout=5 \
ops@${{ steps.staging_ip.outputs.ip }} \
'echo ok' && break
sleep 10
done
ssh ops@${{ steps.staging_ip.outputs.ip }} 'bash -s' << 'TESTS'
set -e
systemctl is-active sshd
zpool status
wg show wg0
echo "All smoke tests passed"
TESTS
# ─── Deploy to production (requires manual approval) ────────────────────────
deploy-production:
needs: [build-image, deploy-staging]
runs-on: [self-hosted, kldload-builder]
environment: production # GitHub environment with required reviewers
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform init
run: terraform -chdir=terraform/aws init
- name: Rolling deploy to production
run: |
NODE_COUNT=$(terraform -chdir=terraform/aws output -raw node_count)
for i in $(seq 0 $(( NODE_COUNT - 1 ))); do
echo "Deploying node ${i}..."
terraform -chdir=terraform/aws apply -auto-approve \
-var "kldload_ami=${{ needs.build-image.outputs.ami_id }}" \
-var "environment=production" \
-target="aws_instance.kldload_node[${i}]"
echo "Node ${i} deployed, waiting 30s..."
sleep 30
done
- name: Update image manifest
run: |
echo '{"ami_id": "${{ needs.build-image.outputs.ami_id }}", "version": "${{ needs.build-image.outputs.image_version }}", "deployed": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' \
> deployed-image.json
git config user.email "ci@kldload.com"
git config user.name "kldload CI"
git add deployed-image.json
git commit -m "deploy: production image ${{ needs.build-image.outputs.image_version }}"
git push
GitLab CI alternative
# .gitlab-ci.yml
stages:
- build
- test
- staging
- production
variables:
IMAGE_VERSION: "${CI_COMMIT_SHORT_SHA}-$(date +%Y%m%d)"
build-image:
stage: build
tags: [kldload-builder]
script:
- packer init packer/kldload-server/
- packer build
-var "image_version=${IMAGE_VERSION}"
packer/kldload-server/kldload-server.pkr.hcl
- AMI_ID=$(cat manifest.json | jq -r '.builds[0].artifact_id' | cut -d: -f2)
- echo "AMI_ID=${AMI_ID}" >> build.env
artifacts:
reports:
dotenv: build.env
deploy-staging:
stage: staging
tags: [kldload-builder]
dependencies: [build-image]
script:
- terraform -chdir=terraform/aws init
- terraform -chdir=terraform/aws apply -auto-approve
-var "kldload_ami=${AMI_ID}"
-var "environment=staging"
environment:
name: staging
deploy-production:
stage: production
tags: [kldload-builder]
dependencies: [build-image]
when: manual # requires manual click in GitLab UI
script:
- terraform -chdir=terraform/aws init
- terraform -chdir=terraform/aws apply -auto-approve
-var "kldload_ami=${AMI_ID}"
-var "environment=production"
environment:
name: production
Related pages
- Unattended Install — the answers file format and all installer variables
- Export Formats — qcow2, vmdk, vhd, ova, raw — when to use which
- Cloud & Packer — quick-start cloud deployment guide
- IaC Quickstart — the fastest path from kldload ISO to deployed fleet
- WireGuard Masterclass — injecting WireGuard config at deploy time
- Kubernetes on KVM — deploying a k8s cluster from kldload images
- Automation — kldload automation tools and postinstallers