| your Linux re-packer
kldload — your platform, your way, anywhere, free
Source

Packer & IaC Masterclass

This guide covers the full image-factory pipeline: building golden images with kldload, automating that build with Packer, and deploying the results at scale with Terraform. If you have installed kldload and exported an image manually, this is the next step — making the entire process repeatable, versioned, and automated across every cloud and hypervisor you run.

What this page covers: the philosophy of image-based deployment, the kldload image pipeline and kexport tool, Packer templates for QEMU/AWS/GCP/Azure, Terraform configs for KVM/libvirt and all three major clouds, the golden image lifecycle, secrets injection, and a complete CI/CD pipeline that goes from git push to production fleet.

Prerequisites: a running kldload build environment, familiarity with the unattended install and export formats guides, and a basic understanding of how VMs and cloud instances work.


1. Images Are the Deployment Unit

The dominant infrastructure pattern of the last decade is simple to state and hard to fully internalize: you do not configure servers — you build images and deploy them. The image contains the OS, the application, the runtime configuration, the kernel tuning, the security baseline. Deploying a new server means booting a known-good image, not running a 200-line Ansible playbook against an unknown base.

This is not just a DevOps aesthetic. It solves a class of production problems that configuration management cannot. A server that has been patched in place twelve times is a unique snowflake. It has accumulated state, partial upgrades, leftover config from services that were removed, and subtle drift from every other server in the fleet. When it breaks at 2am you cannot reproduce the failure anywhere else. An image-based server is identical to every other server built from that image. When it breaks, you deploy the previous image version in under a minute.

The shift in one sentence: Packer builds the image. Terraform deploys it. kldload provides the ZFS-rooted base. Together they give you immutable infrastructure — servers that are replaced, never patched in place.

kldload is an image factory: it builds golden images on ZFS, exports them to qcow2, vmdk, vhd, ova, or raw, and deploys them anywhere — on-prem KVM, Proxmox, AWS, GCP, Azure, or bare metal. The ISO installer is the image builder. The darksites inside the ISO mean no internet access is required during the build. The entire pipeline runs in an air-gapped room if needed.

The shift from "configure servers" to "build images" is the most important infrastructure pattern of the last decade. But it requires changing how you think about server state. Configuration management (Ansible, Puppet, Chef) treats servers as long-lived entities that accumulate changes over time. Image-based deployment treats servers as disposable artifacts that are replaced atomically. The former is a maintenance burden that grows with fleet size. The latter scales horizontally — 1 server and 1000 servers are identical operations. When something goes wrong, you roll back the image. When something goes right, you promote the image to production. The server itself is never the source of truth. The image pipeline is.

What an image contains

The complete OS filesystem, kernel, drivers, installed packages, configuration files, systemd units, compiled applications, and tuning parameters. Everything needed to boot a fully functional server with no further configuration.

// Think: a ZFS snapshot of a fully configured system // Not: a blank OS + a pile of Ansible roles

What an image does NOT contain

Secrets. Hostnames. IP addresses. SSH host keys. Machine IDs. Anything that must be unique per instance is injected at deploy time via cloud-init or environment variables. The image is a template, not a server.

// cloud-init runs once on first boot to set identity // The image itself is stateless and reusable

Immutable means replaceable

An immutable server is never modified after deployment. Configuration changes happen in the image pipeline, not on live servers. Upgrading means deploying a new image and destroying the old one. Rollback means deploying the previous image version.

// Blue/green deploy: bring up new image, drain old // Never: ssh in and edit /etc/nginx/nginx.conf

The toolchain

Packer automates image creation from a declarative template. Terraform declares the infrastructure that runs those images. kldload provides the hardened ZFS base with the darksites for offline builds. Git is the source of truth for all of it.

// kldload ISO → Packer → golden image // golden image → Terraform → running fleet

2. The kldload Image Pipeline

Most image pipelines use Packer to boot an OS ISO, answer installer prompts via a preseed or kickstart file, wait while the installer downloads packages from the internet, then export the disk. kldload's approach is different at each step.

Build, seal, export

The kldload image pipeline has three phases:

  1. Build: Boot the kldload ISO in a VM. Run the unattended installer, which reads an answers file and installs the target distro to disk — ZFS on root, WireGuard, eBPF tools, and all selected packages. Because the darksites are baked into the ISO, no internet access is required. A full install completes in 3–5 minutes.
  2. Seal: Run kexport seal (or call k_seal_image_for_clone() directly). This clears the machine ID, removes SSH host keys, enables cloud-init with a multi-datasource config that auto-detects AWS, GCP, Azure, and NoCloud, and exports the ZFS pools. The system is now a template, not a server.
  3. Export: Run kexport convert. This calls qemu-img convert to produce the target format. Optionally SCP the image to a remote host, upload to an object store, or register as a cloud provider image.

Export formats

Format Target platforms Notes
qcow2 KVM, Proxmox, OpenStack Native QEMU format, supports snapshots and thin provisioning
vmdk VMware ESXi, vSphere, Fusion Use streamOptimized subformat for OVA packaging
vhd / vhdx Hyper-V, Azure Azure requires fixed-size VHD; use --subformat fixed
ova VMware, VirtualBox, generic Self-contained archive: vmdk + OVF descriptor
raw Bare metal, any hypervisor dd directly to a disk; import into any platform with qemu-img convert

The kexport tool

# Seal the installed system for cloning (run on the installed target, before export)
kexport seal

# What kexport seal does:
#   - Truncates /etc/machine-id (systemd regenerates on next boot)
#   - Removes /etc/ssh/ssh_host_* (new keys generated on first boot)
#   - Writes /etc/cloud/cloud.cfg.d/99-datasource.cfg with multi-datasource list
#   - Exports all ZFS pools (zpool export -a)
#   - Sets a firstboot flag so cloud-init runs on next boot

# Convert the disk image (run from the build host after the VM is shut down)
kexport convert --format qcow2 --input /dev/vda --output /images/kldload-server-v1.0.0.qcow2

# Export to multiple formats in one pass
kexport convert --format qcow2,vmdk,vhd \
  --input /dev/vda \
  --output-dir /images/kldload-server-v1.0.0/

# SCP to remote host after conversion
kexport convert --format qcow2 \
  --input /dev/vda \
  --output /images/kldload-server-v1.0.0.qcow2 \
  --scp-target images@build.example.com:/exports/
Most image pipelines use Packer to boot an ISO, answer installer prompts with a preseed or kickstart, wait 20 minutes while packages download, then export. kldload's unattended install completes in 3–5 minutes because the darksites are baked into the ISO — no internet downloads during install. The installer reads an answers file, formats the disk as ZFS, installs from the local mirror, and exits. The image is built offline, sealed, exported, and ready to deploy anywhere. This is how you build images in air-gapped environments, in datacenter build rooms with no public internet access, or on a laptop on an airplane.

3. Packer Basics

Packer is a tool from HashiCorp that automates the creation of machine images from a declarative template. You describe what you want — which ISO to boot, what commands to run, how to export the result — and Packer handles the VM lifecycle, the boot sequence, and the export. The same template can produce images for a dozen different platforms.

Builders

A builder is a Packer plugin that creates a VM on a specific platform, boots it, and waits for provisioners to run. Common builders: qemu (local KVM), proxmox (Proxmox VE API), amazon-ebs (build in EC2), googlecompute (build in GCE), azure-arm (build in Azure).

// builder = the VM host that runs your build // One template can have multiple builders in parallel

Provisioners

Provisioners run after the VM boots and before the image is exported. Common provisioners: shell (run bash scripts), file (upload files), ansible (run an Ansible playbook), salt-masterless (run Salt states). This is where you install software, configure services, and seal the image.

// provisioner = what runs inside the VM after install // Multiple provisioners run in sequence

Post-processors

Post-processors run after the VM is shut down and the image is captured. Common post-processors: compress (gzip the image), checksum (sha256sum for verification), manifest (write a JSON manifest of all outputs), vagrant (package as a Vagrant box).

// post-processor = what happens to the image file // Runs on your build machine, not inside the VM

HCL2 template format

Modern Packer uses HCL2 (HashiCorp Configuration Language v2) — the same language as Terraform. Templates are .pkr.hcl files. Variables, locals, expressions, and loops are all supported. The old JSON format still works but HCL2 is the standard for new templates.

// .pkr.hcl = modern Packer template // .json = legacy format, avoid for new work

Install Packer

# On a kldload host (CentOS Stream 9 / RHEL / Rocky)
sudo dnf config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
sudo dnf install -y packer

# On Debian / Ubuntu
wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \
  sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt-get update && sudo apt-get install -y packer

# Verify
packer --version

# Install required plugins (run in your template directory)
packer init .
Packer is the missing piece between "I built a kldload ISO" and "I have images in five clouds." You write one Packer template that boots the kldload ISO, runs the unattended installer, configures the system, and exports to every format you need. One template, many outputs. The key mental model: Packer is a build automation tool, not a configuration management tool. It runs once to create the image. Terraform then deploys that image as many times as you need. The division of responsibility is clear: Packer is responsible for what is in the image. Terraform is responsible for where and how many instances run it.

4. Building a kldload Image with Packer (QEMU Builder)

The QEMU builder creates a local VM, boots the kldload ISO, runs the installer, and exports the disk. This is the foundational build — the same image produced here becomes the source for cloud uploads, so it is worth getting right.

Directory structure

packer/
  kldload-server/
    kldload-server.pkr.hcl      # main template
    variables.pkrvars.hcl       # default variable values
    profiles/
      server.pkrvars.hcl        # server profile overrides
      desktop.pkrvars.hcl       # desktop profile overrides
      k8s-node.pkrvars.hcl      # Kubernetes node profile
    scripts/
      post-install.sh           # runs inside the VM after install
      seal.sh                   # calls kexport seal

Main template: kldload-server.pkr.hcl

packer {
  required_plugins {
    qemu = {
      version = ">= 1.0.9"
      source  = "github.com/hashicorp/qemu"
    }
  }
}

# ─── Variables ────────────────────────────────────────────────────────────────

variable "iso_url" {
  type    = string
  default = "/images/kldload-desktop-1.0.2-x86_64.iso"
}

variable "iso_checksum" {
  type    = string
  default = "file:/images/kldload-desktop-1.0.2-x86_64.iso.sha256"
}

variable "disk_size" {
  type    = string
  default = "40960"  # 40 GiB in MiB
}

variable "memory" {
  type    = number
  default = 4096
}

variable "cpus" {
  type    = number
  default = 4
}

variable "output_dir" {
  type    = string
  default = "/images/output"
}

variable "image_name" {
  type    = string
  default = "kldload-server"
}

variable "image_version" {
  type    = string
  default = "1.0.0"
}

variable "target_distro" {
  type    = string
  default = "centos"  # centos | debian | ubuntu | fedora | rocky
}

variable "install_profile" {
  type    = string
  default = "server"  # server | desktop | core
}

variable "ssh_username" {
  type    = string
  default = "root"
}

variable "ssh_password" {
  type      = string
  default   = "kldload"
  sensitive = true
}

# ─── Locals ───────────────────────────────────────────────────────────────────

locals {
  output_filename = "${var.image_name}-${var.image_version}-${var.target_distro}"
  timestamp       = formatdate("YYYYMMDD", timestamp())
}

# ─── Source: QEMU ─────────────────────────────────────────────────────────────

source "qemu" "kldload" {
  # ISO to boot
  iso_url      = var.iso_url
  iso_checksum = var.iso_checksum

  # Disk
  disk_size         = var.disk_size
  disk_interface    = "virtio"
  format            = "qcow2"
  output_directory  = "${var.output_dir}/${local.output_filename}"
  vm_name           = "${local.output_filename}.qcow2"

  # Machine
  machine_type = "q35"
  memory       = var.memory
  cpus         = var.cpus
  net_device   = "virtio-net"

  # UEFI boot (kldload requires UEFI)
  efi_boot          = true
  efi_firmware_code = "/usr/share/edk2/ovmf/OVMF_CODE.fd"
  efi_firmware_vars = "/usr/share/edk2/ovmf/OVMF_VARS.fd"

  # Boot command: kldload live environment autologins as root
  # We write an answers file and kick off the unattended installer
  boot_wait = "15s"
  boot_command = [
    # Wait for the live desktop/shell to come up, then write the answers file
    "",
    "cat > /tmp/answers.env << 'EOF'",
    "K_TARGET_DISTRO=${var.target_distro}",
    "K_INSTALL_PROFILE=${var.install_profile}",
    "K_DISK=vda",
    "K_HOSTNAME=kldload-template",
    "K_TIMEZONE=UTC",
    "K_ROOT_PASSWORD=kldload",
    "K_INSTALL_USER=ops",
    "K_INSTALL_USER_PASSWORD=kldload",
    "EOF",
    "",
    # Launch the unattended installer
    "kldload-install-target --answers /tmp/answers.env --unattended"
  ]

  # SSH connection (installer reboots into the target system)
  communicator           = "ssh"
  ssh_username           = var.ssh_username
  ssh_password           = var.ssh_password
  ssh_timeout            = "30m"
  ssh_handshake_attempts = 30

  # Shutdown
  shutdown_command = "shutdown -h now"
  shutdown_timeout = "5m"

  # Headless build (no GUI window)
  headless = true

  # QEMU extra args for performance
  qemuargs = [
    ["-cpu", "host"],
    ["-enable-kvm"]
  ]
}

# ─── Build ────────────────────────────────────────────────────────────────────

build {
  name    = "kldload-server"
  sources = ["source.qemu.kldload"]

  # Wait for the system to fully come up after install reboot
  provisioner "shell" {
    inline = ["echo 'System is up'", "uname -a", "zpool status"]
  }

  # Run post-install configuration
  provisioner "shell" {
    script = "scripts/post-install.sh"
    environment_vars = [
      "IMAGE_NAME=${var.image_name}",
      "IMAGE_VERSION=${var.image_version}",
      "TARGET_DISTRO=${var.target_distro}"
    ]
  }

  # Seal the image for cloning
  provisioner "shell" {
    script = "scripts/seal.sh"
  }

  # Write a manifest
  post-processor "manifest" {
    output     = "${var.output_dir}/${local.output_filename}/manifest.json"
    strip_path = false
  }

  # Checksum
  post-processor "checksum" {
    checksum_types = ["sha256"]
    output         = "${var.output_dir}/${local.output_filename}/${local.output_filename}.{{.ChecksumType}}sum"
  }
}

Post-install script: scripts/post-install.sh

#!/bin/bash
set -euo pipefail

echo "=== Post-install configuration ==="
echo "Image: ${IMAGE_NAME} v${IMAGE_VERSION} (${TARGET_DISTRO})"

# Install additional packages specific to this image type
if command -v dnf &>/dev/null; then
    dnf install -y htop tmux vim-enhanced nmap-ncat
elif command -v apt-get &>/dev/null; then
    apt-get install -y -q htop tmux vim ncat
fi

# Configure sshd for image use
cat > /etc/ssh/sshd_config.d/99-image.conf << 'EOF'
PermitRootLogin prohibit-password
PasswordAuthentication no
ChallengeResponseAuthentication no
EOF

# Enable services that should start on first boot
systemctl enable cloud-init
systemctl enable cloud-init-local
systemctl enable cloud-config
systemctl enable cloud-final

# Write image metadata
mkdir -p /etc/kldload
cat > /etc/kldload/image-metadata.json << EOF
{
  "image_name": "${IMAGE_NAME}",
  "image_version": "${IMAGE_VERSION}",
  "target_distro": "${TARGET_DISTRO}",
  "build_timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "builder": "packer-qemu"
}
EOF

echo "=== Post-install complete ==="

Seal script: scripts/seal.sh

#!/bin/bash
set -euo pipefail

echo "=== Sealing image for cloning ==="

# Clear machine identity
truncate -s 0 /etc/machine-id
rm -f /var/lib/dbus/machine-id

# Remove SSH host keys (regenerated on first boot)
rm -f /etc/ssh/ssh_host_*

# Clear persistent network interface naming
rm -f /etc/udev/rules.d/70-persistent-net.rules
rm -f /etc/udev/rules.d/75-net-description.rules

# Clear bash history
unset HISTFILE
history -c
rm -f /root/.bash_history /home/*/.bash_history

# Remove cloud-init's "already ran" flag so it runs on first boot
rm -f /var/lib/cloud/instances
rm -rf /var/lib/cloud/instance
cloud-init clean --logs

# Configure cloud-init multi-datasource (auto-detects AWS, GCP, Azure, NoCloud)
mkdir -p /etc/cloud/cloud.cfg.d
cat > /etc/cloud/cloud.cfg.d/99-datasource.cfg << 'EOF'
datasource_list:
  - NoCloud
  - ConfigDrive
  - Ec2
  - GCE
  - Azure
  - AltCloud
  - OpenStack
  - None
EOF

# Export ZFS pools so the image can be imported fresh on first boot
zpool export -a 2>/dev/null || true

echo "=== Image sealed ==="

Build commands

# Initialize plugins
cd packer/kldload-server
packer init .

# Validate the template
packer validate kldload-server.pkr.hcl

# Build with default variables (CentOS server profile)
packer build kldload-server.pkr.hcl

# Build with a specific profile
packer build -var-file=profiles/k8s-node.pkrvars.hcl kldload-server.pkr.hcl

# Build all profiles in parallel
packer build \
  -var-file=profiles/server.pkrvars.hcl \
  kldload-server.pkr.hcl &

packer build \
  -var-file=profiles/k8s-node.pkrvars.hcl \
  kldload-server.pkr.hcl &

wait
echo "All builds complete"
The QEMU builder creates a VM locally, boots the ISO, runs the installer, and exports the disk image. No cloud account needed. No network access during build — the darksites are baked into the kldload ISO. The entire pipeline runs on your build machine. This is how you build images in air-gapped environments: burn a kldload ISO to a USB, boot a physical build machine from it, run Packer, and the image appears on a local disk. Take the disk, copy the image, and deploy it anywhere. The ISO is self-contained. The pipeline is self-contained. Nothing phones home.

5. Cloud-Specific Packer Builds

Once you have a local qcow2 image, you can upload it to any cloud and register it as a native image. Alternatively, you can build the image directly in the cloud using the cloud provider's Packer builder — this is faster for cloud-specific images because you skip the upload step and build in the same region where the image will run.

AWS AMI (amazon-ebs builder)

packer {
  required_plugins {
    amazon = {
      version = ">= 1.2.8"
      source  = "github.com/hashicorp/amazon"
    }
  }
}

variable "aws_region" {
  type    = string
  default = "us-east-1"
}

variable "aws_instance_type" {
  type    = string
  default = "t3.medium"
}

variable "base_ami" {
  type        = string
  description = "A recent CentOS Stream 9 or Rocky 9 AMI to use as the base"
  default     = "ami-0xxxxxxxxxxxxxxxxx"  # find with: aws ec2 describe-images
}

source "amazon-ebs" "kldload" {
  region        = var.aws_region
  instance_type = var.aws_instance_type
  source_ami    = var.base_ami

  ssh_username = "ec2-user"

  ami_name        = "kldload-server-${formatdate("YYYYMMDD", timestamp())}"
  ami_description = "kldload golden image — ZFS + WireGuard + cloud-init"

  ami_regions = [
    "us-east-1",
    "us-west-2",
    "eu-west-1"
  ]

  tags = {
    Name        = "kldload-server"
    Version     = "1.0.0"
    BuildDate   = formatdate("YYYY-MM-DD", timestamp())
    ManagedBy   = "packer"
  }

  # Encrypt the AMI root volume
  encrypt_boot      = true
  kms_key_id        = "alias/kldload-images"

  # Launch block device for the AMI
  launch_block_device_mappings {
    device_name           = "/dev/xvda"
    volume_size           = 20
    volume_type           = "gp3"
    iops                  = 3000
    throughput            = 125
    delete_on_termination = true
  }
}

build {
  name    = "kldload-aws"
  sources = ["source.amazon-ebs.kldload"]

  provisioner "shell" {
    script = "scripts/post-install-aws.sh"
  }

  provisioner "shell" {
    script = "scripts/seal.sh"
  }
}

GCP image (googlecompute builder)

source "googlecompute" "kldload" {
  project_id          = "my-gcp-project"
  source_image_family = "centos-stream-9"
  zone                = "us-central1-a"
  machine_type        = "n2-standard-2"

  image_name        = "kldload-server-${formatdate("YYYYMMDD", timestamp())}"
  image_description = "kldload golden image — ZFS + WireGuard + cloud-init"
  image_family      = "kldload-server"

  image_labels = {
    managed_by = "packer"
    version    = "1-0-0"
  }

  disk_size = 20
  disk_type = "pd-ssd"

  ssh_username = "packer"
}

build {
  name    = "kldload-gcp"
  sources = ["source.googlecompute.kldload"]

  provisioner "shell" {
    script = "scripts/post-install-gcp.sh"
  }

  provisioner "shell" {
    script = "scripts/seal.sh"
  }
}

Azure managed image (azure-arm builder)

source "azure-arm" "kldload" {
  # Authentication — use a service principal or managed identity
  # Set via environment: ARM_CLIENT_ID, ARM_CLIENT_SECRET, ARM_SUBSCRIPTION_ID, ARM_TENANT_ID

  managed_image_name                = "kldload-server-${formatdate("YYYYMMDD", timestamp())}"
  managed_image_resource_group_name = "kldload-images-rg"

  os_type         = "Linux"
  image_publisher = "OpenLogic"
  image_offer     = "CentOS"
  image_sku       = "8_5-gen2"

  azure_tags = {
    ManagedBy = "packer"
    Version   = "1.0.0"
  }

  location         = "eastus"
  vm_size          = "Standard_D2s_v5"
  os_disk_size_gb  = 30

  # Azure requires VHDs to be fixed-size
  # Packer handles this automatically for azure-arm

  communicator = "ssh"
  ssh_username = "packer"
}

build {
  name    = "kldload-azure"
  sources = ["source.azure-arm.kldload"]

  provisioner "shell" {
    script = "scripts/post-install-azure.sh"
  }

  provisioner "shell" {
    script = "scripts/seal.sh"
  }
}

Cloud-init multi-datasource configuration

kldload's seal script writes a cloud-init datasource config that auto-detects the cloud environment. On first boot, cloud-init reads instance metadata from whatever metadata service is available — AWS IMDSv2, GCP metadata server, Azure IMDS, or a local NoCloud seed — and configures the hostname, network, and injected SSH keys automatically.

# /etc/cloud/cloud.cfg.d/99-datasource.cfg (written by kexport seal)
# datasource_list in priority order — first match wins
datasource_list:
  - NoCloud         # local: seed from ISO or filesystem (KVM, VirtualBox)
  - ConfigDrive     # OpenStack
  - Ec2             # AWS (also works for Exoscale, Outscale, etc.)
  - GCE             # Google Cloud
  - Azure           # Azure
  - AltCloud        # CloudStack
  - OpenStack       # generic OpenStack
  - None            # fallback: no cloud-init, run with defaults
The same kldload golden image runs on all three major clouds. The differences are: disk format (qcow2 vs raw vs VHD), metadata service (all use 169.254.169.254 but with different API paths and authentication), and boot firmware (UEFI vs BIOS, though all major clouds now support UEFI). cloud-init abstracts most of these differences. The multi-datasource config means a single sealed image boots correctly on AWS, GCP, Azure, and on a local KVM host — cloud-init just queries each metadata service in order and uses whichever one responds. You build the image once. You deploy it everywhere. No per-cloud customization needed.

6. Terraform Basics

Terraform is an infrastructure-as-code tool that declares what infrastructure you want, then creates, modifies, or destroys resources to match that declaration. You describe VMs, networks, DNS records, storage buckets, and load balancers in .tf files. Terraform figures out what needs to change and executes it in the right order.

Resources

A resource is a thing Terraform manages — a VM, a network, a DNS record, a storage bucket. Each resource has a type (e.g. libvirt_domain, aws_instance) and a set of arguments. Terraform tracks resources in state and reconciles them on every apply.

// resource "aws_instance" "web" { ... } // This declares: one EC2 instance named "web"

Providers

A provider is a plugin that knows how to talk to a specific API. Common providers: hashicorp/libvirt (KVM), telmate/proxmox (Proxmox VE), hashicorp/aws, hashicorp/google, hashicorp/azurerm. Providers are downloaded automatically on terraform init.

// provider = API client + resource definitions // One workspace can use multiple providers

State

Terraform stores what it has created in a state file (terraform.tfstate). On every plan/apply, it compares desired state (your .tf files) to actual state (the file) to real resources (the API). Without state, Terraform cannot know what it already created.

// state = Terraform's memory of what it built // Never edit state manually; use terraform state commands

Remote state backends

For team use, store state in a remote backend: S3 + DynamoDB (AWS), GCS (GCP), Azure Blob, or Terraform Cloud. Remote state allows multiple team members to run Terraform without stomping on each other and enables state locking to prevent concurrent runs.

// local state: fine for solo use or CI // remote state: required for team/production use

Basic Terraform workflow

# Initialize: download providers and set up backend
terraform init

# Preview changes without applying them
terraform plan

# Apply changes (creates/modifies/destroys resources)
terraform apply

# Destroy all resources managed by this workspace
terraform destroy

# Show current state
terraform show

# List all resources in state
terraform state list

# Remove a specific resource from state (without destroying it)
terraform state rm libvirt_domain.kldload_vm["worker-1"]
Terraform does not configure servers — it creates infrastructure (VMs, networks, DNS records, storage) and wires them together. Packer builds the image. Terraform deploys it. This separation is important: the image is immutable, Terraform is the deployment mechanism. When you want to update software on your fleet, you do not update the running servers with Terraform — you build a new image with Packer, update the image_path variable in your Terraform config, and run terraform apply. Terraform destroys the old VMs and creates new ones from the new image. The new servers are identical. The state is clean. No drift, no snowflakes.

7. Deploying kldload Images with Terraform (KVM / libvirt)

On a kldload KVM host, the libvirt Terraform provider replaces manual virt-install commands. You describe a fleet of VMs in a .tf file and Terraform creates them all in parallel, each with unique hostname, IP, and cloud-init configuration derived from the same golden image.

Provider configuration

terraform {
  required_providers {
    libvirt = {
      source  = "dmacvicar/libvirt"
      version = "~> 0.7"
    }
  }
}

provider "libvirt" {
  uri = "qemu:///system"
  # For remote KVM host:
  # uri = "qemu+ssh://root@kvm-host.example.com/system"
}

Base image and network

# Pool where images are stored
resource "libvirt_pool" "kldload" {
  name = "kldload"
  type = "dir"
  path = "/var/lib/libvirt/images/kldload"
}

# The golden image (built by Packer, uploaded once)
resource "libvirt_volume" "base_image" {
  name   = "kldload-server-1.0.0.qcow2"
  pool   = libvirt_pool.kldload.name
  source = "/images/kldload-server-1.0.0.qcow2"
  format = "qcow2"
}

# Isolated network for the fleet
resource "libvirt_network" "kldload_net" {
  name      = "kldload-fleet"
  mode      = "nat"
  domain    = "fleet.local"
  addresses = ["10.100.0.0/24"]

  dhcp { enabled = false }  # we assign IPs via cloud-init

  dns { enabled = true }
}

Fleet definition with cloud-init

# Variables
variable "fleet_nodes" {
  description = "Map of node name to IP address"
  type        = map(string)
  default = {
    "kldload-web-1"  = "10.100.0.11"
    "kldload-web-2"  = "10.100.0.12"
    "kldload-app-1"  = "10.100.0.21"
  }
}

variable "ssh_public_key" {
  type    = string
  default = "~/.ssh/id_ed25519.pub"
}

locals {
  ssh_key = file(var.ssh_public_key)
}

# ─── Per-node disk (thin clone from golden image) ─────────────────────────────

resource "libvirt_volume" "node_disk" {
  for_each = var.fleet_nodes

  name           = "${each.key}.qcow2"
  pool           = libvirt_pool.kldload.name
  base_volume_id = libvirt_volume.base_image.id
  format         = "qcow2"
  size           = 42949672960  # 40 GiB
}

# ─── Per-node cloud-init ISO ──────────────────────────────────────────────────

resource "libvirt_cloudinit_disk" "node_init" {
  for_each = var.fleet_nodes

  name = "${each.key}-init.iso"
  pool = libvirt_pool.kldload.name

  user_data = <<-EOF
    #cloud-config
    hostname: ${each.key}
    fqdn: ${each.key}.fleet.local
    manage_etc_hosts: true
    users:
      - name: ops
        groups: wheel
        sudo: ALL=(ALL) NOPASSWD:ALL
        shell: /bin/bash
        ssh_authorized_keys:
          - ${local.ssh_key}
    ssh_pwauth: false
    packages:
      - vim
      - tmux
    runcmd:
      - systemctl enable --now zfs-import-cache
      - echo "Node ${each.key} is up" > /etc/motd
  EOF

  network_config = <<-EOF
    version: 2
    ethernets:
      eth0:
        addresses:
          - ${each.value}/24
        gateway4: 10.100.0.1
        nameservers:
          addresses: [10.100.0.1, 1.1.1.1]
  EOF
}

# ─── VM definitions ───────────────────────────────────────────────────────────

resource "libvirt_domain" "fleet_node" {
  for_each = var.fleet_nodes

  name   = each.key
  memory = 2048
  vcpu   = 2

  cpu { mode = "host-passthrough" }

  disk {
    volume_id = libvirt_volume.node_disk[each.key].id
  }

  cloudinit = libvirt_cloudinit_disk.node_init[each.key].id

  network_interface {
    network_id     = libvirt_network.kldload_net.id
    hostname       = each.key
    wait_for_lease = true
  }

  console {
    type        = "pty"
    target_type = "serial"
    target_port = "0"
  }

  graphics {
    type        = "vnc"
    listen_type = "address"
    autoport    = true
  }
}

# ─── Outputs ──────────────────────────────────────────────────────────────────

output "fleet_ips" {
  value = {
    for name, ip in var.fleet_nodes : name => ip
  }
}

output "ssh_commands" {
  value = {
    for name, ip in var.fleet_nodes : name => "ssh ops@${ip}"
  }
}
# Deploy the fleet
terraform init
terraform plan
terraform apply

# View the deployed IPs
terraform output fleet_ips

# Destroy everything cleanly
terraform destroy
On a kldload KVM host, Terraform + libvirt provider replaces manual virt-install commands entirely. You describe 10 VMs in a .tf file, run terraform apply, and all 10 are created in parallel from the same golden image, each with a unique IP, hostname, and cloud-init configuration. terraform destroy removes them all cleanly, including their disk images. The key efficiency here is the thin clone: each VM's disk is a qcow2 linked clone of the golden image, so creating 10 VMs does not copy 10 x 40 GB of disk — it creates 10 delta files that start empty and only store the differences from the base. A fleet of 10 nodes from a 3 GB golden image uses about 3 GB total until the nodes write significant data.

8. Deploying to AWS with Terraform

The Packer build in section 5 produced a registered AMI. Now Terraform uses that AMI to deploy EC2 instances with the full kldload configuration — VPC, subnets, security groups, and a separate EBS volume for ZFS data.

AWS provider and variables

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "kldload-terraform-state"
    key            = "aws/production/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "kldload-terraform-locks"
    encrypt        = true
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      ManagedBy   = "terraform"
      Project     = "kldload"
      Environment = var.environment
    }
  }
}

variable "aws_region"    { default = "us-east-1" }
variable "environment"   { default = "production" }
variable "kldload_ami"   { description = "AMI ID from Packer build" }
variable "instance_type" { default = "t3.large" }
variable "node_count"    { default = 3 }
variable "ssh_key_name"  { description = "EC2 key pair name" }

VPC and networking

resource "aws_vpc" "kldload" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  tags = { Name = "kldload-${var.environment}" }
}

resource "aws_subnet" "kldload_private" {
  count             = 3
  vpc_id            = aws_vpc.kldload.id
  cidr_block        = "10.0.${count.index + 1}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]
  tags = { Name = "kldload-private-${count.index + 1}" }
}

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_security_group" "kldload_nodes" {
  name   = "kldload-nodes-${var.environment}"
  vpc_id = aws_vpc.kldload.id

  ingress {
    description = "SSH from VPC"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [aws_vpc.kldload.cidr_block]
  }

  ingress {
    description = "WireGuard"
    from_port   = 51820
    to_port     = 51820
    protocol    = "udp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

EC2 instances with ZFS EBS data volumes

resource "aws_instance" "kldload_node" {
  count = var.node_count

  ami           = var.kldload_ami
  instance_type = var.instance_type
  key_name      = var.ssh_key_name
  subnet_id     = aws_subnet.kldload_private[count.index % 3].id

  vpc_security_group_ids = [aws_security_group.kldload_nodes.id]

  # Root volume — ext4, AWS limitation for AMI boot
  root_block_device {
    volume_type           = "gp3"
    volume_size           = 20
    iops                  = 3000
    throughput            = 125
    encrypted             = true
    delete_on_termination = true
  }

  user_data = base64encode(<<-EOF
    #cloud-config
    hostname: kldload-node-${count.index + 1}
    fqdn: kldload-node-${count.index + 1}.${var.environment}.internal
    manage_etc_hosts: true
    runcmd:
      # Import or create the ZFS data pool on the attached EBS volume
      - |
        if ! zpool status data &>/dev/null; then
          # First boot: create the pool
          # Wait for the EBS volume to appear
          while [ ! -b /dev/nvme1n1 ]; do sleep 1; done
          zpool create -o ashift=12 \
            -O compression=lz4 \
            -O atime=off \
            -O mountpoint=/data \
            data /dev/nvme1n1
        else
          # Subsequent boots: import existing pool
          zpool import data
        fi
  EOF
  )

  tags = { Name = "kldload-node-${count.index + 1}" }
}

# ZFS data volume — separate EBS volume, persistent across instance replacements
resource "aws_ebs_volume" "kldload_data" {
  count = var.node_count

  availability_zone = aws_instance.kldload_node[count.index].availability_zone
  size              = 100
  type              = "gp3"
  iops              = 3000
  throughput        = 125
  encrypted         = true

  tags = { Name = "kldload-data-${count.index + 1}" }
}

resource "aws_volume_attachment" "kldload_data" {
  count = var.node_count

  device_name = "/dev/sdf"
  volume_id   = aws_ebs_volume.kldload_data[count.index].id
  instance_id = aws_instance.kldload_node[count.index].id
}

output "node_private_ips" {
  value = aws_instance.kldload_node[*].private_ip
}
ZFS on AWS works by attaching a dedicated EBS volume and creating or importing a ZFS pool on it. The root volume (AMI boot disk) must be ext4 because AWS snapshots AMIs from running instances using a filesystem-aware mechanism that does not understand ZFS pool state. This is an AWS limitation, not a ZFS limitation. Your data, databases, application state, and working directories all live on the ZFS EBS volume. You get ZFS compression, snapshots, and zfs send/receive on cloud storage, with EBS providing the underlying IOPS. The kldload cloud-init firstboot script handles ZFS pool creation automatically on the first boot and pool import on subsequent reboots. Keeping the ZFS EBS volume separate from the instance means you can replace or resize the EC2 instance without losing data: detach the volume, terminate the instance, create a new instance from the latest kldload AMI, reattach the volume. The pool imports with all your data intact.

9. Deploying to GCP and Azure

GCP: Compute Engine instance from a kldload image

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

provider "google" {
  project = var.gcp_project
  region  = var.gcp_region
}

variable "gcp_project"      { description = "GCP project ID" }
variable "gcp_region"       { default     = "us-central1" }
variable "gcp_zone"         { default     = "us-central1-a" }
variable "kldload_image"    { description = "GCP image name from Packer build" }
variable "machine_type"     { default     = "n2-standard-2" }
variable "node_count"       { default     = 3 }

resource "google_compute_network" "kldload" {
  name                    = "kldload-network"
  auto_create_subnetworks = false
}

resource "google_compute_subnetwork" "kldload" {
  name          = "kldload-subnet"
  network       = google_compute_network.kldload.id
  ip_cidr_range = "10.10.0.0/24"
  region        = var.gcp_region
}

resource "google_compute_firewall" "kldload_ssh" {
  name    = "kldload-allow-ssh"
  network = google_compute_network.kldload.id

  allow {
    protocol = "tcp"
    ports    = ["22"]
  }
  source_ranges = ["35.235.240.0/20"]  # Cloud IAP IP range for SSH tunneling
}

resource "google_compute_instance" "kldload_node" {
  count = var.node_count

  name         = "kldload-node-${count.index + 1}"
  machine_type = var.machine_type
  zone         = var.gcp_zone

  boot_disk {
    initialize_params {
      image = "projects/${var.gcp_project}/global/images/${var.kldload_image}"
      size  = 20
      type  = "pd-ssd"
    }
  }

  # Separate persistent disk for ZFS data
  attached_disk {
    source      = google_compute_disk.kldload_data[count.index].self_link
    device_name = "data"
    mode        = "READ_WRITE"
  }

  network_interface {
    subnetwork = google_compute_subnetwork.kldload.id
    # No external IP — use Cloud IAP for SSH access
  }

  metadata = {
    user-data = <<-EOF
      #cloud-config
      hostname: kldload-node-${count.index + 1}
      runcmd:
        - |
          if ! zpool status data &>/dev/null; then
            while [ ! -b /dev/disk/by-id/google-data ]; do sleep 1; done
            zpool create -o ashift=12 -O compression=lz4 -O atime=off \
              -O mountpoint=/data data /dev/disk/by-id/google-data
          else
            zpool import data
          fi
    EOF
  }

  service_account {
    scopes = ["cloud-platform"]
  }

  labels = {
    managed_by = "terraform"
    project    = "kldload"
  }
}

resource "google_compute_disk" "kldload_data" {
  count = var.node_count
  name  = "kldload-data-${count.index + 1}"
  type  = "pd-ssd"
  zone  = var.gcp_zone
  size  = 100

  labels = { managed_by = "terraform" }
}

Azure: VM from a managed image

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

variable "azure_location"  { default = "eastus" }
variable "resource_group"  { default = "kldload-production" }
variable "kldload_image"   { description = "Managed image name from Packer build" }
variable "vm_size"         { default = "Standard_D2s_v5" }
variable "node_count"      { default = 3 }

data "azurerm_image" "kldload" {
  name                = var.kldload_image
  resource_group_name = var.resource_group
}

resource "azurerm_resource_group" "kldload" {
  name     = var.resource_group
  location = var.azure_location
}

resource "azurerm_virtual_network" "kldload" {
  name                = "kldload-vnet"
  address_space       = ["10.20.0.0/16"]
  location            = azurerm_resource_group.kldload.location
  resource_group_name = azurerm_resource_group.kldload.name
}

resource "azurerm_subnet" "kldload" {
  name                 = "kldload-subnet"
  resource_group_name  = azurerm_resource_group.kldload.name
  virtual_network_name = azurerm_virtual_network.kldload.name
  address_prefixes     = ["10.20.1.0/24"]
}

resource "azurerm_network_interface" "kldload_node" {
  count               = var.node_count
  name                = "kldload-nic-${count.index + 1}"
  location            = azurerm_resource_group.kldload.location
  resource_group_name = azurerm_resource_group.kldload.name

  ip_configuration {
    name                          = "internal"
    subnet_id                     = azurerm_subnet.kldload.id
    private_ip_address_allocation = "Dynamic"
  }
}

resource "azurerm_linux_virtual_machine" "kldload_node" {
  count               = var.node_count
  name                = "kldload-node-${count.index + 1}"
  resource_group_name = azurerm_resource_group.kldload.name
  location            = azurerm_resource_group.kldload.location
  size                = var.vm_size

  admin_username = "ops"

  admin_ssh_key {
    username   = "ops"
    public_key = file("~/.ssh/id_ed25519.pub")
  }

  network_interface_ids = [
    azurerm_network_interface.kldload_node[count.index].id
  ]

  source_image_id = data.azurerm_image.kldload.id

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "Premium_LRS"
    disk_size_gb         = 30
  }

  # ZFS data disk
  data_disk {
    disk_size_gb         = 100
    lun                  = 10
    storage_account_type = "Premium_LRS"
    caching              = "None"
  }

  custom_data = base64encode(<<-EOF
    #cloud-config
    hostname: kldload-node-${count.index + 1}
    runcmd:
      - |
        if ! zpool status data &>/dev/null; then
          while [ ! -b /dev/sdc ]; do sleep 1; done
          zpool create -o ashift=12 -O compression=lz4 \
            -O atime=off -O mountpoint=/data data /dev/sdc
        else
          zpool import data
        fi
  EOF
  )

  tags = { managed_by = "terraform" }
}
GCP and Azure both support UEFI boot and cloud-init, which means the same sealed kldload image deploys on all three major clouds without modification. The differences are operational: GCP uses a metadata server at 169.254.169.254 with a different authentication header than AWS; Azure uses IMDS at 169.254.169.254 with a different API path; all three are handled by the multi-datasource cloud-init config baked into the sealed image. For a multi-cloud deployment, consider a shared Terraform module for the common parts (ZFS pool setup, WireGuard config injection, hostname assignment) and provider-specific modules that handle the VM creation and disk attachment for each cloud. Variables are shared; resources differ per provider.

10. The Golden Image Lifecycle

A golden image is not a one-time artifact. It is a versioned, tested, promoted release that goes through the same pipeline as your application code. Every change to the image — a security patch, a new package, a configuration update — produces a new image version. The old version is not deleted until the new one is validated in production.

Image naming and versioning

# Consistent naming convention
kldload-server-YYYYMMDD           # date-based (simple, chronological)
kldload-server-v1.2.3             # semantic version (structured, for releases)
kldload-server-main-abc1234       # git branch + commit hash (CI builds)

# Examples
kldload-server-20260402           # built April 2, 2026
kldload-k8s-node-v2.1.0          # k8s node image, version 2.1.0
kldload-server-main-7f3a8b2      # built from main branch commit 7f3a8b2

Image rotation policy

#!/bin/bash
# rotate-images.sh — keep the last 5 images per family, delete older ones

IMAGE_FAMILY="kldload-server"
KEEP_COUNT=5
REGION="us-east-1"

# List all AMIs for this family, sorted by creation date (newest first)
mapfile -t AMIS < <(
  aws ec2 describe-images \
    --owners self \
    --filters "Name=name,Values=${IMAGE_FAMILY}-*" \
    --query 'sort_by(Images, &CreationDate)[].[ImageId,Name]' \
    --output text \
    --region "${REGION}" \
  | awk '{print $1}'
)

TOTAL=${#AMIS[@]}
DELETE_COUNT=$(( TOTAL - KEEP_COUNT ))

if (( DELETE_COUNT > 0 )); then
  echo "Keeping ${KEEP_COUNT} of ${TOTAL} images, deleting ${DELETE_COUNT} oldest"
  for i in $(seq 0 $(( DELETE_COUNT - 1 ))); do
    AMI_ID="${AMIS[$i]}"
    echo "Deregistering ${AMI_ID}..."
    aws ec2 deregister-image --image-id "${AMI_ID}" --region "${REGION}"
    # Also delete the associated snapshot
    SNAPSHOT=$(aws ec2 describe-images --image-ids "${AMI_ID}" --region "${REGION}" \
      --query 'Images[0].BlockDeviceMappings[0].Ebs.SnapshotId' --output text)
    aws ec2 delete-snapshot --snapshot-id "${SNAPSHOT}" --region "${REGION}"
  done
else
  echo "Only ${TOTAL} images exist, nothing to delete"
fi

Test pipeline: build → validate → promote

# 1. Build the image
packer build \
  -var "image_version=$(date +%Y%m%d)" \
  kldload-server.pkr.hcl

# 2. Deploy to a test VM
terraform apply \
  -var "kldload_ami=${NEW_AMI_ID}" \
  -var "environment=test" \
  -target=aws_instance.kldload_node[0]

# 3. Run validation tests
ssh ops@"${TEST_NODE_IP}" 'bash -s' << 'EOF'
  set -e
  echo "=== Smoke tests ==="
  systemctl is-active sshd          || { echo "FAIL: sshd"; exit 1; }
  zpool status                       || { echo "FAIL: ZFS"; exit 1; }
  wg show                            || { echo "FAIL: WireGuard"; exit 1; }
  df -h /                            # verify root filesystem
  free -h                            # verify memory
  uname -r                           # verify kernel
  echo "=== All smoke tests passed ==="
EOF

# 4. If tests pass, update production variable file
echo "kldload_ami = \"${NEW_AMI_ID}\"" > environments/production.tfvars

# 5. Deploy to production (rolling, one node at a time)
for i in 0 1 2; do
  terraform apply \
    -var-file=environments/production.tfvars \
    -target="aws_instance.kldload_node[${i}]"
  sleep 30  # wait for node to come up and pass health checks
done
Immutable infrastructure means you never SSH into a production server to fix something. You fix it in the image pipeline — update a package, change a config file, add a systemd unit — build a new image, deploy it, and destroy the old one. If the new image is broken, you deploy the previous version in under a minute. The server is disposable. The image pipeline is the source of truth. This sounds simple but requires a culture shift: engineers must resist the temptation to "just log in and fix it." Every fix that happens on a live server creates a snowflake that diverges from the image and cannot be reproduced. The next deploy reverts the fix. The discipline is: nothing changes on a running server. Everything changes in the pipeline, is tested, and is deployed as a new image.

11. Secrets and Configuration

The number one image pipeline mistake is baking secrets into the golden image. Every instance gets the same database password. Every instance has your WireGuard private key. If the image leaks — and images are large files that get copied, uploaded, and shared — everything leaks with it. Secrets are injected at deploy time, not baked at build time.

What goes in the image vs what is injected

Bake into image (safe) Inject at deploy time (required)
OS packages and kernel SSH host keys (regenerated by cloud-init)
Application binaries Machine ID (cleared by kexport seal)
systemd unit files Hostname and IP address
Kernel tuning (sysctl) WireGuard private key
ZFS tuning and datasets Database passwords
Non-sensitive configuration API tokens and certificates
User accounts (without passwords) User SSH authorized keys

Secrets tools: Vault, SOPS, age

# HashiCorp Vault — centralized secrets management
# Store WireGuard key in Vault
vault kv put secret/wireguard/node-1 private_key="$(wg genkey)"

# Read from Vault in a deployment script
WG_PRIVATE_KEY=$(vault kv get -field=private_key secret/wireguard/node-1)

# ─── SOPS + age — encrypted secrets in git ────────────────────────────────────
# SOPS encrypts specific values in YAML/JSON files, leaving keys readable
# age is a modern replacement for GPG

# Generate an age key pair
age-keygen -o ~/.config/sops/age/keys.txt

# Encrypt a secrets file
sops --encrypt --age age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx secrets.yaml > secrets.enc.yaml

# Decrypt at deploy time (key must be in ~/.config/sops/age/keys.txt)
sops --decrypt secrets.enc.yaml | terraform apply -var-file=/dev/stdin

Injecting WireGuard keys via cloud-init from Vault

# In Terraform — fetch WireGuard key from Vault, inject via cloud-init

data "vault_kv_secret_v2" "wireguard" {
  count = var.node_count
  mount = "secret"
  name  = "wireguard/node-${count.index + 1}"
}

resource "aws_instance" "kldload_node" {
  count         = var.node_count
  ami           = var.kldload_ami
  instance_type = var.instance_type

  user_data = base64encode(<<-EOF
    #cloud-config
    hostname: kldload-node-${count.index + 1}
    write_files:
      - path: /etc/wireguard/wg0.conf
        permissions: '0600'
        content: |
          [Interface]
          PrivateKey = ${data.vault_kv_secret_v2.wireguard[count.index].data["private_key"]}
          Address = 10.200.${count.index + 1}.1/24
          ListenPort = 51820
          [Peer]
          PublicKey = ${var.wireguard_server_pubkey}
          Endpoint = ${var.wireguard_server_endpoint}:51820
          AllowedIPs = 10.200.0.0/16
    runcmd:
      - systemctl enable --now wg-quick@wg0
  EOF
  )
}
There are three common secrets injection patterns, in order of increasing security. First: Terraform reads secrets from a local file or environment variables and injects them into cloud-init user-data. Simple, works everywhere, but the secrets appear in Terraform state — encrypt the state backend. Second: Vault dynamic secrets — Terraform calls the Vault API to generate a short-lived credential that is valid only for the duration of the instance's lifecycle. No long-lived secrets in state at all. Third: instance-identity-based auth — the EC2/GCP/Azure instance authenticates to Vault using its cloud provider identity (IAM role, service account), and the application fetches its own secrets at runtime. No secrets in cloud-init, no secrets in Terraform state, no secrets on disk. The right pattern depends on your threat model. The wrong pattern — baking secrets into the image — is never right.

12. CI/CD Integration

The full pipeline: a git push triggers a Packer build, tests pass, Terraform deploys to staging, manual approval unlocks production, Terraform rolls out to production. The entire deployment history is your git history. Rollback is a git revert.

GitHub Actions pipeline

# .github/workflows/image-pipeline.yml

name: Image Pipeline

on:
  push:
    branches: [main]
    paths:
      - 'packer/**'
      - 'scripts/**'

env:
  AWS_REGION: us-east-1
  PACKER_LOG: 1

jobs:
  # ─── Build the golden image ─────────────────────────────────────────────────
  build-image:
    runs-on: [self-hosted, kldload-builder]  # runs on a kldload KVM host
    outputs:
      ami_id: ${{ steps.packer.outputs.ami_id }}
      image_version: ${{ steps.version.outputs.version }}
    steps:
      - uses: actions/checkout@v4

      - name: Set image version
        id: version
        run: echo "version=$(date +%Y%m%d)-${GITHUB_SHA::8}" >> "$GITHUB_OUTPUT"

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ vars.AWS_ACCOUNT_ID }}:role/packer-builder
          aws-region: ${{ env.AWS_REGION }}

      - name: Setup Packer
        uses: hashicorp/setup-packer@main
        with:
          version: "latest"

      - name: Packer init
        run: packer init packer/kldload-server/

      - name: Packer validate
        run: |
          packer validate \
            -var "image_version=${{ steps.version.outputs.version }}" \
            packer/kldload-server/kldload-server.pkr.hcl

      - name: Packer build
        id: packer
        run: |
          packer build \
            -var "image_version=${{ steps.version.outputs.version }}" \
            -machine-readable \
            packer/kldload-server/kldload-server.pkr.hcl \
          | tee /tmp/packer-output.txt

          AMI_ID=$(grep 'artifact,0,id' /tmp/packer-output.txt \
            | cut -d, -f6 | cut -d: -f2)
          echo "ami_id=${AMI_ID}" >> "$GITHUB_OUTPUT"
          echo "Built AMI: ${AMI_ID}"

  # ─── Deploy to staging and run smoke tests ──────────────────────────────────
  deploy-staging:
    needs: build-image
    runs-on: [self-hosted, kldload-builder]
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform init
        run: terraform -chdir=terraform/aws init

      - name: Deploy to staging
        run: |
          terraform -chdir=terraform/aws apply -auto-approve \
            -var "kldload_ami=${{ needs.build-image.outputs.ami_id }}" \
            -var "environment=staging" \
            -var "node_count=1"

      - name: Get staging IP
        id: staging_ip
        run: |
          IP=$(terraform -chdir=terraform/aws output -raw node_private_ips | head -1)
          echo "ip=${IP}" >> "$GITHUB_OUTPUT"

      - name: Run smoke tests
        run: |
          # Wait for SSH to be available
          for i in $(seq 1 30); do
            ssh -o StrictHostKeyChecking=no \
                -o ConnectTimeout=5 \
                ops@${{ steps.staging_ip.outputs.ip }} \
                'echo ok' && break
            sleep 10
          done

          ssh ops@${{ steps.staging_ip.outputs.ip }} 'bash -s' << 'TESTS'
            set -e
            systemctl is-active sshd
            zpool status
            wg show wg0
            echo "All smoke tests passed"
          TESTS

  # ─── Deploy to production (requires manual approval) ────────────────────────
  deploy-production:
    needs: [build-image, deploy-staging]
    runs-on: [self-hosted, kldload-builder]
    environment: production   # GitHub environment with required reviewers
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform init
        run: terraform -chdir=terraform/aws init

      - name: Rolling deploy to production
        run: |
          NODE_COUNT=$(terraform -chdir=terraform/aws output -raw node_count)
          for i in $(seq 0 $(( NODE_COUNT - 1 ))); do
            echo "Deploying node ${i}..."
            terraform -chdir=terraform/aws apply -auto-approve \
              -var "kldload_ami=${{ needs.build-image.outputs.ami_id }}" \
              -var "environment=production" \
              -target="aws_instance.kldload_node[${i}]"
            echo "Node ${i} deployed, waiting 30s..."
            sleep 30
          done

      - name: Update image manifest
        run: |
          echo '{"ami_id": "${{ needs.build-image.outputs.ami_id }}", "version": "${{ needs.build-image.outputs.image_version }}", "deployed": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' \
            > deployed-image.json
          git config user.email "ci@kldload.com"
          git config user.name "kldload CI"
          git add deployed-image.json
          git commit -m "deploy: production image ${{ needs.build-image.outputs.image_version }}"
          git push

GitLab CI alternative

# .gitlab-ci.yml

stages:
  - build
  - test
  - staging
  - production

variables:
  IMAGE_VERSION: "${CI_COMMIT_SHORT_SHA}-$(date +%Y%m%d)"

build-image:
  stage: build
  tags: [kldload-builder]
  script:
    - packer init packer/kldload-server/
    - packer build
        -var "image_version=${IMAGE_VERSION}"
        packer/kldload-server/kldload-server.pkr.hcl
    - AMI_ID=$(cat manifest.json | jq -r '.builds[0].artifact_id' | cut -d: -f2)
    - echo "AMI_ID=${AMI_ID}" >> build.env
  artifacts:
    reports:
      dotenv: build.env

deploy-staging:
  stage: staging
  tags: [kldload-builder]
  dependencies: [build-image]
  script:
    - terraform -chdir=terraform/aws init
    - terraform -chdir=terraform/aws apply -auto-approve
        -var "kldload_ami=${AMI_ID}"
        -var "environment=staging"
  environment:
    name: staging

deploy-production:
  stage: production
  tags: [kldload-builder]
  dependencies: [build-image]
  when: manual   # requires manual click in GitLab UI
  script:
    - terraform -chdir=terraform/aws init
    - terraform -chdir=terraform/aws apply -auto-approve
        -var "kldload_ami=${AMI_ID}"
        -var "environment=production"
  environment:
    name: production
The full CI/CD pipeline makes infrastructure changes identical to software changes: both go through a git push, a build, an automated test, and a review-gated deploy. The git history IS the infrastructure history. Rollback is a git revert followed by a pipeline run — the previous AMI ID is restored to the Terraform config, and Terraform replaces the current instances with the previous image. No manual steps, no SSH sessions, no emergency patches. The constraint that enforces this discipline is: the only way to change infrastructure is through a git commit. Direct terraform apply from a local workstation should require MFA and leave an audit trail. Build machines have tightly scoped IAM roles that can only do what the pipeline needs. Humans do not have direct AWS console access to production. The pipeline is the only deployment mechanism.

Related pages