Documentation

CI/CD & GitOps Masterclass

This guide covers the complete CI/CD and GitOps stack on kldload: from GitOps principles and git-as-truth through Flux CD, ArgoCD, container image pipelines, artifact management, local package repositories, darksite customisation, postinstaller extensibility, Packer golden image pipelines, air-gapped deployment patterns, deployment strategies, rollback with ZFS snapshots and git revert, and DORA metrics monitoring. By the end you will know how to build, extend, and operate a complete delivery pipeline — connected or disconnected — on kldload infrastructure.

The premise: Most CI/CD fails not because teams lack automation, but because they have too much of it — pipelines layered on pipelines, each with its own state, its own secrets store, its own drift. GitOps inverts this. Git becomes the single source of truth. The cluster pulls desired state from git. Drift is detected automatically. Rollback is git revert. The entire system becomes auditable, reproducible, and boring — which is exactly what production infrastructure should be.

What this page covers: GitOps fundamentals and the pull-vs-push model. The kldload build pipeline (deploy.sh, 5 stages, how to extend it). Adding custom RPM and APT packages to the darksite. Extending the postinstaller with your own packages and configs. Building local package repositories with createrepo and reprepro. Container image registries and air-gapped mirroring. Flux CD and ArgoCD deep dives. CI pipeline design with Gitea Actions, GitLab CI, and GitHub Actions self-hosted runners. Golden image pipelines with Packer. Artifact management. Air-gapped deployment patterns. Deployment strategies (rolling, blue/green, canary). Rollback and recovery with git revert and ZFS snapshots. DORA metrics and pipeline monitoring.

Prerequisites: a running kldload system (desktop or server profile). The Kubernetes sections assume a cluster from the Kubernetes on KVM guide. The darksite sections assume familiarity with the Build Your Own workflow.

Why most CI/CD implementations fail

The reason most CI/CD implementations fail has nothing to do with tooling. Jenkins, GitLab CI, GitHub Actions, ArgoCD — they all work. The failure mode is complexity. Teams bolt on a CI server, then a CD server, then a secrets manager, then a policy engine, then an artifact scanner, then a deployment approval workflow — and within six months the pipeline itself becomes the most fragile, least-understood system in the organisation. The pipeline breaks more often than the application.

GitOps is not a tool — it is a constraint. By forcing all desired state into git, you eliminate half of those moving parts. The remaining half becomes auditable. Every change has a commit hash, an author, and a timestamp. Rollback is git revert. Disaster recovery is git clone on a new cluster. The constraint reduces failure, and that is the single most important idea on this page.

The teams that struggle most with CI/CD are not the ones with bad tools — they are the ones with too many good tools bolted together with no unifying model. GitOps provides that model.

1. GitOps Fundamentals

GitOps is an operational model where git is the single source of truth for declarative infrastructure and application definitions. A GitOps operator (Flux, ArgoCD) runs inside the cluster, watches a git repository, and continuously reconciles the live state to match the declared state. If someone makes a manual change to the cluster, the operator reverts it. If you want to change something, you commit to git. The cluster converges. This is the pull model — the cluster pulls state from git, rather than a CI server pushing deployments to the cluster.

Declarative vs Imperative

Imperative: "run these 14 commands in this order." Declarative: "here is the desired state — make it so." GitOps is declarative. You describe what you want (a Deployment with 3 replicas running image v2.1.0) and the operator figures out how to get there. If a pod crashes, the operator recreates it. If someone scales to 5 manually, the operator scales back to 3.

// Imperative is giving driving directions. Declarative is giving a destination. GPS figures out the route.

Pull vs Push

Push model: CI server has credentials to the cluster and runs kubectl apply after a build. Pull model: an operator inside the cluster watches git and applies changes itself. Pull is more secure (no cluster credentials in CI), more resilient (operator retries on failure), and more auditable (every change is a git commit).

// Push: the mailman walks into your house and puts mail on the table. Pull: you check the mailbox yourself.

Git as Source of Truth

Every infrastructure change is a git commit. Who changed what, when, and why is captured in the commit log. Rollback is git revert. Audit is git log. Disaster recovery is git clone on a new cluster. The entire infrastructure definition fits in a single repository — or a small set of repositories with clear ownership boundaries.

// Your git log is your change management database. No more spreadsheets or Jira tickets as audit trail.

Reconciliation Loop

The operator runs a continuous loop: read desired state from git, read actual state from the cluster, compute the diff, apply changes. This loop runs every 1-10 minutes (configurable). Drift is detected and corrected automatically. You never have to wonder "is the cluster in sync?" — it always is, by design.

// A thermostat for your infrastructure. Set the temperature. It maintains it.

The four principles of GitOps

The OpenGitOps project (CNCF sandbox) defines four principles:

Declarative — the desired state of the system is expressed declaratively
Versioned and immutable — desired state is stored in a way that enforces immutability and versioning, retaining a complete history (git)
Pulled automatically — agents automatically pull the desired state from the source
Continuously reconciled — agents continuously observe actual state and attempt to apply the desired state

# The simplest GitOps workflow
git clone git@git.internal:infra/k8s-manifests.git
cd k8s-manifests

# Change a deployment image tag
sed -i 's|image: app:v1.2.0|image: app:v1.3.0|' apps/myapp/deployment.yaml
git add -A && git commit -m "promote myapp to v1.3.0"
git push

# Flux/ArgoCD detects the commit, applies the change, cluster converges
# No kubectl. No CI server touching the cluster. Just git.

2. The kldload Build Pipeline

The kldload build pipeline is orchestrated by deploy.sh and runs entirely inside containers (podman or docker, auto-detected). Understanding its five stages is the foundation for extending it with your own packages, darksites, and postinstallers.

The five build stages

Stage	Container	What it does	Output
1. Builder image	CentOS Stream 9	Installs lorax, squashfs-tools, xorriso, dracut, mtools	Builder container image
2. Debian darksite	debian:trixie-slim	Resolves + downloads all APT packages for offline Debian install	`live-build/darksite-debian-cache/`
3. Ubuntu darksite	ubuntu:noble	Same as Debian but with Ubuntu package sets (universe for ZFS)	`live-build/darksite-ubuntu-cache/`
4. RPM darksite	Builder (CentOS 9)	`dnf download --resolve --alldeps` for all RPM distros	RPM repo in build tree
5. ISO assembly	Builder (CentOS 9)	Bootstrap rootfs, build ZFS DKMS, embed darksites, create squashfs+EFI+ISO	`live-build/output/*.iso`

# Full rebuild from scratch
./deploy.sh clean
./deploy.sh builder-image
./deploy.sh build-debian-darksite   # slow first time, cached after
./deploy.sh build-ubuntu-darksite   # slow first time, cached after
PROFILE=desktop ./deploy.sh build

# Incremental rebuild (skips darksites if cache exists)
PROFILE=server ./deploy.sh build

# Deploy to KVM, Proxmox, or USB
./deploy.sh kvm-deploy
./deploy.sh proxmox-deploy
./deploy.sh burn                    # dd to /dev/sda

How to extend the pipeline

The pipeline is designed to be extended at three injection points: package sets (what gets downloaded into the darksite), build-iso.sh (what gets installed into the live ISO rootfs), and postinstallers (what runs on the target system after install). The rest of this masterclass covers each injection point in detail.

Build pipeline vs CI system

The kldload build pipeline is not a CI system — it is a build system that happens to be containerized. A CI system orchestrates triggers, parallelism, caching, and artifact promotion. The kldload pipeline is a deterministic sequence: given the same inputs (package sets, configs, profile), it produces the same ISO. This determinism is what makes it suitable as a stage in a larger CI pipeline. You wrap deploy.sh in your CI, not the other way around.

Think of deploy.sh as a compiler. You would not replace GCC with Jenkins — you would have Jenkins call GCC. Same idea here.

3. Adding Packages to the Darksite

The darksite is kldload's offline package mirror — every package needed for installation is baked into the ISO so that the installer never needs internet access. Adding your own packages to the darksite means they will be available for offline installation on every system you build from that ISO.

RPM packages (CentOS, RHEL, Rocky, Fedora)

RPM package sets live in build/darksite/config/package-sets/. Each .txt file contains one package name per line. Dependencies resolve automatically via dnf download --resolve --alldeps.

# List existing package sets
ls build/darksite/config/package-sets/
# base.txt  desktop.txt  server.txt  zfs.txt  ...

# Add your custom packages — create a new file or append to an existing one
cat >> build/darksite/config/package-sets/custom.txt <<'EOF'
# Custom packages for our environment
htop
tmux
jq
yq
ansible-core
python3-pip
golang
rust
cargo
podman-compose
buildah
skopeo
EOF

# Dependencies resolve automatically. You only list top-level package names.
# Rebuild the darksite:
./deploy.sh build                   # incrementally rebuilds everything

Debian APT packages

Debian package sets live in build/darksite-debian/config/package-sets/. The build runs inside a debian:trixie-slim container and uses apt-get download with full dependency resolution.

# Add Debian-specific packages
cat >> build/darksite-debian/config/package-sets/custom.txt <<'EOF'
htop
tmux
jq
ansible
python3-pip
golang-go
rustc
cargo
podman
buildah
skopeo
EOF

# Rebuild the Debian darksite (cached — only downloads new packages)
./deploy.sh build-debian-darksite

Ubuntu APT packages

Ubuntu package sets live in build/darksite-ubuntu/config/package-sets/. The build runs inside ubuntu:noble and requires the universe component for ZFS packages. The builder script is shared with Debian but uses Ubuntu-specific package sets.

# Add Ubuntu-specific packages
cat >> build/darksite-ubuntu/config/package-sets/custom.txt <<'EOF'
htop
tmux
jq
ansible
python3-pip
golang-go
rustc
cargo
podman
buildah
skopeo
EOF

# Rebuild the Ubuntu darksite
./deploy.sh build-ubuntu-darksite

Darksite architecture on the live ISO

When the ISO boots, the darksite is served differently depending on the target distro:

Protocol	Port	Path	Distros
HTTP (apt-cacher-ng style)	3142	`/root/darksite/debian/`	Debian
HTTP	3143	`/root/darksite/ubuntu/`	Ubuntu
file://	n/a	`/root/darksite/`	CentOS, RHEL, Rocky, Fedora

RPM distros use file:///root/darksite/ directly — no HTTP server needed. The installer configures the target system's package manager to point at the darksite during installation, then removes the darksite repo config after install completes.

Why darksites matter for regulated environments

Most organisations in regulated environments — defence, healthcare, finance, critical infrastructure — cannot pull packages from the internet during provisioning. They end up maintaining fragile Satellite, Pulp, or Aptly servers that become single points of failure. The kldload approach is different: bake everything into the ISO. The ISO is the artifact. It is versioned, checksummed, and transportable. You can carry it on a USB stick through an air gap. No servers required. This is the pattern that scales to zero-trust, SCIF, and submarine environments.

The darksite is kldload's most underappreciated feature. It turns provisioning from a network-dependent process into a file-based one, which changes the entire operational model for disconnected environments.

4. Custom Postinstallers

Postinstallers run on the target system after the base OS is installed. They handle everything the base package install does not: service configuration, user setup, custom software, monitoring agents, compliance baselines. kldload's postinstaller system is designed to be extended with your own scripts.

How postinstallers work

The installer backend (kldload-install-target) sources nine bash libraries from usr/lib/kldload-installer/lib/. After the base OS packages are installed via dnf --installroot (RPM distros), debootstrap (Debian/Ubuntu), or pacstrap (Arch), the postinstaller phase runs. Profile-specific packages and configs are gated by the profiles.sh functions k_profile_packages and k_install_system_files.

Adding your own packages to the postinstaller

# The postinstaller installs packages into the target rootfs.
# To add your own packages, you have two approaches:

# Approach 1: Add to the darksite package sets (recommended)
# This ensures packages are available offline.
# Add RPM packages:
echo "my-custom-package" >> build/darksite/config/package-sets/custom.txt
# Add Debian packages:
echo "my-custom-package" >> build/darksite-debian/config/package-sets/custom.txt
# Then reference them in your postinstaller script.

# Approach 2: Add a custom postinstaller script
# Place it in the live ISO filesystem overlay:
mkdir -p live-build/config/includes.chroot/usr/lib/kldload-installer/lib/
cat > live-build/config/includes.chroot/usr/lib/kldload-installer/lib/99-custom.sh <<'SCRIPT'
#!/bin/bash
# Custom postinstaller — runs in the context of kldload-install-target
# $TARGET is the rootfs mount point (e.g., /mnt/target)

k_custom_postinstall() {
    echo "[custom] Installing additional packages..."

    # Install packages from the darksite into the target
    case "$K_DISTRO" in
        centos|rhel|rocky|fedora)
            dnf --installroot="$TARGET" install -y \
                htop tmux jq ansible-core
            ;;
        debian|ubuntu)
            chroot "$TARGET" apt-get install -y \
                htop tmux jq ansible
            ;;
        arch)
            arch-chroot "$TARGET" pacman -S --noconfirm \
                htop tmux jq ansible
            ;;
    esac

    echo "[custom] Deploying configuration files..."
    # Copy custom configs into the target
    cp -r /root/custom-configs/etc/* "$TARGET/etc/"

    echo "[custom] Enabling custom services..."
    chroot "$TARGET" systemctl enable my-monitoring-agent.service
}
SCRIPT
chmod +x live-build/config/includes.chroot/usr/lib/kldload-installer/lib/99-custom.sh

Adding custom configuration files

# Everything under includes.chroot/ mirrors into the live ISO root filesystem.
# Place files where they should appear on the live ISO:

# Custom monitoring agent config
mkdir -p live-build/config/includes.chroot/root/custom-configs/etc/monitoring/
cat > live-build/config/includes.chroot/root/custom-configs/etc/monitoring/agent.conf <<'EOF'
[agent]
server = monitoring.internal.example.com
port = 8443
tls_cert = /etc/pki/tls/certs/monitoring.pem
interval = 30
EOF

# Custom systemd unit
mkdir -p live-build/config/includes.chroot/root/custom-configs/etc/systemd/system/
cat > live-build/config/includes.chroot/root/custom-configs/etc/systemd/system/my-monitoring-agent.service <<'EOF'
[Unit]
Description=Custom Monitoring Agent
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
ExecStart=/usr/bin/monitoring-agent -c /etc/monitoring/agent.conf
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

Profile gating

The core profile is stripped — no k* tools, no web UI, no sanoid, no darksites. If your postinstaller should only run on desktop or server profiles, gate it with a profile check:

k_custom_postinstall() {
    if [[ "$K_PROFILE" == "core" ]]; then
        echo "[custom] Skipping custom postinstall for core profile"
        return 0
    fi

    # Only runs for desktop and server profiles
    echo "[custom] Installing monitoring stack..."
    # ...
}

5. Local Package Repositories

Beyond the darksite (which is embedded in the ISO), you may want to maintain a persistent local package repository for ongoing system updates, custom-built RPMs, and internal software distribution. This section covers building and hosting RPM and APT repositories on kldload systems.

RPM repository with createrepo

# Install createrepo on your kldload system
dnf install -y createrepo_c

# Create directory structure on a ZFS dataset (snapshottable!)
zfs create rpool/data/repos
zfs create rpool/data/repos/rpm
mkdir -p /data/repos/rpm/{base,updates,custom}

# Copy or download RPMs into the repo
cp /path/to/custom-packages/*.rpm /data/repos/rpm/custom/

# Generate repository metadata
createrepo_c /data/repos/rpm/custom/

# After adding new packages, update metadata
createrepo_c --update /data/repos/rpm/custom/

# GPG-sign the repository metadata
gpg --detach-sign --armor /data/repos/rpm/custom/repodata/repomd.xml

Serving with nginx

# /etc/nginx/conf.d/repo.conf
server {
    listen 8080;
    server_name repo.internal.example.com;

    location /rpm/ {
        alias /data/repos/rpm/;
        autoindex on;
        autoindex_format json;
    }

    location /debian/ {
        alias /data/repos/debian/;
        autoindex on;
    }

    location /ubuntu/ {
        alias /data/repos/ubuntu/;
        autoindex on;
    }
}

# Client-side repo config (/etc/yum.repos.d/internal.repo)
[internal-custom]
name=Internal Custom Repository
baseurl=http://repo.internal.example.com:8080/rpm/custom/
enabled=1
gpgcheck=1
gpgkey=http://repo.internal.example.com:8080/rpm/RPM-GPG-KEY-internal

APT repository with reprepro

# Install reprepro
apt-get install -y reprepro gnupg

# Create directory structure
zfs create rpool/data/repos/debian
mkdir -p /data/repos/debian/{conf,incoming}

# Configure the repository
cat > /data/repos/debian/conf/distributions <<'EOF'
Origin: Internal
Label: Internal Debian Repository
Codename: trixie
Architectures: amd64
Components: main
Description: Internal packages for kldload deployments
SignWith: your-gpg-key-id
EOF

cat > /data/repos/debian/conf/options <<'EOF'
verbose
ask-passphrase
EOF

# Add packages
reprepro -b /data/repos/debian includedeb trixie /path/to/package.deb

# List packages in the repo
reprepro -b /data/repos/debian list trixie

# Client config (/etc/apt/sources.list.d/internal.list)
# deb http://repo.internal.example.com:8080/debian/ trixie main

APT repository with aptly (alternative)

# Aptly is more feature-rich: snapshots, publishing, mirroring
dnf install -y aptly   # or apt-get install aptly

# Create a local repo
aptly repo create -distribution=noble -component=main internal-ubuntu

# Add packages
aptly repo add internal-ubuntu /path/to/packages/

# Take a snapshot (version your repo state)
aptly snapshot create internal-ubuntu-2026-04-05 from repo internal-ubuntu

# Publish the snapshot
aptly publish snapshot internal-ubuntu-2026-04-05

# Mirror an upstream repo for air-gapped use
aptly mirror create ubuntu-noble-main \
    http://archive.ubuntu.com/ubuntu noble main
aptly mirror update ubuntu-noble-main
aptly snapshot create ubuntu-noble-snap from mirror ubuntu-noble-main
aptly publish snapshot ubuntu-noble-snap

ZFS-backed repository storage

Storing repositories on ZFS datasets gives you atomic snapshots of your entire repository state. Before updating repository metadata, snapshot the dataset. If the update causes problems, rollback is instant.

Most organisations version their application artifacts meticulously — Docker tags, Helm chart versions, OCI digests — but treat their package repository as a mutable, unversioned blob. When a package update breaks something, they cannot roll back the repository; they can only roll forward by publishing a new version. ZFS snapshots give you instant, zero-cost repository rollback. Combined with sanoid for automated snapshot management, you get a complete version history of your package repository with zero operational overhead.

# Snapshot before repo update
zfs snapshot rpool/data/repos/rpm@before-update-2026-04-05

# Update repo
createrepo_c --update /data/repos/rpm/custom/

# If something breaks, instant rollback
zfs rollback rpool/data/repos/rpm@before-update-2026-04-05

# Schedule regular snapshots with sanoid
# /etc/sanoid/sanoid.conf
[rpool/data/repos]
    use_template = production
    recursive = yes

[template_production]
    hourly = 24
    daily = 30
    monthly = 12
    autosnap = yes
    autoprune = yes

Almost nobody versions their package repositories. ZFS snapshots make it free and automatic — there is no reason not to.

6. Container Image Registries

In a GitOps workflow, container images are the primary deployment artifact. You need a local registry for air-gapped environments, image scanning, and supply chain control.

Distribution Registry (CNCF reference implementation)

# Deploy the CNCF distribution registry on a ZFS dataset
zfs create rpool/data/registry

podman run -d \
    --name registry \
    -p 5000:5000 \
    -v /data/registry:/var/lib/registry:Z \
    -e REGISTRY_STORAGE_DELETE_ENABLED=true \
    -e REGISTRY_HTTP_HEADERS_Access-Control-Allow-Origin='["*"]' \
    --restart always \
    docker.io/library/registry:2

# Push an image to the local registry
podman tag docker.io/library/nginx:1.27 registry.internal:5000/nginx:1.27
podman push registry.internal:5000/nginx:1.27

# Configure containerd/CRI-O to use the local registry
# /etc/containers/registries.conf.d/internal.conf
[[registry]]
location = "registry.internal:5000"
insecure = true   # use TLS in production — see TLS & PKI masterclass

Harbor (enterprise registry)

# Harbor provides: vulnerability scanning, RBAC, replication, OCI artifact support
# Install Harbor offline installer (air-gapped)
tar xzf harbor-offline-installer-v2.11.0.tgz
cd harbor

# Configure harbor.yml
hostname: harbor.internal.example.com
https:
  port: 443
  certificate: /etc/pki/tls/certs/harbor.pem
  private_key: /etc/pki/tls/private/harbor.key
harbor_admin_password: ChangeMeNow!
database:
  password: ChangeMeNow!
data_volume: /data/harbor    # ZFS dataset

# Install
./install.sh --with-trivy   # includes vulnerability scanner

# Replicate from upstream (when connected) for air-gapped use:
# Harbor UI → Administration → Registries → New Endpoint
# Harbor UI → Administration → Replications → New Rule
# Pull-based replication: Harbor pulls images from Docker Hub/GHCR
# on a schedule, then serves them locally offline

Air-gapped image mirroring

# Use skopeo to copy images for offline use
# On a connected machine:
skopeo copy --all \
    docker://docker.io/library/nginx:1.27 \
    dir:/tmp/nginx-1.27

# Transfer the directory to the air-gapped network (USB, SCP, etc.)
tar czf nginx-1.27.tar.gz -C /tmp nginx-1.27
# sneakernet the tarball across the air gap

# On the air-gapped machine:
tar xzf nginx-1.27.tar.gz -C /tmp
skopeo copy \
    dir:/tmp/nginx-1.27 \
    docker://registry.internal:5000/nginx:1.27

# Bulk mirror with a manifest file
cat > images-to-mirror.txt <<'EOF'
docker.io/library/nginx:1.27
docker.io/library/postgres:16
docker.io/library/redis:7
ghcr.io/fluxcd/flux-cli:v2.4.0
ghcr.io/fluxcd/source-controller:v1.4.1
ghcr.io/fluxcd/kustomize-controller:v1.4.0
ghcr.io/fluxcd/helm-controller:v1.1.0
ghcr.io/fluxcd/notification-controller:v1.4.0
quay.io/argoproj/argocd:v2.13.0
EOF

# Mirror all images
while IFS= read -r img; do
    name=$(echo "$img" | sed 's|.*/||')
    skopeo copy --all "docker://$img" "docker://registry.internal:5000/$name"
done < images-to-mirror.txt

Image signing with cosign

# Generate a signing keypair
cosign generate-key-pair

# Sign an image after pushing
cosign sign --key cosign.key registry.internal:5000/myapp:v1.2.0

# Verify on pull (enforce in admission controller)
cosign verify --key cosign.pub registry.internal:5000/myapp:v1.2.0

# Kubernetes admission enforcement (Kyverno policy)
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signatures
spec:
  validationFailureAction: Enforce
  rules:
  - name: verify-cosign
    match:
      resources:
        kinds: ["Pod"]
    verifyImages:
    - imageReferences: ["registry.internal:5000/*"]
      attestors:
      - entries:
        - keys:
            publicKeys: |-
              -----BEGIN PUBLIC KEY-----
              ... your cosign.pub contents ...
              -----END PUBLIC KEY-----

7. Flux CD

Flux is a CNCF graduated GitOps toolkit. It runs as a set of controllers in your Kubernetes cluster and continuously reconciles cluster state against git repositories, Helm charts, and OCI artifacts. Flux is the recommended GitOps operator for kldload Kubernetes deployments because of its mature multi-tenancy model and image automation.

Installation

# Install the Flux CLI
curl -s https://fluxcd.io/install.sh | bash

# For air-gapped: download the binary and container images
flux install --export > flux-system.yaml
# Extract image references and mirror them to your local registry
grep "image:" flux-system.yaml | awk '{print $2}' | sort -u

# Bootstrap Flux into your cluster
# This creates the flux-system namespace and connects to your git repo
flux bootstrap git \
    --url=ssh://git@git.internal:2222/infra/k8s-fleet.git \
    --branch=main \
    --path=clusters/production \
    --private-key-file=/root/.ssh/flux-deploy-key

# Verify installation
flux check
kubectl get pods -n flux-system

GitRepository source

# Define a git source that Flux watches
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: infra
  namespace: flux-system
spec:
  interval: 5m
  url: ssh://git@git.internal:2222/infra/k8s-fleet.git
  ref:
    branch: main
  secretRef:
    name: flux-ssh-key   # SSH deploy key

Kustomization (Flux reconciliation)

# Tell Flux to apply manifests from a path in the GitRepository
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps
  namespace: flux-system
spec:
  interval: 10m
  sourceRef:
    kind: GitRepository
    name: infra
  path: ./clusters/production/apps
  prune: true          # delete resources removed from git
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: myapp
    namespace: default
  timeout: 5m
  retryInterval: 2m

HelmRelease

# Deploy a Helm chart via GitOps
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: ingress-nginx
  namespace: ingress-system
spec:
  interval: 30m
  chart:
    spec:
      chart: ingress-nginx
      version: "4.11.*"    # semver range
      sourceRef:
        kind: HelmRepository
        name: ingress-nginx
        namespace: flux-system
  values:
    controller:
      replicaCount: 2
      service:
        type: LoadBalancer
      metrics:
        enabled: true

Image automation

Flux's image automation is where GitOps and CI/CD merge into a single loop. Your CI pipeline builds a new container image and pushes it to the registry. Flux's image-reflector-controller detects the new tag. The image-automation-controller updates the git repository with the new tag. The kustomize-controller detects the git change and applies it to the cluster.

The entire flow — from code commit to running in production — happens without any CI server touching the cluster. The only thing that touches the cluster is Flux itself, running inside it. This is the security model that makes GitOps compelling for regulated environments: the cluster's credentials never leave the cluster.

# Flux can automatically update image tags in git when new images appear
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
  name: myapp
  namespace: flux-system
spec:
  image: registry.internal:5000/myapp
  interval: 5m

---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
  name: myapp
  namespace: flux-system
spec:
  imageRepositoryRef:
    name: myapp
  policy:
    semver:
      range: ">=1.0.0"

---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageUpdateAutomation
metadata:
  name: myapp-auto
  namespace: flux-system
spec:
  interval: 30m
  sourceRef:
    kind: GitRepository
    name: infra
  git:
    checkout:
      ref:
        branch: main
    commit:
      author:
        name: flux-bot
        email: flux@internal
      messageTemplate: "chore: update myapp to {{.NewTag}}"
    push:
      branch: main
  update:
    path: ./clusters/production/apps
    strategy: Setters

Image automation closes the loop that most CI/CD setups leave open. CI builds the artifact; Flux delivers it. No handoff, no credentials leak, no manual step.

8. ArgoCD

ArgoCD is another CNCF graduated GitOps project. Where Flux is a set of modular controllers, ArgoCD is a monolithic application with a rich web UI, SSO integration, and an opinionated workflow. Both are excellent. Choose Flux if you want composability and CLI-first operation. Choose ArgoCD if you want a visual dashboard and RBAC tied to your identity provider.

Installation

# Standard installation
kubectl create namespace argocd
kubectl apply -n argocd -f \
    https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Air-gapped: download the manifests, mirror images, apply locally
curl -sLO https://raw.githubusercontent.com/argoproj/argo-cd/v2.13.0/manifests/install.yaml
grep "image:" install.yaml | awk '{print $2}' | sort -u
# Mirror those images to registry.internal:5000, then sed the manifest

# Get the initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret \
    -o jsonpath='{.data.password}' | base64 -d

# Install the CLI
curl -sLO https://github.com/argoproj/argo-cd/releases/download/v2.13.0/argocd-linux-amd64
chmod +x argocd-linux-amd64 && mv argocd-linux-amd64 /usr/local/bin/argocd

# Login
argocd login argocd.internal.example.com --grpc-web

Application CRD

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: ssh://git@git.internal:2222/apps/myapp.git
    targetRevision: main
    path: deploy/k8s
  destination:
    server: https://kubernetes.default.svc
    namespace: myapp
  syncPolicy:
    automated:
      prune: true          # delete resources removed from git
      selfHeal: true       # revert manual changes
    syncOptions:
    - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

ApplicationSets for multi-cluster

# Deploy the same app to multiple clusters from a single definition
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: myapp-fleet
  namespace: argocd
spec:
  generators:
  - clusters:
      selector:
        matchLabels:
          env: production
  template:
    metadata:
      name: 'myapp-{{name}}'
    spec:
      project: default
      source:
        repoURL: ssh://git@git.internal:2222/apps/myapp.git
        targetRevision: main
        path: deploy/k8s
      destination:
        server: '{{server}}'
        namespace: myapp
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

RBAC with Keycloak

# ArgoCD OIDC config (argocd-cm ConfigMap)
data:
  url: https://argocd.internal.example.com
  oidc.config: |
    name: Keycloak
    issuer: https://keycloak.internal.example.com/realms/infrastructure
    clientID: argocd
    clientSecret: $oidc.keycloak.clientSecret
    requestedScopes: ["openid", "profile", "email", "groups"]

# RBAC policy (argocd-rbac-cm ConfigMap)
data:
  policy.csv: |
    # SRE team gets full access
    p, role:sre, applications, *, */*, allow
    p, role:sre, clusters, *, *, allow
    p, role:sre, repositories, *, *, allow

    # Developers can sync their own apps but not delete
    p, role:developer, applications, get, */*, allow
    p, role:developer, applications, sync, */*, allow

    # Map Keycloak groups to ArgoCD roles
    g, sre-team, role:sre
    g, developers, role:developer
  policy.default: role:readonly

Flux vs ArgoCD comparison

Feature	Flux CD	ArgoCD
Architecture	Modular controllers	Monolithic with UI
Web UI	Weave GitOps (separate)	Built-in, feature-rich
Multi-cluster	Kustomization per cluster	ApplicationSets
Helm support	HelmRelease CRD	Native in Application
Image automation	Built-in controllers	Argo Image Updater (separate)
RBAC	Kubernetes RBAC	Own RBAC + OIDC
OCI artifacts	Native OCIRepository	Supported
Notification	Built-in providers	Notification controller
Air-gapped	Excellent (mirror images)	Excellent (mirror images)
Learning curve	Steeper (many CRDs)	Gentler (UI helps)

9. Pipeline Design

A CI/CD pipeline has three fundamental stages: build (compile, package), test (unit, integration, security scan), and deploy (push artifacts, trigger reconciliation). The choice of CI runner matters less than the pipeline design. Here are production-grade pipeline configurations for the three most common self-hosted runners.

Separating CI from CD

The most common mistake in pipeline design is putting deployment logic in CI. Your CI pipeline should build, test, and push artifacts. It should never run kubectl apply, helm install, or ssh into a server. Deployment is the GitOps operator's job.

When you put deployment in CI, you create three problems: a single point of failure (the CI server), a security risk (cluster credentials stored in CI), and a missing reconciliation loop (if the deploy fails, nothing retries it). Separate concerns: CI builds artifacts, GitOps deploys them.

If your CI server has kubectl credentials, you have already lost the security argument for GitOps. The whole point is that only the in-cluster operator touches the cluster.

Gitea Actions (self-hosted, air-gap friendly)

# .gitea/workflows/build.yaml
name: Build and Push
on:
  push:
    branches: [main]
    tags: ['v*']

jobs:
  build:
    runs-on: ubuntu-latest    # self-hosted runner on kldload
    steps:
    - uses: actions/checkout@v4

    - name: Build container image
      run: |
        podman build -t registry.internal:5000/myapp:${{ github.sha }} .
        podman tag registry.internal:5000/myapp:${{ github.sha }} \
                    registry.internal:5000/myapp:latest

    - name: Run tests
      run: |
        podman run --rm registry.internal:5000/myapp:${{ github.sha }} \
            /app/run-tests.sh

    - name: Scan for vulnerabilities
      run: |
        trivy image --severity HIGH,CRITICAL \
            registry.internal:5000/myapp:${{ github.sha }}

    - name: Push to registry
      run: |
        podman push registry.internal:5000/myapp:${{ github.sha }}
        podman push registry.internal:5000/myapp:latest

    - name: Sign image
      run: |
        cosign sign --key /secrets/cosign.key \
            registry.internal:5000/myapp:${{ github.sha }}

    # Flux detects the new image and updates git automatically
    # No kubectl apply. No cluster credentials in CI.

GitLab CI (self-hosted runner)

# .gitlab-ci.yml
stages:
  - build
  - test
  - scan
  - push

variables:
  REGISTRY: registry.internal:5000
  IMAGE: ${REGISTRY}/myapp

build:
  stage: build
  script:
    - podman build -t ${IMAGE}:${CI_COMMIT_SHA} .
  tags:
    - kldload-runner

test:
  stage: test
  script:
    - podman run --rm ${IMAGE}:${CI_COMMIT_SHA} /app/run-tests.sh
  tags:
    - kldload-runner

scan:
  stage: scan
  script:
    - trivy image --exit-code 1 --severity CRITICAL ${IMAGE}:${CI_COMMIT_SHA}
  allow_failure: false
  tags:
    - kldload-runner

push:
  stage: push
  script:
    - podman push ${IMAGE}:${CI_COMMIT_SHA}
    - |
      if [ -n "$CI_COMMIT_TAG" ]; then
        podman tag ${IMAGE}:${CI_COMMIT_SHA} ${IMAGE}:${CI_COMMIT_TAG}
        podman push ${IMAGE}:${CI_COMMIT_TAG}
      fi
  only:
    - main
    - tags
  tags:
    - kldload-runner

GitHub Actions (self-hosted runner on kldload)

# .github/workflows/build.yml
name: Build Pipeline
on:
  push:
    branches: [main]
    tags: ['v*']

jobs:
  build-test-push:
    runs-on: self-hosted   # kldload runner
    steps:
    - uses: actions/checkout@v4

    - name: Set image tag
      id: tag
      run: |
        if [[ "$GITHUB_REF" == refs/tags/* ]]; then
          echo "tag=${GITHUB_REF#refs/tags/}" >> $GITHUB_OUTPUT
        else
          echo "tag=${GITHUB_SHA::8}" >> $GITHUB_OUTPUT
        fi

    - name: Build
      run: podman build -t registry.internal:5000/myapp:${{ steps.tag.outputs.tag }} .

    - name: Test
      run: podman run --rm registry.internal:5000/myapp:${{ steps.tag.outputs.tag }} make test

    - name: Push
      run: podman push registry.internal:5000/myapp:${{ steps.tag.outputs.tag }}

# Self-hosted runner setup on kldload:
# mkdir -p /opt/actions-runner && cd /opt/actions-runner
# curl -sLO https://github.com/actions/runner/releases/download/v2.321.0/actions-runner-linux-x64-2.321.0.tar.gz
# tar xzf actions-runner-linux-x64-2.321.0.tar.gz
# ./config.sh --url https://github.com/your-org/your-repo --token YOUR_TOKEN
# ./svc.sh install && ./svc.sh start

10. Golden Image Pipeline

A golden image is a pre-configured, sealed OS image that serves as a template for new machines. kldload's export feature produces cloud-init-ready images (qcow2, vmdk, vhd, ova, raw) that can be imported into any hypervisor. Combined with Packer and git-triggered builds, you get a fully automated image pipeline.

Golden images solve the "works on my machine" problem for infrastructure. Instead of running an Ansible playbook against a fresh VM and hoping it converges to the right state, you start from a known-good image that was built deterministically and validated in staging. The kldload export feature takes this further: the image includes ZFS on root, WireGuard, eBPF tools, and cloud-init — everything your infrastructure needs. Packer makes the build repeatable. ZFS makes the storage efficient (clones share blocks with the parent snapshot). Git tags make the versioning auditable.

Packer + kldload integration

# packer/kldload-golden.pkr.hcl
packer {
  required_plugins {
    qemu = {
      version = ">= 1.1.0"
      source  = "github.com/hashicorp/qemu"
    }
  }
}

source "qemu" "kldload-golden" {
  iso_url           = "live-build/output/kldload-1.0.3-x86_64.iso"
  iso_checksum      = "file:live-build/output/kldload-1.0.3-x86_64.iso.sha256"
  output_directory  = "output/golden"
  disk_size         = "40G"
  format            = "qcow2"
  accelerator       = "kvm"
  cpus              = 4
  memory            = 8192
  net_device        = "virtio-net"
  disk_interface    = "virtio-scsi"
  boot_wait         = "10s"
  boot_command      = [""]   # auto-boot the ISO

  # kldload answers file for unattended install
  http_directory    = "packer/http"
  shutdown_command   = "poweroff"
  ssh_username      = "root"
  ssh_password      = "kldload"
  ssh_timeout       = "30m"
}

build {
  sources = ["source.qemu.kldload-golden"]

  # Run the kldload installer with an answers file
  provisioner "shell" {
    inline = [
      "kldload-install-target < /tmp/answers.env",
    ]
  }

  # Seal the image for cloning
  provisioner "shell" {
    inline = [
      "truncate -s 0 /etc/machine-id",
      "rm -f /etc/ssh/ssh_host_*",
      "cloud-init clean --logs --seed",
    ]
  }

  post-processor "checksum" {
    checksum_types = ["sha256"]
    output         = "output/golden/kldload-golden-{{.BuildName}}.sha256"
  }
}

Version tagging and promotion

# Image promotion pipeline: dev → staging → production
# Each stage is a git branch or tag

# Build triggers on git tag
git tag -a golden/v1.2.0 -m "Golden image v1.2.0 — CentOS 9, ZFS 2.2.7"
git push origin golden/v1.2.0

# CI pipeline builds the image and pushes to artifact storage
# packer build packer/kldload-golden.pkr.hcl

# Promotion is a git operation:
# 1. Tag passes CI tests → promoted to staging
git tag -a golden/v1.2.0-staging -m "Promoted to staging"
git push origin golden/v1.2.0-staging

# 2. Staging validated → promoted to production
git tag -a golden/v1.2.0-prod -m "Promoted to production"
git push origin golden/v1.2.0-prod

# ZFS snapshot versioning for golden images stored as zvols
zfs snapshot rpool/images/golden-centos9@v1.2.0
zfs send rpool/images/golden-centos9@v1.2.0 | \
    zfs receive backup/images/golden-centos9@v1.2.0

ZFS snapshot versioning for VM templates

# Store golden images as ZFS zvols — snapshottable, cloneable, replicable
zfs create -V 40G rpool/images/golden-centos9

# After Packer builds the image, convert to zvol
qemu-img convert -f qcow2 -O raw \
    output/golden/kldload-golden.qcow2 \
    /dev/zvol/rpool/images/golden-centos9

# Snapshot the golden image
zfs snapshot rpool/images/golden-centos9@v1.2.0

# Clone a new VM from the golden image (instant, zero-copy)
zfs clone rpool/images/golden-centos9@v1.2.0 rpool/vms/webserver-01

# List all golden image versions
zfs list -t snapshot -r rpool/images/ -o name,creation,used

Golden images are the infrastructure equivalent of compiled binaries. You do not debug your compiler on every deploy — you trust the artifact and ship it.

11. Artifact Management

Beyond container images, a production pipeline produces many artifacts: Helm charts, binary packages, OCI bundles, documentation, golden images, signed SBOMs. You need a strategy for storing, versioning, and distributing them.

OCI artifacts (universal packaging)

# OCI is not just for container images — it is a universal artifact format
# Push a Helm chart as an OCI artifact
helm package ./charts/myapp
helm push myapp-1.2.0.tgz oci://registry.internal:5000/charts

# Push arbitrary files as OCI artifacts (using ORAS)
oras push registry.internal:5000/artifacts/golden-image:v1.2.0 \
    --artifact-type application/vnd.kldload.golden-image \
    kldload-golden.qcow2:application/octet-stream

# Pull artifacts
oras pull registry.internal:5000/artifacts/golden-image:v1.2.0

# Flux can consume OCI artifacts as sources
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: OCIRepository
metadata:
  name: myapp-manifests
  namespace: flux-system
spec:
  interval: 5m
  url: oci://registry.internal:5000/manifests/myapp
  ref:
    semver: ">=1.0.0"

Helm chart repository

# Option 1: ChartMuseum (traditional Helm repo)
podman run -d \
    --name chartmuseum \
    -p 8082:8080 \
    -v /data/charts:/charts:Z \
    -e STORAGE=local \
    -e STORAGE_LOCAL_ROOTDIR=/charts \
    ghcr.io/helm/chartmuseum:v0.16.2

# Push a chart
curl --data-binary "@myapp-1.2.0.tgz" \
    http://chartmuseum.internal:8082/api/charts

# Option 2: OCI registry (recommended — same registry for everything)
helm push myapp-1.2.0.tgz oci://registry.internal:5000/charts

# Add repo to Helm
helm repo add internal http://chartmuseum.internal:8082
helm repo update

ZFS-backed artifact storage

# Store all artifacts on ZFS for snapshotting and replication
zfs create rpool/data/artifacts
zfs create rpool/data/artifacts/images     # golden images
zfs create rpool/data/artifacts/charts     # Helm charts
zfs create rpool/data/artifacts/packages   # RPM/DEB packages
zfs create rpool/data/artifacts/oci        # OCI bundles

# Enable compression (artifacts compress well)
zfs set compression=zstd rpool/data/artifacts

# Replicate artifacts to a backup site
zfs snapshot -r rpool/data/artifacts@daily-2026-04-05
zfs send -R rpool/data/artifacts@daily-2026-04-05 | \
    ssh backup-site zfs receive -F backup/artifacts

12. Air-Gapped Deployments

Air-gapped environments — disconnected from the internet by policy or physics — are where kldload's darksite pattern shines. The ISO itself is an air-gapped deployment artifact. But you need a complete strategy for ongoing operations: updates, new applications, security patches, and configuration changes.

The file-based artifact philosophy

Every artifact your environment needs should be transportable as a file. Container images become OCI tarballs. Helm charts become .tgz files. Git repos become bundles. RPM and APT packages live in the darksite directory. Golden images ship as qcow2 files. When you design your entire pipeline around file-based artifacts, air-gapped deployment becomes a logistics problem, not an engineering problem. You are moving files across a boundary, not re-engineering your workflow. This is why kldload builds everything into the ISO — the ISO is the ultimate file-based artifact.

The USB-sneakernet workflow

# On the CONNECTED side (build machine with internet)

# 1. Build a fresh kldload ISO with all darksites
./deploy.sh clean && ./deploy.sh builder-image
./deploy.sh build-debian-darksite
./deploy.sh build-ubuntu-darksite
PROFILE=server ./deploy.sh build

# 2. Mirror container images
mkdir -p /data/transfer/images
while IFS= read -r img; do
    name=$(echo "$img" | tr '/:' '_')
    skopeo copy "docker://$img" "oci-archive:/data/transfer/images/${name}.tar"
done < images-to-mirror.txt

# 3. Package Helm charts
mkdir -p /data/transfer/charts
helm pull oci://registry.internal:5000/charts/myapp --version 1.2.0 \
    -d /data/transfer/charts/

# 4. Bundle git repositories
mkdir -p /data/transfer/repos
git -C /path/to/k8s-fleet bundle create /data/transfer/repos/k8s-fleet.bundle --all
git -C /path/to/app-configs bundle create /data/transfer/repos/app-configs.bundle --all

# 5. Create the transfer bundle
tar czf /data/transfer/airgap-bundle-2026-04-05.tar.gz -C /data/transfer .
sha256sum /data/transfer/airgap-bundle-2026-04-05.tar.gz > /data/transfer/airgap-bundle-2026-04-05.sha256

# 6. Write to USB
dd if=/dev/zero of=/dev/sda bs=1M count=100   # clear partition table
parted /dev/sda mklabel gpt
parted /dev/sda mkpart primary ext4 1MiB 100%
mkfs.ext4 /dev/sda1
mount /dev/sda1 /mnt/usb
cp /data/transfer/airgap-bundle-*.tar.gz /mnt/usb/
cp /data/transfer/airgap-bundle-*.sha256 /mnt/usb/
cp live-build/output/kldload-*.iso /mnt/usb/
umount /mnt/usb

# On the DISCONNECTED side (air-gapped network)

# 1. Verify the bundle
sha256sum -c /mnt/usb/airgap-bundle-2026-04-05.sha256

# 2. Extract
tar xzf /mnt/usb/airgap-bundle-2026-04-05.tar.gz -C /data/incoming/

# 3. Load container images into local registry
for img in /data/incoming/images/*.tar; do
    skopeo copy "oci-archive:${img}" \
        "docker://registry.internal:5000/$(basename ${img%.tar} | tr '_' '/')"
done

# 4. Load Helm charts
for chart in /data/incoming/charts/*.tgz; do
    helm push "$chart" oci://registry.internal:5000/charts
done

# 5. Update git repositories
cd /data/git/k8s-fleet
git fetch /data/incoming/repos/k8s-fleet.bundle main:main

# 6. Flux/ArgoCD detects the git update and reconciles
flux reconcile source git infra   # or wait for automatic detection

The darksite pattern is not just for initial installation — it is a philosophy. Once you think in terms of portable file-based artifacts, air-gapped deployment stops being special and starts being normal.

Automated transfer station

# A "transfer station" automates the connected-side bundling
# Cron job on the connected build machine

#!/bin/bash
# /usr/local/bin/airgap-bundle.sh — runs weekly via cron
set -euo pipefail

BUNDLE_DATE=$(date +%Y-%m-%d)
BUNDLE_DIR="/data/transfer/${BUNDLE_DATE}"
mkdir -p "${BUNDLE_DIR}"/{images,charts,repos,packages}

# Sync container images
while IFS= read -r img; do
    name=$(echo "$img" | tr '/:' '_')
    skopeo copy --all "docker://$img" \
        "oci-archive:${BUNDLE_DIR}/images/${name}.tar"
done < /etc/airgap/images.txt

# Sync Helm charts
while IFS= read -r chart; do
    helm pull "$chart" -d "${BUNDLE_DIR}/charts/"
done < /etc/airgap/charts.txt

# Bundle git repos
while IFS= read -r repo; do
    name=$(basename "$repo" .git)
    git -C "$repo" bundle create "${BUNDLE_DIR}/repos/${name}.bundle" \
        --since="7 days ago" --all
done < /etc/airgap/repos.txt

# Package it
tar czf "/data/transfer/bundle-${BUNDLE_DATE}.tar.gz" -C "${BUNDLE_DIR}" .
sha256sum "/data/transfer/bundle-${BUNDLE_DATE}.tar.gz" \
    > "/data/transfer/bundle-${BUNDLE_DATE}.sha256"

echo "[$(date)] Bundle created: bundle-${BUNDLE_DATE}.tar.gz"

13. Integrating Custom Software into the kldload Build Pipeline

When you need to ship proprietary software, internal tools, or custom-compiled binaries as part of your kldload ISO, you integrate directly with the build pipeline. There are four methods, ranging from simple file drops to full darksite integration.

Choosing the right method

When people ask "how do I add my own software to kldload," they usually mean one of two things. Either they want to add a standard package that happens not to be in the default package sets — in which case the answer is to add a line to a .txt file — or they want to integrate proprietary software that does not exist in any public repository. The four methods below cover the full spectrum. For most organisations, Method 1 (file overlay) handles 80% of cases. For organisations that need proper package management with versioning and dependency tracking, Methods 2 and 3 (custom RPM/DEB) are the right answer.

The key insight is that the kldload build pipeline is designed for extension — the includes.chroot directory and the package-sets text files are injection points, not hacks.

Method 1: File overlay (simplest)

# Everything under includes.chroot/ mirrors into the live ISO filesystem
# Place binaries, scripts, and configs where they should appear

# Custom binary
mkdir -p live-build/config/includes.chroot/usr/local/bin/
cp /path/to/my-custom-tool \
    live-build/config/includes.chroot/usr/local/bin/my-custom-tool
chmod +x live-build/config/includes.chroot/usr/local/bin/my-custom-tool

# Custom systemd service
mkdir -p live-build/config/includes.chroot/etc/systemd/system/
cp my-tool.service \
    live-build/config/includes.chroot/etc/systemd/system/

# Custom configuration
mkdir -p live-build/config/includes.chroot/etc/my-tool/
cp my-tool.conf \
    live-build/config/includes.chroot/etc/my-tool/

Method 2: Custom RPM packages in the darksite

# Build your own RPM and include it in the darksite
# 1. Create an RPM spec file
cat > ~/rpmbuild/SPECS/my-tool.spec <<'EOF'
Name:    my-tool
Version: 1.0.0
Release: 1%{?dist}
Summary: Internal monitoring tool
License: Proprietary
Source0: my-tool-1.0.0.tar.gz

%description
Internal monitoring and compliance tool for kldload deployments.

%prep
%setup -q

%build
make

%install
install -Dm755 my-tool %{buildroot}/usr/bin/my-tool
install -Dm644 my-tool.conf %{buildroot}/etc/my-tool/my-tool.conf
install -Dm644 my-tool.service %{buildroot}/usr/lib/systemd/system/my-tool.service

%files
/usr/bin/my-tool
%config(noreplace) /etc/my-tool/my-tool.conf
/usr/lib/systemd/system/my-tool.service
EOF

# 2. Build the RPM
rpmbuild -ba ~/rpmbuild/SPECS/my-tool.spec

# 3. Copy the RPM into the darksite build area
cp ~/rpmbuild/RPMS/x86_64/my-tool-1.0.0-1.el9.x86_64.rpm \
    build/darksite/custom-rpms/

# 4. Add to package set
echo "my-tool" >> build/darksite/config/package-sets/custom.txt

# 5. Rebuild the ISO
./deploy.sh build

Method 3: Custom .deb packages in the darksite

# Build a .deb package for Debian/Ubuntu darksite inclusion
mkdir -p my-tool-1.0.0/DEBIAN
mkdir -p my-tool-1.0.0/usr/bin
mkdir -p my-tool-1.0.0/etc/my-tool
mkdir -p my-tool-1.0.0/usr/lib/systemd/system

cp my-tool my-tool-1.0.0/usr/bin/
cp my-tool.conf my-tool-1.0.0/etc/my-tool/
cp my-tool.service my-tool-1.0.0/usr/lib/systemd/system/

cat > my-tool-1.0.0/DEBIAN/control <<'EOF'
Package: my-tool
Version: 1.0.0
Section: admin
Priority: optional
Architecture: amd64
Maintainer: SRE Team <sre@example.com>
Description: Internal monitoring tool
 Internal monitoring and compliance tool for kldload deployments.
EOF

dpkg-deb --build my-tool-1.0.0

# Copy to darksite build areas
cp my-tool-1.0.0.deb build/darksite-debian/custom-debs/
cp my-tool-1.0.0.deb build/darksite-ubuntu/custom-debs/

# Add to package sets
echo "my-tool" >> build/darksite-debian/config/package-sets/custom.txt
echo "my-tool" >> build/darksite-ubuntu/config/package-sets/custom.txt

Method 4: Embed container images in the ISO

# For containerised workloads, pre-load images into the ISO
mkdir -p live-build/config/includes.chroot/root/container-images/

# Save images as OCI archives
podman save -o live-build/config/includes.chroot/root/container-images/nginx.tar \
    docker.io/library/nginx:1.27
podman save -o live-build/config/includes.chroot/root/container-images/postgres.tar \
    docker.io/library/postgres:16

# In your postinstaller, load them on the target
k_custom_postinstall() {
    echo "[custom] Pre-loading container images..."
    for img in /root/container-images/*.tar; do
        chroot "$TARGET" podman load -i "$img"
    done
    rm -rf "$TARGET/root/container-images"   # clean up after loading
}

The includes.chroot directory and package-sets text files are designed as extension points. Use them confidently — they are not hacks, they are the intended API.

14. Deployment Strategies

How you roll out changes to production determines your blast radius when something goes wrong. The right strategy depends on your architecture, your risk tolerance, and your ability to observe the deployment in real time.

Rolling deployment

# Kubernetes rolling update (default strategy)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1    # at most 1 pod down at a time
      maxSurge: 1          # at most 1 extra pod during rollout
  template:
    spec:
      containers:
      - name: myapp
        image: registry.internal:5000/myapp:v1.3.0
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Blue/green deployment

# Two complete environments. Switch traffic atomically.
# Blue is current production. Green is the new version.

# Deploy green alongside blue
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
  labels:
    app: myapp
    version: green
spec:
  replicas: 6
  template:
    spec:
      containers:
      - name: myapp
        image: registry.internal:5000/myapp:v1.3.0

---
# Service points to blue initially
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue    # ← change to "green" to switch

# Cutover: update the service selector
kubectl patch service myapp -p '{"spec":{"selector":{"version":"green"}}}'

# Rollback: switch back to blue
kubectl patch service myapp -p '{"spec":{"selector":{"version":"blue"}}}'

Canary deployment with Cilium

# Cilium can split traffic by weight at L7
apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
  name: myapp-canary
spec:
  services:
  - name: myapp
    namespace: default
  resources:
  - "@type": type.googleapis.com/envoy.config.route.v3.RouteConfiguration
    virtual_hosts:
    - name: myapp
      domains: ["*"]
      routes:
      - match:
          prefix: "/"
        route:
          weighted_clusters:
            clusters:
            - name: "default/myapp-stable"
              weight: 90    # 90% to stable
            - name: "default/myapp-canary"
              weight: 10    # 10% to canary

# Progressive: increase canary weight as metrics confirm health
# 10% → 25% → 50% → 100% over hours/days
# Automated with Flagger or Argo Rollouts

Feature flags

# Decouple deployment from release. Deploy dark, enable incrementally.
# Simple feature flag with environment variable:

# In your application
import os
ENABLE_NEW_CHECKOUT = os.getenv("FEATURE_NEW_CHECKOUT", "false") == "true"

# In Kubernetes, toggle via ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: myapp-features
data:
  FEATURE_NEW_CHECKOUT: "true"
  FEATURE_DARK_MODE: "false"

# Change the ConfigMap in git → Flux applies → pods restart with new flags
# No redeployment. Same image. Different behaviour.

Strategy comparison

Strategy	Blast radius	Rollback speed	Resource cost	Complexity
Rolling	Gradual (1 pod at a time)	Minutes (undo rollout)	Low (+1 pod)	Low
Blue/Green	All-or-nothing	Seconds (switch selector)	High (2x resources)	Medium
Canary	Percentage-based	Seconds (shift weight)	Medium (+canary pods)	High
Feature flag	Per-feature	Seconds (toggle flag)	None (same deployment)	Medium (code changes)

15. Rollback & Recovery

Rollback is where GitOps earns its keep. In a traditional pipeline, rollback means finding the previous artifact version, running the deployment script again, and hoping the state converges. In GitOps, rollback is git revert.

Git revert as rollback

# The last commit promoted myapp to v1.3.0 and it is broken.
# Rollback by reverting the commit:
git log --oneline -5
# a1b2c3d promote myapp to v1.3.0
# d4e5f6g update ingress annotations
# ...

git revert a1b2c3d
# Creates a new commit that undoes the v1.3.0 promotion
git push

# Flux/ArgoCD detects the revert commit, applies the previous state.
# The cluster returns to v1.2.0. The git log records exactly what happened:
# f7g8h9i Revert "promote myapp to v1.3.0"
# a1b2c3d promote myapp to v1.3.0
# Full audit trail. No mystery. No "who ran kubectl at 3am?"

Flux automatic rollback

# Flux Kustomization with health checks — if the deployment is unhealthy
# after apply, Flux reverts to the last known-good state
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: myapp
  namespace: flux-system
spec:
  interval: 5m
  sourceRef:
    kind: GitRepository
    name: infra
  path: ./apps/myapp
  prune: true
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: myapp
    namespace: default
  timeout: 5m
  # If health checks fail, Flux marks the Kustomization as failed
  # and does not apply subsequent changes until the issue is resolved

ArgoCD automatic rollback

# ArgoCD Application with auto-rollback on sync failure
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
  annotations:
    # Progressive rollback with Argo Rollouts
    notifications.argoproj.io/subscribe.on-degraded.slack: deployments
spec:
  syncPolicy:
    automated:
      selfHeal: true    # revert manual drift
    retry:
      limit: 3
      backoff:
        duration: 10s
        factor: 2

# Argo Rollouts for automated progressive delivery with rollback
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - setWeight: 30
      - pause: {duration: 5m}
      - setWeight: 60
      - pause: {duration: 5m}
      analysis:
        templates:
        - templateName: success-rate
      # If analysis fails at any step, automatic rollback to stable

ZFS snapshot rollback for VMs and bare metal

For non-Kubernetes workloads (VMs, bare metal), ZFS snapshots are your rollback mechanism. Kubernetes rollback works for Kubernetes workloads. Git revert works for GitOps-managed resources. But what about the node itself? A kernel update that panics, a ZFS module upgrade that fails, a firmware change that bricks the NIC — these are below the application layer.

ZFS boot environments give you an undo button for the entire operating system. Combined with GitOps for the application layer, you have rollback at every level of the stack — from individual feature flags all the way down to the root filesystem. This is defence in depth applied to change management.

# Before deploying a change, snapshot the system
zfs snapshot rpool/ROOT/centos@before-deploy-2026-04-05

# Deploy the change
# ... something breaks ...

# Instant rollback — reboot into the previous state
zfs rollback rpool/ROOT/centos@before-deploy-2026-04-05
reboot

# Or use boot environments (non-destructive — keeps both states)
# kldload's kbe tool manages boot environments
kbe create pre-deploy
# Deploy changes to a new boot environment
kbe activate pre-deploy   # reboot into the known-good state

# List boot environments
kbe list
# NAME         ACTIVE   MOUNTPOINT   CREATION
# default      -        /            2026-03-01
# pre-deploy   R        /            2026-04-05

ZFS boot environments are the layer most teams forget. You can roll back a Kubernetes deployment in seconds, but a bad kernel update will still take you down without OS-level rollback.

16. Monitoring Pipelines — DORA Metrics

You cannot improve what you do not measure. The four DORA (DevOps Research and Assessment) metrics are the industry standard for measuring delivery performance. They correlate directly with organisational performance.

The four DORA metrics

Deployment Frequency

How often your team deploys to production. Elite teams deploy multiple times per day. Low performers deploy monthly or less. In a GitOps workflow, this is simply the frequency of commits to the production branch that change application versions.

// Measured by: count of git commits to prod branch that change image tags, per week.

Lead Time for Changes

Time from code commit to running in production. Elite teams: less than one hour. Low performers: more than six months. In GitOps, this is the time from merge to main to Flux/ArgoCD completing the sync.

// Measured by: timestamp of merge commit minus timestamp of sync-complete event.

Mean Time to Recovery (MTTR)

How long it takes to recover from a production failure. Elite teams: less than one hour. With GitOps, recovery is git revert + automatic reconciliation. With ZFS, it is snapshot rollback. Both are measured in minutes.

// Measured by: time between alert firing and service returning to healthy.

Change Failure Rate

Percentage of deployments that cause a failure in production. Elite teams: 0-15%. This is the ratio of rollbacks (git reverts, Flux failures, Argo sync failures) to total deployments.

// Measured by: count of reverts / count of deployments, per month.

Why MTTR and failure rate matter most

Most organisations obsess over deployment frequency and lead time while ignoring change failure rate and MTTR. This is backwards. Deploying frequently is easy — just merge faster. Deploying frequently without breaking things is hard.

The organisations that achieve elite DORA scores do not move faster by cutting corners — they move faster because their recovery mechanisms are so reliable that the cost of a failure is low. GitOps combined with ZFS snapshots and canary deployments reduces the cost of failure to near zero. When failure is cheap, you can afford to deploy often. When failure is expensive, every deployment becomes a risk that slows you down.

If your MTTR is under five minutes, deploying ten times a day is safe. If your MTTR is four hours, deploying once a week feels reckless. Fix recovery first, then speed follows naturally.

Prometheus metrics for pipeline monitoring

# Instrument your pipeline to emit DORA metrics
# Deploy Prometheus + Grafana on kldload (see Observability Masterclass)

# Flux exposes metrics natively on :8080/metrics
# Scrape Flux controllers
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: flux-system
  namespace: flux-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: flux
  endpoints:
  - port: http-prom
    interval: 30s

# Key Flux metrics:
# gotk_reconcile_duration_seconds — how long reconciliation takes
# gotk_reconcile_condition — success/failure of each reconciliation
# gotk_suspend_status — whether a resource is suspended

# ArgoCD metrics (exposed on :8082/metrics)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
  namespace: argocd
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-application-controller
  endpoints:
  - port: metrics
    interval: 30s

# Key ArgoCD metrics:
# argocd_app_sync_total — sync operations count
# argocd_app_health_status — health of each application
# argocd_app_reconcile_count — reconciliation operations

Grafana dashboard for DORA metrics

# PromQL queries for DORA metrics

# Deployment Frequency (deploys per day)
sum(increase(gotk_reconcile_condition{type="Ready",status="True"}[24h]))

# Lead Time (average reconcile duration)
avg(gotk_reconcile_duration_seconds{kind="Kustomization"})

# Change Failure Rate
sum(increase(gotk_reconcile_condition{type="Ready",status="False"}[7d]))
/
sum(increase(gotk_reconcile_condition{type="Ready"}[7d]))
* 100

# MTTR (time between failure and recovery)
# This requires custom recording rules — track state transitions:
# Record when a Kustomization goes from Ready=True to Ready=False (failure)
# Record when it goes from Ready=False to Ready=True (recovery)
# MTTR = average(recovery_time - failure_time)

17. Troubleshooting Reference

Symptom	Cause	Resolution
Flux Kustomization stuck in "not ready"	Health check timeout — deployment not becoming ready	Check pod logs: `kubectl logs -n default deploy/myapp`. Check events: `kubectl get events -n default`. Increase timeout in Kustomization spec.
ArgoCD shows "OutOfSync" but won't sync	RBAC — ArgoCD service account lacks permissions	Check ArgoCD logs: `kubectl logs -n argocd deploy/argocd-application-controller`. Verify ClusterRole bindings.
Image pull fails in air-gapped cluster	Image not mirrored to local registry	Verify image exists: `skopeo inspect docker://registry.internal:5000/image:tag`. Check containerd registry config.
Darksite package missing during install	Package not in package-sets .txt file	Add package name to the appropriate file in `build/darksite/config/package-sets/`. Rebuild ISO.
createrepo metadata stale	Forgot to run `createrepo_c --update`	Run `createrepo_c --update /path/to/repo/` after adding new RPMs.
reprepro "already registered"	Trying to add a package version that already exists	Use `reprepro remove trixie package-name` first, or bump the version.
cosign verify fails	Image signed with different key or signature not uploaded	Verify the public key matches. Check that `cosign sign` completed after push.
Helm chart OCI push fails	Registry does not support OCI artifacts	Use Harbor or distribution/registry v2.8+. Ensure `HELM_EXPERIMENTAL_OCI=1` is set (older Helm).
Flux image automation not updating git	Image policy not matching any tags	Check `flux get image policy myapp`. Verify semver range matches published tags.
Git bundle fetch fails on air-gapped side	Bundle was created with `--since` but remote has no common ancestor	Create a full bundle: `git bundle create repo.bundle --all` (no --since).
Pipeline runs but cluster does not change	CI pushes artifacts but nothing triggers GitOps reconciliation	Verify Flux/ArgoCD is watching the correct branch and path. Run `flux reconcile source git infra` manually.
ZFS rollback fails "dataset has children"	Clones or newer snapshots depend on the target snapshot	Destroy dependent clones/snapshots first, or use `zfs rollback -r` to recursively destroy.
Postinstaller script not running	Script not sourced by kldload-install-target	Ensure script is in `usr/lib/kldload-installer/lib/` and follows the naming/function conventions.
Custom RPM not found during install	RPM in darksite directory but not in repo metadata	Ensure `createrepo_c` ran on the directory. Check that the package name in the .txt file matches the RPM name.

Build Your Own — Overview — understanding the kldload build pipeline
Postinstallers — reference for the postinstaller system
Package Management — RPM and APT package management on kldload
Docker & Podman on ZFS — container runtimes and ZFS storage drivers
Kubernetes on KVM — building a Kubernetes cluster on kldload
Cloud & Packer — golden image creation with Packer
Cluster & Blue/Green — deployment strategies tutorial
Packer & IaC Masterclass — infrastructure as code deep dive
Kubernetes Masterclass — cluster operations and workload management
Observability Masterclass — Prometheus, Grafana, and monitoring stack
Security Hardening Masterclass — supply chain security and image signing
Keycloak & SELinux Masterclass — RBAC and identity for ArgoCD/Flux
Backup & DR Masterclass — ZFS replication and disaster recovery
Blue/Green & SRE Masterclass — SRE practices and deployment patterns

← Containers Masterclass eBPF Masterclass →