Documentation

Kubernetes on KVM with kldload

This guide builds a production-grade Kubernetes cluster on a kldload KVM hypervisor. You build one golden image with all the K8s prerequisites baked in, clone it instantly with ZFS for each node, and have a running cluster in minutes. Every node is a zvol — snapshotable, cloneable, replicatable to a DR site. Scale by cloning. Roll back a broken upgrade by reverting a snapshot. Tear down the whole cluster and rebuild it from the golden image in under five minutes.

This is where kldload's ZFS + KVM architecture pays off in a way that's hard to appreciate until you've done it the other way. On a traditional hypervisor, building a 6-node K8s cluster means: create 6 VMs (minutes each), install an OS on each (20+ minutes each), configure each one, install K8s prerequisites on each. That's an afternoon. On kldload: build one golden image with everything baked in, kvm-clone it 5 times (2 seconds each), set hostnames, run kubeadm init + kubeadm join. Total time from golden image to running cluster: under 15 minutes. And when something breaks, kvm-snap + rollback gets you back to the last known-good state in seconds — not "reinstall and rejoin the cluster."

Prerequisites

A kldload system installed with the kvm profile (or any profile with KLDLOAD_ENABLE_KVM=1). This gives you KVM, libvirt, ZFS zvol storage, and the kvm-create/kvm-clone/kvm-snap tools.

Cluster size	RAM	Disk	Notes
1 CP + 2 workers (dev)	16GB	100GB	Tight but functional
3 CP + 3 workers (prod)	32GB	200GB	HA control plane
3 CP + 6+ workers (scale)	64GB+	500GB+	Production workloads

Step 1 — ZFS layout (already done by the kvm profile)

The kvm profile creates rpool/vms with the right properties (compression=off, recordsize=64K, primarycache=metadata). VMs created with kvm-create are zvols under rpool/vms/<name> automatically. No manual dataset creation needed.

# Verify the layout exists
zfs list -r rpool/vms

Step 2 — Build a golden image

The golden image is the most important artifact in this workflow. You build it once with all the K8s prerequisites — containerd, kubelet, kubeadm, kubectl, kernel modules, sysctl settings — and every node you create from it has everything pre-installed. No per-node setup. No per-node package downloads. Clone and join. The time you invest in getting the golden image right pays back every time you scale, rebuild, or recover a node.

Create the base VM

# Create a VM from the kldload ISO
kvm-create k8s-golden --ram 4096 --cpus 4 --disk 40 \
  --iso /var/lib/libvirt/isos/kldload-free-latest.iso \
  --os centos-stream9

# Connect via VNC to complete the kldload install
virsh vncdisplay k8s-golden
# Install with the "server" or "core" profile — minimal is best for K8s nodes

Bake K8s prerequisites into the golden image

SSH into the golden VM and install everything a K8s node needs:

Step 3 — Clone VMs instantly

# Clone 3 control plane nodes from the golden image
for i in 1 2 3; do
  kvm-clone k8s-golden k8s-cp-${i}
done

# Clone 3 worker nodes
for i in 1 2 3; do
  kvm-clone k8s-golden k8s-worker-${i}
done

# That's it. 6 VMs created in under 10 seconds total.
# Each is a ZFS zvol clone — zero bytes copied, shares blocks with the golden image.

# Start all nodes
for vm in k8s-cp-{1,2,3} k8s-worker-{1,2,3}; do
  virsh start ${vm}
done

Six VMs. Ten seconds. Each one has a 40GB disk that uses near-zero space because ZFS clones share all blocks with the golden image until they diverge. The total disk usage for 6 fresh clones is effectively 0GB — it only grows as each node writes unique data (K8s state, container images, logs). This is why ZFS zvol cloning is transformative for Kubernetes: you can tear down and rebuild the entire cluster from the golden image in minutes, not hours. Break something during an upgrade? kvm-snap rollback. Need to test a new K8s version? Clone the golden, upgrade the clone, test. Didn't work? Destroy the clone. Zero risk.

Set unique hostnames

# SSH into each node and set its hostname
# (the golden image had its hostname cleared during sealing)
for i in 1 2 3; do
  ssh root@k8s-cp-${i} "hostnamectl set-hostname k8s-cp-${i}"
done
for i in 1 2 3; do
  ssh root@k8s-worker-${i} "hostnamectl set-hostname k8s-worker-${i}"
done

Step 4 — Initialize the cluster

Since the golden image has containerd, kubelet, kubeadm, and kubectl pre-installed, this step is just running kubeadm init on the first control plane node, then kubeadm join on everything else. No per-node package installs. No per-node configuration. The golden image did all of that once.

On ALL nodes — regenerate machine identity (clones only)

# Each cloned node needs a unique machine-id (systemd uses this for journald, D-Bus, etc.)
# If the golden image was sealed properly, this happens automatically on first boot.
# Verify on each node:
cat /etc/machine-id    # should be unique per node
hostnamectl            # should show the hostname you set

On the first control plane node (k8s-cp-1)

# Initialize the cluster
kubeadm init \
  --control-plane-endpoint "k8s-cp-1:6443" \
  --pod-network-cidr 10.244.0.0/16 \
  --upload-certs

# Save the output — it contains join commands for other nodes

# Set up kubectl
mkdir -p ~/.kube
cp /etc/kubernetes/admin.conf ~/.kube/config

# Install Cilium as the CNI (eBPF-based — replaces kube-proxy entirely)
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
curl -L --fail --remote-name-all \
  https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-amd64.tar.gz
tar xzf cilium-linux-amd64.tar.gz -C /usr/local/bin
cilium install --set kubeProxyReplacement=true

Why Cilium over Flannel/Calico: Cilium uses eBPF to handle all pod networking and policy enforcement in the kernel. It replaces kube-proxy entirely — no iptables rules, no conntrack table exhaustion, no NAT overhead. On a kldload system that already has eBPF tooling (bcc, bpftrace) installed, Cilium fits naturally. It gives you: eBPF-based load balancing (faster than iptables DNAT), transparent encryption between pods (WireGuard or IPsec), network policy enforcement at the kernel level, and Hubble for network observability. For a kldload cluster running on WireGuard-connected nodes, Cilium's WireGuard encryption mode means pod traffic is encrypted even between nodes on the same LAN — defense in depth without performance cost.

Join additional control plane nodes

# On k8s-cp-2 and k8s-cp-3, use the join command from kubeadm init output:
kubeadm join k8s-cp-1:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash> \
  --control-plane \
  --certificate-key <cert-key>

Join worker nodes

# On each worker:
kubeadm join k8s-cp-1:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash>

Verify

kubectl get nodes -o wide
cilium status

What ZFS brings to Kubernetes

Kubernetes on traditional infrastructure has a weak spot: node state is fragile. If a node's disk corrupts, you rebuild from scratch. If an upgrade breaks kubelet, you debug for hours or reinstall. If you need to test a cluster upgrade, you build a separate test cluster. ZFS eliminates all of these problems because the node itself is a snapshotable, cloneable, replicatable object. Here's what that actually means in practice.

Pre-upgrade snapshots (atomic cluster rollback)

# Before upgrading K8s from 1.30 to 1.31:
# Snapshot every node from the KVM host
for vm in k8s-cp-{1,2,3} k8s-worker-{1,2,3}; do
  kvm-snap ${vm}
done

# Upgrade proceeds normally inside the VMs...
# kubeadm upgrade apply v1.31.0, etc.

# If the upgrade breaks something:
for vm in k8s-cp-{1,2,3} k8s-worker-{1,2,3}; do
  kvm-snap ${vm} rollback
done
# Entire cluster is back to pre-upgrade state in seconds.
# Not "restore from backup." Not "debug for 3 hours."
# Atomic rollback to the exact state before the upgrade.

Test clusters from production state

# Clone every production node into a test cluster
for vm in k8s-cp-1 k8s-worker-{1,2,3}; do
  kvm-clone ${vm} test-${vm}
done

# The test cluster is a byte-identical copy of production.
# Test your upgrade, migration, or config change on the clone.
# When done, destroy the test cluster:
for vm in test-k8s-cp-1 test-k8s-worker-{1,2,3}; do
  virsh destroy ${vm} 2>/dev/null
  virsh undefine ${vm} --nvram
  zfs destroy rpool/vms/${vm}
done
# Zero risk to production. Zero cost until the clones diverge.

Node recovery in seconds

# Worker node is misbehaving — corrupted container runtime, broken kubelet config
# Don't debug. Just roll back.
kubectl drain k8s-worker-2 --ignore-daemonsets --delete-emptydir-data
kvm-snap k8s-worker-2 rollback
virsh start k8s-worker-2
# Worker rejoins the cluster with its last known-good state.
# Kubernetes reschedules pods automatically.

ZFS-backed persistent volumes (CSI driver)

# openebs-zfs CSI driver gives Kubernetes pods ZFS-backed persistent volumes
# Each PV is a ZFS dataset — snapshotable, cloneable, compressable
kubectl apply -f https://openebs.github.io/charts/zfs-operator.yaml

# Create a storage class
cat << 'EOF' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: zfs-sc
parameters:
  recordsize: "8k"        # tune per workload
  compression: "lz4"
  poolname: "rpool"
provisioner: zfs.csi.openebs.io
volumeBindingMode: WaitForFirstConsumer
EOF

# Now pods can request ZFS-backed storage:
# kind: PersistentVolumeClaim
# spec:
#   storageClassName: zfs-sc
#   resources:
#     requests:
#       storage: 50Gi

With the OpenEBS ZFS CSI driver, Kubernetes persistent volumes ARE ZFS datasets. A pod requests 50GB of storage, the CSI driver creates a ZFS dataset with recordsize=8k and compression=lz4 (from the StorageClass), and mounts it into the pod. You get ZFS snapshots of individual Kubernetes volumes. You can zfs send a PV to another cluster. You can set per-volume recordsize tuned to the workload — 8k for databases, 128k for general, 1M for media streaming. This is what "infrastructure-aware storage" means: the storage layer understands the workload because ZFS properties are set per-dataset, and each PV is its own dataset.

# Disable swap (kldload doesn't create swap on ZFS, but be explicit) swapoff -a sed -i '/swap/d' /etc/fstab # Load kernel modules required by K8s cat > /etc/modules-load.d/k8s.conf << 'EOF' overlay br_netfilter EOF modprobe overlay modprobe br_netfilter # Sysctl settings for K8s networking cat > /etc/sysctl.d/k8s.conf << 'EOF' net.bridge.bridge-nf-call-iptables = 1 net.bridge.bridge-nf-call-ip6tables = 1 net.ipv4.ip_forward = 1 EOF sysctl --system # Install containerd (container runtime) dnf install -y containerd containerd config default > /etc/containerd/config.toml sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml systemctl enable --now containerd # Add Kubernetes repo (CentOS/RHEL — adjust for Debian if using Debian golden) cat > /etc/yum.repos.d/kubernetes.repo << 'EOF' [kubernetes] name=Kubernetes baseurl=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/ enabled=1 gpgcheck=1 gpgkey=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/repodata/repomd.xml.key EOF # Install K8s components dnf install -y kubelet kubeadm kubectl systemctl enable kubelet # Pre-pull control plane images (saves time during kubeadm init) kubeadm config images pull

Debian golden image alternative: If you're building on Debian, replace the dnf/yum commands with the Debian K8s repo setup (apt-get install -y kubelet kubeadm kubectl containerd). The rest is identical — kernel modules, sysctl, containerd config, kubeadm. kldload supports 8 distros as K8s node targets. CentOS Stream 9 and Debian 13 are the most common choices.

Seal and snapshot the golden image

# Inside the golden VM: clean up for cloning
rm -f /etc/machine-id
truncate -s 0 /etc/hostname
cloud-init clean 2>/dev/null || true
rm -f /etc/ssh/ssh_host_*     # new keys generated on first boot
history -c

# Shut down
poweroff

# On the KVM host: snapshot the golden zvol
kvm-snap k8s-golden
# ✔ Snapshot: rpool/vms/k8s-golden@2026-04-02_143022

The golden image now contains: a fresh kldload install, containerd configured with systemd cgroups, kubelet/kubeadm/kubectl installed and enabled, all control plane images pre-pulled, kernel modules and sysctl settings applied. Every clone of this image boots in 15 seconds with everything ready to kubeadm init or kubeadm join. No package downloads, no config, no waiting. If Kubernetes releases a new version, update the golden image once, re-snapshot, and future clones get the new version automatically. Existing nodes upgrade the normal Kubernetes way.

Scaling — add nodes on demand

# Clone a new worker from the golden image
kvm-clone k8s-golden k8s-worker-4
virsh start k8s-worker-4

# SSH in, set hostname, join the cluster
ssh root@k8s-worker-4 "hostnamectl set-hostname k8s-worker-4"
ssh root@k8s-worker-4 "kubeadm join k8s-cp-1:6443 --token  --discovery-token-ca-cert-hash sha256:"

# To generate a fresh join token (they expire after 24h):
# On any control plane node:
kubeadm token create --print-join-command

Three commands to add a node: clone, hostname, join. The golden image has everything pre-installed. The ZFS clone takes 2 seconds regardless of disk size. The VM boots in 15 seconds. kubeadm join takes about 30 seconds. Total time from "I need another worker" to "node is Ready": under a minute. Try that with Terraform + cloud API + user-data scripts.

Tear down a node

# Drain and remove from Kubernetes
kubectl drain k8s-worker-3 --ignore-daemonsets --delete-emptydir-data
kubectl delete node k8s-worker-3

# Destroy the VM and its zvol
virsh destroy k8s-worker-3
virsh undefine k8s-worker-3 --nvram
zfs destroy rpool/vms/k8s-worker-3

Using WireGuard for pod networking

If your kldload KVM hosts are connected via WireGuard (which they should be), you can run the entire Kubernetes cluster on the encrypted WireGuard fabric. The API server advertises on a WG address. Pod traffic (via Cilium or Flannel) traverses WireGuard tunnels. Nothing touches the physical LAN unencrypted. This is zero-trust Kubernetes networking without buying a service mesh: the transport is encrypted by WireGuard at the kernel level, the CNI policy is enforced by Cilium's eBPF programs, and both happen below the application layer. Pods don't know and don't care.

If your kldload nodes use the four-plane WireGuard mesh (see WireGuard Mesh & Multi-Site), use wg3 (storage plane, 10.80.0.0/16) for Kubernetes traffic:

# Initialize using the WireGuard address
kubeadm init \
  --apiserver-advertise-address 10.80.0.1 \
  --pod-network-cidr 10.244.0.0/16 \
  --upload-certs

# Configure Cilium to use the WG interface
cilium install \
  --set kubeProxyReplacement=true \
  --set devices=wg3

# Result: all K8s traffic — API server, pod-to-pod, etcd —
# runs through WireGuard tunnels. Encrypted, isolated, invisible
# to the physical LAN.

Multi-host Kubernetes across sites

# With WireGuard + BGP (see the Networking tutorial):
# - KVM hosts at Site A and Site B are connected via WG mesh
# - BGP announces K8s node routes between sites
# - Cilium handles pod networking across both sites
# - Result: pods on Site A can talk to pods on Site B
#   over encrypted WG tunnels with BGP-learned routes
#
# This is multi-cloud Kubernetes with no cloud provider lock-in.

← Install Docker The double-ZFS problem →