Kubernetes on KVM with kldload
This guide builds a production-grade Kubernetes cluster on a kldload KVM hypervisor. You build one golden image with all the K8s prerequisites baked in, clone it instantly with ZFS for each node, and have a running cluster in minutes. Every node is a zvol — snapshotable, cloneable, replicatable to a DR site. Scale by cloning. Roll back a broken upgrade by reverting a snapshot. Tear down the whole cluster and rebuild it from the golden image in under five minutes.
kvm-clone it 5 times (2 seconds each), set hostnames, run kubeadm init + kubeadm join. Total time from golden image to running cluster: under 15 minutes. And when something breaks, kvm-snap + rollback gets you back to the last known-good state in seconds — not "reinstall and rejoin the cluster."Prerequisites
A kldload system installed with the kvm profile (or any profile with
KLDLOAD_ENABLE_KVM=1). This gives you KVM, libvirt, ZFS zvol storage,
and the kvm-create/kvm-clone/kvm-snap tools.
| Cluster size | RAM | Disk | Notes |
|---|---|---|---|
| 1 CP + 2 workers (dev) | 16GB | 100GB | Tight but functional |
| 3 CP + 3 workers (prod) | 32GB | 200GB | HA control plane |
| 3 CP + 6+ workers (scale) | 64GB+ | 500GB+ | Production workloads |
Step 1 — ZFS layout (already done by the kvm profile)
The kvm profile creates rpool/vms with the right properties
(compression=off, recordsize=64K, primarycache=metadata). VMs created with
kvm-create are zvols under rpool/vms/<name>
automatically. No manual dataset creation needed.
# Verify the layout exists
zfs list -r rpool/vms
Step 2 — Build a golden image
Create the base VM
# Create a VM from the kldload ISO
kvm-create k8s-golden --ram 4096 --cpus 4 --disk 40 \
--iso /var/lib/libvirt/isos/kldload-free-latest.iso \
--os centos-stream9
# Connect via VNC to complete the kldload install
virsh vncdisplay k8s-golden
# Install with the "server" or "core" profile — minimal is best for K8s nodes
Bake K8s prerequisites into the golden image
SSH into the golden VM and install everything a K8s node needs:
Step 3 — Clone VMs instantly
# Clone 3 control plane nodes from the golden image
for i in 1 2 3; do
kvm-clone k8s-golden k8s-cp-${i}
done
# Clone 3 worker nodes
for i in 1 2 3; do
kvm-clone k8s-golden k8s-worker-${i}
done
# That's it. 6 VMs created in under 10 seconds total.
# Each is a ZFS zvol clone — zero bytes copied, shares blocks with the golden image.
# Start all nodes
for vm in k8s-cp-{1,2,3} k8s-worker-{1,2,3}; do
virsh start ${vm}
done
kvm-snap rollback. Need to test a new K8s version? Clone the golden, upgrade the clone, test. Didn't work? Destroy the clone. Zero risk.Set unique hostnames
# SSH into each node and set its hostname
# (the golden image had its hostname cleared during sealing)
for i in 1 2 3; do
ssh root@k8s-cp-${i} "hostnamectl set-hostname k8s-cp-${i}"
done
for i in 1 2 3; do
ssh root@k8s-worker-${i} "hostnamectl set-hostname k8s-worker-${i}"
done
Step 4 — Initialize the cluster
kubeadm init on the first control plane node, then kubeadm join on everything else. No per-node package installs. No per-node configuration. The golden image did all of that once.On ALL nodes — regenerate machine identity (clones only)
# Each cloned node needs a unique machine-id (systemd uses this for journald, D-Bus, etc.)
# If the golden image was sealed properly, this happens automatically on first boot.
# Verify on each node:
cat /etc/machine-id # should be unique per node
hostnamectl # should show the hostname you set
On the first control plane node (k8s-cp-1)
# Initialize the cluster
kubeadm init \
--control-plane-endpoint "k8s-cp-1:6443" \
--pod-network-cidr 10.244.0.0/16 \
--upload-certs
# Save the output — it contains join commands for other nodes
# Set up kubectl
mkdir -p ~/.kube
cp /etc/kubernetes/admin.conf ~/.kube/config
# Install Cilium as the CNI (eBPF-based — replaces kube-proxy entirely)
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
curl -L --fail --remote-name-all \
https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-amd64.tar.gz
tar xzf cilium-linux-amd64.tar.gz -C /usr/local/bin
cilium install --set kubeProxyReplacement=true
Join additional control plane nodes
# On k8s-cp-2 and k8s-cp-3, use the join command from kubeadm init output:
kubeadm join k8s-cp-1:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--control-plane \
--certificate-key <cert-key>
Join worker nodes
# On each worker:
kubeadm join k8s-cp-1:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash>
Verify
kubectl get nodes -o wide
cilium status
What ZFS brings to Kubernetes
Pre-upgrade snapshots (atomic cluster rollback)
# Before upgrading K8s from 1.30 to 1.31:
# Snapshot every node from the KVM host
for vm in k8s-cp-{1,2,3} k8s-worker-{1,2,3}; do
kvm-snap ${vm}
done
# Upgrade proceeds normally inside the VMs...
# kubeadm upgrade apply v1.31.0, etc.
# If the upgrade breaks something:
for vm in k8s-cp-{1,2,3} k8s-worker-{1,2,3}; do
kvm-snap ${vm} rollback
done
# Entire cluster is back to pre-upgrade state in seconds.
# Not "restore from backup." Not "debug for 3 hours."
# Atomic rollback to the exact state before the upgrade.
Test clusters from production state
# Clone every production node into a test cluster
for vm in k8s-cp-1 k8s-worker-{1,2,3}; do
kvm-clone ${vm} test-${vm}
done
# The test cluster is a byte-identical copy of production.
# Test your upgrade, migration, or config change on the clone.
# When done, destroy the test cluster:
for vm in test-k8s-cp-1 test-k8s-worker-{1,2,3}; do
virsh destroy ${vm} 2>/dev/null
virsh undefine ${vm} --nvram
zfs destroy rpool/vms/${vm}
done
# Zero risk to production. Zero cost until the clones diverge.
Node recovery in seconds
# Worker node is misbehaving — corrupted container runtime, broken kubelet config
# Don't debug. Just roll back.
kubectl drain k8s-worker-2 --ignore-daemonsets --delete-emptydir-data
kvm-snap k8s-worker-2 rollback
virsh start k8s-worker-2
# Worker rejoins the cluster with its last known-good state.
# Kubernetes reschedules pods automatically.
ZFS-backed persistent volumes (CSI driver)
# openebs-zfs CSI driver gives Kubernetes pods ZFS-backed persistent volumes
# Each PV is a ZFS dataset — snapshotable, cloneable, compressable
kubectl apply -f https://openebs.github.io/charts/zfs-operator.yaml
# Create a storage class
cat << 'EOF' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: zfs-sc
parameters:
recordsize: "8k" # tune per workload
compression: "lz4"
poolname: "rpool"
provisioner: zfs.csi.openebs.io
volumeBindingMode: WaitForFirstConsumer
EOF
# Now pods can request ZFS-backed storage:
# kind: PersistentVolumeClaim
# spec:
# storageClassName: zfs-sc
# resources:
# requests:
# storage: 50Gi
recordsize=8k and compression=lz4 (from the StorageClass), and mounts it into the pod. You get ZFS snapshots of individual Kubernetes volumes. You can zfs send a PV to another cluster. You can set per-volume recordsize tuned to the workload — 8k for databases, 128k for general, 1M for media streaming. This is what "infrastructure-aware storage" means: the storage layer understands the workload because ZFS properties are set per-dataset, and each PV is its own dataset.apt-get install -y kubelet kubeadm kubectl containerd). The rest is identical — kernel modules, sysctl, containerd config, kubeadm. kldload supports 8 distros as K8s node targets. CentOS Stream 9 and Debian 13 are the most common choices.Seal and snapshot the golden image
# Inside the golden VM: clean up for cloning
rm -f /etc/machine-id
truncate -s 0 /etc/hostname
cloud-init clean 2>/dev/null || true
rm -f /etc/ssh/ssh_host_* # new keys generated on first boot
history -c
# Shut down
poweroff
# On the KVM host: snapshot the golden zvol
kvm-snap k8s-golden
# ✔ Snapshot: rpool/vms/k8s-golden@2026-04-02_143022
kubeadm init or kubeadm join. No package downloads, no config, no waiting. If Kubernetes releases a new version, update the golden image once, re-snapshot, and future clones get the new version automatically. Existing nodes upgrade the normal Kubernetes way.Scaling — add nodes on demand
# Clone a new worker from the golden image
kvm-clone k8s-golden k8s-worker-4
virsh start k8s-worker-4
# SSH in, set hostname, join the cluster
ssh root@k8s-worker-4 "hostnamectl set-hostname k8s-worker-4"
ssh root@k8s-worker-4 "kubeadm join k8s-cp-1:6443 --token --discovery-token-ca-cert-hash sha256:"
# To generate a fresh join token (they expire after 24h):
# On any control plane node:
kubeadm token create --print-join-command
kubeadm join takes about 30 seconds. Total time from "I need another worker" to "node is Ready": under a minute. Try that with Terraform + cloud API + user-data scripts.Tear down a node
# Drain and remove from Kubernetes
kubectl drain k8s-worker-3 --ignore-daemonsets --delete-emptydir-data
kubectl delete node k8s-worker-3
# Destroy the VM and its zvol
virsh destroy k8s-worker-3
virsh undefine k8s-worker-3 --nvram
zfs destroy rpool/vms/k8s-worker-3
Using WireGuard for pod networking
If your kldload nodes use the four-plane WireGuard mesh (see
WireGuard Mesh & Multi-Site),
use wg3 (storage plane, 10.80.0.0/16) for Kubernetes traffic:
# Initialize using the WireGuard address
kubeadm init \
--apiserver-advertise-address 10.80.0.1 \
--pod-network-cidr 10.244.0.0/16 \
--upload-certs
# Configure Cilium to use the WG interface
cilium install \
--set kubeProxyReplacement=true \
--set devices=wg3
# Result: all K8s traffic — API server, pod-to-pod, etcd —
# runs through WireGuard tunnels. Encrypted, isolated, invisible
# to the physical LAN.
Multi-host Kubernetes across sites
# With WireGuard + BGP (see the Networking tutorial):
# - KVM hosts at Site A and Site B are connected via WG mesh
# - BGP announces K8s node routes between sites
# - Cilium handles pod networking across both sites
# - Result: pods on Site A can talk to pods on Site B
# over encrypted WG tunnels with BGP-learned routes
#
# This is multi-cloud Kubernetes with no cloud provider lock-in.