Kubernetes Masterclass
This guide picks up where the Kubernetes on KVM
tutorial leaves off. You have a cluster: golden images built with the kvm profile,
nodes cloned with kvm-clone, kubeadm init done, Cilium installed. Now learn to
operate it — RBAC, Helm, storage classes backed by ZFS datasets, operators, upgrade
strategies that use ZFS snapshots as your safety net, and multi-cluster networking
over WireGuard. Zero to hero, on hardware you own.
What Kubernetes actually is: a container orchestrator. It schedules workloads across nodes, manages pod networking, handles persistent storage lifecycle, automates rolling deployments, and self-heals failed containers. On kldload, the nodes are ZFS zvols — snapshotable, cloneable, replicatable in seconds. The network is Cilium running eBPF programs in the kernel. This combination changes what is operationally possible.
What this masterclass covers: cluster architecture, kubeadm deep dive, RBAC, namespaces, ZFS-backed persistent volumes, Helm, ingress, workload types, operators, observability, upgrade strategies, multi-cluster, security hardening, and troubleshooting — all grounded in the kldload stack.
1. Kubernetes Is Infrastructure, Not Magic
Before going deep, fix the mental model. Kubernetes is a distributed system built
from components that talk over HTTPS. When you run kubectl apply, your manifest
travels to the API server, gets stored in etcd, is noticed by a controller, which
tells the scheduler to place the pod, which tells the kubelet on the chosen node to
pull the image and start the container. Every step is auditable. Nothing is magic.
API Server
The front door. Every kubectl command, every controller action, every webhook
call goes through the API server. It validates manifests, enforces admission
policies, persists objects to etcd, and notifies watchers of changes via watch
streams. It is stateless — all state lives in etcd.
etcd
The distributed key-value store that holds all cluster state: every object, every secret, every config. etcd uses the Raft consensus algorithm — a write is not committed until a majority of etcd members agree. Lose etcd, lose the cluster. Back it up. On kldload, etcd data lives on a ZFS dataset; snapshot before every upgrade.
Scheduler
Watches for pods with no assigned node and picks one. Decisions are based on resource requests, node affinity/anti-affinity rules, taints and tolerations, topology spread constraints, and custom scoring plugins. It does not start containers — it just writes a node name into the pod spec.
Controller Manager
A collection of control loops, each watching the cluster state and reconciling it toward the desired state. The Deployment controller ensures replica counts match. The Node controller marks nodes NotReady when they stop heartbeating. The Endpoints controller keeps service endpoints in sync with pods.
Kubelet
The node agent. Runs on every node including control plane nodes (in stacked etcd setups). Receives pod specs from the API server via its assigned node field, calls the container runtime (containerd) to pull images and start containers, reports pod status back, manages volume mounts, and enforces resource limits via cgroups.
kube-proxy / Cilium
Handles service networking. In a kldload cluster with Cilium, kube-proxy is replaced entirely. Cilium's eBPF programs handle service IP translation in the kernel with O(1) hash map lookups. No iptables chains, no conntrack for east-west traffic, no performance cliff at 1000+ services.
kldload Makes Node Provisioning Instant
Traditional node provisioning: boot an OS, install packages, configure networking,
join the cluster — 15 to 30 minutes. On kldload, you build one golden VM with the
kvm profile (containerd, kubeadm, kubectl, kubelet pre-installed), take a ZFS snapshot,
and clone it with kvm-clone. Clone time is under 10 seconds regardless of disk size
because ZFS clones are copy-on-write — no data is copied at clone time.
# Snapshot the golden node
kvm-snap k8s-golden before-clone
# Clone 3 workers in parallel — each clone takes seconds
kvm-clone k8s-golden k8s-worker-1 &
kvm-clone k8s-golden k8s-worker-2 &
kvm-clone k8s-golden k8s-worker-3 &
wait
# Set unique hostnames and IPs, then join
for node in k8s-worker-{1,2,3}; do
virsh start $node
done
Before any significant cluster operation — upgrade, CNI migration, etcd maintenance
— snapshot every node with kvm-snap. A failed operation is a 2-second rollback,
not a 2-hour rebuild.
2. kubeadm Deep Dive
kubeadm is the official tool for bootstrapping Kubernetes clusters. It handles certificate generation, etcd bootstrap, static pod manifests for the control plane components, and the join token workflow for adding nodes.
Init: First Control Plane Node
# kubeadm-config.yaml — explicit is better than implicit
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.31.0
controlPlaneEndpoint: "k8s-api.internal:6443" # VIP or load balancer for HA
networking:
podSubnet: "10.244.0.0/16"
serviceSubnet: "10.96.0.0/12"
etcd:
local:
dataDir: /var/lib/etcd # put this on a ZFS dataset
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: "192.168.10.10"
bindPort: 6443
# Snapshot before init kvm-snap k8s-control-1 pre-init kubeadm init --config kubeadm-config.yaml --upload-certs # Copy kubeconfig mkdir -p $HOME/.kube cp /etc/kubernetes/admin.conf $HOME/.kube/config
Join: Workers and Additional Control Plane Nodes
# Print join command for workers kubeadm token create --print-join-command # Join command for additional control plane nodes (HA) # --control-plane --certificate-key is included in the init output kubeadm join k8s-api.internal:6443 \ --token <token> \ --discovery-token-ca-cert-hash sha256:<hash> \ --control-plane \ --certificate-key <cert-key>
HA Control Plane: 3 Nodes, Stacked etcd
A highly available control plane runs three control plane nodes, each running the
API server, controller manager, scheduler, and etcd. A load balancer (or keepalived
VIP) fronts the three API servers at the controlPlaneEndpoint. Raft requires a
majority — three nodes tolerate one failure. Five nodes tolerate two.
Stacked etcd
etcd runs on the same nodes as the API server. Simpler to operate. Three control plane nodes means three etcd members. This is the default and is appropriate for most deployments. The control plane nodes need enough resources to run etcd reliably — SSDs for etcd data, avoid noisy neighbours.
External etcd
etcd runs on its own dedicated cluster, separate from the API servers. Allows independent scaling and isolation of etcd failures from API server failures. More complex to operate: separate certificate management, separate backup targets. Appropriate for very large clusters or regulated environments requiring strict component isolation.
Certificate Management
# Check certificate expiry — do this regularly kubeadm certs check-expiration # Rotate all certificates (do this annually, or before they expire) # Snapshot every node first kubeadm certs renew all # Restart control plane components to pick up new certs # (static pods auto-restart when their manifests change) systemctl restart kubelet
kubeadm Upgrade Workflow
# Always upgrade one minor version at a time. Never skip. # 1.29 → 1.30 → 1.31. Not 1.29 → 1.31. # On the first control plane node: # 1. Snapshot before anything kvm-snap k8s-control-1 pre-upgrade-1-31 # 2. Upgrade kubeadm package dnf install -y kubeadm-1.31.0 # or apt on Debian/Ubuntu nodes # 3. Plan the upgrade (shows what will change) kubeadm upgrade plan # 4. Apply the upgrade kubeadm upgrade apply v1.31.0 # 5. Drain node, upgrade kubelet and kubectl, uncordon kubectl drain k8s-control-1 --ignore-daemonsets dnf install -y kubelet-1.31.0 kubectl-1.31.0 systemctl daemon-reload && systemctl restart kubelet kubectl uncordon k8s-control-1 # Repeat for each worker: kubectl drain k8s-worker-1 --ignore-daemonsets # (on the worker) kubeadm upgrade node && dnf install ... kubectl uncordon k8s-worker-1
3. RBAC — Who Can Do What
Role-Based Access Control is Kubernetes's authorization system. Every request to the API server carries an identity — a user, a group, or a service account. RBAC rules say what that identity can do: which verbs (get, list, create, delete, patch) on which resources (pods, secrets, deployments) in which namespaces.
The Four Objects
Role
Defines a set of permissions within a single namespace. A Role named
pod-reader in namespace dev grants get, list, watch on pods
in the dev namespace only. It cannot span namespaces.
ClusterRole
Like a Role but cluster-scoped. Can grant permissions on namespaced resources
across all namespaces, or on non-namespaced resources like nodes, PersistentVolumes,
and StorageClasses. Built-in ClusterRoles: cluster-admin, admin, edit,
view.
RoleBinding
Binds a Role or ClusterRole to a subject (user, group, or service account) within a namespace. A RoleBinding can reference a ClusterRole — this grants the ClusterRole's permissions only within the binding's namespace. Useful for reusable role definitions with namespace-scoped grants.
ClusterRoleBinding
Binds a ClusterRole to a subject cluster-wide. A ClusterRoleBinding to
cluster-admin gives the subject full control over everything. Use sparingly.
Most humans should be bound to a ClusterRole via a RoleBinding in their
specific namespace, not a ClusterRoleBinding.
Concrete Examples
# Read-only viewer: can see pods, logs, events — nothing else apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: viewer rules: - apiGroups: [""] resources: ["pods", "pods/log", "events", "services", "endpoints"] verbs: ["get", "list", "watch"] - apiGroups: ["apps"] resources: ["deployments", "replicasets", "statefulsets", "daemonsets"] verbs: ["get", "list", "watch"] --- # Bind to a specific user in the monitoring namespace apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: alice-viewer namespace: monitoring subjects: - kind: User name: alice apiGroup: rbac.authorization.k8s.io roleRef: kind: ClusterRole name: viewer apiGroup: rbac.authorization.k8s.io
# CI/CD deployer: can create/update/delete deployments, services, configmaps # but NOT secrets or RBAC objects apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: deployer namespace: production rules: - apiGroups: ["apps"] resources: ["deployments", "replicasets"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: [""] resources: ["services", "configmaps"] verbs: ["get", "list", "watch", "create", "update", "patch"] --- # Service account for the CI runner pod apiVersion: v1 kind: ServiceAccount metadata: name: ci-deployer namespace: production --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: ci-deployer-binding namespace: production subjects: - kind: ServiceAccount name: ci-deployer namespace: production roleRef: kind: Role name: deployer apiGroup: rbac.authorization.k8s.io
# Audit: who has cluster-admin?
kubectl get clusterrolebindings \
-o jsonpath='{range .items[?(@.roleRef.name=="cluster-admin")]}{.metadata.name}: {range .subjects[*]}{.kind}/{.name} {end}{"\n"}{end}'
# Check what a specific user can do
kubectl auth can-i --list --as=alice --namespace=production
# Check a specific permission
kubectl auth can-i delete pods --as=system:serviceaccount:production:ci-deployer -n production
4. Namespaces — Multi-Tenancy
Namespaces are Kubernetes's way of partitioning a cluster into virtual segments. Resources in a namespace are isolated from resources in other namespaces by name. RBAC policies can be scoped to a namespace. ResourceQuotas limit total resource consumption per namespace. NetworkPolicies (via Cilium) control which pods can talk to which.
Resource Quotas and Limit Ranges
# ResourceQuota: hard limits on total resource consumption in a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: dev-quota
namespace: dev
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
pods: "20"
persistentvolumeclaims: "10"
services.loadbalancers: "0" # no LoadBalancer services in dev
---
# LimitRange: default requests/limits for pods that don't specify
apiVersion: v1
kind: LimitRange
metadata:
name: dev-limits
namespace: dev
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: 256Mi
defaultRequest:
cpu: "100m"
memory: 64Mi
max:
cpu: "2"
memory: 4Gi
Dev / Staging / Production Namespace Pattern
# Create namespaces with environment labels
kubectl create namespace dev
kubectl create namespace staging
kubectl create namespace production
kubectl label namespace dev environment=dev
kubectl label namespace staging environment=staging
kubectl label namespace production environment=production
# Cilium NetworkPolicy: staging pods cannot reach production pods
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: no-staging-to-prod
namespace: production
spec:
endpointSelector: {} # applies to all pods in production
ingress:
- fromEndpoints:
- matchLabels:
io.kubernetes.pod.namespace: production # only from production namespace
5. Storage — ZFS-Backed Persistent Volumes
On kldload, a Kubernetes PersistentVolume is a ZFS dataset. Not a block device emulating a filesystem, not a thin-provisioned LUN — a real ZFS dataset with its own recordsize, compression algorithm, encryption key, snapshot schedule, and replication target. The OpenEBS ZFS CSI driver bridges Kubernetes storage primitives to ZFS operations.
zfs send | zfs receive. No other Kubernetes storage backend gives you this combination of features without a proprietary SAN.Install OpenEBS ZFS CSI Driver
# ZFS must already exist on the nodes with a pool named 'rpool' (or your pool name) zpool list # verify pool exists on each worker node # Install via Helm helm repo add openebs-zfslocalpv https://openebs.github.io/zfs-localpv helm repo update helm install openebs-zfs openebs-zfslocalpv/zfs-localpv \ --namespace openebs \ --create-namespace
StorageClasses for Different Workloads
# General-purpose: lz4 compression, 128K recordsize (good for most workloads) apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: zfs-general provisioner: zfs.csi.openebs.io parameters: poolname: rpool/k8s compression: lz4 recordsize: "131072" # 128K fstype: zfs reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true --- # Database: 8K recordsize matches typical PostgreSQL/MySQL page size apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: zfs-db provisioner: zfs.csi.openebs.io parameters: poolname: rpool/k8s compression: lz4 recordsize: "8192" # 8K — matches DB page size, avoids write amplification fstype: zfs reclaimPolicy: Retain # never auto-delete database volumes volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true --- # Logging / time-series: large records, high compression apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: zfs-logs provisioner: zfs.csi.openebs.io parameters: poolname: rpool/k8s compression: zstd recordsize: "1048576" # 1M — sequential write workloads fstype: zfs reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer
Using PVCs
# Request a 50Gi database volume
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
namespace: production
spec:
accessModes: [ReadWriteOnce]
storageClassName: zfs-db
resources:
requests:
storage: 50Gi
VolumeSnapshots — Backup and Clone
# Install VolumeSnapshot CRDs and controller (one-time)
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
# ZFS snapshot class
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: zfs-snapshot
driver: zfs.csi.openebs.io
deletionPolicy: Delete
---
# Take a snapshot of the postgres volume
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-snap-20260402
namespace: production
spec:
volumeSnapshotClassName: zfs-snapshot
source:
persistentVolumeClaimName: postgres-data
---
# Restore: create a new PVC from the snapshot
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-restore
namespace: production
spec:
accessModes: [ReadWriteOnce]
storageClassName: zfs-db
resources:
requests:
storage: 50Gi
dataSource:
name: postgres-snap-20260402
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
6. Helm — Package Management
Helm is Kubernetes's package manager. A chart is a collection of templated Kubernetes manifests. A release is an installed instance of a chart with a specific set of values. Helm tracks releases, making upgrades and rollbacks first-class operations.
Basic Helm Workflow
# Add a chart repository helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo add jetstack https://charts.jetstack.io helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Search for a chart helm search repo nginx # Install with default values helm install my-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx \ --create-namespace # Install with custom values file helm install my-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx \ --create-namespace \ --values nginx-values.yaml # List releases helm list -A # Upgrade a release helm upgrade my-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx \ --values nginx-values.yaml # Roll back to previous revision helm rollback my-nginx 1 --namespace ingress-nginx # Uninstall helm uninstall my-nginx --namespace ingress-nginx
Writing a Minimal Chart
# Scaffold a new chart
helm create myapp
# Chart structure:
# myapp/
# Chart.yaml — name, version, description
# values.yaml — default values
# templates/
# deployment.yaml — {{ .Values.image.repository }}:{{ .Values.image.tag }}
# service.yaml
# _helpers.tpl — named templates shared across files
# Lint and dry-run before installing
helm lint myapp/
helm install myapp ./myapp --dry-run --debug
# Package for distribution
helm package myapp/
7. Ingress & Load Balancing
Kubernetes Services expose pods inside the cluster. Getting traffic from outside the cluster to pods requires either a LoadBalancer service (needs an external load balancer or a bare-metal implementation like MetalLB or Cilium BGP), or an Ingress controller that routes HTTP/HTTPS traffic to services based on hostname and path.
nginx Ingress Controller
# values for bare-metal (no cloud load balancer)
# nginx-values.yaml
controller:
service:
type: LoadBalancer
# On kldload with Cilium BGP, Cilium assigns the LoadBalancer IP
# On bare metal without BGP, use NodePort or MetalLB
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--create-namespace \
--values nginx-values.yaml
cert-manager + Let's Encrypt TLS
# Install cert-manager
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set crds.enabled=true
# ClusterIssuer: Let's Encrypt production
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ops@example.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
---
# Ingress with TLS — cert-manager provisions the cert automatically
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp
namespace: production
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- app.example.com
secretName: myapp-tls
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp
port:
number: 8080
Cilium BGP for LoadBalancer IPs
# Cilium can announce LoadBalancer IPs via BGP to your router.
# Result: no MetalLB needed. Services get real routable IPs.
# CiliumBGPPeeringPolicy: peer with your upstream router
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeeringPolicy
metadata:
name: rack-policy
spec:
nodeSelector:
matchLabels:
rack: rack0
virtualRouters:
- localASN: 65001
exportPodCIDR: true
neighbors:
- peerAddress: 192.168.1.1/32
peerASN: 65000
serviceSelector:
matchExpressions:
- key: somekey
operator: NotIn
values: ['never'] # select all services
---
# IP pool for LoadBalancer services
apiVersion: cilium.io/v2alpha1
kind: CiliumLoadBalancerIPPool
metadata:
name: default-pool
spec:
blocks:
- cidr: 192.168.20.0/24 # your dedicated LB IP range
8. Deployments, StatefulSets, DaemonSets
Deployment
For stateless workloads. Manages a ReplicaSet which manages identical,
interchangeable pods. Rolling updates replace pods one at a time (configurable
via maxSurge and maxUnavailable). Rollback to any previous revision with
kubectl rollout undo. Use for web servers, API servers, workers.
StatefulSet
For stateful workloads where pod identity matters. Pods get stable network
names (pod-0, pod-1, pod-2) and stable PVC bindings that survive pod
deletion. Startup and shutdown are ordered. Use for databases, distributed
systems like Kafka or Zookeeper, anything with primary/replica topology.
DaemonSet
Ensures one pod runs on every node (or every node matching a selector).
As nodes are added, the DaemonSet automatically places a pod on them. As
nodes are removed, the pods are garbage-collected. Use for node-level agents:
node_exporter, Falco, log shippers, CNI plugins, device plugins.
Rolling Update and Rollback
# Update the image — triggers a rolling update kubectl set image deployment/myapp app=myapp:v2 -n production # Watch the rollout progress kubectl rollout status deployment/myapp -n production # View rollout history kubectl rollout history deployment/myapp -n production # Roll back to the previous version kubectl rollout undo deployment/myapp -n production # Roll back to a specific revision kubectl rollout undo deployment/myapp --to-revision=3 -n production
StatefulSet with ZFS PVs
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: production
spec:
serviceName: postgres
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:16
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
resources:
requests:
cpu: "500m"
memory: 1Gi
limits:
cpu: "2"
memory: 4Gi
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
storageClassName: zfs-db # 8K recordsize, lz4 compression
resources:
requests:
storage: 100Gi
9. Operators & CRDs
An operator is a Kubernetes controller that manages a complex, stateful application. It extends the Kubernetes API with Custom Resource Definitions (CRDs) that represent the application in a Kubernetes-native way, then runs a controller loop that reconciles the actual state of the application to match the desired state expressed in the CRD.
CRD: Custom Resource Definition
Extends the Kubernetes API with new resource types. Once a CRD is installed,
you can kubectl apply objects of that type like any native Kubernetes
resource. The operator watches for these custom resources and reconciles the
actual application state to match.
Zalando Postgres Operator
Manages PostgreSQL clusters on Kubernetes. Declare a postgresql CRD with
the number of instances, storage size, and version. The operator creates
StatefulSets, Services, Secrets, and a Patroni-managed primary/replica
topology. Failover is automatic. Backups integrate with S3 or local storage.
Prometheus Operator
Manages Prometheus instances and scrape configuration via CRDs:
Prometheus, ServiceMonitor, PodMonitor, PrometheusRule. Add a
ServiceMonitor to tell Prometheus to scrape a new service without
restarting Prometheus or editing its config file manually.
Redis Operator (Redis Enterprise / Spotahome)
Manages Redis instances: standalone, Sentinel (HA), or cluster mode. The operator handles replica promotion, configuration tuning, and Kubernetes service management. Combined with a ZFS StorageClass, each Redis instance gets a ZFS dataset with appropriate recordsize.
Zalando PostgreSQL Operator Example
# Install the operator
helm repo add postgres-operator-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator
helm install postgres-operator postgres-operator-charts/postgres-operator \
--namespace postgres-operator \
--create-namespace
# Declare a PostgreSQL cluster
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: myapp-db
namespace: production
spec:
teamId: myapp
volume:
size: 100Gi
storageClass: zfs-db # ZFS dataset per instance
numberOfInstances: 3 # 1 primary + 2 replicas
postgresql:
version: "16"
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
10. Observability in Kubernetes
A Kubernetes cluster without observability is a black box. You need metrics for capacity planning and alerting, logs for debugging, traces for latency analysis, and network flow visibility for security and troubleshooting.
kube-prometheus-stack: Everything in One Helm Chart
# Installs: Prometheus, Alertmanager, Grafana, kube-state-metrics,
# node_exporter DaemonSet, Prometheus Operator, default dashboards
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# kube-prometheus-values.yaml
prometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: zfs-logs # 1M recordsize for time-series data
resources:
requests:
storage: 200Gi
retention: 30d
grafana:
persistence:
enabled: true
storageClassName: zfs-general
size: 10Gi
adminPassword: "changeme"
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: zfs-general
resources:
requests:
storage: 5Gi
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values kube-prometheus-values.yaml
Loki + Promtail: Log Aggregation
helm repo add grafana https://grafana.github.io/helm-charts helm repo update # Loki (log storage + query) helm install loki grafana/loki \ --namespace monitoring \ --set loki.storage.type=filesystem \ --set loki.commonConfig.replication_factor=1 # Promtail (log shipper, runs as DaemonSet on every node) helm install promtail grafana/promtail \ --namespace monitoring \ --set config.lokiAddress=http://loki:3100/loki/api/v1/push
Hubble: Cilium Network Observability
# Hubble is built into Cilium — enable it cilium hubble enable --ui # Real-time network flow viewer hubble observe --namespace production --follow # Filter to a specific pod hubble observe --pod production/postgres-0 --follow # Show dropped flows (policy denials) hubble observe --verdict DROPPED --follow # Hubble UI — browser-based network graph cilium hubble ui # opens port-forward to the UI automatically
11. Upgrade Strategies
Kubernetes upgrades are the operation most likely to go wrong and the operation most likely to be done under time pressure. On kldload, ZFS snapshots make upgrades reversible. A failed upgrade is not a disaster — it is a 2-second rollback.
kvm-snap rollback, not a 2-hour rebuild from scratch. This changes your risk calculus. On a cloud-managed cluster, a botched upgrade might mean calling support and waiting. On kldload, you snapshot, upgrade, test, and if anything is wrong, you roll back immediately and debug at leisure. The ZFS snapshot is your escape hatch. Always use it.Standard kubeadm Upgrade (In-Place)
# Before anything: snapshot every VM
for node in k8s-control-{1,2,3} k8s-worker-{1,2,3,4}; do
kvm-snap $node pre-upgrade-1-31
done
# On control plane node 1:
dnf install -y kubeadm-1.31.0
kubeadm upgrade plan
kubeadm upgrade apply v1.31.0
kubectl drain k8s-control-1 --ignore-daemonsets
dnf install -y kubelet-1.31.0 kubectl-1.31.0
systemctl daemon-reload && systemctl restart kubelet
kubectl uncordon k8s-control-1
# Additional control plane nodes:
dnf install -y kubeadm-1.31.0
kubeadm upgrade node
kubectl drain k8s-control-2 --ignore-daemonsets
dnf install -y kubelet-1.31.0 kubectl-1.31.0
systemctl daemon-reload && systemctl restart kubelet
kubectl uncordon k8s-control-2
# Workers (repeat for each):
kubectl drain k8s-worker-1 --ignore-daemonsets --delete-emptydir-data
# (on the worker):
dnf install -y kubeadm-1.31.0 && kubeadm upgrade node
dnf install -y kubelet-1.31.0 kubectl-1.31.0
systemctl daemon-reload && systemctl restart kubelet
# (on control plane):
kubectl uncordon k8s-worker-1
# Emergency rollback (if anything goes wrong):
kvm-snap rollback k8s-control-1 pre-upgrade-1-31
Blue/Green Cluster Upgrade
For zero-downtime major upgrades or CNI migrations, build a second cluster alongside the first, migrate workloads to it, then decommission the old cluster. On kldload, the new cluster is cloned from the same golden image in under a minute.
# 1. Clone new nodes from golden image
kvm-clone k8s-golden k8s-blue-control-1
kvm-clone k8s-golden k8s-blue-worker-{1,2,3}
# 2. Bootstrap blue cluster with new K8s version
kubeadm init --config kubeadm-config-v1-31.yaml
# 3. Deploy applications to blue cluster, point staging DNS to blue
# 4. Validate all workloads on blue
# 5. Migrate persistent data: ZFS send/receive or Velero restore
# 6. Swap production DNS from green to blue
# 7. Decommission green cluster after validation period
Canary Deployments Within a Cluster
# Run v1 (90% traffic) and v2 (10% traffic) simultaneously # Cilium's traffic splitting or nginx weighted routing # With Cilium traffic management: apiVersion: cilium.io/v2 kind: CiliumEnvoyConfig metadata: name: myapp-canary namespace: production # ... traffic weight configuration # Simpler: two Deployments, one Service, weighted by replica count # v1: 9 replicas, v2: 1 replica = 10% to v2 kubectl scale deployment myapp-v2 --replicas=1 kubectl scale deployment myapp-v1 --replicas=9
12. Multi-Cluster
Multi-cluster Kubernetes lets pods in one cluster communicate directly with pods in another, share services across cluster boundaries, and distribute workloads across multiple availability zones or sites. On kldload, the underlying transport is WireGuard — the same mesh you already have connecting your kldload nodes.
Cilium Cluster Mesh
Cilium Cluster Mesh extends Cilium's identity-based networking across cluster boundaries. Pods in cluster A can reach services in cluster B using the service's DNS name. Network policy applies across clusters using the same identity model.
# Enable cluster mesh on both clusters
# Cluster A (cluster-id=1)
cilium clustermesh enable --service-type LoadBalancer
cilium clustermesh status
# Cluster B (cluster-id=2)
cilium clustermesh enable --service-type LoadBalancer
# Connect cluster A to cluster B
cilium clustermesh connect --destination-context k8s-cluster-b
# Verify connectivity
cilium clustermesh status
cilium connectivity test
# Make a service available across clusters
# Add annotation to the service:
apiVersion: v1
kind: Service
metadata:
name: postgres
namespace: production
annotations:
service.cilium.io/global: "true" # visible in all clusters
spec:
...
ZFS Replication of PVs Between Clusters
# Find the ZFS dataset backing a PV
kubectl get pv $(kubectl get pvc postgres-data -n production -o jsonpath='{.spec.volumeName}') \
-o jsonpath='{.spec.csi.volumeAttributes.poolname}'
# Replicate the dataset to DR site via WireGuard mesh
# (assumes WireGuard connectivity between sites)
zfs snapshot rpool/k8s/pvc-abc123@repl-$(date +%Y%m%d)
zfs send -i rpool/k8s/pvc-abc123@previous rpool/k8s/pvc-abc123@repl-$(date +%Y%m%d) \
| ssh dr-node zfs receive rpool/k8s/pvc-abc123
# Sanoid + syncoid automate this with configurable schedules:
# /etc/sanoid/sanoid.conf — snapshot policy
# syncoid rpool/k8s/pvc-abc123 dr-node:rpool/k8s/pvc-abc123
13. Security Hardening
Default Kubernetes is not secure. Pods can run as root. There are no network policies. Secrets are base64-encoded in etcd (not encrypted at rest by default). Service account tokens are auto-mounted. Harden deliberately.
Pod Security Standards
# Pod Security Standards enforce security profiles at the namespace level.
# Three profiles: privileged, baseline, restricted.
# Label namespace to enforce restricted mode:
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/audit=restricted \
pod-security.kubernetes.io/warn=restricted
# "restricted" requires:
# - non-root user
# - read-only root filesystem (recommended)
# - no privilege escalation
# - drop ALL capabilities
# - no hostNetwork, hostPID, hostIPC
# Example compliant pod:
apiVersion: v1
kind: Pod
metadata:
name: secure-app
namespace: production
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:v1
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: [ALL]
Cilium Network Policy (L3/L4/L7)
# Default-deny ingress for a namespace — then allow explicitly
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes: [Ingress]
---
# Allow frontend → backend on port 8080 (L4)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: frontend-to-backend
namespace: production
spec:
endpointSelector:
matchLabels:
app: backend
ingress:
- fromEndpoints:
- matchLabels:
app: frontend
toPorts:
- ports:
- port: "8080"
protocol: TCP
rules:
http: # L7 HTTP rule
- method: GET
path: /api/v1/.* # allow GET /api/v1/*
---
# Deny specific path even if L4 allows it (L7 policy)
# method: DELETE → implicitly denied by not being in the allow list
Secrets Management
# Option 1: Sealed Secrets (encrypt secrets at rest in git)
# Install bitnami sealed-secrets controller
helm install sealed-secrets sealed-secrets/sealed-secrets \
--namespace kube-system
# Encrypt a secret for git storage
kubectl create secret generic db-password \
--from-literal=password=supersecret \
--dry-run=client -o yaml \
| kubeseal --controller-namespace kube-system \
| kubectl apply -f -
# Option 2: External Secrets Operator (pull from Vault, AWS Secrets Manager, etc.)
helm install external-secrets external-secrets/external-secrets \
--namespace external-secrets \
--create-namespace
# Option 3: etcd encryption at rest (built-in, enable in API server)
# /etc/kubernetes/encryption-config.yaml:
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources: [secrets]
providers:
- aescbc:
keys:
- name: key1
secret: <base64-encoded-32-byte-key>
- identity: {} # fallback for existing unencrypted secrets
Disable Auto-Mounted Service Account Tokens
# Most pods do not need to call the Kubernetes API. # Auto-mounted tokens give every pod API server access by default. # Disable at the namespace level (ServiceAccount default) apiVersion: v1 kind: ServiceAccount metadata: name: default namespace: production automountServiceAccountToken: false # Or per-pod: spec: automountServiceAccountToken: false
14. Troubleshooting
Kubernetes troubleshooting is systematic. Work from the outside in: is the node healthy? Is the pod scheduled? Is the container starting? Is the application inside the container working? Each layer has its own diagnostic commands.
The Diagnostic Toolkit
# Node health — first thing to check kubectl get nodes -o wide kubectl describe node k8s-worker-1 # events, conditions, resource usage # Pod status kubectl get pods -n production -o wide kubectl describe pod myapp-abc123 -n production # events are key kubectl logs myapp-abc123 -n production kubectl logs myapp-abc123 -n production --previous # logs from crashed container # All events in a namespace, sorted by time kubectl get events -n production --sort-by='.lastTimestamp' # Exec into a running pod for debugging kubectl exec -it myapp-abc123 -n production -- /bin/sh # Run a debug container alongside a problem pod (K8s 1.25+) kubectl debug -it myapp-abc123 -n production --image=nicolaka/netshoot --target=app
Common Failure Modes
CrashLoopBackOff
The container starts and immediately exits. Kubernetes backs off the restart
with exponential delay. Check logs with kubectl logs --previous to see the
exit output. Common causes: wrong entrypoint, missing environment variable,
bad config file mounted, OOM kill (check describe pod for OOMKilled).
ImagePullBackOff
Kubelet cannot pull the container image. Check describe pod events for the
specific error. Common causes: wrong image name or tag, private registry
without imagePullSecrets, registry unreachable from the node, rate limiting
(Docker Hub).
Pending Pods
Scheduler cannot place the pod. Check describe pod for the reason.
Common causes: insufficient CPU/memory on all nodes, no node matches
affinity/selector, PVC cannot be bound (wrong StorageClass, no capacity),
taint on all nodes with no matching toleration.
Node NotReady
Kubelet on the node stopped heartbeating. SSH to the node and check:
systemctl status kubelet, journalctl -u kubelet -n 50. Common causes:
kubelet crash (certificate expired, disk full, cgroup driver mismatch),
containerd hung, kernel panic (check dmesg), node out of disk or memory.
etcd Health
# Check etcd member health ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint health # Check etcd cluster members ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ member list # Backup etcd (do this before every upgrade) ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # On kldload: also snapshot the ZFS dataset zfs snapshot rpool/k8s-control/var-lib-etcd@pre-upgrade
Certificate Expiry
# Check all certificate expiry dates kubeadm certs check-expiration # If certificates have expired, the API server will refuse connections. # The symptom: kubectl commands fail with "certificate has expired" or # "unable to connect to the server: x509" # Renew all certificates kubeadm certs renew all systemctl restart kubelet # Rotate the admin kubeconfig after renewal cp /etc/kubernetes/admin.conf ~/.kube/config
The complete picture: a kldload Kubernetes cluster is a system where every component is understood and owned. The nodes are ZFS zvols — snapshot them before any operation, roll back in two seconds if anything goes wrong. The networking is Cilium eBPF — O(1) packet decisions, L7 policy without sidecars, network flow observability with Hubble. RBAC controls who can do what. PVs are ZFS datasets with per-volume tuning. Helm manages application lifecycle. Operators encode operational knowledge. The upgrade path is a ZFS snapshot away from being reversed.
This is infrastructure you understand end to end — from the kernel's eBPF hooks to the etcd Raft log to the ZFS dataset backing your database. No black boxes. No cloud console. No vendor lock-in. Just Linux, ZFS, and Kubernetes, on hardware you own.
Related pages
- Kubernetes on KVM — build the cluster this masterclass operates
- Cilium Masterclass — deep dive on eBPF networking, kube-proxy replacement, L7 policy
- ZFS Masterclass — ZFS internals powering your node disks and PVs
- WireGuard Masterclass — the transport layer for multi-cluster networking
- Observability Advanced — Prometheus, Loki, Grafana in depth
- Cluster & Blue/Green — cluster cloning and upgrade workflow on kldload
- AI for Kubernetes — AI-assisted cluster operations