| pick your distro, get ZFS on root
kldload — your platform, your way, free
Source

Kubernetes Masterclass

This guide picks up where the Kubernetes on KVM tutorial leaves off. You have a cluster: golden images built with the kvm profile, nodes cloned with kvm-clone, kubeadm init done, Cilium installed. Now learn to operate it — RBAC, Helm, storage classes backed by ZFS datasets, operators, upgrade strategies that use ZFS snapshots as your safety net, and multi-cluster networking over WireGuard. Zero to hero, on hardware you own.

What Kubernetes actually is: a container orchestrator. It schedules workloads across nodes, manages pod networking, handles persistent storage lifecycle, automates rolling deployments, and self-heals failed containers. On kldload, the nodes are ZFS zvols — snapshotable, cloneable, replicatable in seconds. The network is Cilium running eBPF programs in the kernel. This combination changes what is operationally possible.

What this masterclass covers: cluster architecture, kubeadm deep dive, RBAC, namespaces, ZFS-backed persistent volumes, Helm, ingress, workload types, operators, observability, upgrade strategies, multi-cluster, security hardening, and troubleshooting — all grounded in the kldload stack.

Most Kubernetes tutorials teach you to click buttons in a cloud console. You provision a managed cluster with a credit card, deploy a sample app, and call it done. This masterclass teaches you to build and operate a cluster on hardware you own, with ZFS underneath every node, WireGuard between sites, and Cilium handling all networking at eBPF speed. No cloud provider lock-in. No hidden control plane. No mystery. When something breaks, you have the kernel tools to understand exactly why.

1. Kubernetes Is Infrastructure, Not Magic

Before going deep, fix the mental model. Kubernetes is a distributed system built from components that talk over HTTPS. When you run kubectl apply, your manifest travels to the API server, gets stored in etcd, is noticed by a controller, which tells the scheduler to place the pod, which tells the kubelet on the chosen node to pull the image and start the container. Every step is auditable. Nothing is magic.

API Server

The front door. Every kubectl command, every controller action, every webhook call goes through the API server. It validates manifests, enforces admission policies, persists objects to etcd, and notifies watchers of changes via watch streams. It is stateless — all state lives in etcd.

// kube-apiserver: the single source of truth interface // Everything reads/writes through here. Everything.

etcd

The distributed key-value store that holds all cluster state: every object, every secret, every config. etcd uses the Raft consensus algorithm — a write is not committed until a majority of etcd members agree. Lose etcd, lose the cluster. Back it up. On kldload, etcd data lives on a ZFS dataset; snapshot before every upgrade.

// etcd is the database. The API server is the ORM. // Think: lose etcd = lose everything not in your git repo

Scheduler

Watches for pods with no assigned node and picks one. Decisions are based on resource requests, node affinity/anti-affinity rules, taints and tolerations, topology spread constraints, and custom scoring plugins. It does not start containers — it just writes a node name into the pod spec.

// Scheduler: "pod X belongs on node Y" // Kubelet on Y: "ok, I will actually start it"

Controller Manager

A collection of control loops, each watching the cluster state and reconciling it toward the desired state. The Deployment controller ensures replica counts match. The Node controller marks nodes NotReady when they stop heartbeating. The Endpoints controller keeps service endpoints in sync with pods.

// Desired state: 3 replicas // Actual state: 2 running // Controller: starts 1 more. Always reconciling.

Kubelet

The node agent. Runs on every node including control plane nodes (in stacked etcd setups). Receives pod specs from the API server via its assigned node field, calls the container runtime (containerd) to pull images and start containers, reports pod status back, manages volume mounts, and enforces resource limits via cgroups.

// Kubelet is the only K8s component that actually runs containers // Everything else is orchestration logic

kube-proxy / Cilium

Handles service networking. In a kldload cluster with Cilium, kube-proxy is replaced entirely. Cilium's eBPF programs handle service IP translation in the kernel with O(1) hash map lookups. No iptables chains, no conntrack for east-west traffic, no performance cliff at 1000+ services.

// kube-proxy: iptables rules, O(n) per packet // Cilium eBPF: hash map, O(1) per packet, every time

kldload Makes Node Provisioning Instant

Traditional node provisioning: boot an OS, install packages, configure networking, join the cluster — 15 to 30 minutes. On kldload, you build one golden VM with the kvm profile (containerd, kubeadm, kubectl, kubelet pre-installed), take a ZFS snapshot, and clone it with kvm-clone. Clone time is under 10 seconds regardless of disk size because ZFS clones are copy-on-write — no data is copied at clone time.

# Snapshot the golden node
kvm-snap k8s-golden before-clone

# Clone 3 workers in parallel — each clone takes seconds
kvm-clone k8s-golden k8s-worker-1 &
kvm-clone k8s-golden k8s-worker-2 &
kvm-clone k8s-golden k8s-worker-3 &
wait

# Set unique hostnames and IPs, then join
for node in k8s-worker-{1,2,3}; do
  virsh start $node
done

Before any significant cluster operation — upgrade, CNI migration, etcd maintenance — snapshot every node with kvm-snap. A failed operation is a 2-second rollback, not a 2-hour rebuild.


2. kubeadm Deep Dive

kubeadm is the official tool for bootstrapping Kubernetes clusters. It handles certificate generation, etcd bootstrap, static pod manifests for the control plane components, and the join token workflow for adding nodes.

kubeadm is the "build it yourself" tool. Managed Kubernetes (EKS, GKE, AKS) hides everything kubeadm does behind an API call. Understanding kubeadm means understanding what those managed services abstract away — and what they take away from you. When you run your own control plane, you can audit every certificate, tune every flag, and access etcd directly. You also own the upgrades and the backups.

Init: First Control Plane Node

# kubeadm-config.yaml — explicit is better than implicit
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.31.0
controlPlaneEndpoint: "k8s-api.internal:6443"   # VIP or load balancer for HA
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/12"
etcd:
  local:
    dataDir: /var/lib/etcd                       # put this on a ZFS dataset
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: "192.168.10.10"
  bindPort: 6443
# Snapshot before init
kvm-snap k8s-control-1 pre-init

kubeadm init --config kubeadm-config.yaml --upload-certs

# Copy kubeconfig
mkdir -p $HOME/.kube
cp /etc/kubernetes/admin.conf $HOME/.kube/config

Join: Workers and Additional Control Plane Nodes

# Print join command for workers
kubeadm token create --print-join-command

# Join command for additional control plane nodes (HA)
# --control-plane --certificate-key is included in the init output
kubeadm join k8s-api.internal:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash> \
  --control-plane \
  --certificate-key <cert-key>

HA Control Plane: 3 Nodes, Stacked etcd

A highly available control plane runs three control plane nodes, each running the API server, controller manager, scheduler, and etcd. A load balancer (or keepalived VIP) fronts the three API servers at the controlPlaneEndpoint. Raft requires a majority — three nodes tolerate one failure. Five nodes tolerate two.

Stacked etcd

etcd runs on the same nodes as the API server. Simpler to operate. Three control plane nodes means three etcd members. This is the default and is appropriate for most deployments. The control plane nodes need enough resources to run etcd reliably — SSDs for etcd data, avoid noisy neighbours.

// 3 control nodes = 3 etcd members = 1 failure tolerance // Simple, fewer moving parts, works for <1000 nodes

External etcd

etcd runs on its own dedicated cluster, separate from the API servers. Allows independent scaling and isolation of etcd failures from API server failures. More complex to operate: separate certificate management, separate backup targets. Appropriate for very large clusters or regulated environments requiring strict component isolation.

// External etcd: 3 etcd VMs + 3 control plane VMs // More nodes, more complexity, more failure surface to isolate

Certificate Management

# Check certificate expiry — do this regularly
kubeadm certs check-expiration

# Rotate all certificates (do this annually, or before they expire)
# Snapshot every node first
kubeadm certs renew all

# Restart control plane components to pick up new certs
# (static pods auto-restart when their manifests change)
systemctl restart kubelet

kubeadm Upgrade Workflow

# Always upgrade one minor version at a time. Never skip.
# 1.29 → 1.30 → 1.31. Not 1.29 → 1.31.

# On the first control plane node:
# 1. Snapshot before anything
kvm-snap k8s-control-1 pre-upgrade-1-31

# 2. Upgrade kubeadm package
dnf install -y kubeadm-1.31.0          # or apt on Debian/Ubuntu nodes

# 3. Plan the upgrade (shows what will change)
kubeadm upgrade plan

# 4. Apply the upgrade
kubeadm upgrade apply v1.31.0

# 5. Drain node, upgrade kubelet and kubectl, uncordon
kubectl drain k8s-control-1 --ignore-daemonsets
dnf install -y kubelet-1.31.0 kubectl-1.31.0
systemctl daemon-reload && systemctl restart kubelet
kubectl uncordon k8s-control-1

# Repeat for each worker:
kubectl drain k8s-worker-1 --ignore-daemonsets
# (on the worker) kubeadm upgrade node && dnf install ...
kubectl uncordon k8s-worker-1

3. RBAC — Who Can Do What

Role-Based Access Control is Kubernetes's authorization system. Every request to the API server carries an identity — a user, a group, or a service account. RBAC rules say what that identity can do: which verbs (get, list, create, delete, patch) on which resources (pods, secrets, deployments) in which namespaces.

RBAC is the most skipped and most important Kubernetes security feature. Default cluster configurations often have overly permissive roles, service accounts with cluster-admin, or no RBAC policies at all. The blast radius of a compromised pod with a mounted service account token and cluster-admin binding is: the entire cluster. Every namespace. Every secret. Get RBAC right before you put anything sensitive in the cluster.

The Four Objects

Role

Defines a set of permissions within a single namespace. A Role named pod-reader in namespace dev grants get, list, watch on pods in the dev namespace only. It cannot span namespaces.

// Role: "here are the things you can do, in THIS namespace"

ClusterRole

Like a Role but cluster-scoped. Can grant permissions on namespaced resources across all namespaces, or on non-namespaced resources like nodes, PersistentVolumes, and StorageClasses. Built-in ClusterRoles: cluster-admin, admin, edit, view.

// ClusterRole: "here are the things you can do, everywhere"

RoleBinding

Binds a Role or ClusterRole to a subject (user, group, or service account) within a namespace. A RoleBinding can reference a ClusterRole — this grants the ClusterRole's permissions only within the binding's namespace. Useful for reusable role definitions with namespace-scoped grants.

// RoleBinding: "give subject X the permissions in role Y, in namespace Z"

ClusterRoleBinding

Binds a ClusterRole to a subject cluster-wide. A ClusterRoleBinding to cluster-admin gives the subject full control over everything. Use sparingly. Most humans should be bound to a ClusterRole via a RoleBinding in their specific namespace, not a ClusterRoleBinding.

// ClusterRoleBinding: "give subject X god mode, everywhere" // Use only for cluster operators and automation that needs it

Concrete Examples

# Read-only viewer: can see pods, logs, events — nothing else
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: viewer
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log", "events", "services", "endpoints"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets", "statefulsets", "daemonsets"]
  verbs: ["get", "list", "watch"]
---
# Bind to a specific user in the monitoring namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: alice-viewer
  namespace: monitoring
subjects:
- kind: User
  name: alice
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: viewer
  apiGroup: rbac.authorization.k8s.io
# CI/CD deployer: can create/update/delete deployments, services, configmaps
# but NOT secrets or RBAC objects
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployer
  namespace: production
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["services", "configmaps"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
---
# Service account for the CI runner pod
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ci-deployer
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ci-deployer-binding
  namespace: production
subjects:
- kind: ServiceAccount
  name: ci-deployer
  namespace: production
roleRef:
  kind: Role
  name: deployer
  apiGroup: rbac.authorization.k8s.io
# Audit: who has cluster-admin?
kubectl get clusterrolebindings \
  -o jsonpath='{range .items[?(@.roleRef.name=="cluster-admin")]}{.metadata.name}: {range .subjects[*]}{.kind}/{.name} {end}{"\n"}{end}'

# Check what a specific user can do
kubectl auth can-i --list --as=alice --namespace=production

# Check a specific permission
kubectl auth can-i delete pods --as=system:serviceaccount:production:ci-deployer -n production

4. Namespaces — Multi-Tenancy

Namespaces are Kubernetes's way of partitioning a cluster into virtual segments. Resources in a namespace are isolated from resources in other namespaces by name. RBAC policies can be scoped to a namespace. ResourceQuotas limit total resource consumption per namespace. NetworkPolicies (via Cilium) control which pods can talk to which.

Resource Quotas and Limit Ranges

# ResourceQuota: hard limits on total resource consumption in a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-quota
  namespace: dev
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    pods: "20"
    persistentvolumeclaims: "10"
    services.loadbalancers: "0"      # no LoadBalancer services in dev
---
# LimitRange: default requests/limits for pods that don't specify
apiVersion: v1
kind: LimitRange
metadata:
  name: dev-limits
  namespace: dev
spec:
  limits:
  - type: Container
    default:
      cpu: "500m"
      memory: 256Mi
    defaultRequest:
      cpu: "100m"
      memory: 64Mi
    max:
      cpu: "2"
      memory: 4Gi

Dev / Staging / Production Namespace Pattern

# Create namespaces with environment labels
kubectl create namespace dev
kubectl create namespace staging
kubectl create namespace production

kubectl label namespace dev environment=dev
kubectl label namespace staging environment=staging
kubectl label namespace production environment=production

# Cilium NetworkPolicy: staging pods cannot reach production pods
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: no-staging-to-prod
  namespace: production
spec:
  endpointSelector: {}       # applies to all pods in production
  ingress:
  - fromEndpoints:
    - matchLabels:
        io.kubernetes.pod.namespace: production   # only from production namespace

5. Storage — ZFS-Backed Persistent Volumes

On kldload, a Kubernetes PersistentVolume is a ZFS dataset. Not a block device emulating a filesystem, not a thin-provisioned LUN — a real ZFS dataset with its own recordsize, compression algorithm, encryption key, snapshot schedule, and replication target. The OpenEBS ZFS CSI driver bridges Kubernetes storage primitives to ZFS operations.

On kldload, a PV is a ZFS dataset. This means you can snapshot it on demand for backups (VolumeSnapshots map to ZFS snapshots), clone it to provision a test environment with production data in seconds, set per-volume recordsize (8K for databases, 128K for sequential workloads), enable per-volume compression independently, and replicate it to a DR site with zfs send | zfs receive. No other Kubernetes storage backend gives you this combination of features without a proprietary SAN.

Install OpenEBS ZFS CSI Driver

# ZFS must already exist on the nodes with a pool named 'rpool' (or your pool name)
zpool list    # verify pool exists on each worker node

# Install via Helm
helm repo add openebs-zfslocalpv https://openebs.github.io/zfs-localpv
helm repo update
helm install openebs-zfs openebs-zfslocalpv/zfs-localpv \
  --namespace openebs \
  --create-namespace

StorageClasses for Different Workloads

# General-purpose: lz4 compression, 128K recordsize (good for most workloads)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: zfs-general
provisioner: zfs.csi.openebs.io
parameters:
  poolname: rpool/k8s
  compression: lz4
  recordsize: "131072"      # 128K
  fstype: zfs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# Database: 8K recordsize matches typical PostgreSQL/MySQL page size
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: zfs-db
provisioner: zfs.csi.openebs.io
parameters:
  poolname: rpool/k8s
  compression: lz4
  recordsize: "8192"        # 8K — matches DB page size, avoids write amplification
  fstype: zfs
reclaimPolicy: Retain       # never auto-delete database volumes
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# Logging / time-series: large records, high compression
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: zfs-logs
provisioner: zfs.csi.openebs.io
parameters:
  poolname: rpool/k8s
  compression: zstd
  recordsize: "1048576"     # 1M — sequential write workloads
  fstype: zfs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Using PVCs

# Request a 50Gi database volume
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: production
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: zfs-db
  resources:
    requests:
      storage: 50Gi

VolumeSnapshots — Backup and Clone

# Install VolumeSnapshot CRDs and controller (one-time)
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml

# ZFS snapshot class
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: zfs-snapshot
driver: zfs.csi.openebs.io
deletionPolicy: Delete
---
# Take a snapshot of the postgres volume
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-snap-20260402
  namespace: production
spec:
  volumeSnapshotClassName: zfs-snapshot
  source:
    persistentVolumeClaimName: postgres-data
---
# Restore: create a new PVC from the snapshot
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-restore
  namespace: production
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: zfs-db
  resources:
    requests:
      storage: 50Gi
  dataSource:
    name: postgres-snap-20260402
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

6. Helm — Package Management

Helm is Kubernetes's package manager. A chart is a collection of templated Kubernetes manifests. A release is an installed instance of a chart with a specific set of values. Helm tracks releases, making upgrades and rollbacks first-class operations.

Helm is apt/dnf for Kubernetes. The same way you would not manually configure every option in a PostgreSQL RPM by hand-editing config files, you do not manually write every Kubernetes manifest for a complex application. Helm charts encode community best practices: resource requests, RBAC, service accounts, probes, PodDisruptionBudgets. Use them. Customise via values files. Do not fork charts unless you absolutely must.

Basic Helm Workflow

# Add a chart repository
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo add jetstack https://charts.jetstack.io
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Search for a chart
helm search repo nginx

# Install with default values
helm install my-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace

# Install with custom values file
helm install my-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --values nginx-values.yaml

# List releases
helm list -A

# Upgrade a release
helm upgrade my-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --values nginx-values.yaml

# Roll back to previous revision
helm rollback my-nginx 1 --namespace ingress-nginx

# Uninstall
helm uninstall my-nginx --namespace ingress-nginx

Writing a Minimal Chart

# Scaffold a new chart
helm create myapp

# Chart structure:
# myapp/
#   Chart.yaml          — name, version, description
#   values.yaml         — default values
#   templates/
#     deployment.yaml   — {{ .Values.image.repository }}:{{ .Values.image.tag }}
#     service.yaml
#     _helpers.tpl      — named templates shared across files

# Lint and dry-run before installing
helm lint myapp/
helm install myapp ./myapp --dry-run --debug

# Package for distribution
helm package myapp/

7. Ingress & Load Balancing

Kubernetes Services expose pods inside the cluster. Getting traffic from outside the cluster to pods requires either a LoadBalancer service (needs an external load balancer or a bare-metal implementation like MetalLB or Cilium BGP), or an Ingress controller that routes HTTP/HTTPS traffic to services based on hostname and path.

nginx Ingress Controller

# values for bare-metal (no cloud load balancer)
# nginx-values.yaml
controller:
  service:
    type: LoadBalancer
  # On kldload with Cilium BGP, Cilium assigns the LoadBalancer IP
  # On bare metal without BGP, use NodePort or MetalLB
  nodeSelector:
    node-role.kubernetes.io/control-plane: ""
  tolerations:
  - key: node-role.kubernetes.io/control-plane
    effect: NoSchedule

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --values nginx-values.yaml

cert-manager + Let's Encrypt TLS

# Install cert-manager
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set crds.enabled=true

# ClusterIssuer: Let's Encrypt production
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
---
# Ingress with TLS — cert-manager provisions the cert automatically
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp
  namespace: production
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - app.example.com
    secretName: myapp-tls
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp
            port:
              number: 8080

Cilium BGP for LoadBalancer IPs

# Cilium can announce LoadBalancer IPs via BGP to your router.
# Result: no MetalLB needed. Services get real routable IPs.

# CiliumBGPPeeringPolicy: peer with your upstream router
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeeringPolicy
metadata:
  name: rack-policy
spec:
  nodeSelector:
    matchLabels:
      rack: rack0
  virtualRouters:
  - localASN: 65001
    exportPodCIDR: true
    neighbors:
    - peerAddress: 192.168.1.1/32
      peerASN: 65000
    serviceSelector:
      matchExpressions:
      - key: somekey
        operator: NotIn
        values: ['never']    # select all services
---
# IP pool for LoadBalancer services
apiVersion: cilium.io/v2alpha1
kind: CiliumLoadBalancerIPPool
metadata:
  name: default-pool
spec:
  blocks:
  - cidr: 192.168.20.0/24   # your dedicated LB IP range

8. Deployments, StatefulSets, DaemonSets

Deployment

For stateless workloads. Manages a ReplicaSet which manages identical, interchangeable pods. Rolling updates replace pods one at a time (configurable via maxSurge and maxUnavailable). Rollback to any previous revision with kubectl rollout undo. Use for web servers, API servers, workers.

// Deployment → ReplicaSet → [pod, pod, pod] // Pods are fungible. Lose one, get another identical one.

StatefulSet

For stateful workloads where pod identity matters. Pods get stable network names (pod-0, pod-1, pod-2) and stable PVC bindings that survive pod deletion. Startup and shutdown are ordered. Use for databases, distributed systems like Kafka or Zookeeper, anything with primary/replica topology.

// StatefulSet: postgres-0 is always the primary. // It gets its own PVC. Its hostname never changes.

DaemonSet

Ensures one pod runs on every node (or every node matching a selector). As nodes are added, the DaemonSet automatically places a pod on them. As nodes are removed, the pods are garbage-collected. Use for node-level agents: node_exporter, Falco, log shippers, CNI plugins, device plugins.

// DaemonSet: one pod per node, always. // New node joins the cluster? Pod lands automatically.

Rolling Update and Rollback

# Update the image — triggers a rolling update
kubectl set image deployment/myapp app=myapp:v2 -n production

# Watch the rollout progress
kubectl rollout status deployment/myapp -n production

# View rollout history
kubectl rollout history deployment/myapp -n production

# Roll back to the previous version
kubectl rollout undo deployment/myapp -n production

# Roll back to a specific revision
kubectl rollout undo deployment/myapp --to-revision=3 -n production

StatefulSet with ZFS PVs

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: production
spec:
  serviceName: postgres
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:16
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
        resources:
          requests:
            cpu: "500m"
            memory: 1Gi
          limits:
            cpu: "2"
            memory: 4Gi
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: zfs-db      # 8K recordsize, lz4 compression
      resources:
        requests:
          storage: 100Gi

9. Operators & CRDs

An operator is a Kubernetes controller that manages a complex, stateful application. It extends the Kubernetes API with Custom Resource Definitions (CRDs) that represent the application in a Kubernetes-native way, then runs a controller loop that reconciles the actual state of the application to match the desired state expressed in the CRD.

Operators encode operational knowledge in code. The Zalando PostgreSQL operator knows how to do automatic failover when the primary dies, take consistent backups, add replicas for read scaling, apply configuration changes without downtime, and do rolling minor version upgrades. This is knowledge that a human DBA would apply manually, now automated and reproducible. When you install the operator, you get the DBA's expertise in a loop that runs 24/7.

CRD: Custom Resource Definition

Extends the Kubernetes API with new resource types. Once a CRD is installed, you can kubectl apply objects of that type like any native Kubernetes resource. The operator watches for these custom resources and reconciles the actual application state to match.

// CRD: "PostgresCluster is now a valid Kubernetes type" // Operator: "when I see a PostgresCluster, I run the database"

Zalando Postgres Operator

Manages PostgreSQL clusters on Kubernetes. Declare a postgresql CRD with the number of instances, storage size, and version. The operator creates StatefulSets, Services, Secrets, and a Patroni-managed primary/replica topology. Failover is automatic. Backups integrate with S3 or local storage.

// kubectl apply -f postgres-cluster.yaml // → primary + 2 replicas, with automatic failover, in ~2 minutes

Prometheus Operator

Manages Prometheus instances and scrape configuration via CRDs: Prometheus, ServiceMonitor, PodMonitor, PrometheusRule. Add a ServiceMonitor to tell Prometheus to scrape a new service without restarting Prometheus or editing its config file manually.

// ServiceMonitor: declarative scrape config // No prometheus.yml editing. No reloads. Just apply a YAML.

Redis Operator (Redis Enterprise / Spotahome)

Manages Redis instances: standalone, Sentinel (HA), or cluster mode. The operator handles replica promotion, configuration tuning, and Kubernetes service management. Combined with a ZFS StorageClass, each Redis instance gets a ZFS dataset with appropriate recordsize.

// RedisFailover CR → operator → Redis primary + 2 Sentinel replicas // Failover handled automatically. No manual intervention.

Zalando PostgreSQL Operator Example

# Install the operator
helm repo add postgres-operator-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator
helm install postgres-operator postgres-operator-charts/postgres-operator \
  --namespace postgres-operator \
  --create-namespace

# Declare a PostgreSQL cluster
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: myapp-db
  namespace: production
spec:
  teamId: myapp
  volume:
    size: 100Gi
    storageClass: zfs-db         # ZFS dataset per instance
  numberOfInstances: 3           # 1 primary + 2 replicas
  postgresql:
    version: "16"
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 4Gi

10. Observability in Kubernetes

A Kubernetes cluster without observability is a black box. You need metrics for capacity planning and alerting, logs for debugging, traces for latency analysis, and network flow visibility for security and troubleshooting.

kube-prometheus-stack: Everything in One Helm Chart

# Installs: Prometheus, Alertmanager, Grafana, kube-state-metrics,
#           node_exporter DaemonSet, Prometheus Operator, default dashboards
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# kube-prometheus-values.yaml
prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: zfs-logs   # 1M recordsize for time-series data
          resources:
            requests:
              storage: 200Gi
    retention: 30d

grafana:
  persistence:
    enabled: true
    storageClassName: zfs-general
    size: 10Gi
  adminPassword: "changeme"

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: zfs-general
          resources:
            requests:
              storage: 5Gi

helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values kube-prometheus-values.yaml

Loki + Promtail: Log Aggregation

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Loki (log storage + query)
helm install loki grafana/loki \
  --namespace monitoring \
  --set loki.storage.type=filesystem \
  --set loki.commonConfig.replication_factor=1

# Promtail (log shipper, runs as DaemonSet on every node)
helm install promtail grafana/promtail \
  --namespace monitoring \
  --set config.lokiAddress=http://loki:3100/loki/api/v1/push

Hubble: Cilium Network Observability

# Hubble is built into Cilium — enable it
cilium hubble enable --ui

# Real-time network flow viewer
hubble observe --namespace production --follow

# Filter to a specific pod
hubble observe --pod production/postgres-0 --follow

# Show dropped flows (policy denials)
hubble observe --verdict DROPPED --follow

# Hubble UI — browser-based network graph
cilium hubble ui    # opens port-forward to the UI automatically

11. Upgrade Strategies

Kubernetes upgrades are the operation most likely to go wrong and the operation most likely to be done under time pressure. On kldload, ZFS snapshots make upgrades reversible. A failed upgrade is not a disaster — it is a 2-second rollback.

On kldload, a failed Kubernetes upgrade is a 2-second rollback with kvm-snap rollback, not a 2-hour rebuild from scratch. This changes your risk calculus. On a cloud-managed cluster, a botched upgrade might mean calling support and waiting. On kldload, you snapshot, upgrade, test, and if anything is wrong, you roll back immediately and debug at leisure. The ZFS snapshot is your escape hatch. Always use it.

Standard kubeadm Upgrade (In-Place)

# Before anything: snapshot every VM
for node in k8s-control-{1,2,3} k8s-worker-{1,2,3,4}; do
  kvm-snap $node pre-upgrade-1-31
done

# On control plane node 1:
dnf install -y kubeadm-1.31.0
kubeadm upgrade plan
kubeadm upgrade apply v1.31.0
kubectl drain k8s-control-1 --ignore-daemonsets
dnf install -y kubelet-1.31.0 kubectl-1.31.0
systemctl daemon-reload && systemctl restart kubelet
kubectl uncordon k8s-control-1

# Additional control plane nodes:
dnf install -y kubeadm-1.31.0
kubeadm upgrade node
kubectl drain k8s-control-2 --ignore-daemonsets
dnf install -y kubelet-1.31.0 kubectl-1.31.0
systemctl daemon-reload && systemctl restart kubelet
kubectl uncordon k8s-control-2

# Workers (repeat for each):
kubectl drain k8s-worker-1 --ignore-daemonsets --delete-emptydir-data
# (on the worker):
dnf install -y kubeadm-1.31.0 && kubeadm upgrade node
dnf install -y kubelet-1.31.0 kubectl-1.31.0
systemctl daemon-reload && systemctl restart kubelet
# (on control plane):
kubectl uncordon k8s-worker-1

# Emergency rollback (if anything goes wrong):
kvm-snap rollback k8s-control-1 pre-upgrade-1-31

Blue/Green Cluster Upgrade

For zero-downtime major upgrades or CNI migrations, build a second cluster alongside the first, migrate workloads to it, then decommission the old cluster. On kldload, the new cluster is cloned from the same golden image in under a minute.

# 1. Clone new nodes from golden image
kvm-clone k8s-golden k8s-blue-control-1
kvm-clone k8s-golden k8s-blue-worker-{1,2,3}

# 2. Bootstrap blue cluster with new K8s version
kubeadm init --config kubeadm-config-v1-31.yaml

# 3. Deploy applications to blue cluster, point staging DNS to blue
# 4. Validate all workloads on blue
# 5. Migrate persistent data: ZFS send/receive or Velero restore
# 6. Swap production DNS from green to blue
# 7. Decommission green cluster after validation period

Canary Deployments Within a Cluster

# Run v1 (90% traffic) and v2 (10% traffic) simultaneously
# Cilium's traffic splitting or nginx weighted routing

# With Cilium traffic management:
apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
  name: myapp-canary
  namespace: production
# ... traffic weight configuration

# Simpler: two Deployments, one Service, weighted by replica count
# v1: 9 replicas, v2: 1 replica = 10% to v2
kubectl scale deployment myapp-v2 --replicas=1
kubectl scale deployment myapp-v1 --replicas=9

12. Multi-Cluster

Multi-cluster Kubernetes lets pods in one cluster communicate directly with pods in another, share services across cluster boundaries, and distribute workloads across multiple availability zones or sites. On kldload, the underlying transport is WireGuard — the same mesh you already have connecting your kldload nodes.

Cilium Cluster Mesh

Cilium Cluster Mesh extends Cilium's identity-based networking across cluster boundaries. Pods in cluster A can reach services in cluster B using the service's DNS name. Network policy applies across clusters using the same identity model.

# Enable cluster mesh on both clusters
# Cluster A (cluster-id=1)
cilium clustermesh enable --service-type LoadBalancer
cilium clustermesh status

# Cluster B (cluster-id=2)
cilium clustermesh enable --service-type LoadBalancer

# Connect cluster A to cluster B
cilium clustermesh connect --destination-context k8s-cluster-b

# Verify connectivity
cilium clustermesh status
cilium connectivity test

# Make a service available across clusters
# Add annotation to the service:
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: production
  annotations:
    service.cilium.io/global: "true"    # visible in all clusters
spec:
  ...

ZFS Replication of PVs Between Clusters

# Find the ZFS dataset backing a PV
kubectl get pv $(kubectl get pvc postgres-data -n production -o jsonpath='{.spec.volumeName}') \
  -o jsonpath='{.spec.csi.volumeAttributes.poolname}'

# Replicate the dataset to DR site via WireGuard mesh
# (assumes WireGuard connectivity between sites)
zfs snapshot rpool/k8s/pvc-abc123@repl-$(date +%Y%m%d)
zfs send -i rpool/k8s/pvc-abc123@previous rpool/k8s/pvc-abc123@repl-$(date +%Y%m%d) \
  | ssh dr-node zfs receive rpool/k8s/pvc-abc123

# Sanoid + syncoid automate this with configurable schedules:
# /etc/sanoid/sanoid.conf — snapshot policy
# syncoid rpool/k8s/pvc-abc123 dr-node:rpool/k8s/pvc-abc123

13. Security Hardening

Default Kubernetes is not secure. Pods can run as root. There are no network policies. Secrets are base64-encoded in etcd (not encrypted at rest by default). Service account tokens are auto-mounted. Harden deliberately.

Default Kubernetes is insecure in multiple ways simultaneously: pods run as root by default, any pod can talk to any other pod on any port, secrets stored in etcd are base64-encoded (not encrypted), service account tokens are auto-mounted into every pod giving it API server access, and the default service account in most namespaces has more permissions than it should. None of this is malicious design — it is the result of optimising for getting started quickly. Production clusters need explicit hardening on all of these fronts.

Pod Security Standards

# Pod Security Standards enforce security profiles at the namespace level.
# Three profiles: privileged, baseline, restricted.

# Label namespace to enforce restricted mode:
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/warn=restricted

# "restricted" requires:
# - non-root user
# - read-only root filesystem (recommended)
# - no privilege escalation
# - drop ALL capabilities
# - no hostNetwork, hostPID, hostIPC

# Example compliant pod:
apiVersion: v1
kind: Pod
metadata:
  name: secure-app
  namespace: production
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp:v1
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: [ALL]

Cilium Network Policy (L3/L4/L7)

# Default-deny ingress for a namespace — then allow explicitly
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes: [Ingress]
---
# Allow frontend → backend on port 8080 (L4)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: frontend-to-backend
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: backend
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: frontend
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
      rules:
        http:                                  # L7 HTTP rule
        - method: GET
          path: /api/v1/.*                     # allow GET /api/v1/*
---
# Deny specific path even if L4 allows it (L7 policy)
# method: DELETE → implicitly denied by not being in the allow list

Secrets Management

# Option 1: Sealed Secrets (encrypt secrets at rest in git)
# Install bitnami sealed-secrets controller
helm install sealed-secrets sealed-secrets/sealed-secrets \
  --namespace kube-system

# Encrypt a secret for git storage
kubectl create secret generic db-password \
  --from-literal=password=supersecret \
  --dry-run=client -o yaml \
  | kubeseal --controller-namespace kube-system \
  | kubectl apply -f -

# Option 2: External Secrets Operator (pull from Vault, AWS Secrets Manager, etc.)
helm install external-secrets external-secrets/external-secrets \
  --namespace external-secrets \
  --create-namespace

# Option 3: etcd encryption at rest (built-in, enable in API server)
# /etc/kubernetes/encryption-config.yaml:
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources: [secrets]
  providers:
  - aescbc:
      keys:
      - name: key1
        secret: <base64-encoded-32-byte-key>
  - identity: {}    # fallback for existing unencrypted secrets

Disable Auto-Mounted Service Account Tokens

# Most pods do not need to call the Kubernetes API.
# Auto-mounted tokens give every pod API server access by default.

# Disable at the namespace level (ServiceAccount default)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: default
  namespace: production
automountServiceAccountToken: false

# Or per-pod:
spec:
  automountServiceAccountToken: false

14. Troubleshooting

Kubernetes troubleshooting is systematic. Work from the outside in: is the node healthy? Is the pod scheduled? Is the container starting? Is the application inside the container working? Each layer has its own diagnostic commands.

The Diagnostic Toolkit

# Node health — first thing to check
kubectl get nodes -o wide
kubectl describe node k8s-worker-1    # events, conditions, resource usage

# Pod status
kubectl get pods -n production -o wide
kubectl describe pod myapp-abc123 -n production    # events are key
kubectl logs myapp-abc123 -n production
kubectl logs myapp-abc123 -n production --previous   # logs from crashed container

# All events in a namespace, sorted by time
kubectl get events -n production --sort-by='.lastTimestamp'

# Exec into a running pod for debugging
kubectl exec -it myapp-abc123 -n production -- /bin/sh

# Run a debug container alongside a problem pod (K8s 1.25+)
kubectl debug -it myapp-abc123 -n production --image=nicolaka/netshoot --target=app

Common Failure Modes

CrashLoopBackOff

The container starts and immediately exits. Kubernetes backs off the restart with exponential delay. Check logs with kubectl logs --previous to see the exit output. Common causes: wrong entrypoint, missing environment variable, bad config file mounted, OOM kill (check describe pod for OOMKilled).

kubectl logs pod/myapp --previous -n production # Look for: exit code, error message, missing config

ImagePullBackOff

Kubelet cannot pull the container image. Check describe pod events for the specific error. Common causes: wrong image name or tag, private registry without imagePullSecrets, registry unreachable from the node, rate limiting (Docker Hub).

kubectl describe pod myapp -n production | grep -A 20 Events: # Look for: 403, 404, connection refused, timeout

Pending Pods

Scheduler cannot place the pod. Check describe pod for the reason. Common causes: insufficient CPU/memory on all nodes, no node matches affinity/selector, PVC cannot be bound (wrong StorageClass, no capacity), taint on all nodes with no matching toleration.

kubectl describe pod myapp | grep -A 5 "0/3 nodes are available" # The message tells you exactly why each node was rejected

Node NotReady

Kubelet on the node stopped heartbeating. SSH to the node and check: systemctl status kubelet, journalctl -u kubelet -n 50. Common causes: kubelet crash (certificate expired, disk full, cgroup driver mismatch), containerd hung, kernel panic (check dmesg), node out of disk or memory.

ssh k8s-worker-1 systemctl status kubelet journalctl -u kubelet --since "10 minutes ago"

etcd Health

# Check etcd member health
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

# Check etcd cluster members
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  member list

# Backup etcd (do this before every upgrade)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# On kldload: also snapshot the ZFS dataset
zfs snapshot rpool/k8s-control/var-lib-etcd@pre-upgrade

Certificate Expiry

# Check all certificate expiry dates
kubeadm certs check-expiration

# If certificates have expired, the API server will refuse connections.
# The symptom: kubectl commands fail with "certificate has expired" or
# "unable to connect to the server: x509"

# Renew all certificates
kubeadm certs renew all
systemctl restart kubelet

# Rotate the admin kubeconfig after renewal
cp /etc/kubernetes/admin.conf ~/.kube/config

The complete picture: a kldload Kubernetes cluster is a system where every component is understood and owned. The nodes are ZFS zvols — snapshot them before any operation, roll back in two seconds if anything goes wrong. The networking is Cilium eBPF — O(1) packet decisions, L7 policy without sidecars, network flow observability with Hubble. RBAC controls who can do what. PVs are ZFS datasets with per-volume tuning. Helm manages application lifecycle. Operators encode operational knowledge. The upgrade path is a ZFS snapshot away from being reversed.

This is infrastructure you understand end to end — from the kernel's eBPF hooks to the etcd Raft log to the ZFS dataset backing your database. No black boxes. No cloud console. No vendor lock-in. Just Linux, ZFS, and Kubernetes, on hardware you own.

Related pages