AI for Kubernetes

Build Your Own

AI for Kubernetes — voice-controlled cluster management.

Generic LLMs know that kubectl exists. This model knows your cluster topology, your running deployments, your pod health, your ingress rules, and your PVC layout. It reads kubectl get all before answering every question. It recommends ksnap before every destructive rollout. It understands Calico policies for your actual network, not a textbook network.

The Modelfile encodes deep Kubernetes knowledge. The context script feeds live cluster state into every query. The execute mode generates real kubectl commands, shows them to you, and waits for confirmation before touching anything.

1. The Kubernetes Modelfile

This is the complete system prompt. It encodes kubectl operations, resource management, Helm workflows, Calico networking, troubleshooting patterns, and ZFS-backed persistent storage. The model memorizes all of it.

Complete Kubernetes expert Modelfile

# /srv/ollama/Modelfile.k8s-expert
FROM llama3.1:8b

SYSTEM """
You are a Kubernetes operations expert for this kldload-based infrastructure.
You give precise kubectl commands, reference actual resource names from context,
and always recommend ZFS snapshots of PVC volumes before destructive operations.

=== KUBECTL CORE ===
List resources:         kubectl get pods,svc,deploy,ing -A
Describe resource:      kubectl describe pod/NAME -n NAMESPACE
Apply manifest:         kubectl apply -f manifest.yaml
Delete resource:        kubectl delete pod/NAME -n NAMESPACE
Scale deployment:       kubectl scale deploy/NAME --replicas=N -n NAMESPACE
Rollout status:         kubectl rollout status deploy/NAME -n NAMESPACE
Rollout history:        kubectl rollout history deploy/NAME -n NAMESPACE
Rollout undo:           kubectl rollout undo deploy/NAME -n NAMESPACE
Exec into pod:          kubectl exec -it pod/NAME -n NAMESPACE -- /bin/sh
Pod logs:               kubectl logs pod/NAME -n NAMESPACE --tail=100
Pod logs (previous):    kubectl logs pod/NAME -n NAMESPACE --previous
Port forward:           kubectl port-forward svc/NAME 8080:80 -n NAMESPACE
Watch resources:        kubectl get pods -w -n NAMESPACE
Top (resource usage):   kubectl top pods -n NAMESPACE --sort-by=memory
Events:                 kubectl get events --sort-by=.lastTimestamp -n NAMESPACE

=== RESOURCE TYPES ===
Pods:                   Smallest deployable unit. One or more containers sharing network/storage.
Deployments:            Declarative pod management. ReplicaSets under the hood.
Services:               Stable network endpoint: ClusterIP, NodePort, LoadBalancer.
Ingress:                HTTP/HTTPS routing. Requires an ingress controller (nginx, traefik).
ConfigMaps:             Non-sensitive configuration. Mount as files or inject as env vars.
Secrets:                Base64-encoded sensitive data. Use with caution — not encrypted at rest by default.
PVCs:                   Persistent Volume Claims. Request storage from a StorageClass.
Namespaces:             Logical isolation. Use for environments (dev/staging/prod) or teams.
DaemonSets:             One pod per node. Use for log collectors, monitoring agents.
StatefulSets:           Ordered, stable pod identities. Use for databases, ZooKeeper.
CronJobs:               Scheduled jobs. Cron syntax: "0 5 * * *"
ServiceAccounts:        Pod identity for RBAC. Bind to Roles via RoleBindings.

=== HELM ===
Install chart:          helm install RELEASE CHART --namespace NS --create-namespace -f values.yaml
Upgrade release:        helm upgrade RELEASE CHART --namespace NS -f values.yaml
List releases:          helm list -A
Show values:            helm get values RELEASE -n NS
Show all values:        helm show values CHART
History:                helm history RELEASE -n NS
Rollback:               helm rollback RELEASE REVISION -n NS
Uninstall:              helm uninstall RELEASE -n NS
Add repo:               helm repo add NAME URL && helm repo update
Search:                 helm search repo KEYWORD

=== CALICO / NETWORK POLICY ===
Default deny all:       apiVersion: networking.k8s.io/v1, kind: NetworkPolicy
                        spec.podSelector: {}, policyTypes: [Ingress, Egress]
Allow specific ingress: spec.ingress[].from[].podSelector.matchLabels: {app: frontend}
Allow namespace:        spec.ingress[].from[].namespaceSelector.matchLabels: {name: monitoring}
Allow port:             spec.ingress[].ports[].port: 8080, protocol: TCP
Calico status:          kubectl get pods -n calico-system
Calico policies:        kubectl get networkpolicies -A
GlobalNetworkPolicy:    Calico CRD — applies across all namespaces

=== TROUBLESHOOTING ===

CrashLoopBackOff:
  1. kubectl describe pod/NAME -n NS  (check Events section)
  2. kubectl logs pod/NAME -n NS --previous  (see crash output)
  3. Common causes: bad entrypoint, missing config, OOM, failed health check
  4. Fix: check image, configmap mounts, resource limits, liveness probe

ImagePullBackOff:
  1. kubectl describe pod/NAME -n NS  (look for "Failed to pull image")
  2. Check image name and tag — typos are the #1 cause
  3. Private registry: kubectl create secret docker-registry ...
  4. Check node can reach registry: kubectl exec debug-pod -- curl registry:5000/v2/

Pending pods:
  1. kubectl describe pod/NAME -n NS  (check Events for scheduling failures)
  2. Insufficient CPU/memory: kubectl describe nodes | grep -A5 Allocated
  3. Unschedulable: kubectl get nodes (check taints, cordoned nodes)
  4. PVC not bound: kubectl get pvc -n NS  (check StorageClass exists)

OOMKilled:
  1. kubectl describe pod/NAME -n NS  (Last State: OOMKilled)
  2. Increase memory limit in deployment spec: resources.limits.memory
  3. Check actual usage: kubectl top pod NAME -n NS
  4. Application leak: profile the app, don't just raise limits forever

=== ZFS-BACKED PVCs ===
StorageClass for ZFS:   provisioner: kubernetes.io/no-provisioner (local PV on ZFS dataset)
Create PV on ZFS:       zfs create -o mountpoint=/srv/k8s-volumes/pv-NAME rpool/k8s/pv-NAME
Snapshot before deploy: ksnap /srv/k8s-volumes  (snapshot all PVCs at once)
Clone PVC for testing:  kclone /srv/k8s-volumes/pv-data /srv/k8s-volumes/pv-data-test
Compression:            zfs set compression=zstd rpool/k8s  (inherited by all PVs)
Recordsize for DBs:     zfs set recordsize=16k rpool/k8s/pv-postgres
Monitor usage:          kdf  (shows all datasets including k8s PVs)
Replicate to DR:        syncoid rpool/k8s root@dr-node:tank/k8s

=== KLDLOAD K8S WORKFLOW ===
Before any destructive operation:
  1. ksnap /srv/k8s-volumes  (ZFS snapshot of all PVC data)
  2. kubectl get all -A > /tmp/cluster-state-$(date +%F).txt  (record current state)
  3. Then proceed with the operation
  4. If rollback needed: ksnap rollback + kubectl rollout undo

=== PHILOSOPHY ===
Snapshot PVC volumes before every deployment. ksnap is instant. Rollbacks are free.
kubectl describe is your stethoscope. Events tell you what happened and when.
Don't guess at resource limits — measure with kubectl top, then set.
Network policies are not optional in production. Default deny, then allow.
Helm values files belong in git. If it's not in git, it didn't happen.
ZFS-backed PVCs mean your persistent data has checksums, compression, and snapshots — for free.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 16384

# Build the Kubernetes expert model
ollama create k8s-expert -f /srv/ollama/Modelfile.k8s-expert

# Verify it
ollama run k8s-expert "How do I troubleshoot a pod stuck in CrashLoopBackOff?"

A mechanic memorizes engine codes. A K8s admin memorizes pod states. This model memorizes both — plus your deployment topology, your ingress rules, and every event that matters.

2. Live context script (kai-k8s)

The Modelfile is the AI's training. The context script is the patient chart. Every query includes fresh cluster state — running pods, node health, recent events, resource usage — so the model answers based on what is happening right now, not what the docs say should be happening.

The Kubernetes context builder

#!/bin/bash
# /usr/local/bin/kai-k8s — query the Kubernetes AI with live cluster context

build_k8s_context() {
    echo "=== LIVE KUBERNETES STATE ($(date -Iseconds)) ==="

    echo -e "\n--- kubectl get all -A ---"
    kubectl get all -A 2>/dev/null

    echo -e "\n--- Node status ---"
    kubectl get nodes -o wide 2>/dev/null

    echo -e "\n--- Node resource usage ---"
    kubectl top nodes 2>/dev/null

    echo -e "\n--- Pod status (non-Running) ---"
    kubectl get pods -A --field-selector='status.phase!=Running' 2>/dev/null

    echo -e "\n--- Pod resource usage ---"
    kubectl top pods -A --sort-by=memory 2>/dev/null | head -30

    echo -e "\n--- Recent events (last 30) ---"
    kubectl get events -A --sort-by=.lastTimestamp 2>/dev/null | tail -30

    echo -e "\n--- PVCs ---"
    kubectl get pvc -A 2>/dev/null

    echo -e "\n--- Ingress ---"
    kubectl get ingress -A 2>/dev/null

    echo -e "\n--- Helm releases ---"
    helm list -A 2>/dev/null

    echo -e "\n--- ZFS PVC volumes ---"
    zfs list -o name,used,avail,mountpoint 2>/dev/null | grep k8s

    echo -e "\n--- ZFS PVC snapshots ---"
    zfs list -t snapshot -o name,used,creation 2>/dev/null | grep k8s | tail -20
}

QUESTION="$*"
if [ -z "$QUESTION" ]; then
    echo "Usage: kai-k8s <question>"
    echo ""
    echo "Examples:"
    echo "  kai-k8s 'why is my pod crashing?'"
    echo "  kai-k8s 'scale nginx to 5 replicas'"
    echo "  kai-k8s 'create a deployment for redis with a ZFS PVC'"
    echo "  kai-k8s 'show me pods using more than 500MB RAM'"
    echo "  kai-k8s 'set up an ingress for my web app'"
    echo "  kai-k8s 'rollback the last deployment'"
    exit 1
fi

CONTEXT=$(build_k8s_context)

echo -e "${CONTEXT}\n\n=== QUESTION ===\n${QUESTION}" | ollama run k8s-expert

You don't diagnose a cluster by reading the Kubernetes docs. You diagnose it by reading pod events and logs. This script makes sure the AI always reads your cluster state before it opens its mouth.

3. Example queries

Every query below hits the model with fresh cluster data. The AI sees your actual pods, your actual node capacity, your actual events. It doesn't guess — it reads.

"Scale nginx to 5 replicas"

The AI reads your current deployment, checks available node capacity via kubectl top nodes, and gives you the exact command. If your nodes are near capacity, it warns you before you scale into OOM territory.

kai-k8s "scale nginx to 5 replicas"
# AI output: kubectl scale deploy/nginx --replicas=5 -n default

"Why is pod X in CrashLoopBackOff"

The AI reads pod events, pulls the previous container logs, checks for OOMKilled signals, missing configmaps, failed health probes, and image pull errors. It tells you which thing broke and how to fix it — not a generic troubleshooting flowchart.

kai-k8s "pod redis-0 is in CrashLoopBackOff — what happened?"

"Create a deployment for redis with 3 replicas and a ZFS PVC"

The AI generates the full manifest: StatefulSet with 3 replicas, a PVC template pointing at the ZFS StorageClass, recordsize=16k for the backing dataset, and the ksnap command to snapshot the volume before first write.

kai-k8s "deploy redis with 3 replicas, ZFS-backed persistent storage"

"Show me all pods using more than 500MB RAM"

The AI parses kubectl top pods output from the live context and lists every pod exceeding the threshold. It flags any without memory limits set — because a pod without limits is a pod waiting to get OOMKilled.

kai-k8s "which pods are using more than 500MB RAM?"

"Set up an ingress for my web app"

The AI checks your existing ingress controller, reads your services, and generates an Ingress manifest with the correct annotations for your setup. TLS, path routing, host rules — all based on what's actually running, not a copy-paste from Stack Overflow.

kai-k8s "create an ingress for my-webapp service on port 8080"

"Rollback the last deployment"

The AI reads rollout history, identifies the previous revision, shows you the diff between current and previous, and gives you the undo command. It recommends ksnap of PVC volumes before the rollback — because data changes don't undo themselves.

kai-k8s "rollback the nginx deployment to the previous version"

4. kai-k8s-do — execute mode

Reading cluster state is safe. Changing it is not. kai-k8s-do follows the same pattern as kai-do for Proxmox: the AI generates the kubectl commands, shows them to you, and waits for confirmation before executing anything.

Execute mode with confirmation

#!/bin/bash
# /usr/local/bin/kai-k8s-do — AI generates kubectl commands, you confirm

QUESTION="$*"
if [ -z "$QUESTION" ]; then
    echo "Usage: kai-k8s-do <instruction>"
    echo ""
    echo "Examples:"
    echo "  kai-k8s-do 'scale nginx to 5 replicas'"
    echo "  kai-k8s-do 'create a redis deployment with ZFS PVC'"
    echo "  kai-k8s-do 'delete all completed jobs'"
    echo "  kai-k8s-do 'apply network policy to deny all ingress in staging'"
    exit 1
fi

# Snapshot PVC volumes before any changes
echo "=== Pre-flight: snapshotting ZFS PVC volumes ==="
ksnap /srv/k8s-volumes 2>/dev/null && echo "Snapshot complete." || echo "(no ZFS k8s volumes found)"
echo ""

# Gather cluster state
CONTEXT=$(/usr/local/bin/kai-k8s __build_context 2>/dev/null)

# Ask the AI to generate commands (not execute)
COMMANDS=$(echo "${CONTEXT}

=== INSTRUCTION ===
${QUESTION}

Generate the exact kubectl/helm commands to accomplish this.
Output ONLY the commands, one per line, prefixed with CMD:
Do not explain. Do not add commentary. Just CMD: lines." | \
    ollama run k8s-expert)

echo "=== AI-generated commands ==="
echo "$COMMANDS" | grep '^CMD:' | sed 's/^CMD: *//'
echo ""
echo "=== Review the commands above ==="
read -p "Execute? [y/N] " confirm

if [[ "$confirm" =~ ^[Yy]$ ]]; then
    echo "$COMMANDS" | grep '^CMD:' | sed 's/^CMD: *//' | while IFS= read -r cmd; do
        echo ">>> $cmd"
        eval "$cmd"
        echo ""
    done
    echo "Done."
else
    echo "Aborted. No changes made."
fi

A surgeon doesn't let the anesthesiologist also hold the scalpel. kai-k8s reads the patient chart. kai-k8s-do holds the scalpel. You decide when to cut.

5. ZFS integration

Kubernetes persistent volumes are just directories. On kldload, those directories live on ZFS datasets. That means your PVCs get checksums, compression, snapshots, clones, and replication — for free. The AI knows this and uses it.

Snapshot PVCs before deployments

Every kai-k8s-do command starts with a ZFS snapshot. If the deployment goes sideways, your data is exactly where it was 30 seconds ago.

# Snapshot all PVC volumes at once
ksnap /srv/k8s-volumes

# Snapshot a specific PVC
ksnap /srv/k8s-volumes/pv-postgres

# Rollback after a bad deployment
ksnap rollback /srv/k8s-volumes/pv-postgres
kubectl rollout undo deploy/postgres

Kubernetes can rollback a deployment. It cannot rollback the data that deployment wrote to disk. ZFS can.

Clone PVCs for testing

Need a copy of production data for your staging cluster? kclone creates an instant, zero-cost copy. No cp. No rsync. No waiting.

# Clone production postgres data for staging
kclone /srv/k8s-volumes/pv-postgres /srv/k8s-volumes/pv-postgres-staging

# Point staging PV at the clone
kubectl apply -f staging-pv.yaml

Photocopying a 500GB database takes hours. ZFS cloning takes milliseconds — it's copy-on-write. Only the bytes that change consume space.

Replicate volumes to DR cluster

Your Kubernetes data follows the same replication path as everything else on kldload: syncoid over WireGuard. Incremental. Encrypted in flight. Checksummed end to end.

# Replicate all k8s volumes to DR site
syncoid rpool/k8s root@dr-node:tank/k8s

# Replicate a specific PVC
syncoid rpool/k8s/pv-postgres root@dr-node:tank/k8s/pv-postgres

# Automate with cron
cat > /etc/cron.d/k8s-replicate <<'EOF'
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
*/15 * * * * root syncoid --no-sync-snap rpool/k8s root@dr-node:tank/k8s
EOF

Kubernetes has no opinion about disaster recovery for persistent data. ZFS does. syncoid sends only the changed blocks. Your RPO is 15 minutes, not "whenever we remember to run a backup."

6. Fleet management

One AI. Multiple clusters. The model lives on ZFS. It replicates via syncoid over WireGuard. Every cluster gets the same Kubernetes expertise, injected with its own live state.

Multi-cluster AI over WireGuard

#!/bin/bash
# replicate-k8s-expert.sh — push the K8s model to all clusters

CLUSTERS="cluster-prod cluster-staging cluster-dev"

# Snapshot the trained model
zfs snapshot rpool/srv/ollama@k8s-expert-$(date +%F)

# Replicate to every cluster's control plane
for cluster in $CLUSTERS; do
    echo "--- Syncing K8s expert to $cluster ---"
    syncoid --no-sync-snap rpool/srv/ollama "root@${cluster}:rpool/srv/ollama"
    ssh "root@${cluster}" "systemctl restart ollama"
    echo "$cluster: done"
done

# Deploy the kai-k8s scripts to every cluster
for cluster in $CLUSTERS; do
    scp /usr/local/bin/kai-k8s "root@${cluster}:/usr/local/bin/kai-k8s"
    scp /usr/local/bin/kai-k8s-do "root@${cluster}:/usr/local/bin/kai-k8s-do"
    ssh "root@${cluster}" "chmod +x /usr/local/bin/kai-k8s /usr/local/bin/kai-k8s-do"
done

echo "All clusters updated at $(date)"

Same flight manual, different aircraft. Every cluster runs the same K8s expert but feeds it its own pod events. Production asks about scaling. Staging asks about broken configs. Same expertise. Different patients.

Kubernetes gives you the primitives. Pods, deployments, services, ingress, PVCs, namespaces. These are not abstractions — they are building blocks. The AI doesn't replace your understanding of them. It amplifies it. It reads your pod events at 3 AM. It catches the CrashLoopBackOff before your users notice. It remembers the Helm values you set three months ago and why.

ZFS gives your persistent volumes something Kubernetes never will: checksums, snapshots, and replication that works the same way whether you're running one pod or a thousand.

Learn the primitives. Then teach them to a machine that never sleeps.

← AI for KVM & Virtual Machine Management — a local model that knows your hypervisor. What changes when you have ZFS on root. →