kldload — Train AI on Your Infrastructure

Build Your Own

Train AI on Your Infrastructure — turn a generic LLM into YOUR sysadmin.

The AI Admin Assistant page showed you how to install Ollama and create a basic infrastructure model. This page goes deeper. You will scrape your entire knowledge base — docs, configs, man pages, tool output — into a single context corpus, build a comprehensive Modelfile that encodes everything about your environment, inject live system state into every query, generate daily health reports, and replicate the trained model across your fleet.

The goal: a local LLM that doesn't just know Linux — it knows your Linux. Your pool layout. Your dataset hierarchy. Your WireGuard topology. Your tool flags. Generic models give generic answers. Trained models give your answers.

1. Build the knowledge base

Before the model can know your infrastructure, you need to collect everything it should know into plain text. Docs, tool usage, man pages, current state — all of it.

Scrape the kldload documentation

Extract text from every HTML doc on the system. Strip tags. Keep structure.

#!/bin/bash
# build-knowledge-base.sh — collect everything the AI needs to know

KB="/srv/ollama/knowledge-base"
mkdir -p "$KB"

# --- kldload HTML docs ---
# If you have local docs (from the ISO or the website), extract them
echo "=== KLDLOAD DOCUMENTATION ===" > "$KB/docs.txt"
for f in /usr/local/share/kldload-webui/free/*.html \
         /usr/share/doc/kldload/*.html 2>/dev/null; do
    [ -f "$f" ] || continue
    echo -e "\n--- $(basename "$f") ---"
    # Strip HTML tags, collapse whitespace, keep meaningful text
    sed 's/<[^>]*>//g; s/&mdash;/—/g; s/&amp;/\&/g; s/&lt;/</g; s/&gt;/>/g' "$f" \
        | tr -s '[:space:]' ' ' \
        | fold -s -w 120
done >> "$KB/docs.txt"

echo "Docs: $(wc -l < "$KB/docs.txt") lines"

Capture every kldload tool's usage

The model needs to know what each tool does, what flags it accepts, and what output to expect.

# --- Tool help output ---
echo "=== KLDLOAD TOOL REFERENCE ===" > "$KB/tools.txt"

for tool in kst ksnap kbe kdf kdir kpkg kupgrade krecovery kexport kvpn kfw; do
    if command -v "$tool" &>/dev/null; then
        echo -e "\n=== $tool ==="
        echo "--- $tool --help ---"
        $tool --help 2>&1 || true
        echo ""
    fi
done >> "$KB/tools.txt"

# Capture a live kst output as an example of what "healthy" looks like
echo -e "\n=== EXAMPLE: kst output on a healthy system ===" >> "$KB/tools.txt"
kst >> "$KB/tools.txt" 2>&1 || true

echo "Tools: $(wc -l < "$KB/tools.txt") lines"

You're building a reference manual the AI will memorize. Every flag. Every output format. Every error message it might encounter.

Dump ZFS man page summaries

# --- ZFS reference ---
echo "=== ZFS REFERENCE ===" > "$KB/zfs.txt"

# Core man pages — extract the SYNOPSIS and DESCRIPTION sections
for page in zfs zpool zfs-send zfs-recv zfs-snapshot zfs-clone zfs-destroy \
            zfs-set zfs-mount zfs-share zpoolprops zfsprops; do
    if man -w "$page" &>/dev/null 2>&1; then
        echo -e "\n=== man $page ==="
        man "$page" 2>/dev/null | col -bx | \
            sed -n '/^NAME/,/^[A-Z]/p; /^SYNOPSIS/,/^[A-Z]/p; /^DESCRIPTION/,/^[A-Z]/p' | \
            head -80
    fi
done >> "$KB/zfs.txt"

# ZFS properties quick reference
echo -e "\n=== ZFS DATASET PROPERTIES ===" >> "$KB/zfs.txt"
zfs get all rpool 2>/dev/null | head -50 >> "$KB/zfs.txt"

# Pool layout
echo -e "\n=== POOL LAYOUT ===" >> "$KB/zfs.txt"
zpool status 2>/dev/null >> "$KB/zfs.txt"
zfs list -o name,used,avail,refer,mountpoint,compression,compressratio 2>/dev/null >> "$KB/zfs.txt"

echo "ZFS: $(wc -l < "$KB/zfs.txt") lines"

Capture current system state as baseline

# --- System state snapshot ---
echo "=== SYSTEM BASELINE ===" > "$KB/system.txt"

echo -e "\n--- OS ---" >> "$KB/system.txt"
cat /etc/os-release >> "$KB/system.txt" 2>/dev/null

echo -e "\n--- Kernel ---" >> "$KB/system.txt"
uname -a >> "$KB/system.txt"

echo -e "\n--- Network interfaces ---" >> "$KB/system.txt"
ip -br addr >> "$KB/system.txt" 2>/dev/null

echo -e "\n--- WireGuard tunnels ---" >> "$KB/system.txt"
wg show 2>/dev/null >> "$KB/system.txt" || echo "(no WireGuard tunnels active)" >> "$KB/system.txt"

echo -e "\n--- Listening services ---" >> "$KB/system.txt"
ss -tlnp >> "$KB/system.txt" 2>/dev/null

echo -e "\n--- Systemd failed units ---" >> "$KB/system.txt"
systemctl --failed --no-pager >> "$KB/system.txt" 2>/dev/null

echo -e "\n--- Installed kldload packages ---" >> "$KB/system.txt"
rpm -qa 2>/dev/null | grep -i kldload >> "$KB/system.txt" || true

echo -e "\n--- Firewall rules ---" >> "$KB/system.txt"
nft list ruleset 2>/dev/null | head -40 >> "$KB/system.txt"

echo "System: $(wc -l < "$KB/system.txt") lines"

# --- Assemble the full corpus ---
cat "$KB/docs.txt" "$KB/tools.txt" "$KB/zfs.txt" "$KB/system.txt" > "$KB/full-corpus.txt"
echo ""
echo "Total knowledge base: $(wc -l < "$KB/full-corpus.txt") lines, $(du -h "$KB/full-corpus.txt" | cut -f1)"

Think of this as the AI's first day on the job. You hand it the entire operations manual, the tool inventory, and a walkthrough of the current environment. Now it's ready to work.

2. Create a comprehensive Modelfile

The basic Modelfile from the AI Admin page is a starting point. This one encodes the full knowledge base — every tool, every pattern, every troubleshooting flow your AI needs to know by heart.

The complete infrastructure Modelfile

# /srv/ollama/Modelfile.infra-trained
FROM llama3.1:8b

SYSTEM """
You are the infrastructure expert for this specific kldload-based system.
You have been trained on its documentation, tool reference, ZFS layout,
and network topology. You give precise answers with exact commands.

=== KLDLOAD CLI TOOLS ===
kst             — system status dashboard (pools, datasets, services, memory, ARC)
ksnap           — create/list/rollback ZFS snapshots (wraps zfs snapshot)
ksnap rollback  — rollback a dataset to a previous snapshot
kbe             — ZFSBootMenu boot environment manager
kdf             — disk usage per dataset, sorted, human-readable
kdir            — create ZFS dataset with sane defaults (compression, mountpoint)
kpkg            — package operations from local darksite (offline repo)
kupgrade        — system upgrade with automatic pre-upgrade snapshot
krecovery       — boot into recovery, repair grub/ZFSBootMenu, chroot
kexport         — export VMs/datasets as OVA, QCOW2, raw, or ZFS stream
kvpn            — WireGuard tunnel manager (add peer, generate configs)
kfw             — nftables firewall manager (open/close ports, list rules)

=== ZFS OPERATIONS QUICK REFERENCE ===
Create pool:       zpool create -o ashift=12 rpool mirror /dev/disk/by-id/X /dev/disk/by-id/Y
Create dataset:    kdir -o recordsize=128k -o compression=zstd /srv/data
Snapshot:          ksnap /srv/data          (or: zfs snapshot rpool/srv/data@$(date +%F))
Rollback:          ksnap rollback /srv/data  (or: zfs rollback rpool/srv/data@name)
Send/recv:         zfs send -Rw rpool/srv/data@snap | ssh node2 zfs recv rpool/srv/data
Scrub:             zpool scrub rpool
ARC stats:         cat /proc/spl/kstat/zfs/arcstats | grep -E 'size|hits|misses'
Tune ARC:          echo SIZE > /sys/module/zfs/parameters/zfs_arc_max

=== WIREGUARD PATTERNS ===
Generate keys:     wg genkey | tee /etc/wireguard/private.key | wg pubkey > /etc/wireguard/public.key
Interface up:      wg-quick up wg0
Show status:       wg show
Config location:   /etc/wireguard/wg0.conf
Add peer:          kvpn add-peer --name node2 --endpoint 10.0.0.2:51820
Hub-and-spoke:     one server with AllowedIPs = 10.100.0.0/24, nodes route through it

=== TROUBLESHOOTING FLOWS ===

Pool DEGRADED:
  1. zpool status -v  (identify the faulted device)
  2. ksnap /srv       (snapshot everything first)
  3. zpool online rpool DEVICE  (if transient)
  4. zpool replace rpool OLD_DEVICE NEW_DEVICE  (if hardware failure)
  5. zpool scrub rpool  (verify after replace)

High ARC miss rate:
  1. cat /proc/spl/kstat/zfs/arcstats | grep -E 'hits|misses'
  2. Calculate: hits / (hits + misses) * 100
  3. If below 85%, increase zfs_arc_max
  4. echo $((RAM_BYTES / 2)) > /sys/module/zfs/parameters/zfs_arc_max
  5. Persist: add zfs_arc_max=N to /etc/modprobe.d/zfs.conf

Service won't start:
  1. systemctl status UNIT
  2. journalctl -u UNIT --since '10 min ago' --no-pager
  3. Check config syntax if applicable
  4. systemctl daemon-reload && systemctl restart UNIT

Disk full:
  1. kdf  (find the largest datasets)
  2. ksnap  (check for old snapshots holding space)
  3. zfs list -t snapshot -o name,used -s used  (sort by space used)
  4. zfs destroy rpool/path@old-snapshot  (reclaim space)

Boot failure:
  1. Boot into ZFSBootMenu recovery shell
  2. krecovery  (guided repair)
  3. Or manually: zpool import -fN rpool && zfs mount -a

=== PHILOSOPHY ===
Learn the primitives. ZFS, systemd, nftables, WireGuard — these are the building blocks.
kldload tools are convenience wrappers, not abstractions. Understand what they do underneath.
Always snapshot before changes. Always check 'zpool status' first. Always read the error message.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 16384

# Build the trained model
ollama create infra-trained -f /srv/ollama/Modelfile.infra-trained

# Verify it works
ollama run infra-trained "What does ksnap do and how do I rollback a dataset?"

A generic LLM knows what ZFS is. This model knows YOUR pool is called rpool, YOUR tools start with k, and YOUR rollback command is 'ksnap rollback'. That's the difference between a textbook and a colleague.

Embedding the full corpus into the system prompt

For larger knowledge bases, generate the Modelfile dynamically so the corpus is always current:

#!/bin/bash
# rebuild-model.sh — regenerate Modelfile with latest knowledge base

KB="/srv/ollama/knowledge-base/full-corpus.txt"

# Truncate to fit context window (8k model ~ 24k chars of system prompt is safe)
CORPUS=$(head -c 24000 "$KB")

cat > /srv/ollama/Modelfile.infra-trained <<MODELFILE
FROM llama3.1:8b

SYSTEM """
You are the infrastructure expert for this system. Below is the complete
reference for this environment — docs, tools, ZFS layout, and system state.
Use this to give precise, system-specific answers.

${CORPUS}

When answering:
- Give exact commands, not pseudocode
- Reference the specific pool names, dataset paths, and IPs from the context above
- Always recommend ksnap before destructive operations
- If you don't know something specific, say so — don't guess
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 16384
MODELFILE

# Snapshot before rebuilding (in case the new model is worse)
ksnap /srv/ollama

# Build it
ollama create infra-trained -f /srv/ollama/Modelfile.infra-trained

echo "Model rebuilt at $(date) with $(wc -c < "$KB") bytes of context"

Run this weekly or after major changes. The knowledge base updates, the model updates, and your AI always reflects the current state of the infrastructure.

3. Live context injection

The Modelfile gives the AI permanent knowledge. Live context injection gives it right now knowledge. Every query includes fresh system data so the model answers based on what's happening this second, not what was true last Tuesday.

The context builder

#!/bin/bash
# /usr/local/bin/kai — query the AI with live system context

build_context() {
    echo "=== LIVE SYSTEM STATE ($(date -Iseconds)) ==="

    echo -e "\n--- kst ---"
    kst 2>/dev/null

    echo -e "\n--- zpool status ---"
    zpool status 2>/dev/null

    echo -e "\n--- ARC stats ---"
    awk '/^size/{print "ARC size: "$3} /^hits/{print "ARC hits: "$3} /^misses/{print "ARC misses: "$3}' \
        /proc/spl/kstat/zfs/arcstats 2>/dev/null

    echo -e "\n--- Memory ---"
    free -h 2>/dev/null

    echo -e "\n--- Journal errors (last hour) ---"
    journalctl -p err --since "1 hour ago" --no-pager -q 2>/dev/null | tail -15

    echo -e "\n--- Failed units ---"
    systemctl --failed --no-pager --no-legend 2>/dev/null

    echo -e "\n--- ZFS dataset usage (top 10) ---"
    zfs list -o name,used,avail,refer -s used 2>/dev/null | tail -10

    echo -e "\n--- WireGuard ---"
    wg show 2>/dev/null | grep -E 'interface|peer|latest handshake|transfer' || echo "(no tunnels)"
}

QUESTION="$*"
if [ -z "$QUESTION" ]; then
    echo "Usage: kai <question>"
    echo "  kai 'is my pool healthy?'"
    echo "  kai 'why is memory usage high?'"
    echo "  kai 'what should I tune?'"
    exit 1
fi

CONTEXT=$(build_context)

echo -e "${CONTEXT}\n\n=== QUESTION ===\n${QUESTION}" | ollama run infra-trained

# Usage — every query sees live data
kai "is my pool healthy?"
kai "my ARC hit rate seems low, what should I change?"
kai "which datasets are using the most space?"
kai "any errors I should worry about?"
kai "generate a WireGuard config for a new peer at 10.0.0.5"

The system prompt is the AI's education. The live context is the patient chart. It doesn't just know medicine — it's looking at YOUR vitals right now.

Targeted context for specific queries

Don't always send everything. For focused questions, send focused context:

# ZFS-specific query — deep pool context
kai-zfs() {
    local CTX=$(zpool status -v 2>/dev/null; echo "---"; \
                zfs list -o name,used,avail,compression,compressratio 2>/dev/null; echo "---"; \
                zpool iostat -v 2>/dev/null)
    echo -e "ZFS context:\n${CTX}\n\nQuestion: $*" | ollama run infra-trained
}

# Network-specific query — WireGuard + firewall context
kai-net() {
    local CTX=$(ip -br addr 2>/dev/null; echo "---"; \
                wg show 2>/dev/null; echo "---"; \
                nft list ruleset 2>/dev/null | head -50; echo "---"; \
                ss -tlnp 2>/dev/null)
    echo -e "Network context:\n${CTX}\n\nQuestion: $*" | ollama run infra-trained
}

# Usage
kai-zfs "should I add an L2ARC device?"
kai-net "is my firewall blocking anything it shouldn't?"

4. Periodic health reports

A cron job runs the AI against your system state every day. It reads the same data you would, finds the same patterns you would — but it does it at 6 AM while you are still asleep.

Daily AI health report

#!/bin/bash
# /usr/local/bin/kai-report — daily AI infrastructure health report

REPORT_DIR="/var/log/kai-reports"
mkdir -p "$REPORT_DIR"
REPORT="$REPORT_DIR/$(date +%F).txt"

# Build comprehensive system snapshot
SNAPSHOT=$(cat <<SNAP
=== DAILY HEALTH CHECK — $(date) ===
=== HOSTNAME: $(hostname) ===

--- ZFS Pool Status ---
$(zpool status -v 2>/dev/null)

--- ZFS Pool I/O ---
$(zpool iostat -v 2>/dev/null)

--- Dataset Usage ---
$(zfs list -o name,used,avail,refer,compressratio -s used 2>/dev/null)

--- Snapshot Inventory ---
$(zfs list -t snapshot -o name,used,creation -s creation 2>/dev/null | tail -20)

--- ARC Statistics ---
$(awk '/^size/{printf "Size: %d MB\n",$3/1048576}
      /^hits/{h=$3} /^misses/{m=$3}
      END{if(h+m>0) printf "Hit rate: %.1f%%\n",h/(h+m)*100}' \
    /proc/spl/kstat/zfs/arcstats 2>/dev/null)

--- Memory ---
$(free -h 2>/dev/null)

--- Disk I/O (last hour average) ---
$(iostat -xh 1 1 2>/dev/null | tail -20)

--- Journal Errors (last 24h) ---
$(journalctl -p err --since "24 hours ago" --no-pager -q 2>/dev/null | tail -30)

--- Failed Systemd Units ---
$(systemctl --failed --no-pager 2>/dev/null)

--- Last Scrub ---
$(zpool status 2>/dev/null | grep -A2 'scan:')

--- WireGuard Peers ---
$(wg show 2>/dev/null | grep -E 'peer|latest handshake|transfer')

--- Sanoid Snapshot Status ---
$(sanoid --monitor-snapshots 2>/dev/null || echo "(sanoid not installed)")
SNAP
)

# Ask the AI for analysis
ANALYSIS=$(echo "${SNAPSHOT}

Analyze this infrastructure health check. Report:
1. CRITICAL — anything that needs immediate attention
2. WARNINGS — things to watch or address this week
3. TUNING — performance optimizations worth considering
4. STATUS — one-line overall health summary

Be specific. Reference actual values from the data. Give exact commands for any recommended actions." | \
    ollama run infra-trained)

# Write the report
{
    echo "=== AI INFRASTRUCTURE HEALTH REPORT ==="
    echo "=== $(hostname) — $(date) ==="
    echo ""
    echo "$ANALYSIS"
    echo ""
    echo "=== RAW DATA ==="
    echo "$SNAPSHOT"
} > "$REPORT"

# Optional: email the report
if command -v mail &>/dev/null; then
    head -50 "$REPORT" | mail -s "[$(hostname)] AI Health Report — $(date +%F)" root
fi

# Optional: log to systemd journal
echo "$ANALYSIS" | head -5 | logger -t kai-report

echo "Report saved: $REPORT"

Schedule it

# Run every morning at 6 AM
cat > /etc/cron.d/kai-report <<'EOF'
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
0 6 * * * root /usr/local/bin/kai-report
EOF

# Or use a systemd timer for better logging
cat > /etc/systemd/system/kai-report.service <<EOF
[Unit]
Description=AI Infrastructure Health Report

[Service]
Type=oneshot
ExecStart=/usr/local/bin/kai-report
EOF

cat > /etc/systemd/system/kai-report.timer <<EOF
[Unit]
Description=Daily AI Health Report

[Timer]
OnCalendar=*-*-* 06:00:00
Persistent=true

[Install]
WantedBy=timers.target
EOF

systemctl daemon-reload
systemctl enable --now kai-report.timer

# Check the last report
cat /var/log/kai-reports/$(date +%F).txt

Your infrastructure gets a daily physical. The AI reads every metric, checks every pool, reviews every error — then leaves a report on your desk before you wake up.

5. Fleet training — replicate to every node

One machine builds and trains the model. ZFS replicates it to every node in the fleet. Every server gets the same expert assistant. No repeated setup. No drift.

The master trains, the fleet inherits

#!/bin/bash
# train-and-replicate.sh — build model on master, push to all nodes

MASTER_DATASET="rpool/srv/ollama"
NODES="node-2 node-3 node-4 node-5"

# Step 1: Rebuild the knowledge base and model on the master
/usr/local/bin/build-knowledge-base.sh
/usr/local/bin/rebuild-model.sh

# Step 2: Snapshot the trained state
SNAP="${MASTER_DATASET}@trained-$(date +%F)"
zfs snapshot "$SNAP"
echo "Created snapshot: $SNAP"

# Step 3: Replicate to every node
for node in $NODES; do
    echo "--- Replicating to $node ---"

    # syncoid handles incremental sends automatically
    # Only changed blocks transfer — not the full 8GB model every time
    syncoid --no-sync-snap "$MASTER_DATASET" "root@${node}:${MASTER_DATASET}"

    # Restart ollama on the remote node to pick up the new model
    ssh "root@${node}" "systemctl restart ollama"

    echo "$node: done"
done

echo "Fleet updated at $(date)"

Train once, deploy everywhere. The first sync sends the full model. Every sync after that sends only the delta. Changed 200KB of system prompt? That's a 200KB transfer, not 8GB.

Per-node context with shared knowledge

The trained model is the same everywhere. But each node injects its own live context:

# The model knows kldload tools, ZFS patterns, and troubleshooting flows (shared)
# The live context shows THIS node's pools, errors, and state (per-node)
# Result: same expert, different patient

# On node-2:
kai "is my pool healthy?"
# → reads node-2's zpool status, node-2's errors, gives node-2's answer

# On node-5:
kai "is my pool healthy?"
# → reads node-5's zpool status, node-5's errors, gives node-5's answer

# Same model. Same expertise. Different data. Different answers.

Automate the whole cycle

# Weekly: rebuild knowledge base, retrain model, replicate to fleet
cat > /etc/cron.d/kai-fleet-train <<'EOF'
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
0 3 * * 0 root /usr/local/bin/train-and-replicate.sh >> /var/log/kai-fleet-train.log 2>&1
EOF

# Sunday 3 AM: knowledge base rebuilds, model retrains, fleet syncs
# Monday 6 AM: every node generates its own health report with the latest model

The master node is the teacher. Every Sunday night it learns everything new, then hands updated textbooks to every student Monday morning. The students apply the knowledge to their own homework.

6. Security — everything stays local

No data leaves the machine

Ollama runs the model locally. Your configs, logs, pool layouts, WireGuard keys, error messages — none of it touches an API endpoint. None of it crosses a network boundary. The AI lives on the same box it's monitoring.

Air-gap compatible

Download the model once. Transfer it via USB or zfs send. The trained model works entirely offline. No internet connection required after initial setup. Perfect for classified environments, lab networks, or remote sites.

Audit everything

Every query and response can be logged locally. /var/log/kai-reports/ holds every health report. /var/log/ai-actions.log tracks any automated actions. Full accountability. Full traceability. Your data, your logs, your control.

ZFS encryption at rest

Store the model on an encrypted dataset: kdir -o encryption=on -o keyformat=passphrase /srv/ollama. The AI's knowledge base and model weights are encrypted on disk. Power off the machine and the data is unreadable.

The point is not to replace you. The point is to give you a colleague that has read every man page, memorized every tool flag, and looked at your pool status before you finished pouring your coffee. It's your knowledge, systematized. Your runbooks, automated. Your infrastructure, understood.

Learn the primitives. Then teach them to a machine.

← Build Overview ↑ Top