Train AI on Your Infrastructure — turn a generic LLM into YOUR sysadmin.
The AI Admin Assistant page showed you how to install Ollama and create a basic infrastructure model. This page goes deeper. You will scrape your entire knowledge base — docs, configs, man pages, tool output — into a single context corpus, build a comprehensive Modelfile that encodes everything about your environment, inject live system state into every query, generate daily health reports, and replicate the trained model across your fleet.
The goal: a local LLM that doesn't just know Linux — it knows your Linux. Your pool layout. Your dataset hierarchy. Your WireGuard topology. Your tool flags. Generic models give generic answers. Trained models give your answers.
1. Build the knowledge base
Before the model can know your infrastructure, you need to collect everything it should know into plain text. Docs, tool usage, man pages, current state — all of it.
Scrape the kldload documentation
Extract text from every HTML doc on the system. Strip tags. Keep structure.
#!/bin/bash
# build-knowledge-base.sh — collect everything the AI needs to know
KB="/srv/ollama/knowledge-base"
mkdir -p "$KB"
# --- kldload HTML docs ---
# If you have local docs (from the ISO or the website), extract them
echo "=== KLDLOAD DOCUMENTATION ===" > "$KB/docs.txt"
for f in /usr/local/share/kldload-webui/free/*.html \
/usr/share/doc/kldload/*.html 2>/dev/null; do
[ -f "$f" ] || continue
echo -e "\n--- $(basename "$f") ---"
# Strip HTML tags, collapse whitespace, keep meaningful text
sed 's/<[^>]*>//g; s/—/—/g; s/&/\&/g; s/</</g; s/>/>/g' "$f" \
| tr -s '[:space:]' ' ' \
| fold -s -w 120
done >> "$KB/docs.txt"
echo "Docs: $(wc -l < "$KB/docs.txt") lines"
Capture every kldload tool's usage
The model needs to know what each tool does, what flags it accepts, and what output to expect.
# --- Tool help output ---
echo "=== KLDLOAD TOOL REFERENCE ===" > "$KB/tools.txt"
for tool in kst ksnap kbe kdf kdir kpkg kupgrade krecovery kexport kvpn kfw; do
if command -v "$tool" &>/dev/null; then
echo -e "\n=== $tool ==="
echo "--- $tool --help ---"
$tool --help 2>&1 || true
echo ""
fi
done >> "$KB/tools.txt"
# Capture a live kst output as an example of what "healthy" looks like
echo -e "\n=== EXAMPLE: kst output on a healthy system ===" >> "$KB/tools.txt"
kst >> "$KB/tools.txt" 2>&1 || true
echo "Tools: $(wc -l < "$KB/tools.txt") lines"
Dump ZFS man page summaries
# --- ZFS reference ---
echo "=== ZFS REFERENCE ===" > "$KB/zfs.txt"
# Core man pages — extract the SYNOPSIS and DESCRIPTION sections
for page in zfs zpool zfs-send zfs-recv zfs-snapshot zfs-clone zfs-destroy \
zfs-set zfs-mount zfs-share zpoolprops zfsprops; do
if man -w "$page" &>/dev/null 2>&1; then
echo -e "\n=== man $page ==="
man "$page" 2>/dev/null | col -bx | \
sed -n '/^NAME/,/^[A-Z]/p; /^SYNOPSIS/,/^[A-Z]/p; /^DESCRIPTION/,/^[A-Z]/p' | \
head -80
fi
done >> "$KB/zfs.txt"
# ZFS properties quick reference
echo -e "\n=== ZFS DATASET PROPERTIES ===" >> "$KB/zfs.txt"
zfs get all rpool 2>/dev/null | head -50 >> "$KB/zfs.txt"
# Pool layout
echo -e "\n=== POOL LAYOUT ===" >> "$KB/zfs.txt"
zpool status 2>/dev/null >> "$KB/zfs.txt"
zfs list -o name,used,avail,refer,mountpoint,compression,compressratio 2>/dev/null >> "$KB/zfs.txt"
echo "ZFS: $(wc -l < "$KB/zfs.txt") lines"
Capture current system state as baseline
# --- System state snapshot ---
echo "=== SYSTEM BASELINE ===" > "$KB/system.txt"
echo -e "\n--- OS ---" >> "$KB/system.txt"
cat /etc/os-release >> "$KB/system.txt" 2>/dev/null
echo -e "\n--- Kernel ---" >> "$KB/system.txt"
uname -a >> "$KB/system.txt"
echo -e "\n--- Network interfaces ---" >> "$KB/system.txt"
ip -br addr >> "$KB/system.txt" 2>/dev/null
echo -e "\n--- WireGuard tunnels ---" >> "$KB/system.txt"
wg show 2>/dev/null >> "$KB/system.txt" || echo "(no WireGuard tunnels active)" >> "$KB/system.txt"
echo -e "\n--- Listening services ---" >> "$KB/system.txt"
ss -tlnp >> "$KB/system.txt" 2>/dev/null
echo -e "\n--- Systemd failed units ---" >> "$KB/system.txt"
systemctl --failed --no-pager >> "$KB/system.txt" 2>/dev/null
echo -e "\n--- Installed kldload packages ---" >> "$KB/system.txt"
rpm -qa 2>/dev/null | grep -i kldload >> "$KB/system.txt" || true
echo -e "\n--- Firewall rules ---" >> "$KB/system.txt"
nft list ruleset 2>/dev/null | head -40 >> "$KB/system.txt"
echo "System: $(wc -l < "$KB/system.txt") lines"
# --- Assemble the full corpus ---
cat "$KB/docs.txt" "$KB/tools.txt" "$KB/zfs.txt" "$KB/system.txt" > "$KB/full-corpus.txt"
echo ""
echo "Total knowledge base: $(wc -l < "$KB/full-corpus.txt") lines, $(du -h "$KB/full-corpus.txt" | cut -f1)"
2. Create a comprehensive Modelfile
The basic Modelfile from the AI Admin page is a starting point. This one encodes the full knowledge base — every tool, every pattern, every troubleshooting flow your AI needs to know by heart.
The complete infrastructure Modelfile
# /srv/ollama/Modelfile.infra-trained
FROM llama3.1:8b
SYSTEM """
You are the infrastructure expert for this specific kldload-based system.
You have been trained on its documentation, tool reference, ZFS layout,
and network topology. You give precise answers with exact commands.
=== KLDLOAD CLI TOOLS ===
kst — system status dashboard (pools, datasets, services, memory, ARC)
ksnap — create/list/rollback ZFS snapshots (wraps zfs snapshot)
ksnap rollback — rollback a dataset to a previous snapshot
kbe — ZFSBootMenu boot environment manager
kdf — disk usage per dataset, sorted, human-readable
kdir — create ZFS dataset with sane defaults (compression, mountpoint)
kpkg — package operations from local darksite (offline repo)
kupgrade — system upgrade with automatic pre-upgrade snapshot
krecovery — boot into recovery, repair grub/ZFSBootMenu, chroot
kexport — export VMs/datasets as OVA, QCOW2, raw, or ZFS stream
kvpn — WireGuard tunnel manager (add peer, generate configs)
kfw — nftables firewall manager (open/close ports, list rules)
=== ZFS OPERATIONS QUICK REFERENCE ===
Create pool: zpool create -o ashift=12 rpool mirror /dev/disk/by-id/X /dev/disk/by-id/Y
Create dataset: kdir -o recordsize=128k -o compression=zstd /srv/data
Snapshot: ksnap /srv/data (or: zfs snapshot rpool/srv/data@$(date +%F))
Rollback: ksnap rollback /srv/data (or: zfs rollback rpool/srv/data@name)
Send/recv: zfs send -Rw rpool/srv/data@snap | ssh node2 zfs recv rpool/srv/data
Scrub: zpool scrub rpool
ARC stats: cat /proc/spl/kstat/zfs/arcstats | grep -E 'size|hits|misses'
Tune ARC: echo SIZE > /sys/module/zfs/parameters/zfs_arc_max
=== WIREGUARD PATTERNS ===
Generate keys: wg genkey | tee /etc/wireguard/private.key | wg pubkey > /etc/wireguard/public.key
Interface up: wg-quick up wg0
Show status: wg show
Config location: /etc/wireguard/wg0.conf
Add peer: kvpn add-peer --name node2 --endpoint 10.0.0.2:51820
Hub-and-spoke: one server with AllowedIPs = 10.100.0.0/24, nodes route through it
=== TROUBLESHOOTING FLOWS ===
Pool DEGRADED:
1. zpool status -v (identify the faulted device)
2. ksnap /srv (snapshot everything first)
3. zpool online rpool DEVICE (if transient)
4. zpool replace rpool OLD_DEVICE NEW_DEVICE (if hardware failure)
5. zpool scrub rpool (verify after replace)
High ARC miss rate:
1. cat /proc/spl/kstat/zfs/arcstats | grep -E 'hits|misses'
2. Calculate: hits / (hits + misses) * 100
3. If below 85%, increase zfs_arc_max
4. echo $((RAM_BYTES / 2)) > /sys/module/zfs/parameters/zfs_arc_max
5. Persist: add zfs_arc_max=N to /etc/modprobe.d/zfs.conf
Service won't start:
1. systemctl status UNIT
2. journalctl -u UNIT --since '10 min ago' --no-pager
3. Check config syntax if applicable
4. systemctl daemon-reload && systemctl restart UNIT
Disk full:
1. kdf (find the largest datasets)
2. ksnap (check for old snapshots holding space)
3. zfs list -t snapshot -o name,used -s used (sort by space used)
4. zfs destroy rpool/path@old-snapshot (reclaim space)
Boot failure:
1. Boot into ZFSBootMenu recovery shell
2. krecovery (guided repair)
3. Or manually: zpool import -fN rpool && zfs mount -a
=== PHILOSOPHY ===
Learn the primitives. ZFS, systemd, nftables, WireGuard — these are the building blocks.
kldload tools are convenience wrappers, not abstractions. Understand what they do underneath.
Always snapshot before changes. Always check 'zpool status' first. Always read the error message.
"""
PARAMETER temperature 0.3
PARAMETER num_ctx 16384
# Build the trained model
ollama create infra-trained -f /srv/ollama/Modelfile.infra-trained
# Verify it works
ollama run infra-trained "What does ksnap do and how do I rollback a dataset?"
Embedding the full corpus into the system prompt
For larger knowledge bases, generate the Modelfile dynamically so the corpus is always current:
#!/bin/bash
# rebuild-model.sh — regenerate Modelfile with latest knowledge base
KB="/srv/ollama/knowledge-base/full-corpus.txt"
# Truncate to fit context window (8k model ~ 24k chars of system prompt is safe)
CORPUS=$(head -c 24000 "$KB")
cat > /srv/ollama/Modelfile.infra-trained <<MODELFILE
FROM llama3.1:8b
SYSTEM """
You are the infrastructure expert for this system. Below is the complete
reference for this environment — docs, tools, ZFS layout, and system state.
Use this to give precise, system-specific answers.
${CORPUS}
When answering:
- Give exact commands, not pseudocode
- Reference the specific pool names, dataset paths, and IPs from the context above
- Always recommend ksnap before destructive operations
- If you don't know something specific, say so — don't guess
"""
PARAMETER temperature 0.3
PARAMETER num_ctx 16384
MODELFILE
# Snapshot before rebuilding (in case the new model is worse)
ksnap /srv/ollama
# Build it
ollama create infra-trained -f /srv/ollama/Modelfile.infra-trained
echo "Model rebuilt at $(date) with $(wc -c < "$KB") bytes of context"
3. Live context injection
The Modelfile gives the AI permanent knowledge. Live context injection gives it right now knowledge. Every query includes fresh system data so the model answers based on what's happening this second, not what was true last Tuesday.
The context builder
#!/bin/bash
# /usr/local/bin/kai — query the AI with live system context
build_context() {
echo "=== LIVE SYSTEM STATE ($(date -Iseconds)) ==="
echo -e "\n--- kst ---"
kst 2>/dev/null
echo -e "\n--- zpool status ---"
zpool status 2>/dev/null
echo -e "\n--- ARC stats ---"
awk '/^size/{print "ARC size: "$3} /^hits/{print "ARC hits: "$3} /^misses/{print "ARC misses: "$3}' \
/proc/spl/kstat/zfs/arcstats 2>/dev/null
echo -e "\n--- Memory ---"
free -h 2>/dev/null
echo -e "\n--- Journal errors (last hour) ---"
journalctl -p err --since "1 hour ago" --no-pager -q 2>/dev/null | tail -15
echo -e "\n--- Failed units ---"
systemctl --failed --no-pager --no-legend 2>/dev/null
echo -e "\n--- ZFS dataset usage (top 10) ---"
zfs list -o name,used,avail,refer -s used 2>/dev/null | tail -10
echo -e "\n--- WireGuard ---"
wg show 2>/dev/null | grep -E 'interface|peer|latest handshake|transfer' || echo "(no tunnels)"
}
QUESTION="$*"
if [ -z "$QUESTION" ]; then
echo "Usage: kai <question>"
echo " kai 'is my pool healthy?'"
echo " kai 'why is memory usage high?'"
echo " kai 'what should I tune?'"
exit 1
fi
CONTEXT=$(build_context)
echo -e "${CONTEXT}\n\n=== QUESTION ===\n${QUESTION}" | ollama run infra-trained
# Usage — every query sees live data
kai "is my pool healthy?"
kai "my ARC hit rate seems low, what should I change?"
kai "which datasets are using the most space?"
kai "any errors I should worry about?"
kai "generate a WireGuard config for a new peer at 10.0.0.5"
Targeted context for specific queries
Don't always send everything. For focused questions, send focused context:
# ZFS-specific query — deep pool context
kai-zfs() {
local CTX=$(zpool status -v 2>/dev/null; echo "---"; \
zfs list -o name,used,avail,compression,compressratio 2>/dev/null; echo "---"; \
zpool iostat -v 2>/dev/null)
echo -e "ZFS context:\n${CTX}\n\nQuestion: $*" | ollama run infra-trained
}
# Network-specific query — WireGuard + firewall context
kai-net() {
local CTX=$(ip -br addr 2>/dev/null; echo "---"; \
wg show 2>/dev/null; echo "---"; \
nft list ruleset 2>/dev/null | head -50; echo "---"; \
ss -tlnp 2>/dev/null)
echo -e "Network context:\n${CTX}\n\nQuestion: $*" | ollama run infra-trained
}
# Usage
kai-zfs "should I add an L2ARC device?"
kai-net "is my firewall blocking anything it shouldn't?"
4. Periodic health reports
A cron job runs the AI against your system state every day. It reads the same data you would, finds the same patterns you would — but it does it at 6 AM while you are still asleep.
Daily AI health report
#!/bin/bash
# /usr/local/bin/kai-report — daily AI infrastructure health report
REPORT_DIR="/var/log/kai-reports"
mkdir -p "$REPORT_DIR"
REPORT="$REPORT_DIR/$(date +%F).txt"
# Build comprehensive system snapshot
SNAPSHOT=$(cat <<SNAP
=== DAILY HEALTH CHECK — $(date) ===
=== HOSTNAME: $(hostname) ===
--- ZFS Pool Status ---
$(zpool status -v 2>/dev/null)
--- ZFS Pool I/O ---
$(zpool iostat -v 2>/dev/null)
--- Dataset Usage ---
$(zfs list -o name,used,avail,refer,compressratio -s used 2>/dev/null)
--- Snapshot Inventory ---
$(zfs list -t snapshot -o name,used,creation -s creation 2>/dev/null | tail -20)
--- ARC Statistics ---
$(awk '/^size/{printf "Size: %d MB\n",$3/1048576}
/^hits/{h=$3} /^misses/{m=$3}
END{if(h+m>0) printf "Hit rate: %.1f%%\n",h/(h+m)*100}' \
/proc/spl/kstat/zfs/arcstats 2>/dev/null)
--- Memory ---
$(free -h 2>/dev/null)
--- Disk I/O (last hour average) ---
$(iostat -xh 1 1 2>/dev/null | tail -20)
--- Journal Errors (last 24h) ---
$(journalctl -p err --since "24 hours ago" --no-pager -q 2>/dev/null | tail -30)
--- Failed Systemd Units ---
$(systemctl --failed --no-pager 2>/dev/null)
--- Last Scrub ---
$(zpool status 2>/dev/null | grep -A2 'scan:')
--- WireGuard Peers ---
$(wg show 2>/dev/null | grep -E 'peer|latest handshake|transfer')
--- Sanoid Snapshot Status ---
$(sanoid --monitor-snapshots 2>/dev/null || echo "(sanoid not installed)")
SNAP
)
# Ask the AI for analysis
ANALYSIS=$(echo "${SNAPSHOT}
Analyze this infrastructure health check. Report:
1. CRITICAL — anything that needs immediate attention
2. WARNINGS — things to watch or address this week
3. TUNING — performance optimizations worth considering
4. STATUS — one-line overall health summary
Be specific. Reference actual values from the data. Give exact commands for any recommended actions." | \
ollama run infra-trained)
# Write the report
{
echo "=== AI INFRASTRUCTURE HEALTH REPORT ==="
echo "=== $(hostname) — $(date) ==="
echo ""
echo "$ANALYSIS"
echo ""
echo "=== RAW DATA ==="
echo "$SNAPSHOT"
} > "$REPORT"
# Optional: email the report
if command -v mail &>/dev/null; then
head -50 "$REPORT" | mail -s "[$(hostname)] AI Health Report — $(date +%F)" root
fi
# Optional: log to systemd journal
echo "$ANALYSIS" | head -5 | logger -t kai-report
echo "Report saved: $REPORT"
Schedule it
# Run every morning at 6 AM
cat > /etc/cron.d/kai-report <<'EOF'
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
0 6 * * * root /usr/local/bin/kai-report
EOF
# Or use a systemd timer for better logging
cat > /etc/systemd/system/kai-report.service <<EOF
[Unit]
Description=AI Infrastructure Health Report
[Service]
Type=oneshot
ExecStart=/usr/local/bin/kai-report
EOF
cat > /etc/systemd/system/kai-report.timer <<EOF
[Unit]
Description=Daily AI Health Report
[Timer]
OnCalendar=*-*-* 06:00:00
Persistent=true
[Install]
WantedBy=timers.target
EOF
systemctl daemon-reload
systemctl enable --now kai-report.timer
# Check the last report
cat /var/log/kai-reports/$(date +%F).txt
5. Fleet training — replicate to every node
One machine builds and trains the model. ZFS replicates it to every node in the fleet. Every server gets the same expert assistant. No repeated setup. No drift.
The master trains, the fleet inherits
#!/bin/bash
# train-and-replicate.sh — build model on master, push to all nodes
MASTER_DATASET="rpool/srv/ollama"
NODES="node-2 node-3 node-4 node-5"
# Step 1: Rebuild the knowledge base and model on the master
/usr/local/bin/build-knowledge-base.sh
/usr/local/bin/rebuild-model.sh
# Step 2: Snapshot the trained state
SNAP="${MASTER_DATASET}@trained-$(date +%F)"
zfs snapshot "$SNAP"
echo "Created snapshot: $SNAP"
# Step 3: Replicate to every node
for node in $NODES; do
echo "--- Replicating to $node ---"
# syncoid handles incremental sends automatically
# Only changed blocks transfer — not the full 8GB model every time
syncoid --no-sync-snap "$MASTER_DATASET" "root@${node}:${MASTER_DATASET}"
# Restart ollama on the remote node to pick up the new model
ssh "root@${node}" "systemctl restart ollama"
echo "$node: done"
done
echo "Fleet updated at $(date)"
Per-node context with shared knowledge
The trained model is the same everywhere. But each node injects its own live context:
# The model knows kldload tools, ZFS patterns, and troubleshooting flows (shared)
# The live context shows THIS node's pools, errors, and state (per-node)
# Result: same expert, different patient
# On node-2:
kai "is my pool healthy?"
# → reads node-2's zpool status, node-2's errors, gives node-2's answer
# On node-5:
kai "is my pool healthy?"
# → reads node-5's zpool status, node-5's errors, gives node-5's answer
# Same model. Same expertise. Different data. Different answers.
Automate the whole cycle
# Weekly: rebuild knowledge base, retrain model, replicate to fleet
cat > /etc/cron.d/kai-fleet-train <<'EOF'
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
0 3 * * 0 root /usr/local/bin/train-and-replicate.sh >> /var/log/kai-fleet-train.log 2>&1
EOF
# Sunday 3 AM: knowledge base rebuilds, model retrains, fleet syncs
# Monday 6 AM: every node generates its own health report with the latest model
6. Security — everything stays local
No data leaves the machine
Ollama runs the model locally. Your configs, logs, pool layouts, WireGuard keys, error messages — none of it touches an API endpoint. None of it crosses a network boundary. The AI lives on the same box it's monitoring.
Air-gap compatible
Download the model once. Transfer it via USB or zfs send.
The trained model works entirely offline.
No internet connection required after initial setup.
Perfect for classified environments, lab networks, or remote sites.
Audit everything
Every query and response can be logged locally.
/var/log/kai-reports/ holds every health report.
/var/log/ai-actions.log tracks any automated actions.
Full accountability. Full traceability. Your data, your logs, your control.
ZFS encryption at rest
Store the model on an encrypted dataset:
kdir -o encryption=on -o keyformat=passphrase /srv/ollama.
The AI's knowledge base and model weights are encrypted on disk.
Power off the machine and the data is unreadable.
The point is not to replace you. The point is to give you a colleague that has read every man page, memorized every tool flag, and looked at your pool status before you finished pouring your coffee. It's your knowledge, systematized. Your runbooks, automated. Your infrastructure, understood.
Learn the primitives. Then teach them to a machine.