AI Voice & Vision — talk to your infrastructure.
Voice control is built into the AI profile and working.
The stack: whisper.cpp for speech-to-text, Ollama for AI reasoning,
all running locally. No cloud APIs, no subscriptions, no data leaving your network.
Two commands do the work: kai-voice records from your mic, transcribes, and sends to the AI.
kai-do goes further — it generates real commands from your voice and executes them on remote hosts.
How it actually works — real demo
You say "create a new VM". Here is exactly what happens:
kai-do in action: voice to Proxmox VM
# You run kai-do pointed at your Proxmox host
$ kai-do pve1.lab
# You speak: "Create a new VM with 4 cores and 8 gigs of RAM"
# whisper.cpp transcribes → "create a new VM with 4 cores and 8 gigs of RAM"
# Ollama generates the command:
qm create 200 --name ai-vm --cores 4 --memory 8192 \
--net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-single \
--scsi0 local-zfs:32 --boot order=scsi0
# kai-do shows you the command and asks: execute? [y/n]
# You confirm → it runs via SSH on pve1.lab
# Done. VM 200 created.
kai-voice — ask questions, get answers
# Start kai-voice — records from mic, transcribes, sends to AI
$ kai-voice
# You speak: "What's my pool status?"
# whisper transcribes → AI checks context → responds:
# "Pool rpool is ONLINE, 42% capacity, last scrub 3 days ago, no errors."
# kai-do — generate and execute commands from voice on remote hosts
$ kai-do proxmox-node
$ kai-do webserver.internal
$ kai-do 192.168.1.50
# You speak: "Show me the ZFS snapshots older than 30 days"
# AI generates: zfs list -t snapshot -o name,creation -s creation | ...
# Executes on the remote host after your confirmation
Walk into a server room. Say "what's my pool status?" out loud. Hear the answer through the speaker. No keyboard, no screen, no SSH session. Whisper converts your speech to text. Ollama processes it. Piper speaks the answer back. The entire pipeline runs on the machine in front of you. Nothing leaves the network.
Then there's vision. Point a camera at a server screen. Send the photo to LLaVA. Get a diagnosis. No OCR hacks, no screen-sharing tools — an actual vision model that reads what it sees and reasons about it.
1. Install Whisper (speech-to-text)
Whisper is OpenAI's speech recognition model. whisper.cpp is the C/C++ port
that runs on CPU without Python, without PyTorch, without 40 GB of VRAM. It loads a
model file, takes audio in, and outputs text. That's it.
Build whisper.cpp from source
# Clone and build
cd /opt
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release -j$(nproc)
# Download the base.en model — fast, accurate enough for commands
# This is what kai-voice uses by default
bash models/download-ggml-model.sh base.en
# Optional: medium.en for better accuracy on longer speech
bash models/download-ggml-model.sh medium.en
# Test it — record 5 seconds from your mic and transcribe
arecord -d 5 -f S16_LE -r 16000 -c 1 /tmp/test.wav
./build/bin/whisper-cli -m models/ggml-base.en.bin -f /tmp/test.wav
# That's it. kai-voice and kai-do use this binary + model directly.
# For offline/air-gapped machines, copy the model file:
mkdir -p /srv/whisper/models
cp models/ggml-base.en.bin /srv/whisper/models/
2. Install Piper (text-to-speech)
Piper converts text to spoken audio. It runs locally, produces natural-sounding speech, and processes in real time on CPU. Download a voice model, pipe text in, get WAV out.
Set up Piper TTS
# Install Piper
curl -L https://github.com/rhasspy/piper/releases/latest/download/piper_linux_x86_64.tar.gz | \
tar xz -C /opt/
# Download a voice (en_US-lessac-medium is clear and natural)
mkdir -p /srv/piper/voices
cd /srv/piper/voices
curl -LO https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
curl -LO https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json
# Test it
echo "Your ZFS pool is healthy. ARC hit rate is 94 percent." | \
/opt/piper/piper --model /srv/piper/voices/en_US-lessac-medium.onnx --output_file /tmp/test.wav
aplay /tmp/test.wav
3. PipeWire audio capture
PipeWire replaced PulseAudio on modern Linux. It handles audio routing — mic input, speaker output, and everything in between. You need it for hands-free operation.
Configure PipeWire for voice pipeline
# Verify PipeWire is running
systemctl --user status pipewire pipewire-pulse wireplumber
# List audio sources (find your mic)
pw-cli list-objects | grep -A2 'node.name.*input'
pactl list sources short
# Record from default mic (PipeWire handles routing)
pw-record --target=0 --format=s16 --rate=16000 --channels=1 /tmp/mic.wav
# For headless servers, use a USB audio adapter
# PipeWire auto-detects USB audio devices
lsusb | grep -i audio
pw-cli list-objects | grep -A4 'media.class.*Audio'
# Set default source (mic) and sink (speaker)
wpctl set-default <SOURCE_ID> # for mic input
wpctl set-default <SINK_ID> # for speaker output
4. The full voice pipeline: kai-voice
This is the script that ties everything together. Mic → Whisper → Ollama → Piper → Speaker. Hands-free, continuous loop. Say something, hear the answer, say something else. It captures audio when you speak, transcribes with Whisper, sends the text to kai (your kldload AI), then speaks the response through Piper.
kai-voice — hands-free AI loop
#!/bin/bash
# /usr/local/bin/kai-voice — talk to your infrastructure
WHISPER_BIN="/opt/whisper.cpp/build/bin/whisper-cli"
WHISPER_MODEL="/srv/whisper/models/ggml-medium.en.bin"
PIPER_BIN="/opt/piper/piper"
PIPER_VOICE="/srv/piper/voices/en_US-lessac-medium.onnx"
OLLAMA_MODEL="kai" # your kldload AI model
RECORD_SEC="${1:-5}" # seconds to listen (default 5)
TMPDIR=$(mktemp -d /tmp/kai-voice.XXXXXX)
cleanup() { rm -rf "$TMPDIR"; }
trap cleanup EXIT
speak() {
echo "$1" | "$PIPER_BIN" --model "$PIPER_VOICE" --output_file "$TMPDIR/response.wav" 2>/dev/null
aplay -q "$TMPDIR/response.wav" 2>/dev/null
}
echo "kai-voice: listening. Speak after the beep. Say 'exit' to quit."
while true; do
# Beep to indicate ready
echo -ne '\a'
# Record from mic
pw-record --target=0 --format=s16 --rate=16000 --channels=1 \
"$TMPDIR/input.wav" &
REC_PID=$!
sleep "$RECORD_SEC"
kill "$REC_PID" 2>/dev/null
wait "$REC_PID" 2>/dev/null
# Transcribe with Whisper
TRANSCRIPT=$("$WHISPER_BIN" -m "$WHISPER_MODEL" -f "$TMPDIR/input.wav" \
--no-timestamps -nt 2>/dev/null | sed 's/^[[:space:]]*//')
if [ -z "$TRANSCRIPT" ]; then
echo "(silence — listening again...)"
continue
fi
echo "You said: $TRANSCRIPT"
# Exit command
if echo "$TRANSCRIPT" | grep -qi '^exit\|^quit\|^stop\|^goodbye'; then
speak "Goodbye."
exit 0
fi
# Send to Ollama
RESPONSE=$(echo "$TRANSCRIPT" | ollama run "$OLLAMA_MODEL" 2>/dev/null)
echo "kai: $RESPONSE"
# Speak the response
speak "$RESPONSE"
done
Use it
# Start the voice loop (default: 5 second listening window)
kai-voice
# Longer listening window for complex questions
kai-voice 10
# Pair with the ZFS expert model specifically
OLLAMA_MODEL=zfs-expert kai-voice
# Example session:
# [beep]
# You: "What's my pool status?"
# kai: "Pool rpool is ONLINE, 42% capacity, last scrub 3 days ago, no errors."
# [beep]
# You: "Are there any snapshots older than 30 days?"
# kai: "Yes, rpool/srv/data@auto-2026-02-15 is 36 days old, using 2.1 GB..."
5. Vision: LLaVA for visual diagnosis
LLaVA is a vision-language model. It looks at images and reasons about them. In Ollama, it's one pull command. Send it a photo of a server screen, a dashboard, an error message, a network diagram — and it tells you what it sees.
Set up LLaVA in Ollama
# Pull the LLaVA model
ollama pull llava
# Test with a screenshot
ollama run llava "Describe what you see in this image" --images /tmp/screenshot.png
# Diagnose a server screen photo
ollama run llava "This is a photo of a server console screen. \
What errors or warnings do you see? What should I do?" --images /tmp/server-screen.jpg
# Read a dashboard
ollama run llava "This is a Grafana dashboard screenshot. \
Summarize the metrics and flag anything concerning." --images /tmp/dashboard.png
kai-vision — camera-to-diagnosis script
#!/bin/bash
# /usr/local/bin/kai-vision — photograph and diagnose
IMAGE="${1:?Usage: kai-vision <image-path> [question]}"
QUESTION="${2:-Analyze this image. If it shows a server screen, terminal, dashboard, \
or error message, diagnose the issue and suggest next steps.}"
if [ ! -f "$IMAGE" ]; then
echo "Error: $IMAGE not found"
exit 1
fi
echo "Analyzing: $IMAGE"
echo "Question: $QUESTION"
echo "---"
ollama run llava "$QUESTION" --images "$IMAGE"
6. Use cases
Voice and vision are not novelties. They solve real operational problems where keyboards and SSH sessions are impractical or unavailable.
Hands-free server room
You're swapping a drive with both hands full. Ask "what's the serial number of the faulted disk in bay 3?" and hear the answer. No laptop, no phone, no awkward one-handed typing.
Remote support via photo
A remote site sends you a phone photo of a server's console screen showing a kernel panic.
Feed it to kai-vision. Get a diagnosis and remediation steps without
squinting at a blurry JPEG trying to read error codes.
Accessibility
Operators with repetitive strain injuries or vision impairment can manage infrastructure by voice. The AI reads the screen state and speaks it back. Not a screen reader — a context-aware infrastructure reader.
Voice command: pool status
"What's my pool status?" → The AI runs context, reads zpool status, and speaks: "Pool rpool is online, 42% capacity, last scrub was Tuesday, no errors. Pool tank is degraded — disk sdb has 3 checksum errors. I'd recommend running a scrub and watching the error count."
Dashboard triage
Screenshot your Grafana dashboard. Send it to LLaVA. It reads the graphs, identifies the spike at 3 AM, correlates it with the CPU panel, and suggests you check the cron job that runs at that time.
Offline operation
Air-gapped lab? No internet? Every component — Whisper, Ollama, Piper, LLaVA — runs entirely local. Pre-download the models, copy them to the target machine, and the voice pipeline works with zero network connectivity.
Your voice is an interface. Your camera is a sensor. Cloud voice assistants listen to everything and send it to someone else's server. This pipeline listens to exactly what you say, processes it on your hardware, and forgets it when you're done. No telemetry. No transcripts in someone's S3 bucket. No "we may use your data to improve our services."
Talk to your servers. Photograph the problem. Get answers. All local. All yours.