| your Linux construction kit
Source
← Back to AI Admin Assistant

AI Voice & Vision — talk to your infrastructure.

Voice control is built into the AI profile and working.

The stack: whisper.cpp for speech-to-text, Ollama for AI reasoning, all running locally. No cloud APIs, no subscriptions, no data leaving your network. Two commands do the work: kai-voice records from your mic, transcribes, and sends to the AI. kai-do goes further — it generates real commands from your voice and executes them on remote hosts.

How it actually works — real demo

You say "create a new VM". Here is exactly what happens:

kai-do in action: voice to Proxmox VM

# You run kai-do pointed at your Proxmox host
$ kai-do pve1.lab

# You speak: "Create a new VM with 4 cores and 8 gigs of RAM"
# whisper.cpp transcribes → "create a new VM with 4 cores and 8 gigs of RAM"
# Ollama generates the command:

  qm create 200 --name ai-vm --cores 4 --memory 8192 \
    --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-single \
    --scsi0 local-zfs:32 --boot order=scsi0

# kai-do shows you the command and asks: execute? [y/n]
# You confirm → it runs via SSH on pve1.lab
# Done. VM 200 created.
Voice in, transcription, AI reasoning, command generation, confirmation, execution. The whole loop runs on your hardware. The remote host just sees an SSH command.

kai-voice — ask questions, get answers

# Start kai-voice — records from mic, transcribes, sends to AI
$ kai-voice

# You speak: "What's my pool status?"
# whisper transcribes → AI checks context → responds:
#   "Pool rpool is ONLINE, 42% capacity, last scrub 3 days ago, no errors."

# kai-do — generate and execute commands from voice on remote hosts
$ kai-do proxmox-node
$ kai-do webserver.internal
$ kai-do 192.168.1.50

# You speak: "Show me the ZFS snapshots older than 30 days"
# AI generates: zfs list -t snapshot -o name,creation -s creation | ...
# Executes on the remote host after your confirmation

Walk into a server room. Say "what's my pool status?" out loud. Hear the answer through the speaker. No keyboard, no screen, no SSH session. Whisper converts your speech to text. Ollama processes it. Piper speaks the answer back. The entire pipeline runs on the machine in front of you. Nothing leaves the network.

Then there's vision. Point a camera at a server screen. Send the photo to LLaVA. Get a diagnosis. No OCR hacks, no screen-sharing tools — an actual vision model that reads what it sees and reasons about it.

1. Install Whisper (speech-to-text)

Whisper is OpenAI's speech recognition model. whisper.cpp is the C/C++ port that runs on CPU without Python, without PyTorch, without 40 GB of VRAM. It loads a model file, takes audio in, and outputs text. That's it.

Build whisper.cpp from source

# Clone and build
cd /opt
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release -j$(nproc)

# Download the base.en model — fast, accurate enough for commands
# This is what kai-voice uses by default
bash models/download-ggml-model.sh base.en

# Optional: medium.en for better accuracy on longer speech
bash models/download-ggml-model.sh medium.en

# Test it — record 5 seconds from your mic and transcribe
arecord -d 5 -f S16_LE -r 16000 -c 1 /tmp/test.wav
./build/bin/whisper-cli -m models/ggml-base.en.bin -f /tmp/test.wav

# That's it. kai-voice and kai-do use this binary + model directly.
# For offline/air-gapped machines, copy the model file:
mkdir -p /srv/whisper/models
cp models/ggml-base.en.bin /srv/whisper/models/
whisper.cpp is to speech what Ollama is to text. A single binary, a model file, no cloud dependency. Record audio, get text. This is the engine behind kai-voice and kai-do.

2. Install Piper (text-to-speech)

Piper converts text to spoken audio. It runs locally, produces natural-sounding speech, and processes in real time on CPU. Download a voice model, pipe text in, get WAV out.

Set up Piper TTS

# Install Piper
curl -L https://github.com/rhasspy/piper/releases/latest/download/piper_linux_x86_64.tar.gz | \
    tar xz -C /opt/

# Download a voice (en_US-lessac-medium is clear and natural)
mkdir -p /srv/piper/voices
cd /srv/piper/voices
curl -LO https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
curl -LO https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

# Test it
echo "Your ZFS pool is healthy. ARC hit rate is 94 percent." | \
    /opt/piper/piper --model /srv/piper/voices/en_US-lessac-medium.onnx --output_file /tmp/test.wav
aplay /tmp/test.wav
Piper is to TTS what whisper.cpp is to STT. No cloud, no API key, no network. Text goes in, audio comes out. Run it on the same machine that runs your pools.

3. PipeWire audio capture

PipeWire replaced PulseAudio on modern Linux. It handles audio routing — mic input, speaker output, and everything in between. You need it for hands-free operation.

Configure PipeWire for voice pipeline

# Verify PipeWire is running
systemctl --user status pipewire pipewire-pulse wireplumber

# List audio sources (find your mic)
pw-cli list-objects | grep -A2 'node.name.*input'
pactl list sources short

# Record from default mic (PipeWire handles routing)
pw-record --target=0 --format=s16 --rate=16000 --channels=1 /tmp/mic.wav

# For headless servers, use a USB audio adapter
# PipeWire auto-detects USB audio devices
lsusb | grep -i audio
pw-cli list-objects | grep -A4 'media.class.*Audio'

# Set default source (mic) and sink (speaker)
wpctl set-default <SOURCE_ID>    # for mic input
wpctl set-default <SINK_ID>      # for speaker output
PipeWire is the switchboard operator. It connects your mic to whisper and Piper's output to your speaker. You configure it once and forget it.

4. The full voice pipeline: kai-voice

This is the script that ties everything together. Mic → Whisper → Ollama → Piper → Speaker. Hands-free, continuous loop. Say something, hear the answer, say something else. It captures audio when you speak, transcribes with Whisper, sends the text to kai (your kldload AI), then speaks the response through Piper.

kai-voice — hands-free AI loop

#!/bin/bash
# /usr/local/bin/kai-voice — talk to your infrastructure

WHISPER_BIN="/opt/whisper.cpp/build/bin/whisper-cli"
WHISPER_MODEL="/srv/whisper/models/ggml-medium.en.bin"
PIPER_BIN="/opt/piper/piper"
PIPER_VOICE="/srv/piper/voices/en_US-lessac-medium.onnx"
OLLAMA_MODEL="kai"          # your kldload AI model
RECORD_SEC="${1:-5}"         # seconds to listen (default 5)
TMPDIR=$(mktemp -d /tmp/kai-voice.XXXXXX)

cleanup() { rm -rf "$TMPDIR"; }
trap cleanup EXIT

speak() {
    echo "$1" | "$PIPER_BIN" --model "$PIPER_VOICE" --output_file "$TMPDIR/response.wav" 2>/dev/null
    aplay -q "$TMPDIR/response.wav" 2>/dev/null
}

echo "kai-voice: listening. Speak after the beep. Say 'exit' to quit."

while true; do
    # Beep to indicate ready
    echo -ne '\a'

    # Record from mic
    pw-record --target=0 --format=s16 --rate=16000 --channels=1 \
        "$TMPDIR/input.wav" &
    REC_PID=$!
    sleep "$RECORD_SEC"
    kill "$REC_PID" 2>/dev/null
    wait "$REC_PID" 2>/dev/null

    # Transcribe with Whisper
    TRANSCRIPT=$("$WHISPER_BIN" -m "$WHISPER_MODEL" -f "$TMPDIR/input.wav" \
        --no-timestamps -nt 2>/dev/null | sed 's/^[[:space:]]*//')

    if [ -z "$TRANSCRIPT" ]; then
        echo "(silence — listening again...)"
        continue
    fi

    echo "You said: $TRANSCRIPT"

    # Exit command
    if echo "$TRANSCRIPT" | grep -qi '^exit\|^quit\|^stop\|^goodbye'; then
        speak "Goodbye."
        exit 0
    fi

    # Send to Ollama
    RESPONSE=$(echo "$TRANSCRIPT" | ollama run "$OLLAMA_MODEL" 2>/dev/null)

    echo "kai: $RESPONSE"

    # Speak the response
    speak "$RESPONSE"
done
A walkie-talkie for your servers. Press-to-talk replaced by listen-and-respond. You speak, the machine listens, thinks, and speaks back. All on the same box, all offline.

Use it

# Start the voice loop (default: 5 second listening window)
kai-voice

# Longer listening window for complex questions
kai-voice 10

# Pair with the ZFS expert model specifically
OLLAMA_MODEL=zfs-expert kai-voice

# Example session:
# [beep]
# You: "What's my pool status?"
# kai: "Pool rpool is ONLINE, 42% capacity, last scrub 3 days ago, no errors."
# [beep]
# You: "Are there any snapshots older than 30 days?"
# kai: "Yes, rpool/srv/data@auto-2026-02-15 is 36 days old, using 2.1 GB..."

5. Vision: LLaVA for visual diagnosis

LLaVA is a vision-language model. It looks at images and reasons about them. In Ollama, it's one pull command. Send it a photo of a server screen, a dashboard, an error message, a network diagram — and it tells you what it sees.

Set up LLaVA in Ollama

# Pull the LLaVA model
ollama pull llava

# Test with a screenshot
ollama run llava "Describe what you see in this image" --images /tmp/screenshot.png

# Diagnose a server screen photo
ollama run llava "This is a photo of a server console screen. \
What errors or warnings do you see? What should I do?" --images /tmp/server-screen.jpg

# Read a dashboard
ollama run llava "This is a Grafana dashboard screenshot. \
Summarize the metrics and flag anything concerning." --images /tmp/dashboard.png
You photograph a whiteboard and send it to a colleague. Same idea, except the colleague is a local LLM that never leaks your infrastructure screenshots to the cloud.

kai-vision — camera-to-diagnosis script

#!/bin/bash
# /usr/local/bin/kai-vision — photograph and diagnose

IMAGE="${1:?Usage: kai-vision <image-path> [question]}"
QUESTION="${2:-Analyze this image. If it shows a server screen, terminal, dashboard, \
or error message, diagnose the issue and suggest next steps.}"

if [ ! -f "$IMAGE" ]; then
    echo "Error: $IMAGE not found"
    exit 1
fi

echo "Analyzing: $IMAGE"
echo "Question: $QUESTION"
echo "---"

ollama run llava "$QUESTION" --images "$IMAGE"

6. Use cases

Voice and vision are not novelties. They solve real operational problems where keyboards and SSH sessions are impractical or unavailable.

Hands-free server room

You're swapping a drive with both hands full. Ask "what's the serial number of the faulted disk in bay 3?" and hear the answer. No laptop, no phone, no awkward one-handed typing.

Remote support via photo

A remote site sends you a phone photo of a server's console screen showing a kernel panic. Feed it to kai-vision. Get a diagnosis and remediation steps without squinting at a blurry JPEG trying to read error codes.

Accessibility

Operators with repetitive strain injuries or vision impairment can manage infrastructure by voice. The AI reads the screen state and speaks it back. Not a screen reader — a context-aware infrastructure reader.

Voice command: pool status

"What's my pool status?" → The AI runs context, reads zpool status, and speaks: "Pool rpool is online, 42% capacity, last scrub was Tuesday, no errors. Pool tank is degraded — disk sdb has 3 checksum errors. I'd recommend running a scrub and watching the error count."

Dashboard triage

Screenshot your Grafana dashboard. Send it to LLaVA. It reads the graphs, identifies the spike at 3 AM, correlates it with the CPU panel, and suggests you check the cron job that runs at that time.

Offline operation

Air-gapped lab? No internet? Every component — Whisper, Ollama, Piper, LLaVA — runs entirely local. Pre-download the models, copy them to the target machine, and the voice pipeline works with zero network connectivity.

Your voice is an interface. Your camera is a sensor. Cloud voice assistants listen to everything and send it to someone else's server. This pipeline listens to exactly what you say, processes it on your hardware, and forgets it when you're done. No telemetry. No transcripts in someone's S3 bucket. No "we may use your data to improve our services."

Talk to your servers. Photograph the problem. Get answers. All local. All yours.