Grafana, Prometheus & Alerting on ZFS — observe everything, lose nothing.
Monitoring data is the most undervalued data on a server. You only realize you needed last month's CPU graph after the outage. Prometheus stores metrics as time series. Grafana renders them into dashboards. Alertmanager wakes you up when something goes wrong. On kldloadOS, all of this sits on ZFS datasets — snapshotable, compressible, replicable. Your monitoring history is as durable as your production data.
kst is the terminal view. Grafana is the browser view. Same infrastructure, same metrics,
different interface. Use kst when you're SSH'd in at 2 AM. Use Grafana when you're reviewing
trends over coffee. They complement each other — they don't compete.
1. ZFS datasets for the monitoring stack
Every component gets its own dataset. Separate datasets mean separate recordsizes, separate snapshot policies, and separate compression ratios. Prometheus TSDB writes are append-heavy with periodic compactions — 128k recordsize and zstd compression are ideal. Grafana's SQLite database is tiny but critical — snapshot it frequently so you never lose a dashboard.
Create dedicated datasets
# Prometheus TSDB — append-heavy, compresses well
zfs create -o recordsize=128k -o compression=zstd -o atime=off \
-o mountpoint=/srv/prometheus rpool/srv/prometheus
# Grafana config and SQLite DB — small but precious
zfs create -o recordsize=32k -o compression=zstd -o atime=off \
-o mountpoint=/srv/grafana rpool/srv/grafana
# Alertmanager data — silence history, notification log
zfs create -o recordsize=32k -o compression=zstd -o atime=off \
-o mountpoint=/srv/alertmanager rpool/srv/alertmanager
# Check them
kdf
Sanoid snapshot policy for monitoring data
# /etc/sanoid/sanoid.conf — monitoring datasets
[rpool/srv/prometheus]
use_template = monitoring
recursive = no
[rpool/srv/grafana]
use_template = monitoring-critical
recursive = no
[rpool/srv/alertmanager]
use_template = monitoring
recursive = no
[template_monitoring]
hourly = 48
daily = 30
monthly = 6
yearly = 0
autosnap = yes
autoprune = yes
[template_monitoring-critical]
# Grafana dashboards are hand-crafted — keep more history
hourly = 72
daily = 90
monthly = 12
yearly = 1
autosnap = yes
autoprune = yes
2. Docker deployment — the full stack
Grafana, Prometheus, and Alertmanager all run as containers. Each container's data volume is a ZFS dataset. The containers are stateless and disposable. The data underneath is durable and versioned. Blow away a container, start a new one, point it at the same dataset — nothing lost.
docker-compose.yml for the monitoring stack
# /srv/monitoring/docker-compose.yml
version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- /srv/prometheus:/prometheus
- /srv/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- /srv/monitoring/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=90d"
- "--storage.tsdb.retention.size=50GB"
- "--web.enable-lifecycle"
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
volumes:
- /srv/grafana:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=http://localhost:3000
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- /srv/alertmanager:/alertmanager
- /srv/monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
command:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--storage.path=/alertmanager"
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
- /srv/monitoring/zfs-collector:/var/lib/node_exporter/textfile:ro
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--path.rootfs=/rootfs"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
- "--collector.textfile.directory=/var/lib/node_exporter/textfile"
# Launch it
cd /srv/monitoring && docker compose up -d
# Snapshot the entire stack before upgrading images
ksnap /srv/prometheus && ksnap /srv/grafana && ksnap /srv/alertmanager
docker compose pull && docker compose up -d
3. Prometheus configuration — scraping everything
Prometheus pulls metrics from exporters. Node exporter gives you CPU, RAM, disk, and network. The ZFS custom collector (below) gives you ARC stats, pool health, scrub status, and snapshot ages. If you run WireGuard, you get peer status, bandwidth, and handshake ages too.
prometheus.yml
# /srv/monitoring/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert-rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scrape_configs:
# This node
- job_name: "node"
static_configs:
- targets: ["node-exporter:9100"]
labels:
instance: "kldload-01"
# Prometheus self-monitoring
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Fleet nodes over WireGuard
- job_name: "fleet"
static_configs:
- targets:
- "10.10.0.2:9100" # node-02
- "10.10.0.3:9100" # node-03
- "10.10.0.4:9100" # node-04
labels:
env: "production"
# WireGuard exporter (if running)
- job_name: "wireguard"
static_configs:
- targets: ["localhost:9586"]
scrape_interval: 30s
4. ZFS custom collector for node_exporter
Node exporter's textfile collector reads .prom files from a directory. A cron job writes
fresh ZFS metrics every minute. This gives Prometheus everything ZFS-specific: ARC hit rate, pool capacity,
scrub age, snapshot counts, and dataset compression ratios. No special exporter binary needed —
just a shell script and the textfile collector you already have.
ZFS metrics collector script
#!/bin/bash
# /usr/local/bin/zfs-prom-collector — write ZFS metrics for node_exporter
OUT="/srv/monitoring/zfs-collector/zfs.prom"
TMP="${OUT}.tmp"
{
# ARC stats
if [ -f /proc/spl/kstat/zfs/arcstats ]; then
awk '
/^size / { printf "zfs_arc_size_bytes %s\n", $3 }
/^c_max / { printf "zfs_arc_max_bytes %s\n", $3 }
/^hits / { h=$3 }
/^misses / { m=$3 }
END {
if (h+m > 0) printf "zfs_arc_hit_ratio %.4f\n", h/(h+m)
printf "zfs_arc_hits_total %s\n", h
printf "zfs_arc_misses_total %s\n", m
}
' /proc/spl/kstat/zfs/arcstats
fi
# Pool health: 0=ONLINE, 1=DEGRADED, 2=FAULTED, 3=UNAVAIL
zpool list -H -o name,health 2>/dev/null | while IFS=$'\t' read -r pool health; do
case "$health" in
ONLINE) val=0 ;;
DEGRADED) val=1 ;;
FAULTED) val=2 ;;
*) val=3 ;;
esac
echo "zfs_pool_health{pool=\"${pool}\"} ${val}"
done
# Pool capacity
zpool list -Hp -o name,size,alloc,free,frag,cap 2>/dev/null | \
while IFS=$'\t' read -r pool size alloc free frag cap; do
echo "zfs_pool_size_bytes{pool=\"${pool}\"} ${size}"
echo "zfs_pool_allocated_bytes{pool=\"${pool}\"} ${alloc}"
echo "zfs_pool_free_bytes{pool=\"${pool}\"} ${free}"
echo "zfs_pool_fragmentation_percent{pool=\"${pool}\"} ${frag}"
echo "zfs_pool_capacity_percent{pool=\"${pool}\"} ${cap}"
done
# Scrub age in seconds
zpool list -H -o name 2>/dev/null | while read -r pool; do
last_scrub=$(zpool status "$pool" 2>/dev/null | \
grep 'scan: scrub repaired' | \
grep -oP '\w+ \w+ +\d+ \d+:\d+:\d+ \d+' | head -1)
if [ -n "$last_scrub" ]; then
scrub_epoch=$(date -d "$last_scrub" +%s 2>/dev/null)
if [ -n "$scrub_epoch" ]; then
age=$(( $(date +%s) - scrub_epoch ))
echo "zfs_scrub_age_seconds{pool=\"${pool}\"} ${age}"
fi
fi
done
# Snapshot count per dataset
zfs list -H -t snapshot -o name 2>/dev/null | \
awk -F@ '{count[$1]++} END {for(ds in count) printf "zfs_snapshot_count{dataset=\"%s\"} %d\n", ds, count[ds]}'
# Oldest snapshot age per dataset (seconds)
zfs list -H -t snapshot -o name,creation -s creation 2>/dev/null | \
awk -F@ 'NR>0 {
split($2, parts, "\t")
ds=$1
if (!(ds in seen)) {
seen[ds]=1
print ds "\t" parts[2]
}
}' | while IFS=$'\t' read -r ds creation; do
if [ -n "$creation" ]; then
snap_epoch=$(date -d "$creation" +%s 2>/dev/null)
if [ -n "$snap_epoch" ]; then
age=$(( $(date +%s) - snap_epoch ))
echo "zfs_oldest_snapshot_age_seconds{dataset=\"${ds}\"} ${age}"
fi
fi
done
} > "$TMP"
mv "$TMP" "$OUT"
# Install it
chmod +x /usr/local/bin/zfs-prom-collector
mkdir -p /srv/monitoring/zfs-collector
# Run every minute via cron
cat > /etc/cron.d/zfs-prom-collector <<'EOF'
* * * * * root /usr/local/bin/zfs-prom-collector
EOF
# Test it
/usr/local/bin/zfs-prom-collector && cat /srv/monitoring/zfs-collector/zfs.prom
5. WireGuard monitoring
If your fleet talks over WireGuard, you need to know when a peer drops off, when a handshake goes stale,
and how much bandwidth each tunnel is consuming. The prometheus-wireguard-exporter scrapes
wg show and exposes it as Prometheus metrics.
WireGuard exporter
# Run the WireGuard exporter as a container
docker run -d --name wg-exporter \
--restart unless-stopped \
--cap-add NET_ADMIN \
--network host \
-v /run/wireguard:/run/wireguard:ro \
mindflavor/prometheus-wireguard-exporter:latest \
-p 9586
# Metrics available at http://localhost:9586/metrics
# Key metrics:
# wireguard_sent_bytes_total{peer="..."} — bytes sent per peer
# wireguard_received_bytes_total{peer="..."} — bytes received per peer
# wireguard_latest_handshake_seconds{peer="..."} — last handshake timestamp
6. Alert rules — know before your users do
Alerts are not optional. They are the difference between "we caught it at 3 AM" and "the CEO called at 9 AM." These rules cover pool health, ARC efficiency, scrub schedules, snapshot freshness, disk errors, and WireGuard peer status.
alert-rules.yml
# /srv/monitoring/alert-rules.yml
groups:
- name: zfs
rules:
- alert: ZfsPoolDegraded
expr: zfs_pool_health != 0
for: 1m
labels:
severity: critical
annotations:
summary: "ZFS pool {{ $labels.pool }} is not ONLINE"
description: "Pool {{ $labels.pool }} health value is {{ $value }} (1=DEGRADED, 2=FAULTED). Check zpool status immediately."
- alert: ZfsArcHitRateLow
expr: zfs_arc_hit_ratio < 0.80
for: 15m
labels:
severity: warning
annotations:
summary: "ZFS ARC hit rate below 80%"
description: "ARC hit ratio is {{ $value | humanizePercentage }}. Consider increasing zfs_arc_max or adding L2ARC."
- alert: ZfsScrubOverdue
expr: zfs_scrub_age_seconds > 2592000
for: 1h
labels:
severity: warning
annotations:
summary: "ZFS scrub overdue on pool {{ $labels.pool }}"
description: "Last scrub was {{ $value | humanizeDuration }} ago. Run: zpool scrub {{ $labels.pool }}"
- alert: ZfsSnapshotTooOld
expr: zfs_oldest_snapshot_age_seconds > 7776000
for: 1h
labels:
severity: warning
annotations:
summary: "Oldest snapshot on {{ $labels.dataset }} is over 90 days old"
description: "Consider reviewing snapshot retention. Old snapshots hold space."
- alert: ZfsPoolCapacityHigh
expr: zfs_pool_capacity_percent > 80
for: 30m
labels:
severity: warning
annotations:
summary: "ZFS pool {{ $labels.pool }} over 80% capacity"
description: "Pool {{ $labels.pool }} is at {{ $value }}%. ZFS performance degrades above 80%. Free space or expand the pool."
- alert: ZfsPoolCapacityCritical
expr: zfs_pool_capacity_percent > 90
for: 5m
labels:
severity: critical
annotations:
summary: "ZFS pool {{ $labels.pool }} over 90% — act now"
description: "Pool {{ $labels.pool }} is at {{ $value }}%. Delete snapshots, remove data, or add vdevs."
- name: wireguard
rules:
- alert: WireguardPeerDown
expr: time() - wireguard_latest_handshake_seconds > 300
for: 5m
labels:
severity: warning
annotations:
summary: "WireGuard peer {{ $labels.peer }} last handshake over 5 minutes ago"
description: "Peer may be offline or unreachable. Check network connectivity."
- name: system
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "CPU usage above 85% on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 10m
labels:
severity: warning
annotations:
summary: "Memory usage above 90% on {{ $labels.instance }}"
description: "Note: ZFS ARC uses RAM intentionally. Check arc_summary before panicking."
- alert: DiskErrorsDetected
expr: rate(node_disk_io_time_weighted_seconds_total[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High disk I/O latency on {{ $labels.instance }}"
alertmanager.yml
# /srv/monitoring/alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: "default"
group_by: ["alertname", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: "critical"
match:
severity: critical
repeat_interval: 1h
receivers:
- name: "default"
webhook_configs:
- url: "http://localhost:8065/hooks/your-webhook-id"
# Replace with your webhook — Mattermost, Slack, Gotify, ntfy, whatever
- name: "critical"
webhook_configs:
- url: "http://localhost:8065/hooks/your-webhook-id"
# Add email, PagerDuty, or SMS for critical alerts
7. Pre-built Grafana dashboard
Import this JSON into Grafana and you get a single-pane view of your kldloadOS node: pool health, ARC statistics, disk I/O, CPU/RAM, and service status. Every panel uses the metrics from node_exporter and the ZFS custom collector above.
kldloadOS dashboard JSON (import via Grafana UI)
{
"dashboard": {
"title": "kldloadOS — Node Overview",
"tags": ["kldload", "zfs"],
"timezone": "browser",
"panels": [
{
"title": "ZFS Pool Health",
"type": "stat",
"targets": [{"expr": "zfs_pool_health"}],
"fieldConfig": {
"defaults": {
"mappings": [
{"type": "value", "options": {"0": {"text": "ONLINE", "color": "green"}}},
{"type": "value", "options": {"1": {"text": "DEGRADED", "color": "orange"}}},
{"type": "value", "options": {"2": {"text": "FAULTED", "color": "red"}}}
]
}
},
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0}
},
{
"title": "ARC Hit Rate",
"type": "gauge",
"targets": [{"expr": "zfs_arc_hit_ratio * 100"}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0, "max": 100,
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "orange", "value": 70},
{"color": "yellow", "value": 80},
{"color": "green", "value": 90}
]
}
}
},
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0}
},
{
"title": "ARC Size",
"type": "timeseries",
"targets": [
{"expr": "zfs_arc_size_bytes", "legendFormat": "ARC Current"},
{"expr": "zfs_arc_max_bytes", "legendFormat": "ARC Max"}
],
"fieldConfig": {"defaults": {"unit": "bytes"}},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
},
{
"title": "Pool Capacity",
"type": "bargauge",
"targets": [{"expr": "zfs_pool_capacity_percent", "legendFormat": "{{ pool }}"}],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 70},
{"color": "red", "value": 85}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 4}
},
{
"title": "Scrub Age (days)",
"type": "stat",
"targets": [{"expr": "zfs_scrub_age_seconds / 86400", "legendFormat": "{{ pool }}"}],
"fieldConfig": {
"defaults": {
"unit": "d",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 14},
{"color": "red", "value": 30}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 6, "y": 4}
},
{
"title": "CPU Usage",
"type": "timeseries",
"targets": [
{"expr": "100 - (avg(rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)", "legendFormat": "CPU %"}
],
"fieldConfig": {"defaults": {"unit": "percent", "min": 0, "max": 100}},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 12}
},
{
"title": "Memory Usage",
"type": "timeseries",
"targets": [
{"expr": "node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes", "legendFormat": "Used"},
{"expr": "node_memory_MemTotal_bytes", "legendFormat": "Total"}
],
"fieldConfig": {"defaults": {"unit": "bytes"}},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 12}
},
{
"title": "Disk I/O",
"type": "timeseries",
"targets": [
{"expr": "rate(node_disk_read_bytes_total[5m])", "legendFormat": "Read {{ device }}"},
{"expr": "rate(node_disk_written_bytes_total[5m])", "legendFormat": "Write {{ device }}"}
],
"fieldConfig": {"defaults": {"unit": "Bps"}},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 20}
},
{
"title": "Snapshot Count per Dataset",
"type": "bargauge",
"targets": [{"expr": "zfs_snapshot_count", "legendFormat": "{{ dataset }}"}],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 20}
}
],
"time": {"from": "now-24h", "to": "now"},
"refresh": "30s"
}
}
# Import via Grafana API
curl -s -X POST http://localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-u admin:changeme \
-d @/srv/monitoring/kldload-dashboard.json
8. Federation — central Grafana, fleet nodes
One Grafana instance. Many Prometheus instances. Each node in the fleet runs its own Prometheus and node_exporter behind WireGuard. The central Prometheus federates from all of them, or you add each node's Prometheus as a Grafana datasource directly. WireGuard makes this secure without certificates or VPN appliances.
Option A: Prometheus federation
Central Prometheus scrapes /federate from each node's Prometheus.
One TSDB with all fleet data. Good for up to ~20 nodes.
# In central prometheus.yml
- job_name: "federate-fleet"
honor_labels: true
metrics_path: "/federate"
params:
'match[]':
- '{job="node"}'
- '{__name__=~"zfs_.*"}'
static_configs:
- targets:
- "10.10.0.2:9090"
- "10.10.0.3:9090"
- "10.10.0.4:9090"
Option B: Multiple Grafana datasources
Each node's Prometheus is a separate Grafana datasource. No central TSDB needed. Dashboards use datasource variables to switch between nodes.
# Add each node as a datasource in Grafana
# Settings > Data Sources > Add data source
# Type: Prometheus
# URL: http://10.10.0.2:9090
# Name: node-02
# In your dashboard, add a variable:
# Name: datasource
# Type: Datasource
# Filter: Prometheus
9. kst vs Grafana — terminal and browser, same data
kst is the terminal view. Grafana is the browser view. They are not competitors.
Use kst when you're on the box over SSH, diagnosing a live issue. Use Grafana when you're
reviewing trends, building dashboards, or showing a colleague what happened yesterday.
Same node_exporter. Same ZFS metrics. Different rendering engine.
kst — the terminal view
kst reads system state directly and renders it in the terminal. No browser needed.
Pool health, ARC hit rate, service status, disk I/O — all in a single glance.
Perfect for quick checks, automation scripts, and headless servers.
# Quick health check via SSH
ssh node-04 kst
# Watch live stats (refreshes every 2s)
watch -n 2 kst
Grafana — the browser view
Grafana reads from Prometheus and renders time-series graphs. Historical trends, drill-downs, annotations, alerting visualizations. Perfect for post-mortems, capacity planning, and showing stakeholders what the infrastructure looks like over time.
# Open in browser
http://kldload-01:3000
# Or over WireGuard
http://10.10.0.1:3000
Monitoring data belongs on ZFS. Prometheus metrics, Grafana dashboards, Alertmanager state —
this data tells you the history of your infrastructure. Without it, every incident starts from zero.
With ZFS underneath, your monitoring history is checksummed, compressed, snapshotted, and replicable
to a backup node with a single syncoid command.
You don't need a SaaS vendor to observe your own servers. You need Prometheus, Grafana, a shell script, and a filesystem that won't let your data rot. You already have all four.