Monitoring and Observability — the complete stack.
This is the complete observability stack for kldload systems. Every component — node_exporter, ZFS exporter, WireGuard exporter, Prometheus, Grafana, Alertmanager, Loki — is open source, runs on ZFS, communicates over WireGuard, and requires zero cloud dependencies. All examples work on CentOS/RHEL/Rocky, Fedora, Debian, and Ubuntu. Every config file on this page is production-ready. Every PromQL query has been tested on real kldload deployments.
The thesis: Observability is not a product you buy. It is not Datadog. It is not New Relic. It is not Splunk. Observability is Prometheus scraping exporters, Grafana rendering dashboards, Alertmanager routing alerts, and Loki indexing logs. Four open-source binaries. Four systemd units. Zero license fees. Zero egress charges. Zero per-host pricing.
The entire commercial observability industry exists because people do not know these four tools. Once you do, you will never pay $23/host/month for metrics again. Your data stays on your infrastructure, on ZFS, compressed, checksummed, snapshotted, and replicated — just like everything else in the kldload stack.
Stack architecture
The kldload observability stack has six components. Every component is a single binary with a single config file. There are no databases to manage (Prometheus has its own TSDB; Loki has its own index). There are no message queues. There are no Kafka clusters. The architecture is deliberately simple because simple systems are reliable systems.
+-----------------+
| Grafana | :3000 (dashboards + log viewer)
+--------+--------+
|
+------------+------------+
| |
+--------+--------+ +---------+---------+
| Prometheus | | Loki | :3100 (log aggregation)
| :9090 | +-------------------+
+--------+--------+ ^
| |
+-----------+-----------+ +--------+--------+
| | | | Promtail | (log shipper on every node)
v v v +--------+--------+
+-----------+ +-----------+ +-----------+ |
| node_exp | | zfs_exp | | wg_exp | |
| :9100 | | :9134 | | :9586 | |
+-----------+ +-----------+ +-----------+ |
^ ^ ^ |
| | | |
[every kldload node — exporters + promtail on each]
+-------------------+
| Alertmanager | :9093 (alert routing + silencing)
+-------------------+
^
|
Prometheus (fires alerts via alert rules)
Data flow: Exporters expose metrics on HTTP endpoints. Prometheus scrapes them every 15 seconds and stores time-series data in its local TSDB on ZFS. Grafana queries Prometheus for dashboards. Alertmanager receives firing alerts from Prometheus and routes them to Slack, PagerDuty, email, or webhooks. Promtail tails log files and ships them to Loki. Grafana queries Loki for log correlation. That is the entire architecture.
Exporters — the data sources
Small HTTP servers that expose metrics in Prometheus format. Each exporter knows one domain: node_exporter knows Linux, zfs_exporter knows ZFS, wireguard_exporter knows WireGuard. They run on every monitored host. They use almost no resources — typically 10-20MB RAM.
Prometheus — the brain
Pull-based time-series database. Scrapes exporters on a schedule, stores data locally, evaluates alert rules, serves PromQL queries. One binary, one config file, one data directory. Handles millions of time series on modest hardware.
Grafana — the eyes
Dashboard and visualization layer. Queries Prometheus for metrics, Loki for logs. Provisioned entirely via YAML and JSON — no clicking through a GUI to configure. Dashboards are code, stored in git, deployed with the system.
Alertmanager — the voice
Receives alerts from Prometheus, deduplicates them, groups them, routes them to the right receiver. Supports silencing, inhibition, and escalation. Runs separately so Prometheus can be restarted without losing alert state.
Loki — the memory
Log aggregation system designed to be cost-effective. Does not index log content — only indexes labels (like Prometheus). Uses the same label model as Prometheus, so you can correlate metrics and logs by the same set of labels. Massively cheaper than Elasticsearch.
Promtail — the courier
Tails log files on each host and ships them to Loki. Discovers logs via systemd journal or filesystem paths. Attaches labels automatically from filename, systemd unit, or hostname. Runs on every node alongside the exporters.
Quick health check with kst
Before deploying the full stack, every kldload system includes kst — a one-command
health dashboard built into the platform:
kst
Shows: ZFS pool health, root usage, compression ratio, snapshot count, boot environments, memory,
CPU, uptime, and service status. This is your quick-glance tool for interactive troubleshooting.
The full Prometheus stack gives you history, alerting, and multi-host views that kst
cannot provide.
node_exporter — per-host metrics
node_exporter is the foundation. It runs on every host and exposes Linux system metrics — CPU, memory, disk, network, filesystem, systemd units, processes, and ZFS. Install it on every node you want to monitor.
Installation
CentOS / RHEL / Rocky / Fedora
# Download the latest release
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
chmod 755 /usr/local/bin/node_exporter
Debian / Ubuntu
# The distro package is fine for basic use, but we want the latest for ZFS collectors
# Install from binary for consistency across all distros
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
chmod 755 /usr/local/bin/node_exporter
Hardened systemd unit
The systemd unit below is production-hardened. It runs as a dedicated unprivileged user, enables the collectors that matter for kldload systems, sets up the textfile collector directory, and applies security restrictions. This is not the minimal example from the README — this is what you actually run in production.
# Create the service user
useradd --no-create-home --shell /sbin/nologin --system node_exporter
# Create the textfile collector directory
mkdir -p /var/lib/node_exporter/textfile
chown node_exporter:node_exporter /var/lib/node_exporter/textfile
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
After=network-online.target
Wants=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.zfs \
--collector.systemd \
--collector.processes \
--collector.tcpstat \
--collector.ntp \
--collector.textfile \
--collector.textfile.directory=/var/lib/node_exporter/textfile \
--no-collector.infiniband \
--no-collector.wifi \
--no-collector.fibrechannel \
--web.listen-address=:9100
Restart=always
RestartSec=5
# Security hardening
ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
ProtectKernelTunables=yes
ProtectControlGroups=yes
ReadOnlyPaths=/
ReadWritePaths=/var/lib/node_exporter/textfile
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now node_exporter
Important collectors for kldload systems
node_exporter ships with dozens of collectors. Most are enabled by default. These are the ones that matter on kldload infrastructure:
collector.zfs
ARC hits, misses, size, pool state. The most important collector on a kldload system.
Reads from /proc/spl/kstat/zfs/. No privileges needed — the files are
world-readable.
collector.systemd
Service states (active, failed, inactive) for all systemd units. Lets you alert on
node_systemd_unit_state{name="zfs-mount.service",state="failed"} == 1.
collector.processes
Process counts by state (running, sleeping, zombie). Useful for detecting fork bombs or runaway process creation.
collector.tcpstat
TCP connection states (established, time-wait, close-wait). Essential for monitoring WireGuard-transported services and detecting connection leaks.
collector.textfile
Reads .prom files from a directory and exposes them as metrics. This is how
you add custom metrics — ZFS snapshot counts, replication lag, scrub progress —
without writing a full exporter.
collector.ntp
NTP clock offset. Time drift breaks Prometheus (out-of-order samples get dropped). Alert if offset exceeds 100ms.
Verify
curl -s http://localhost:9100/metrics | head -30
Expected output (abbreviated):
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_cpu_seconds_total{cpu="0",mode="system"} 4567.89
node_cpu_seconds_total{cpu="0",mode="user"} 8901.23
# HELP node_memory_MemTotal_bytes Memory information field MemTotal_bytes.
# TYPE node_memory_MemTotal_bytes gauge
node_memory_MemTotal_bytes 3.4089984e+10
# HELP node_zfs_arc_hits_total kstat.zfs.misc.arcstats.hits
# TYPE node_zfs_arc_hits_total counter
node_zfs_arc_hits_total 1.28934567e+08
# HELP node_zfs_arc_misses_total kstat.zfs.misc.arcstats.misses
# TYPE node_zfs_arc_misses_total counter
node_zfs_arc_misses_total 4.567890e+06
Textfile collector — custom metrics
The textfile collector is the escape hatch. Any metric you cannot get from a standard collector,
you write to a .prom file and node_exporter picks it up. This is how you expose
ZFS snapshot counts, sanoid replication lag, scrub status, and anything else that requires
running a ZFS command.
cat > /usr/local/bin/zfs-textfile-metrics.sh << 'SCRIPT'
#!/bin/bash
# Custom ZFS metrics for node_exporter textfile collector
# Runs every 5 minutes via cron or systemd timer
set -euo pipefail
OUTPUT="/var/lib/node_exporter/textfile/zfs_custom.prom"
TMPFILE="${OUTPUT}.tmp"
{
# Snapshot count per dataset
echo "# HELP zfs_snapshot_count Number of ZFS snapshots per dataset"
echo "# TYPE zfs_snapshot_count gauge"
zfs list -t snapshot -H -o name 2>/dev/null | \
awk -F'@' '{count[$1]++} END {for (ds in count) printf "zfs_snapshot_count{dataset=\"%s\"} %d\n", ds, count[ds]}'
# Total snapshot count
echo "# HELP zfs_snapshot_count_total Total number of ZFS snapshots"
echo "# TYPE zfs_snapshot_count_total gauge"
echo "zfs_snapshot_count_total $(zfs list -t snapshot -H 2>/dev/null | wc -l)"
# Pool usage percentage
echo "# HELP zfs_pool_usage_percent ZFS pool usage percentage"
echo "# TYPE zfs_pool_usage_percent gauge"
zpool list -H -o name,capacity 2>/dev/null | while read pool cap; do
cap="${cap%\%}"
echo "zfs_pool_usage_percent{pool=\"${pool}\"} ${cap}"
done
# Scrub status (0=none, 1=scrubbing, 2=completed)
echo "# HELP zfs_scrub_state ZFS scrub state (0=none, 1=active, 2=completed)"
echo "# TYPE zfs_scrub_state gauge"
zpool list -H -o name 2>/dev/null | while read pool; do
status=$(zpool status "$pool" 2>/dev/null)
if echo "$status" | grep -q "scrub in progress"; then
echo "zfs_scrub_state{pool=\"${pool}\"} 1"
elif echo "$status" | grep -q "scrub repaired"; then
echo "zfs_scrub_state{pool=\"${pool}\"} 2"
else
echo "zfs_scrub_state{pool=\"${pool}\"} 0"
fi
done
# Scrub errors
echo "# HELP zfs_scrub_errors_total ZFS scrub errors count"
echo "# TYPE zfs_scrub_errors_total gauge"
zpool list -H -o name 2>/dev/null | while read pool; do
errors=$(zpool status "$pool" 2>/dev/null | grep "scan:" | grep -oP '\d+ repaired' | awk '{print $1}' || echo 0)
echo "zfs_scrub_errors_total{pool=\"${pool}\"} ${errors:-0}"
done
# Dataset compression ratio
echo "# HELP zfs_compression_ratio ZFS dataset compression ratio"
echo "# TYPE zfs_compression_ratio gauge"
zfs list -H -o name,compressratio -t filesystem 2>/dev/null | while read ds ratio; do
ratio="${ratio%x}"
echo "zfs_compression_ratio{dataset=\"${ds}\"} ${ratio}"
done
# ARC target size (arc_c) vs actual size (arc_size)
echo "# HELP zfs_arc_target_bytes ZFS ARC target size in bytes"
echo "# TYPE zfs_arc_target_bytes gauge"
arc_c=$(awk '/^size/ {print $3}' /proc/spl/kstat/zfs/arcstats 2>/dev/null || echo 0)
echo "zfs_arc_target_bytes ${arc_c}"
} > "${TMPFILE}"
# Atomic move so node_exporter never reads a partial file
mv "${TMPFILE}" "${OUTPUT}"
SCRIPT
chmod 755 /usr/local/bin/zfs-textfile-metrics.sh
Run it on a timer — a systemd timer is cleaner than cron:
cat > /etc/systemd/system/zfs-textfile-metrics.service << 'EOF'
[Unit]
Description=Generate ZFS metrics for node_exporter textfile collector
[Service]
Type=oneshot
ExecStart=/usr/local/bin/zfs-textfile-metrics.sh
User=root
EOF
cat > /etc/systemd/system/zfs-textfile-metrics.timer << 'EOF'
[Unit]
Description=Run ZFS textfile metrics every 5 minutes
[Timer]
OnBootSec=30
OnUnitActiveSec=5min
AccuracySec=30s
[Install]
WantedBy=timers.target
EOF
systemctl daemon-reload
systemctl enable --now zfs-textfile-metrics.timer
# Run once immediately to populate the file
/usr/local/bin/zfs-textfile-metrics.sh
Filtering metrics at scrape time
node_exporter exposes ~800 metrics by default. If you only need a subset, you can filter at the
Prometheus scrape config level using metric_relabel_configs:
# In prometheus.yml scrape_configs
- job_name: "kldload-nodes"
metric_relabel_configs:
# Drop high-cardinality metrics you don't need
- source_labels: [__name__]
regex: 'node_scrape_collector_duration_seconds'
action: drop
# Keep only CPU modes you care about
- source_labels: [__name__, mode]
regex: 'node_cpu_seconds_total;(idle|iowait|system|user)'
action: keep
ZFS exporter — dedicated ZFS metrics
While node_exporter provides basic ZFS metrics (ARC stats, pool state), a dedicated ZFS exporter gives you deeper visibility: per-dataset usage, per-pool I/O, scrub progress percentages, replication lag, and individual pool member (vdev) health. The pdf/zfs_exporter is the standard choice.
Installation
# Download
curl -LO https://github.com/pdf/zfs_exporter/releases/download/v2.3.5/zfs_exporter-2.3.5.linux-amd64.tar.gz
tar xzf zfs_exporter-2.3.5.linux-amd64.tar.gz
cp zfs_exporter-2.3.5.linux-amd64/zfs_exporter /usr/local/bin/
chmod 755 /usr/local/bin/zfs_exporter
Systemd unit
cat > /etc/systemd/system/zfs_exporter.service << 'EOF'
[Unit]
Description=Prometheus ZFS Exporter
Documentation=https://github.com/pdf/zfs_exporter
After=zfs-mount.service
Requires=zfs-mount.service
[Service]
Type=simple
ExecStart=/usr/local/bin/zfs_exporter \
--web.listen-address=:9134 \
--collector.dataset-snapshot \
--collector.pool
Restart=always
RestartSec=5
ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now zfs_exporter
Key metrics exposed
# Pool health
zfs_pool_health{pool="rpool",state="online"} 1
zfs_pool_health{pool="rpool",state="degraded"} 0
# Pool I/O operations
zfs_pool_read_ops_total{pool="rpool"} 45678901
zfs_pool_write_ops_total{pool="rpool"} 23456789
zfs_pool_read_bytes_total{pool="rpool"} 1.234e+12
zfs_pool_write_bytes_total{pool="rpool"} 5.678e+11
# Dataset usage
zfs_dataset_used_bytes{dataset="rpool/ROOT/os",type="filesystem"} 8.5e+09
zfs_dataset_available_bytes{dataset="rpool/ROOT/os",type="filesystem"} 9.2e+10
zfs_dataset_referenced_bytes{dataset="rpool/ROOT/os",type="filesystem"} 7.8e+09
# Snapshot metrics
zfs_dataset_snapshot_count{dataset="rpool/ROOT/os"} 42
zfs_dataset_snapshot_used_bytes{dataset="rpool/ROOT/os"} 2.1e+09
# ARC detailed stats
zfs_arc_size_bytes 1.7179869e+10
zfs_arc_hits_total 1.28934567e+08
zfs_arc_misses_total 4.567890e+06
zfs_arc_l2_hits_total 0
zfs_arc_l2_misses_total 0
zfs_arc_mfu_size_bytes 8.589934e+09
zfs_arc_mru_size_bytes 6.442450e+09
# Scrub progress
zfs_pool_scrub_progress{pool="rpool"} 0.73
zfs_pool_scrub_errors_total{pool="rpool"} 0
zfs_pool_scrub_duration_seconds{pool="rpool"} 3456
Verify
curl -s http://localhost:9134/metrics | grep zfs_pool_health
# zfs_pool_health{pool="rpool",state="online"} 1
WireGuard exporter — per-peer metrics
If you run a WireGuard mesh (and on kldload you probably do), you need visibility into peer connectivity, handshake recency, and data transfer. The WireGuard exporter reads from the kernel's WireGuard interface and exposes per-peer metrics.
Installation
# Download prometheus-wireguard-exporter
curl -LO https://github.com/MindFlavor/prometheus_wireguard_exporter/releases/download/3.6.6/prometheus_wireguard_exporter-3.6.6-x86_64-unknown-linux-musl.tar.gz
tar xzf prometheus_wireguard_exporter-3.6.6-x86_64-unknown-linux-musl.tar.gz
cp prometheus_wireguard_exporter /usr/local/bin/
chmod 755 /usr/local/bin/prometheus_wireguard_exporter
Systemd unit
cat > /etc/systemd/system/wireguard_exporter.service << 'EOF'
[Unit]
Description=Prometheus WireGuard Exporter
Documentation=https://github.com/MindFlavor/prometheus_wireguard_exporter
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
# Needs CAP_NET_ADMIN to read WireGuard interface data
ExecStart=/usr/local/bin/prometheus_wireguard_exporter \
-p 9586 \
-n /etc/wireguard/
Restart=always
RestartSec=5
AmbientCapabilities=CAP_NET_ADMIN
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadOnlyPaths=/etc/wireguard
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now wireguard_exporter
Key metrics
# Per-peer metrics (one set per peer per interface)
wireguard_sent_bytes_total{interface="wg0",public_key="abc123...",friendly_name="node-1"} 1.234e+09
wireguard_received_bytes_total{interface="wg0",public_key="abc123...",friendly_name="node-1"} 5.678e+08
wireguard_latest_handshake_seconds{interface="wg0",public_key="abc123...",friendly_name="node-1"} 1.71e+09
# Derived: time since last handshake (use in PromQL)
# time() - wireguard_latest_handshake_seconds
# A peer with no handshake in >180s is likely down
Friendly names from config comments
The exporter reads WireGuard config files and maps public keys to friendly names using
comments in the config. Add a comment above each [Peer] block:
# /etc/wireguard/wg0.conf
[Interface]
PrivateKey = ...
Address = 10.78.0.1/24
ListenPort = 51820
# node-1 (web server)
[Peer]
PublicKey = abc123...
AllowedIPs = 10.78.1.1/32
Endpoint = 203.0.113.10:51820
# node-2 (database)
[Peer]
PublicKey = def456...
AllowedIPs = 10.78.2.1/32
Endpoint = 203.0.113.20:51820
The comment text becomes the friendly_name label. This makes dashboards readable
— you see "node-1 (web server)" instead of a base64 public key.
Verify
curl -s http://localhost:9586/metrics | grep wireguard_latest_handshake
# wireguard_latest_handshake_seconds{interface="wg0",public_key="abc123...",friendly_name="node-1 (web server)"} 1.712345678e+09
Prometheus — the metrics server
Prometheus is the center of the stack. It scrapes every exporter, stores time-series data, evaluates alert rules, and serves PromQL queries to Grafana. On a kldload system, Prometheus stores its TSDB on ZFS for compression, checksumming, and snapshots.
Installation
# Download
curl -LO https://github.com/prometheus/prometheus/releases/download/v2.54.1/prometheus-2.54.1.linux-amd64.tar.gz
tar xzf prometheus-2.54.1.linux-amd64.tar.gz
cp prometheus-2.54.1.linux-amd64/{prometheus,promtool} /usr/local/bin/
chmod 755 /usr/local/bin/{prometheus,promtool}
mkdir -p /etc/prometheus /var/lib/prometheus
Storage on ZFS
Prometheus TSDB writes 2-hour blocks, then compacts them. The access pattern is: sequential writes for ingestion, random reads for queries. The optimal ZFS configuration:
# Create a dedicated dataset for Prometheus
zfs create -o mountpoint=/var/lib/prometheus \
-o compression=zstd \
-o recordsize=128k \
-o atime=off \
-o xattr=sa \
-o dnodesize=auto \
-o primarycache=all \
rpool/prometheus
# Set ownership
useradd --no-create-home --shell /sbin/nologin --system prometheus
chown -R prometheus:prometheus /var/lib/prometheus /etc/prometheus
Why these settings: recordsize=128k matches Prometheus' large sequential
writes. zstd compresses time-series data at 3-5x — a 50GB TSDB might use only
12GB on disk. atime=off avoids a metadata write on every read. The dataset inherits
checksumming from the pool, so silent corruption of your metrics database is impossible.
Retention sizing
Prometheus stores data locally. How much space you need depends on the number of time series, scrape interval, and retention period. The formula:
# Space per sample: ~1-2 bytes (Prometheus is very efficient)
# Formula: series_count * samples_per_day * bytes_per_sample * retention_days
# Example: 10 nodes, ~800 series each, 15s scrape interval, 30d retention
# 8,000 series * 5,760 samples/day * 1.5 bytes * 30 days = ~2GB uncompressed
# With zstd compression on ZFS: ~500MB actual disk
# Example: 100 nodes, 30d retention
# 80,000 series * 5,760 * 1.5 * 30 = ~20GB uncompressed = ~5GB on ZFS
# Example: 100 nodes, 1 year retention
# 80,000 * 5,760 * 1.5 * 365 = ~250GB uncompressed = ~60GB on ZFS
Complete prometheus.yml
This is a complete, production-ready Prometheus configuration for a kldload cluster. It scrapes all exporters, loads alert rules, connects to Alertmanager, and includes recording rules for pre-computed queries.
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
# External labels for federation and remote write
external_labels:
cluster: "prod-east"
environment: "production"
# Alert rules and recording rules
rule_files:
- "alerts.yml"
- "recording_rules.yml"
# Alertmanager connection
alerting:
alertmanagers:
- static_configs:
- targets:
- "localhost:9093"
# Scrape configurations
scrape_configs:
# Prometheus monitors itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# node_exporter on every host
- job_name: "node"
static_configs:
- targets:
- "10.78.0.1:9100" # hub (monitoring node)
- "10.78.1.1:9100" # node-1 (web)
- "10.78.2.1:9100" # node-2 (database)
- "10.78.3.1:9100" # node-3 (app server)
- "10.78.4.1:9100" # node-4 (build runner)
relabel_configs:
- source_labels: [__address__]
regex: '10\.78\.(\d+)\.\d+:\d+'
target_label: node_id
- source_labels: [__address__]
regex: '(.+):\d+'
target_label: instance
# ZFS exporter on every host
- job_name: "zfs"
static_configs:
- targets:
- "10.78.0.1:9134"
- "10.78.1.1:9134"
- "10.78.2.1:9134"
- "10.78.3.1:9134"
- "10.78.4.1:9134"
relabel_configs:
- source_labels: [__address__]
regex: '(.+):\d+'
target_label: instance
# WireGuard exporter on every host
- job_name: "wireguard"
static_configs:
- targets:
- "10.78.0.1:9586"
- "10.78.1.1:9586"
- "10.78.2.1:9586"
- "10.78.3.1:9586"
- "10.78.4.1:9586"
relabel_configs:
- source_labels: [__address__]
regex: '(.+):\d+'
target_label: instance
# Grafana health
- job_name: "grafana"
static_configs:
- targets: ["localhost:3000"]
# Alertmanager health
- job_name: "alertmanager"
static_configs:
- targets: ["localhost:9093"]
# libvirt exporter (KVM hosts only)
- job_name: "libvirt"
static_configs:
- targets:
- "10.78.0.1:9177"
- "10.78.3.1:9177"
relabel_configs:
- source_labels: [__address__]
regex: '(.+):\d+'
target_label: instance
Systemd service
cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Time Series Database
Documentation=https://prometheus.io/docs/
After=network-online.target zfs-mount.service
Wants=network-online.target
Requires=zfs-mount.service
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=90d \
--storage.tsdb.retention.size=50GB \
--storage.tsdb.wal-compression \
--web.listen-address=:9090 \
--web.enable-lifecycle \
--web.enable-admin-api
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=5
ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
ReadWritePaths=/var/lib/prometheus
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now prometheus
Key flags: --storage.tsdb.retention.time=90d keeps 90 days of data.
--storage.tsdb.retention.size=50GB caps disk usage. Whichever limit is hit first
triggers eviction of the oldest blocks. --storage.tsdb.wal-compression compresses the
write-ahead log, saving 50% WAL space. --web.enable-lifecycle allows hot-reloading
config via curl -X POST http://localhost:9090/-/reload.
Recording rules — pre-compute expensive queries
Recording rules run a PromQL query on a schedule and store the result as a new time series. Use them for dashboard queries that would be too expensive to compute on every page load.
cat > /etc/prometheus/recording_rules.yml << 'EOF'
groups:
- name: node_recording_rules
interval: 60s
rules:
# CPU usage percentage (pre-computed for dashboards)
- record: instance:node_cpu_utilization:ratio
expr: |
1 - avg without(cpu, mode) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
)
# Memory usage percentage
- record: instance:node_memory_utilization:ratio
expr: |
1 - (
node_memory_MemAvailable_bytes
/ node_memory_MemTotal_bytes
)
# Disk I/O utilization
- record: instance:node_disk_io_utilization:ratio
expr: |
rate(node_disk_io_time_seconds_total[5m])
- name: zfs_recording_rules
interval: 60s
rules:
# ARC hit rate
- record: instance:zfs_arc_hit_ratio:ratio
expr: |
rate(node_zfs_arc_hits_total[5m])
/ (
rate(node_zfs_arc_hits_total[5m])
+ rate(node_zfs_arc_misses_total[5m])
)
# Pool usage ratio (from zfs_exporter)
- record: instance:zfs_pool_usage:ratio
expr: |
zfs_dataset_used_bytes{type="filesystem"}
/ (
zfs_dataset_used_bytes{type="filesystem"}
+ zfs_dataset_available_bytes{type="filesystem"}
)
- name: wireguard_recording_rules
interval: 60s
rules:
# WireGuard peer handshake staleness (seconds since last handshake)
- record: instance:wireguard_peer_handshake_age:seconds
expr: |
time() - wireguard_latest_handshake_seconds
# WireGuard throughput per peer
- record: instance:wireguard_peer_sent_rate:bytes_per_second
expr: |
rate(wireguard_sent_bytes_total[5m])
EOF
Federation — multi-cluster Prometheus
If you have multiple kldload clusters (e.g., prod-east and prod-west), a federated Prometheus on a central node can scrape the recording rules from each cluster's Prometheus:
# On the central/global Prometheus
scrape_configs:
- job_name: "federate-prod-east"
honor_labels: true
metrics_path: "/federate"
params:
'match[]':
- '{__name__=~"instance:.*"}' # All recording rules
- '{__name__=~"job:.*"}' # Job-level aggregates
- 'up' # Target health
static_configs:
- targets:
- "10.79.0.1:9090" # prod-east Prometheus over wg2
- job_name: "federate-prod-west"
honor_labels: true
metrics_path: "/federate"
params:
'match[]':
- '{__name__=~"instance:.*"}'
- '{__name__=~"job:.*"}'
- 'up'
static_configs:
- targets:
- "10.79.10.1:9090" # prod-west Prometheus over wg2
Remote write — Thanos or Mimir for long-term storage
For retention beyond what local ZFS can hold, Prometheus can remote-write to Thanos or Grafana Mimir, which store data in object storage (S3, MinIO). Add to prometheus.yml:
# Remote write to Thanos receive or Mimir
remote_write:
- url: "http://thanos-receive.internal:19291/api/v1/receive"
queue_config:
max_samples_per_send: 5000
batch_send_deadline: 5s
max_shards: 10
write_relabel_configs:
# Only send recording rules to long-term storage (reduce volume)
- source_labels: [__name__]
regex: 'instance:.*|job:.*'
action: keep
Service discovery
For large deployments, static configs become unwieldy. Prometheus supports file-based service discovery — drop JSON or YAML files into a directory and Prometheus picks up new targets automatically:
# In prometheus.yml
scrape_configs:
- job_name: "node"
file_sd_configs:
- files:
- "/etc/prometheus/targets/nodes/*.yml"
refresh_interval: 30s
# /etc/prometheus/targets/nodes/prod.yml
# Add/remove hosts by editing this file — no Prometheus restart needed
- targets:
- "10.78.0.1:9100"
- "10.78.1.1:9100"
- "10.78.2.1:9100"
labels:
environment: "production"
site: "east"
- targets:
- "10.78.10.1:9100"
- "10.78.11.1:9100"
labels:
environment: "production"
site: "west"
Verify
# Check config syntax
promtool check config /etc/prometheus/prometheus.yml
# Checking /etc/prometheus/prometheus.yml
# SUCCESS: 2 rule files found
# SUCCESS: /etc/prometheus/prometheus.yml is valid prometheus config file
# Open the UI
curl -s http://localhost:9090/-/healthy
# Prometheus Server is Healthy.
# Query the API
curl -s 'http://localhost:9090/api/v1/targets' | python3 -m json.tool | head -20
Grafana — dashboards and visualization
Grafana is the visualization layer. On kldload systems, we provision Grafana entirely via YAML and JSON — data sources, dashboards, and alert notification channels are all defined as files, deployed with the system, and version-controlled in git. No clicking through a web UI to configure things that should be code.
Installation
CentOS / RHEL / Rocky / Fedora
cat > /etc/yum.repos.d/grafana.repo << 'EOF'
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF
dnf install -y grafana
Debian / Ubuntu
apt install -y apt-transport-https software-properties-common
curl -fsSL https://apt.grafana.com/gpg.key | gpg --dearmor -o /usr/share/keyrings/grafana.gpg
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
> /etc/apt/sources.list.d/grafana.list
apt update && apt install -y grafana
systemctl enable --now grafana-server
Open http://<monitoring-node>:3000 — default login is
admin/admin. You will be prompted to change the password.
Provisioning via YAML — no GUI clicking
Grafana reads provisioning files from /etc/grafana/provisioning/. This is how you
define data sources and dashboard directories as code:
# Data source provisioning
cat > /etc/grafana/provisioning/datasources/prometheus.yml << 'EOF'
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://localhost:9090
isDefault: true
editable: false
jsonData:
timeInterval: "15s"
httpMethod: POST
- name: Loki
type: loki
access: proxy
url: http://localhost:3100
editable: false
jsonData:
maxLines: 1000
EOF
# Dashboard provisioning
mkdir -p /var/lib/grafana/dashboards
cat > /etc/grafana/provisioning/dashboards/kldload.yml << 'EOF'
apiVersion: 1
providers:
- name: 'kldload'
orgId: 1
folder: 'kldload'
type: file
disableDeletion: true
updateIntervalSeconds: 30
allowUiUpdates: false
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: false
EOF
Dashboard: Host Overview
This dashboard JSON gives you CPU, memory, network, and disk for every kldload node. Drop it
into /var/lib/grafana/dashboards/host-overview.json:
{
"dashboard": {
"title": "kldload Host Overview",
"uid": "kldload-host-overview",
"timezone": "browser",
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"query": "label_values(up{job=\"node\"}, instance)",
"datasource": "Prometheus",
"refresh": 2,
"includeAll": true,
"multi": true
}
]
},
"panels": [
{
"title": "CPU Usage",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "instance:node_cpu_utilization:ratio{instance=~\"$instance\"} * 100",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": { "unit": "percent", "max": 100, "min": 0 }
}
},
{
"title": "Memory Usage",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"targets": [
{
"expr": "instance:node_memory_utilization:ratio{instance=~\"$instance\"} * 100",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": { "unit": "percent", "max": 100, "min": 0 }
}
},
{
"title": "Network Receive",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
"targets": [
{
"expr": "rate(node_network_receive_bytes_total{device!~\"lo|veth.*|br.*|docker.*\",instance=~\"$instance\"}[5m])",
"legendFormat": "{{instance}} - {{device}}"
}
],
"fieldConfig": {
"defaults": { "unit": "Bps" }
}
},
{
"title": "Disk I/O",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
"targets": [
{
"expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\"}[5m])",
"legendFormat": "{{instance}} read - {{device}}"
},
{
"expr": "rate(node_disk_written_bytes_total{instance=~\"$instance\"}[5m])",
"legendFormat": "{{instance}} write - {{device}}"
}
],
"fieldConfig": {
"defaults": { "unit": "Bps" }
}
}
]
},
"overwrite": true
}
Dashboard: ZFS Health
The ZFS dashboard shows pool state, usage, ARC performance, scrub progress, and snapshot counts. These are the PromQL queries for each panel:
# Panel: Pool Health (stat panel, green/red)
zfs_pool_health{state="online",instance=~"$instance"}
# Panel: Pool Usage Percentage (gauge)
zfs_pool_usage_percent{instance=~"$instance"}
# Panel: ARC Hit Rate (gauge, threshold: green >90, yellow >80, red <80)
instance:zfs_arc_hit_ratio:ratio{instance=~"$instance"} * 100
# Panel: ARC Size vs Target (timeseries)
node_zfs_arc_size{instance=~"$instance"} # actual ARC size
zfs_arc_target_bytes{instance=~"$instance"} # target (arc_c)
node_memory_MemTotal_bytes{instance=~"$instance"} * 0.5 # 50% of RAM reference line
# Panel: Scrub Progress (bar gauge)
zfs_pool_scrub_progress{instance=~"$instance"} * 100
# Panel: Scrub Errors (stat, threshold: 0=green, >0=red)
zfs_scrub_errors_total{instance=~"$instance"}
# Panel: Snapshot Count per Dataset (table)
zfs_snapshot_count{instance=~"$instance"}
# Panel: Dataset Usage (bar chart)
zfs_dataset_used_bytes{instance=~"$instance",type="filesystem"}
# Panel: Pool I/O (timeseries)
rate(zfs_pool_read_ops_total{instance=~"$instance"}[5m])
rate(zfs_pool_write_ops_total{instance=~"$instance"}[5m])
# Panel: Pool Throughput (timeseries)
rate(zfs_pool_read_bytes_total{instance=~"$instance"}[5m])
rate(zfs_pool_write_bytes_total{instance=~"$instance"}[5m])
# Panel: Compression Ratio (table)
zfs_compression_ratio{instance=~"$instance"}
Dashboard: ARC Performance
The ARC (Adaptive Replacement Cache) is ZFS's read cache. If the ARC hit rate drops below 85%, you are leaving performance on the table. This dashboard helps you tune ARC sizing.
# Panel: ARC Hit Rate over Time (timeseries)
rate(node_zfs_arc_hits_total{instance=~"$instance"}[5m])
/ (rate(node_zfs_arc_hits_total{instance=~"$instance"}[5m])
+ rate(node_zfs_arc_misses_total{instance=~"$instance"}[5m])) * 100
# Panel: ARC MFU vs MRU (stacked area)
# MFU = Most Frequently Used, MRU = Most Recently Used
zfs_arc_mfu_size_bytes{instance=~"$instance"}
zfs_arc_mru_size_bytes{instance=~"$instance"}
# Panel: ARC Evictions (timeseries — high evictions = ARC too small)
rate(node_zfs_arc_evict_skip_total{instance=~"$instance"}[5m])
# Panel: ARC Demand vs Prefetch Hits (timeseries)
rate(node_zfs_arc_demand_hits_total{instance=~"$instance"}[5m])
rate(node_zfs_arc_prefetch_hits_total{instance=~"$instance"}[5m])
# Panel: L2ARC Hit Rate (if you have L2ARC configured)
rate(zfs_arc_l2_hits_total{instance=~"$instance"}[5m])
/ (rate(zfs_arc_l2_hits_total{instance=~"$instance"}[5m])
+ rate(zfs_arc_l2_misses_total{instance=~"$instance"}[5m])) * 100
Dashboard: WireGuard Mesh
# Panel: Peer Status (stat, per peer — green if handshake <180s ago)
time() - wireguard_latest_handshake_seconds{instance=~"$instance"}
# Panel: Peer Throughput (timeseries)
rate(wireguard_sent_bytes_total{instance=~"$instance"}[5m])
rate(wireguard_received_bytes_total{instance=~"$instance"}[5m])
# Panel: Handshake Age (table — sort by age, flag stale peers)
sort_desc(
time() - wireguard_latest_handshake_seconds{instance=~"$instance"}
)
# Panel: Total Mesh Traffic (single stat)
sum(rate(wireguard_sent_bytes_total[5m])) + sum(rate(wireguard_received_bytes_total[5m]))
Dashboard: KVM Virtual Machines
# Panel: VM CPU Usage (timeseries, requires libvirt_exporter)
rate(libvirt_domain_info_cpu_time_seconds_total{instance=~"$instance"}[5m])
# Panel: VM Memory Usage (gauge)
libvirt_domain_info_memory_usage_bytes{instance=~"$instance"}
/ libvirt_domain_info_maximum_memory_bytes{instance=~"$instance"} * 100
# Panel: VM Disk Read/Write (timeseries)
rate(libvirt_domain_block_stats_read_bytes_total{instance=~"$instance"}[5m])
rate(libvirt_domain_block_stats_write_bytes_total{instance=~"$instance"}[5m])
# Panel: VM Network I/O (timeseries)
rate(libvirt_domain_interface_stats_receive_bytes_total{instance=~"$instance"}[5m])
rate(libvirt_domain_interface_stats_transmit_bytes_total{instance=~"$instance"}[5m])
# Panel: VM State (stat — running=green, shutoff=grey, paused=yellow)
libvirt_domain_info_state{instance=~"$instance"}
Dashboard: eBPF Metrics
# Panel: Syscall Latency p99 (requires eBPF exporter or textfile metrics)
histogram_quantile(0.99, rate(ebpf_syscall_latency_seconds_bucket[5m]))
# Panel: TCP Retransmits per Second
rate(node_netstat_Tcp_RetransSegs[5m])
# Panel: File System Latency (bcc/bpftrace textfile metrics)
ebpf_bio_latency_seconds{quantile="0.99",instance=~"$instance"}
# Panel: TCP Connection Rate
rate(node_netstat_Tcp_ActiveOpens[5m])
rate(node_netstat_Tcp_PassiveOpens[5m])
Alertmanager — alert routing and notification
Alertmanager receives alerts from Prometheus, deduplicates them, groups related alerts, applies silences and inhibitions, and routes them to the correct receiver. It runs as a separate process so Prometheus can be restarted without losing alert state.
Installation
curl -LO https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
cp alertmanager-0.27.0.linux-amd64/{alertmanager,amtool} /usr/local/bin/
chmod 755 /usr/local/bin/{alertmanager,amtool}
mkdir -p /etc/alertmanager /var/lib/alertmanager
Complete alertmanager.yml
# /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: 'alertmanager@kldload.local'
smtp_smarthost: 'smtp.example.com:587'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'smtp-password-here'
smtp_require_tls: true
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
# Inhibition rules: suppress lower-severity alerts when critical fires
inhibit_rules:
# If a node is down, suppress all other alerts for that node
- source_matchers:
- alertname = NodeDown
target_matchers:
- severity =~ "warning|info"
equal: ['instance']
# If a pool is faulted, suppress degraded alerts
- source_matchers:
- alertname = ZFSPoolFaulted
target_matchers:
- alertname = ZFSPoolDegraded
equal: ['instance', 'pool']
# Routing tree
route:
receiver: 'default-slack'
group_by: ['alertname', 'cluster', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts go to PagerDuty AND Slack immediately
- matchers:
- severity = critical
receiver: 'critical-pagerduty'
group_wait: 10s
repeat_interval: 1h
continue: true # Also send to next matching route
- matchers:
- severity = critical
receiver: 'critical-slack'
group_wait: 10s
# ZFS alerts go to the storage channel
- matchers:
- alertname =~ "ZFS.*|ARC.*|Scrub.*|Pool.*"
receiver: 'storage-slack'
group_by: ['alertname', 'pool']
# WireGuard alerts go to the network channel
- matchers:
- alertname =~ "WireGuard.*|Peer.*"
receiver: 'network-slack'
# Receivers
receivers:
- name: 'default-slack'
slack_configs:
- channel: '#monitoring'
send_resolved: true
title: '{{ .Status | toUpper }} {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Instance:* {{ .Labels.instance }}
*Severity:* {{ .Labels.severity }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
{{ end }}
- name: 'critical-pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
send_resolved: true
- name: 'critical-slack'
slack_configs:
- channel: '#critical-alerts'
send_resolved: true
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: '{{ .Status | toUpper }} {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Labels.alertname }}* on {{ .Labels.instance }}
{{ .Annotations.summary }}
{{ end }}
- name: 'storage-slack'
slack_configs:
- channel: '#storage-alerts'
send_resolved: true
title: 'ZFS {{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Pool:* {{ .Labels.pool | default "n/a" }}
*Instance:* {{ .Labels.instance }}
{{ .Annotations.summary }}
{{ end }}
- name: 'network-slack'
slack_configs:
- channel: '#network-alerts'
send_resolved: true
Systemd unit
cat > /etc/systemd/system/alertmanager.service << 'EOF'
[Unit]
Description=Prometheus Alertmanager
Documentation=https://prometheus.io/docs/alerting/latest/alertmanager/
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=:9093 \
--cluster.listen-address=""
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=5
ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
ReadWritePaths=/var/lib/alertmanager
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now alertmanager
Complete alert rules
These are the alert rules that cover every critical dimension of a kldload system. Drop this
into /etc/prometheus/alerts.yml:
cat > /etc/prometheus/alerts.yml << 'EOF'
groups:
# ── Node health ────────────────────────────────────────
- name: node_health
rules:
- alert: NodeDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is unreachable"
description: "Prometheus has been unable to scrape {{ $labels.instance }} for 2 minutes."
- alert: HighCPU
expr: instance:node_cpu_utilization:ratio > 0.90
for: 10m
labels:
severity: warning
annotations:
summary: "CPU usage >90% on {{ $labels.instance }}"
description: "CPU has been above 90% for 10 minutes. Current: {{ $value | humanizePercentage }}"
- alert: HighMemory
expr: instance:node_memory_utilization:ratio > 0.90
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage >90% on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanizePercentage }}. Consider expanding ARC limits or adding RAM."
- alert: HighMemoryCritical
expr: instance:node_memory_utilization:ratio > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "Memory usage >95% on {{ $labels.instance }} — OOM risk"
description: "Memory usage is {{ $value | humanizePercentage }}. OOM killer may activate."
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Less than 10% disk space on {{ $labels.instance }}"
- alert: DiskSpaceCritical
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Less than 5% disk space on {{ $labels.instance }} — risk of data loss"
- alert: ClockDrift
expr: abs(node_ntp_offset_seconds) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "NTP clock drift >100ms on {{ $labels.instance }}"
description: "Clock offset: {{ $value }}s. This can cause Prometheus sample ordering issues."
- alert: SystemdUnitFailed
expr: node_systemd_unit_state{state="failed"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "systemd unit {{ $labels.name }} failed on {{ $labels.instance }}"
# ── ZFS health ─────────────────────────────────────────
- name: zfs_health
rules:
- alert: ZFSPoolDegraded
expr: zfs_pool_health{state="degraded"} == 1
for: 1m
labels:
severity: critical
annotations:
summary: "ZFS pool {{ $labels.pool }} DEGRADED on {{ $labels.instance }}"
description: "A vdev in pool {{ $labels.pool }} has failed. Data is still accessible but redundancy is lost. Replace the failed device immediately."
- alert: ZFSPoolFaulted
expr: zfs_pool_health{state="faulted"} == 1
for: 0m
labels:
severity: critical
annotations:
summary: "ZFS pool {{ $labels.pool }} FAULTED on {{ $labels.instance }}"
description: "Pool {{ $labels.pool }} has experienced an unrecoverable error. DATA MAY BE INACCESSIBLE."
- alert: ZFSPoolUsageHigh
expr: zfs_pool_usage_percent > 80
for: 30m
labels:
severity: warning
annotations:
summary: "ZFS pool usage >80% on {{ $labels.instance }}"
description: "Pool {{ $labels.pool }} is {{ $value }}% full. ZFS performance degrades significantly above 80% capacity. Add storage or delete snapshots."
- alert: ZFSPoolUsageCritical
expr: zfs_pool_usage_percent > 90
for: 10m
labels:
severity: critical
annotations:
summary: "ZFS pool usage >90% on {{ $labels.instance }} — critical"
description: "Pool {{ $labels.pool }} is {{ $value }}% full. Pool may become read-only at 100%."
- alert: ARCHitRateLow
expr: instance:zfs_arc_hit_ratio:ratio < 0.85
for: 30m
labels:
severity: warning
annotations:
summary: "ZFS ARC hit rate below 85% on {{ $labels.instance }}"
description: "ARC hit rate is {{ $value | humanizePercentage }}. Consider increasing ARC max size or investigating the workload. Below 85% means significant I/O is hitting disk."
- alert: ARCHitRateCritical
expr: instance:zfs_arc_hit_ratio:ratio < 0.70
for: 15m
labels:
severity: critical
annotations:
summary: "ZFS ARC hit rate below 70% on {{ $labels.instance }} — severe cache pressure"
description: "ARC hit rate is {{ $value | humanizePercentage }}. The working set exceeds ARC capacity. Add RAM or reduce dataset count."
- alert: ScrubErrors
expr: zfs_scrub_errors_total > 0
for: 0m
labels:
severity: critical
annotations:
summary: "ZFS scrub found errors on {{ $labels.instance }}"
description: "Pool {{ $labels.pool }} scrub detected {{ $value }} errors. Check `zpool status {{ $labels.pool }}` immediately."
- alert: ScrubOverdue
expr: (time() - zfs_pool_scrub_duration_seconds) > (8 * 24 * 3600)
for: 1h
labels:
severity: warning
annotations:
summary: "ZFS scrub overdue on {{ $labels.instance }}"
description: "Pool {{ $labels.pool }} has not been scrubbed in over 8 days."
# ── WireGuard health ───────────────────────────────────
- name: wireguard_health
rules:
- alert: WireGuardPeerStale
expr: instance:wireguard_peer_handshake_age:seconds > 300
for: 5m
labels:
severity: warning
annotations:
summary: "WireGuard peer {{ $labels.friendly_name }} stale on {{ $labels.instance }}"
description: "No handshake in {{ $value | humanizeDuration }}. Peer may be unreachable."
- alert: WireGuardPeerDown
expr: instance:wireguard_peer_handshake_age:seconds > 900
for: 5m
labels:
severity: critical
annotations:
summary: "WireGuard peer {{ $labels.friendly_name }} DOWN on {{ $labels.instance }}"
description: "No handshake in {{ $value | humanizeDuration }}. The peer is not reachable."
- alert: WireGuardNoTraffic
expr: rate(wireguard_received_bytes_total[15m]) == 0
for: 15m
labels:
severity: warning
annotations:
summary: "No WireGuard traffic from peer {{ $labels.friendly_name }} for 15m"
# ── Latency and performance ────────────────────────────
- name: performance
rules:
- alert: HighP99Latency
expr: histogram_quantile(0.99, rate(ebpf_syscall_latency_seconds_bucket[5m])) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "p99 syscall latency >100ms on {{ $labels.instance }}"
- alert: HighTCPRetransmits
expr: rate(node_netstat_Tcp_RetransSegs[5m]) > 50
for: 10m
labels:
severity: warning
annotations:
summary: "High TCP retransmit rate on {{ $labels.instance }}"
description: "{{ $value }} retransmits/sec. Check network path for congestion or packet loss."
- alert: HighDiskIOUtilization
expr: rate(node_disk_io_time_seconds_total[5m]) > 0.90
for: 15m
labels:
severity: warning
annotations:
summary: "Disk I/O utilization >90% on {{ $labels.instance }} device {{ $labels.device }}"
# ── Monitoring stack health ────────────────────────────
- name: monitoring_health
rules:
- alert: PrometheusTargetDown
expr: up == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus target {{ $labels.job }}/{{ $labels.instance }} is down"
- alert: PrometheusStorageFull
expr: prometheus_tsdb_storage_blocks_bytes / (1024*1024*1024) > 45
for: 30m
labels:
severity: warning
annotations:
summary: "Prometheus storage >45GB — approaching retention limit"
- alert: AlertmanagerDown
expr: up{job="alertmanager"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Alertmanager is down — alerts will not be delivered"
EOF
# Validate the rules
promtool check rules /etc/prometheus/alerts.yml
# Reload Prometheus
curl -X POST http://localhost:9090/-/reload
Silencing alerts during maintenance
# Silence all alerts for node-2 for 2 hours (maintenance window)
amtool silence add \
--alertmanager.url=http://localhost:9093 \
--author="todd" \
--comment="scheduled maintenance on node-2" \
--duration=2h \
instance="10.78.2.1"
# List active silences
amtool silence query --alertmanager.url=http://localhost:9093
# Expire a silence early
amtool silence expire --alertmanager.url=http://localhost:9093
# View currently firing alerts
amtool alert query --alertmanager.url=http://localhost:9093
Loki — log aggregation
Loki is "Prometheus for logs." It stores log data with the same label model as Prometheus, so you can jump from a metric spike to the corresponding log lines in a single Grafana click. Unlike Elasticsearch or Splunk, Loki does not index log content — it only indexes labels. This makes it dramatically cheaper to operate: less CPU, less storage, less RAM.
Installation
# Loki server (on the monitoring node)
curl -LO https://github.com/grafana/loki/releases/download/v3.1.1/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
chmod 755 loki-linux-amd64
mv loki-linux-amd64 /usr/local/bin/loki
# Promtail (on every node)
curl -LO https://github.com/grafana/loki/releases/download/v3.1.1/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
chmod 755 promtail-linux-amd64
mv promtail-linux-amd64 /usr/local/bin/promtail
Loki storage on ZFS
zfs create -o mountpoint=/var/lib/loki \
-o compression=zstd \
-o recordsize=128k \
-o atime=off \
rpool/loki
mkdir -p /var/lib/loki/{chunks,index,wal,ruler}
useradd --no-create-home --shell /sbin/nologin --system loki
chown -R loki:loki /var/lib/loki
Loki configuration
mkdir -p /etc/loki
cat > /etc/loki/loki.yml << 'EOF'
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: warn
common:
path_prefix: /var/lib/loki
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
filesystem:
directory: /var/lib/loki/chunks
tsdb_shipper:
active_index_directory: /var/lib/loki/index
cache_location: /var/lib/loki/cache
compactor:
working_directory: /var/lib/loki/compactor
limits_config:
retention_period: 30d
max_query_series: 500
max_query_parallelism: 2
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
chunk_store_config:
chunk_cache_config:
embedded_cache:
enabled: true
max_size_mb: 256
query_range:
align_queries_with_step: true
cache_results: true
analytics:
reporting_enabled: false
EOF
Loki systemd unit
cat > /etc/systemd/system/loki.service << 'EOF'
[Unit]
Description=Grafana Loki Log Aggregation
Documentation=https://grafana.com/docs/loki/latest/
After=network-online.target zfs-mount.service
[Service]
User=loki
Group=loki
Type=simple
ExecStart=/usr/local/bin/loki \
-config.file=/etc/loki/loki.yml
Restart=always
RestartSec=5
ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
ReadWritePaths=/var/lib/loki
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now loki
Promtail configuration (on every node)
mkdir -p /etc/promtail
cat > /etc/promtail/promtail.yml << 'EOF'
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /var/lib/promtail/positions.yml
clients:
- url: http://10.78.0.1:3100/loki/api/v1/push
scrape_configs:
# systemd journal — captures all systemd service logs
- job_name: journal
journal:
max_age: 12h
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
- source_labels: ['__journal__hostname']
target_label: 'hostname'
- source_labels: ['__journal_priority_keyword']
target_label: 'level'
# Syslog and auth logs
- job_name: syslog
static_configs:
- targets: [localhost]
labels:
job: syslog
__path__: /var/log/syslog
- targets: [localhost]
labels:
job: auth
__path__: /var/log/auth.log
# ZFS event logs
- job_name: zfs
static_configs:
- targets: [localhost]
labels:
job: zfs
__path__: /var/log/zfs*.log
pipeline_stages:
- regex:
expression: '^(?P\S+ \S+) (?P\w+) (?P.*)$'
- labels:
level:
# Kernel logs (dmesg)
- job_name: kernel
static_configs:
- targets: [localhost]
labels:
job: kernel
__path__: /var/log/kern.log
# WireGuard logs (from journal)
- job_name: wireguard
journal:
max_age: 12h
labels:
job: wireguard
relabel_configs:
- source_labels: ['__journal__systemd_unit']
regex: 'wg-quick@.*\.service'
action: keep
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
# libvirt/KVM logs
- job_name: libvirt
static_configs:
- targets: [localhost]
labels:
job: libvirt
__path__: /var/log/libvirt/qemu/*.log
relabel_configs:
- source_labels: ['__path__']
regex: '.*/(.*)\.log'
target_label: 'vm_name'
EOF
Promtail systemd unit
mkdir -p /var/lib/promtail
cat > /etc/systemd/system/promtail.service << 'EOF'
[Unit]
Description=Grafana Promtail Log Shipper
Documentation=https://grafana.com/docs/loki/latest/clients/promtail/
After=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/promtail \
-config.file=/etc/promtail/promtail.yml
Restart=always
RestartSec=5
# Promtail needs read access to log files
ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
ReadOnlyPaths=/var/log
ReadWritePaths=/var/lib/promtail
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now promtail
LogQL — querying logs
LogQL is Loki's query language. It looks like PromQL but operates on log streams. In Grafana, select the Loki data source and use these queries:
# All logs from a specific host
{hostname="node-1"}
# All error-level logs across all hosts
{level="err"} OR {level="crit"} OR {level="alert"} OR {level="emerg"}
# ZFS errors in kernel logs
{job="kernel"} |= "zfs" |= "error"
# WireGuard handshake failures
{job="wireguard"} |= "handshake"
# SSH authentication failures
{job="auth"} |= "Failed password"
# All logs from a specific systemd unit
{unit="prometheus.service"}
# Count errors per host over time (metric from logs)
count_over_time({level="err"}[5m])
# Top 10 noisiest systemd units
topk(10, sum by(unit) (count_over_time({job="systemd-journal"}[1h])))
# OOM killer events
{job="kernel"} |= "Out of memory" OR {job="kernel"} |= "oom-kill"
# ZFS scrub completions
{job="kernel"} |= "scan: scrub repaired"
# libvirt VM state changes
{job="libvirt"} |~ "domain.*state"
Correlating logs with metrics
In Grafana, you can link from a metric panel to the corresponding logs. When you see a CPU spike, click the time range and jump to Loki to see what was running at that moment. This requires matching labels between Prometheus and Loki:
Both Promtail and node_exporter should produce a hostname or instance
label that matches. The Promtail journal source automatically uses
__journal__hostname. Prometheus uses the target address. Use relabel_configs to
normalize them to the same label so Grafana can correlate across data sources.
SLOs and error budgets
Service Level Objectives (SLOs) define how reliable your infrastructure must be. An error budget is the allowed amount of unreliability. If your SLO is 99.9% availability (43.8 minutes downtime/month), your error budget is 0.1%. Once you burn through it, you stop deploying and fix reliability. This is the SRE discipline that makes infrastructure sustainable.
Define SLOs for kldload infrastructure
SLO: Host availability 99.9%
SLI: avg_over_time(up{job="node"}[30d]). Target: >0.999.
Error budget: 43.8 minutes/month of allowed downtime per host.
SLO: ZFS pool online 99.99%
SLI: avg_over_time(zfs_pool_health{state="online"}[30d]). Target: >0.9999.
Error budget: 4.38 minutes/month. A pool going degraded or offline eats budget fast.
SLO: ARC hit rate >90%
SLI: instance:zfs_arc_hit_ratio:ratio. Target: >0.90.
When this SLO breaks, disk I/O increases and application latency rises. Add RAM.
SLO: WireGuard mesh connectivity 99.9%
SLI: fraction of peers with handshake <300s. Target: >0.999. A stale peer means a node is isolated from the mesh and cannot be managed.
Recording rules for SLIs
# /etc/prometheus/recording_rules.yml (append to existing)
groups:
- name: slo_recording_rules
interval: 60s
rules:
# Host availability SLI (1 = up, 0 = down)
- record: slo:host_availability:ratio
expr: avg_over_time(up{job="node"}[30d])
# ZFS pool online SLI
- record: slo:zfs_pool_online:ratio
expr: avg_over_time(zfs_pool_health{state="online"}[30d])
# Error budget remaining (1 = full budget, 0 = budget exhausted)
- record: slo:host_availability:error_budget_remaining
expr: |
1 - (
(1 - slo:host_availability:ratio)
/ (1 - 0.999)
)
- record: slo:zfs_pool_online:error_budget_remaining
expr: |
1 - (
(1 - slo:zfs_pool_online:ratio)
/ (1 - 0.9999)
)
Burn rate alerting
Instead of alerting on raw thresholds, burn rate alerting asks: "at the current error rate, how fast are we consuming the error budget?" This avoids alert fatigue from brief blips while catching sustained problems early.
# Burn rate alerts
groups:
- name: slo_burn_rate
rules:
# Fast burn: consuming budget at 14.4x rate over 1h (pages immediately)
- alert: HostAvailabilityBudgetFastBurn
expr: |
(1 - avg_over_time(up{job="node"}[1h])) > (14.4 * 0.001)
and
(1 - avg_over_time(up{job="node"}[5m])) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
slo: host_availability
annotations:
summary: "Host availability error budget burning fast"
description: "At current error rate, the 30-day error budget will be exhausted in {{ printf \"%.0f\" (divf 1 14.4 | mulf 30) }} days."
# Slow burn: consuming budget at 3x rate over 6h (warns early)
- alert: HostAvailabilityBudgetSlowBurn
expr: |
(1 - avg_over_time(up{job="node"}[6h])) > (3 * 0.001)
and
(1 - avg_over_time(up{job="node"}[30m])) > (3 * 0.001)
for: 5m
labels:
severity: warning
slo: host_availability
annotations:
summary: "Host availability error budget burning slowly"
description: "At current error rate, the 30-day error budget will be exhausted in {{ printf \"%.0f\" (divf 1 3 | mulf 30) }} days."
Multi-host monitoring over WireGuard
On a kldload mesh, every node runs exporters. The monitoring node runs Prometheus, Grafana, Alertmanager, and Loki. All scraping and log shipping happens over the WireGuard mesh — no public ports, no firewall exceptions, no VPN appliances.
Deployment topology
# Monitoring node (10.78.0.1 / wg0)
# Runs: Prometheus, Grafana, Alertmanager, Loki
# Runs: node_exporter, zfs_exporter, wireguard_exporter, promtail
# All other nodes (10.78.x.1 / wg0)
# Run: node_exporter, zfs_exporter, wireguard_exporter, promtail
# That's it. Four lightweight daemons. ~80MB RAM total.
# Dedicated metrics plane (optional — wg2 / 10.79.x.x)
# Separate WireGuard interface for monitoring traffic
# Allows different MTU, different firewall rules, different routing
# Prometheus scrapes 10.79.x.x:9100 instead of 10.78.x.x:9100
Secure transport
WireGuard encrypts all traffic by default. This means:
No TLS configuration needed on exporters. Prometheus scrapes over HTTP, but the HTTP runs inside the WireGuard tunnel, which provides authenticated encryption. This eliminates the complexity of managing TLS certificates for every exporter on every node. The mesh is the security boundary.
# On each node, ensure exporter ports are only reachable on WireGuard interfaces
# This prevents accidental exposure on public interfaces
# nftables rule (add to your existing ruleset)
nft add rule inet filter input iifname != "wg0" tcp dport { 9100, 9134, 9586, 9080 } drop
# Or with firewalld (CentOS/RHEL)
firewall-cmd --zone=public --remove-port=9100/tcp --permanent
firewall-cmd --zone=trusted --add-interface=wg0 --permanent
firewall-cmd --reload
Adding a new node to monitoring
# On the new node: install exporters + promtail (same steps as above)
# Then on the monitoring node:
# 1. Add to Prometheus targets (if using file_sd)
cat >> /etc/prometheus/targets/nodes/prod.yml << 'EOF'
- targets:
- "10.78.5.1:9100"
labels:
environment: "production"
role: "app-server"
EOF
# 2. Add ZFS and WireGuard exporter targets similarly
# 3. Prometheus auto-discovers via file_sd — no restart needed
# Verify the new target appears:
curl -s http://localhost:9090/api/v1/targets | python3 -c "
import json, sys
targets = json.load(sys.stdin)['data']['activeTargets']
for t in targets:
print(f\"{t['labels'].get('instance','?'):20s} {t['health']:8s} {t['lastScrape'][:19]}\")"
Monitoring ZFS replication
If you use sanoid/syncoid for ZFS snapshot management and replication, you need to know when replication falls behind or fails. A missed replication window means your DR target is stale. These textfile metrics make replication lag visible in Prometheus.
Replication metrics script
cat > /usr/local/bin/replication-metrics.sh << 'SCRIPT'
#!/bin/bash
# Replication lag and snapshot age metrics for sanoid/syncoid
set -euo pipefail
OUTPUT="/var/lib/node_exporter/textfile/replication.prom"
TMPFILE="${OUTPUT}.tmp"
{
echo "# HELP zfs_replication_lag_seconds Seconds since last successful syncoid replication"
echo "# TYPE zfs_replication_lag_seconds gauge"
echo "# HELP zfs_latest_snapshot_age_seconds Age of the newest snapshot in seconds"
echo "# TYPE zfs_latest_snapshot_age_seconds gauge"
echo "# HELP zfs_sanoid_snapshot_count Number of snapshots managed by sanoid"
echo "# TYPE zfs_sanoid_snapshot_count gauge"
NOW=$(date +%s)
# Check replication lag by looking at the newest syncoid snapshot on the target
# This assumes syncoid snapshots have "syncoid_" prefix
for ds in $(zfs list -H -o name -t filesystem 2>/dev/null); do
newest_sync=$(zfs list -t snapshot -H -o name,creation -S creation "$ds" 2>/dev/null \
| grep "syncoid_" | head -1)
if [[ -n "$newest_sync" ]]; then
snap_name=$(echo "$newest_sync" | awk '{print $1}')
snap_date=$(echo "$newest_sync" | awk '{$1=""; print $0}' | xargs)
snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null || echo 0)
if [[ "$snap_epoch" -gt 0 ]]; then
lag=$((NOW - snap_epoch))
echo "zfs_replication_lag_seconds{dataset=\"${ds}\"} ${lag}"
fi
fi
done
# Latest snapshot age per dataset (any type, not just syncoid)
for ds in $(zfs list -H -o name -t filesystem 2>/dev/null); do
newest=$(zfs list -t snapshot -H -o name,creation -S creation "$ds" 2>/dev/null | head -1)
if [[ -n "$newest" ]]; then
snap_date=$(echo "$newest" | awk '{$1=""; print $0}' | xargs)
snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null || echo 0)
if [[ "$snap_epoch" -gt 0 ]]; then
age=$((NOW - snap_epoch))
echo "zfs_latest_snapshot_age_seconds{dataset=\"${ds}\"} ${age}"
fi
fi
done
# Sanoid snapshot count per policy
for ds in $(zfs list -H -o name -t filesystem 2>/dev/null); do
count=$(zfs list -t snapshot -H -o name "$ds" 2>/dev/null | grep -c "autosnap" || echo 0)
echo "zfs_sanoid_snapshot_count{dataset=\"${ds}\"} ${count}"
done
} > "${TMPFILE}"
mv "${TMPFILE}" "${OUTPUT}"
SCRIPT
chmod 755 /usr/local/bin/replication-metrics.sh
Systemd timer for replication metrics
cat > /etc/systemd/system/replication-metrics.service << 'EOF'
[Unit]
Description=Generate ZFS replication metrics for node_exporter
[Service]
Type=oneshot
ExecStart=/usr/local/bin/replication-metrics.sh
User=root
EOF
cat > /etc/systemd/system/replication-metrics.timer << 'EOF'
[Unit]
Description=Run replication metrics every 5 minutes
[Timer]
OnBootSec=60
OnUnitActiveSec=5min
[Install]
WantedBy=timers.target
EOF
systemctl daemon-reload
systemctl enable --now replication-metrics.timer
Replication alert rules
# Add to /etc/prometheus/alerts.yml
groups:
- name: replication_health
rules:
- alert: ReplicationLagHigh
expr: zfs_replication_lag_seconds > 3600
for: 30m
labels:
severity: warning
annotations:
summary: "ZFS replication lag >1h on {{ $labels.instance }} dataset {{ $labels.dataset }}"
description: "Last syncoid snapshot is {{ $value | humanizeDuration }} old."
- alert: ReplicationLagCritical
expr: zfs_replication_lag_seconds > 86400
for: 15m
labels:
severity: critical
annotations:
summary: "ZFS replication lag >24h on {{ $labels.instance }}"
description: "Dataset {{ $labels.dataset }} has not replicated in {{ $value | humanizeDuration }}. DR target is dangerously stale."
- alert: SnapshotAgeHigh
expr: zfs_latest_snapshot_age_seconds > 7200
for: 30m
labels:
severity: warning
annotations:
summary: "No new snapshots for >2h on {{ $labels.instance }} dataset {{ $labels.dataset }}"
description: "Sanoid may have stopped or the cron/timer is failing."
- alert: SanoidNotRunning
expr: node_systemd_unit_state{name="sanoid.timer",state="active"} != 1
for: 5m
labels:
severity: warning
annotations:
summary: "sanoid timer not active on {{ $labels.instance }}"
Monitoring KVM virtual machines
If you run KVM VMs on ZFS zvols, you want visibility into per-VM CPU, memory, disk I/O, and network I/O from the host side (without agents inside the guest). The libvirt exporter provides this.
libvirt exporter installation
curl -LO https://github.com/prometheus-community/libvirt_exporter/releases/download/v0.4.0/libvirt_exporter-0.4.0.linux-amd64.tar.gz
tar xzf libvirt_exporter-0.4.0.linux-amd64.tar.gz
cp libvirt_exporter-0.4.0.linux-amd64/libvirt_exporter /usr/local/bin/
chmod 755 /usr/local/bin/libvirt_exporter
cat > /etc/systemd/system/libvirt_exporter.service << 'EOF'
[Unit]
Description=Prometheus Libvirt Exporter
After=libvirtd.service
Requires=libvirtd.service
[Service]
Type=simple
ExecStart=/usr/local/bin/libvirt_exporter \
--web.listen-address=:9177 \
--libvirt.uri="qemu:///system"
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now libvirt_exporter
Key metrics
# VM state (1=running, 5=shutoff, 3=paused)
libvirt_domain_info_state{domain="webserver"} 1
# CPU time (rate this for usage)
libvirt_domain_info_cpu_time_seconds_total{domain="webserver"} 45678.9
# Memory
libvirt_domain_info_memory_usage_bytes{domain="webserver"} 4.294967296e+09
libvirt_domain_info_maximum_memory_bytes{domain="webserver"} 8.589934592e+09
# Block device I/O (per-disk, typically the zvol)
libvirt_domain_block_stats_read_bytes_total{domain="webserver",target_device="vda"} 1.234e+11
libvirt_domain_block_stats_write_bytes_total{domain="webserver",target_device="vda"} 5.678e+10
libvirt_domain_block_stats_read_requests_total{domain="webserver",target_device="vda"} 4567890
libvirt_domain_block_stats_write_requests_total{domain="webserver",target_device="vda"} 2345678
# Network I/O (per-interface)
libvirt_domain_interface_stats_receive_bytes_total{domain="webserver",target_device="vnet0"} 8.9e+09
libvirt_domain_interface_stats_transmit_bytes_total{domain="webserver",target_device="vnet0"} 3.4e+09
Per-VM zvol I/O correlation
To correlate ZFS zvol I/O with VM workload, match the zvol block device to the VM's disk.
On a kldload system with VM zvols under rpool/vms/:
# Find which block device a zvol uses
ls -la /dev/zvol/rpool/vms/webserver
# lrwxrwxrwx 1 root root 10 Jan 15 10:00 /dev/zvol/rpool/vms/webserver -> ../../zd0
# The libvirt exporter reports I/O for the domain's "vda" device
# The node_exporter reports I/O for the "zd0" block device
# Correlate them: libvirt_domain_block_stats_*{domain="webserver"} ←→ node_disk_*{device="zd0"}
VM alert rules
groups:
- name: kvm_health
rules:
- alert: VMDown
expr: libvirt_domain_info_state != 1
for: 5m
labels:
severity: warning
annotations:
summary: "VM {{ $labels.domain }} is not running on {{ $labels.instance }}"
- alert: VMHighCPU
expr: rate(libvirt_domain_info_cpu_time_seconds_total[5m]) > 0.90
for: 15m
labels:
severity: warning
annotations:
summary: "VM {{ $labels.domain }} CPU usage >90% on {{ $labels.instance }}"
- alert: VMHighMemory
expr: |
libvirt_domain_info_memory_usage_bytes
/ libvirt_domain_info_maximum_memory_bytes > 0.95
for: 10m
labels:
severity: warning
annotations:
summary: "VM {{ $labels.domain }} memory usage >95% on {{ $labels.instance }}"
- alert: VMHighDiskIO
expr: |
(rate(libvirt_domain_block_stats_read_bytes_total[5m])
+ rate(libvirt_domain_block_stats_write_bytes_total[5m])) > 500e6
for: 15m
labels:
severity: warning
annotations:
summary: "VM {{ $labels.domain }} disk I/O >500MB/s sustained for 15m"
Quick reference
All ports
Port Service Purpose
──── ─────── ───────
3000 Grafana Dashboards and log viewer
3100 Loki Log aggregation API
9080 Promtail Log shipper status
9090 Prometheus Metrics TSDB and PromQL API
9093 Alertmanager Alert routing and silencing
9100 node_exporter Linux system metrics
9134 zfs_exporter ZFS pool/dataset/ARC metrics
9177 libvirt_exporter KVM VM metrics
9586 wireguard_exporter WireGuard peer metrics
All URLs
http://<monitoring-node>:9090 Prometheus UI (query, targets, alerts)
http://<monitoring-node>:9090/targets Prometheus target health
http://<monitoring-node>:9090/-/healthy Prometheus health check
http://<monitoring-node>:3000 Grafana dashboards
http://<monitoring-node>:9093 Alertmanager UI (alerts, silences)
http://<monitoring-node>:3100/ready Loki readiness check
http://<any-node>:9100/metrics node_exporter metrics endpoint
http://<any-node>:9134/metrics zfs_exporter metrics endpoint
All config paths
/etc/prometheus/prometheus.yml Prometheus main config
/etc/prometheus/alerts.yml Alert rules
/etc/prometheus/recording_rules.yml Recording rules
/etc/prometheus/targets/nodes/*.yml File-based service discovery
/etc/alertmanager/alertmanager.yml Alertmanager config
/etc/loki/loki.yml Loki server config
/etc/promtail/promtail.yml Promtail log shipper config
/etc/grafana/provisioning/datasources/ Grafana data source YAML
/etc/grafana/provisioning/dashboards/ Grafana dashboard provider YAML
/var/lib/grafana/dashboards/ Dashboard JSON files
/var/lib/node_exporter/textfile/ Custom metrics (.prom files)
/var/lib/prometheus/ Prometheus TSDB (on ZFS)
/var/lib/loki/ Loki chunks and index (on ZFS)
/var/lib/alertmanager/ Alertmanager state
All systemctl commands
# Status check — run on the monitoring node
systemctl status prometheus grafana-server alertmanager loki
# Status check — run on every node
systemctl status node_exporter zfs_exporter wireguard_exporter promtail
# Restart the full stack
systemctl restart prometheus grafana-server alertmanager loki
# Hot-reload Prometheus config (no restart, no data loss)
curl -X POST http://localhost:9090/-/reload
# Hot-reload Alertmanager config
curl -X POST http://localhost:9093/-/reload
# Check Prometheus config syntax before applying
promtool check config /etc/prometheus/prometheus.yml
# Check alert rules syntax
promtool check rules /etc/prometheus/alerts.yml
# Validate Alertmanager config
amtool check-config /etc/alertmanager/alertmanager.yml
# View Prometheus targets from CLI
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool
# View firing alerts from CLI
amtool alert query --alertmanager.url=http://localhost:9093
# Test an alert rule against historical data
promtool test rules /etc/prometheus/tests/alert_tests.yml
ZFS datasets for the monitoring stack
# Create all ZFS datasets for the monitoring stack
zfs create -o mountpoint=/var/lib/prometheus -o compression=zstd -o recordsize=128k -o atime=off rpool/prometheus
zfs create -o mountpoint=/var/lib/loki -o compression=zstd -o recordsize=128k -o atime=off rpool/loki
zfs create -o mountpoint=/var/lib/grafana -o compression=zstd -o recordsize=16k -o atime=off rpool/grafana
zfs create -o mountpoint=/var/lib/alertmanager -o compression=zstd -o atime=off rpool/alertmanager
# Snapshot the monitoring stack daily
# Add to sanoid.conf:
# [rpool/prometheus]
# use_template = monitoring
# autosnap = yes
# [rpool/loki]
# use_template = monitoring
# autosnap = yes
# [monitoring]
# daily = 7
# weekly = 4
# monthly = 3
Deploy all exporters to a new node — single script
#!/bin/bash
# deploy-exporters.sh — install all exporters + promtail on a kldload node
# Usage: bash deploy-exporters.sh
set -euo pipefail
LOKI_HOST="${1:-10.78.0.1}"
echo "=== Installing node_exporter ==="
curl -sLO https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
useradd --no-create-home --shell /sbin/nologin --system node_exporter 2>/dev/null || true
mkdir -p /var/lib/node_exporter/textfile
chown node_exporter:node_exporter /var/lib/node_exporter/textfile
echo "=== Installing zfs_exporter ==="
curl -sLO https://github.com/pdf/zfs_exporter/releases/download/v2.3.5/zfs_exporter-2.3.5.linux-amd64.tar.gz
tar xzf zfs_exporter-2.3.5.linux-amd64.tar.gz
cp zfs_exporter-2.3.5.linux-amd64/zfs_exporter /usr/local/bin/
echo "=== Installing wireguard_exporter ==="
curl -sLO https://github.com/MindFlavor/prometheus_wireguard_exporter/releases/download/3.6.6/prometheus_wireguard_exporter-3.6.6-x86_64-unknown-linux-musl.tar.gz
tar xzf prometheus_wireguard_exporter-3.6.6-x86_64-unknown-linux-musl.tar.gz
cp prometheus_wireguard_exporter /usr/local/bin/
echo "=== Installing promtail ==="
curl -sLO https://github.com/grafana/loki/releases/download/v3.1.1/promtail-linux-amd64.zip
unzip -o promtail-linux-amd64.zip
mv promtail-linux-amd64 /usr/local/bin/promtail
chmod 755 /usr/local/bin/promtail
mkdir -p /var/lib/promtail /etc/promtail
# Generate promtail config pointing to the Loki host
cat > /etc/promtail/promtail.yml << PROMEOF
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /var/lib/promtail/positions.yml
clients:
- url: http://${LOKI_HOST}:3100/loki/api/v1/push
scrape_configs:
- job_name: journal
journal:
max_age: 12h
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
- source_labels: ['__journal__hostname']
target_label: 'hostname'
PROMEOF
echo "=== Creating systemd units ==="
# [unit files created here — same as shown in individual sections above]
echo "=== Starting services ==="
systemctl daemon-reload
systemctl enable --now node_exporter zfs_exporter wireguard_exporter promtail
echo "=== Verifying ==="
for port in 9100 9134 9586 9080; do
if curl -sf "http://localhost:${port}/metrics" > /dev/null 2>&1 || \
curl -sf "http://localhost:${port}/ready" > /dev/null 2>&1; then
echo " Port ${port}: OK"
else
echo " Port ${port}: FAILED"
fi
done
echo "=== Done. Add this node to Prometheus targets. ==="
rm -f *.tar.gz *.zip