Monitoring and Observability
Setting up Prometheus, Grafana, and node_exporter on kldload systems. All examples work on CentOS/RHEL and Debian.
Quick health check with kst
Every kldload system includes kst — a one-command health
dashboard:
kst
Shows: ZFS pool health, root usage, compression ratio, snapshot count, boot environments, memory, CPU, uptime, and service status.
node_exporter (per-host metrics)
Install on every node you want to monitor:
CentOS/RHEL
dnf install -y golang # or download the binary directly
# Download node_exporter
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
Debian
apt install -y prometheus-node-exporter
systemctl enable --now prometheus-node-exporter
# Done — skip the systemd unit creation below
Systemd service (CentOS, or if installed from binary)
useradd --no-create-home --shell /sbin/nologin node_exporter
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network-online.target
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.zfs \
--collector.systemd \
--collector.processes
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now node_exporter
Verify
curl -s http://localhost:9100/metrics | head -20
# Should show metric lines like node_cpu_seconds_total, node_memory_MemTotal_bytes, etc.
ZFS-specific metrics
node_exporter’s --collector.zfs exposes:
node_zfs_arc_hits_total
node_zfs_arc_misses_total
node_zfs_arc_size
node_zfs_pool_state{pool="rpool"} # 0=online, 1=degraded, etc.
Prometheus (metrics server)
Install on your monitoring node:
# Download
curl -LO https://github.com/prometheus/prometheus/releases/download/v2.54.1/prometheus-2.54.1.linux-amd64.tar.gz
tar xzf prometheus-2.54.1.linux-amd64.tar.gz
cp prometheus-2.54.1.linux-amd64/{prometheus,promtool} /usr/local/bin/
mkdir -p /etc/prometheus /var/lib/prometheus
Configuration
cat > /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "kldload-nodes"
static_configs:
- targets:
- "10.78.0.1:9100" # node 0 (hub)
- "10.78.1.1:9100" # node 1
- "10.78.2.1:9100" # node 2
- "10.78.3.1:9100" # node 3
# ... add all your nodes
relabel_configs:
- source_labels: [__address__]
regex: '(.+):.*'
target_label: instance
EOF
If using the WireGuard mesh, use wg2 addresses
(10.79.x.x) — that’s the metrics plane.
Systemd service
useradd --no-create-home --shell /sbin/nologin prometheus
chown -R prometheus:prometheus /var/lib/prometheus /etc/prometheus
cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
After=network-online.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now prometheus
Store Prometheus data on ZFS
zfs create -o mountpoint=/var/lib/prometheus \
-o compression=zstd \
-o recordsize=128k \
rpool/prometheus
chown prometheus:prometheus /var/lib/prometheus
systemctl restart prometheus
zstd compression works well on time-series data — expect
3–5x compression ratio.
Verify
Open http://<monitoring-node>:9090 in a browser.
Try a query:
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
Grafana (dashboards)
# CentOS/RHEL
cat > /etc/yum.repos.d/grafana.repo << 'EOF'
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
EOF
dnf install -y grafana
# Debian
apt install -y apt-transport-https software-properties-common
curl -fsSL https://apt.grafana.com/gpg.key | gpg --dearmor -o /usr/share/keyrings/grafana.gpg
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
> /etc/apt/sources.list.d/grafana.list
apt update
apt install -y grafana
systemctl enable --now grafana-server
Open http://<monitoring-node>:3000 — default login
is admin/admin.
Add Prometheus as a data source
- Settings → Data Sources → Add data source → Prometheus
- URL:
http://localhost:9090 - Save & Test
Import the Node Exporter dashboard
- Dashboards → Import
- Enter ID:
1860(Node Exporter Full) - Select your Prometheus data source
- Import
This gives you CPU, memory, disk, network, and ZFS metrics for every node.
ZFS-specific dashboard
Create a custom panel in Grafana with these queries:
ARC hit rate
rate(node_zfs_arc_hits_total[5m]) /
(rate(node_zfs_arc_hits_total[5m]) + rate(node_zfs_arc_misses_total[5m])) * 100
A healthy ARC hit rate is >90%. Below 80% means the ARC is too small.
ARC size
node_zfs_arc_size
Pool free space
node_filesystem_avail_bytes{mountpoint="/"}
Snapshot count over time
Use eBPF or a custom exporter, or run a cron job:
# /usr/local/bin/zfs-metrics.sh — called by a textfile collector
echo "# HELP zfs_snapshot_count Number of ZFS snapshots"
echo "# TYPE zfs_snapshot_count gauge"
echo "zfs_snapshot_count $(zfs list -t snapshot -H | wc -l)"
Configure node_exporter to read it:
# Add to node_exporter ExecStart:
--collector.textfile.directory=/var/lib/node_exporter/textfile
# Create the directory and cron job
mkdir -p /var/lib/node_exporter/textfile
echo '*/5 * * * * root /usr/local/bin/zfs-metrics.sh > /var/lib/node_exporter/textfile/zfs.prom' \
>> /etc/crontab
Alerting
Add alert rules to Prometheus:
cat > /etc/prometheus/alerts.yml << 'EOF'
groups:
- name: kldload
rules:
- alert: ZFSPoolDegraded
expr: node_zfs_pool_state{state="degraded"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "ZFS pool degraded on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Less than 10% disk space on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage >90% on {{ $labels.instance }}"
- alert: ARCHitRateLow
expr: rate(node_zfs_arc_hits_total[5m]) / (rate(node_zfs_arc_hits_total[5m]) + rate(node_zfs_arc_misses_total[5m])) < 0.8
for: 15m
labels:
severity: warning
annotations:
summary: "ZFS ARC hit rate below 80% on {{ $labels.instance }}"
EOF
Reference it in prometheus.yml:
rule_files:
- "alerts.yml"
systemctl restart prometheus