Documentation

Monitoring and Observability — the complete stack.

This is the complete observability stack for kldload systems. Every component — node_exporter, ZFS exporter, WireGuard exporter, Prometheus, Grafana, Alertmanager, Loki — is open source, runs on ZFS, communicates over WireGuard, and requires zero cloud dependencies. All examples work on CentOS/RHEL/Rocky, Fedora, Debian, and Ubuntu. Every config file on this page is production-ready. Every PromQL query has been tested on real kldload deployments.

The thesis: Observability is not a product you buy. It is not Datadog. It is not New Relic. It is not Splunk. Observability is Prometheus scraping exporters, Grafana rendering dashboards, Alertmanager routing alerts, and Loki indexing logs. Four open-source binaries. Four systemd units. Zero license fees. Zero egress charges. Zero per-host pricing.

The entire commercial observability industry exists because people do not know these four tools. Once you do, you will never pay $23/host/month for metrics again. Your data stays on your infrastructure, on ZFS, compressed, checksummed, snapshotted, and replicated — just like everything else in the kldload stack.

I have run Datadog, New Relic, and Splunk at scale. The bill for 200 hosts on Datadog was $14,000/month. I replaced it with Prometheus + Grafana + Loki on a single kldload box with 32GB RAM and two NVMe drives. The bill is $0/month. The dashboards are better because I built them for my infrastructure, not for a generic "cloud host" template. The alerts are better because they fire on things I care about, not things a vendor thinks I should care about. The only cost was learning PromQL, and that took a weekend.

Stack architecture

The kldload observability stack has six components. Every component is a single binary with a single config file. There are no databases to manage (Prometheus has its own TSDB; Loki has its own index). There are no message queues. There are no Kafka clusters. The architecture is deliberately simple because simple systems are reliable systems.

                        +-----------------+
                        |    Grafana      |  :3000  (dashboards + log viewer)
                        +--------+--------+
                                 |
                    +------------+------------+
                    |                         |
           +--------+--------+     +---------+---------+
           |   Prometheus    |     |       Loki        |  :3100  (log aggregation)
           |   :9090         |     +-------------------+
           +--------+--------+               ^
                    |                         |
        +-----------+-----------+    +--------+--------+
        |           |           |    |    Promtail     |  (log shipper on every node)
        v           v           v    +--------+--------+
  +-----------+ +-----------+ +-----------+   |
  | node_exp  | | zfs_exp   | | wg_exp    |   |
  | :9100     | | :9134     | | :9586     |   |
  +-----------+ +-----------+ +-----------+   |
        ^           ^           ^             |
        |           |           |             |
  [every kldload node — exporters + promtail on each]

           +-------------------+
           |   Alertmanager    |  :9093  (alert routing + silencing)
           +-------------------+
                    ^
                    |
              Prometheus (fires alerts via alert rules)

Data flow: Exporters expose metrics on HTTP endpoints. Prometheus scrapes them every 15 seconds and stores time-series data in its local TSDB on ZFS. Grafana queries Prometheus for dashboards. Alertmanager receives firing alerts from Prometheus and routes them to Slack, PagerDuty, email, or webhooks. Promtail tails log files and ships them to Loki. Grafana queries Loki for log correlation. That is the entire architecture.

Exporters — the data sources

Small HTTP servers that expose metrics in Prometheus format. Each exporter knows one domain: node_exporter knows Linux, zfs_exporter knows ZFS, wireguard_exporter knows WireGuard. They run on every monitored host. They use almost no resources — typically 10-20MB RAM.

Exporters are read-only. They expose what is already there. They do not modify the system.

Prometheus — the brain

Pull-based time-series database. Scrapes exporters on a schedule, stores data locally, evaluates alert rules, serves PromQL queries. One binary, one config file, one data directory. Handles millions of time series on modest hardware.

Prometheus does not push. It pulls. This means targets do not need to know about Prometheus.

Grafana — the eyes

Dashboard and visualization layer. Queries Prometheus for metrics, Loki for logs. Provisioned entirely via YAML and JSON — no clicking through a GUI to configure. Dashboards are code, stored in git, deployed with the system.

If you are clicking through Grafana's UI to create dashboards, you are doing it wrong. Provision them.

Alertmanager — the voice

Receives alerts from Prometheus, deduplicates them, groups them, routes them to the right receiver. Supports silencing, inhibition, and escalation. Runs separately so Prometheus can be restarted without losing alert state.

Prometheus decides when to fire. Alertmanager decides who to tell and how loudly.

Loki — the memory

Log aggregation system designed to be cost-effective. Does not index log content — only indexes labels (like Prometheus). Uses the same label model as Prometheus, so you can correlate metrics and logs by the same set of labels. Massively cheaper than Elasticsearch.

Loki is "Prometheus for logs." Same labels, same query style, same philosophy.

Promtail — the courier

Tails log files on each host and ships them to Loki. Discovers logs via systemd journal or filesystem paths. Attaches labels automatically from filename, systemd unit, or hostname. Runs on every node alongside the exporters.

Promtail is to Loki what node_exporter is to Prometheus — the per-host agent.

The pull model is the key architectural decision. Prometheus scrapes targets — targets do not push to Prometheus. This means you can add a new node to monitoring by adding one line to prometheus.yml. The node does not need to know where Prometheus is. The node does not need credentials for Prometheus. The node just exposes metrics on a port, and Prometheus comes to collect them. Over WireGuard, this is trivially secure — only peers on the mesh can reach the exporter ports.

Quick health check with kst

Before deploying the full stack, every kldload system includes kst — a one-command health dashboard built into the platform:

kst

Shows: ZFS pool health, root usage, compression ratio, snapshot count, boot environments, memory, CPU, uptime, and service status. This is your quick-glance tool for interactive troubleshooting. The full Prometheus stack gives you history, alerting, and multi-host views that kst cannot provide.

node_exporter — per-host metrics

node_exporter is the foundation. It runs on every host and exposes Linux system metrics — CPU, memory, disk, network, filesystem, systemd units, processes, and ZFS. Install it on every node you want to monitor.

Installation

CentOS / RHEL / Rocky / Fedora

# Download the latest release
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
chmod 755 /usr/local/bin/node_exporter

Debian / Ubuntu

# The distro package is fine for basic use, but we want the latest for ZFS collectors
# Install from binary for consistency across all distros
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
chmod 755 /usr/local/bin/node_exporter

Hardened systemd unit

The systemd unit below is production-hardened. It runs as a dedicated unprivileged user, enables the collectors that matter for kldload systems, sets up the textfile collector directory, and applies security restrictions. This is not the minimal example from the README — this is what you actually run in production.

# Create the service user
useradd --no-create-home --shell /sbin/nologin --system node_exporter

# Create the textfile collector directory
mkdir -p /var/lib/node_exporter/textfile
chown node_exporter:node_exporter /var/lib/node_exporter/textfile

cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
After=network-online.target
Wants=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.zfs \
  --collector.systemd \
  --collector.processes \
  --collector.tcpstat \
  --collector.ntp \
  --collector.textfile \
  --collector.textfile.directory=/var/lib/node_exporter/textfile \
  --no-collector.infiniband \
  --no-collector.wifi \
  --no-collector.fibrechannel \
  --web.listen-address=:9100
Restart=always
RestartSec=5

# Security hardening
ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
ProtectKernelTunables=yes
ProtectControlGroups=yes
ReadOnlyPaths=/
ReadWritePaths=/var/lib/node_exporter/textfile

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now node_exporter

Important collectors for kldload systems

node_exporter ships with dozens of collectors. Most are enabled by default. These are the ones that matter on kldload infrastructure:

collector.zfs

ARC hits, misses, size, pool state. The most important collector on a kldload system. Reads from /proc/spl/kstat/zfs/. No privileges needed — the files are world-readable.

collector.systemd

Service states (active, failed, inactive) for all systemd units. Lets you alert on node_systemd_unit_state{name="zfs-mount.service",state="failed"} == 1.

collector.processes

Process counts by state (running, sleeping, zombie). Useful for detecting fork bombs or runaway process creation.

collector.tcpstat

TCP connection states (established, time-wait, close-wait). Essential for monitoring WireGuard-transported services and detecting connection leaks.

collector.textfile

Reads .prom files from a directory and exposes them as metrics. This is how you add custom metrics — ZFS snapshot counts, replication lag, scrub progress — without writing a full exporter.

collector.ntp

NTP clock offset. Time drift breaks Prometheus (out-of-order samples get dropped). Alert if offset exceeds 100ms.

Verify

curl -s http://localhost:9100/metrics | head -30

Expected output (abbreviated):

# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_cpu_seconds_total{cpu="0",mode="system"} 4567.89
node_cpu_seconds_total{cpu="0",mode="user"} 8901.23
# HELP node_memory_MemTotal_bytes Memory information field MemTotal_bytes.
# TYPE node_memory_MemTotal_bytes gauge
node_memory_MemTotal_bytes 3.4089984e+10
# HELP node_zfs_arc_hits_total kstat.zfs.misc.arcstats.hits
# TYPE node_zfs_arc_hits_total counter
node_zfs_arc_hits_total 1.28934567e+08
# HELP node_zfs_arc_misses_total kstat.zfs.misc.arcstats.misses
# TYPE node_zfs_arc_misses_total counter
node_zfs_arc_misses_total 4.567890e+06

Textfile collector — custom metrics

The textfile collector is the escape hatch. Any metric you cannot get from a standard collector, you write to a .prom file and node_exporter picks it up. This is how you expose ZFS snapshot counts, sanoid replication lag, scrub status, and anything else that requires running a ZFS command.

cat > /usr/local/bin/zfs-textfile-metrics.sh << 'SCRIPT'
#!/bin/bash
# Custom ZFS metrics for node_exporter textfile collector
# Runs every 5 minutes via cron or systemd timer

set -euo pipefail

OUTPUT="/var/lib/node_exporter/textfile/zfs_custom.prom"
TMPFILE="${OUTPUT}.tmp"

{
  # Snapshot count per dataset
  echo "# HELP zfs_snapshot_count Number of ZFS snapshots per dataset"
  echo "# TYPE zfs_snapshot_count gauge"
  zfs list -t snapshot -H -o name 2>/dev/null | \
    awk -F'@' '{count[$1]++} END {for (ds in count) printf "zfs_snapshot_count{dataset=\"%s\"} %d\n", ds, count[ds]}'

  # Total snapshot count
  echo "# HELP zfs_snapshot_count_total Total number of ZFS snapshots"
  echo "# TYPE zfs_snapshot_count_total gauge"
  echo "zfs_snapshot_count_total $(zfs list -t snapshot -H 2>/dev/null | wc -l)"

  # Pool usage percentage
  echo "# HELP zfs_pool_usage_percent ZFS pool usage percentage"
  echo "# TYPE zfs_pool_usage_percent gauge"
  zpool list -H -o name,capacity 2>/dev/null | while read pool cap; do
    cap="${cap%\%}"
    echo "zfs_pool_usage_percent{pool=\"${pool}\"} ${cap}"
  done

  # Scrub status (0=none, 1=scrubbing, 2=completed)
  echo "# HELP zfs_scrub_state ZFS scrub state (0=none, 1=active, 2=completed)"
  echo "# TYPE zfs_scrub_state gauge"
  zpool list -H -o name 2>/dev/null | while read pool; do
    status=$(zpool status "$pool" 2>/dev/null)
    if echo "$status" | grep -q "scrub in progress"; then
      echo "zfs_scrub_state{pool=\"${pool}\"} 1"
    elif echo "$status" | grep -q "scrub repaired"; then
      echo "zfs_scrub_state{pool=\"${pool}\"} 2"
    else
      echo "zfs_scrub_state{pool=\"${pool}\"} 0"
    fi
  done

  # Scrub errors
  echo "# HELP zfs_scrub_errors_total ZFS scrub errors count"
  echo "# TYPE zfs_scrub_errors_total gauge"
  zpool list -H -o name 2>/dev/null | while read pool; do
    errors=$(zpool status "$pool" 2>/dev/null | grep "scan:" | grep -oP '\d+ repaired' | awk '{print $1}' || echo 0)
    echo "zfs_scrub_errors_total{pool=\"${pool}\"} ${errors:-0}"
  done

  # Dataset compression ratio
  echo "# HELP zfs_compression_ratio ZFS dataset compression ratio"
  echo "# TYPE zfs_compression_ratio gauge"
  zfs list -H -o name,compressratio -t filesystem 2>/dev/null | while read ds ratio; do
    ratio="${ratio%x}"
    echo "zfs_compression_ratio{dataset=\"${ds}\"} ${ratio}"
  done

  # ARC target size (arc_c) vs actual size (arc_size)
  echo "# HELP zfs_arc_target_bytes ZFS ARC target size in bytes"
  echo "# TYPE zfs_arc_target_bytes gauge"
  arc_c=$(awk '/^size/ {print $3}' /proc/spl/kstat/zfs/arcstats 2>/dev/null || echo 0)
  echo "zfs_arc_target_bytes ${arc_c}"

} > "${TMPFILE}"

# Atomic move so node_exporter never reads a partial file
mv "${TMPFILE}" "${OUTPUT}"
SCRIPT

chmod 755 /usr/local/bin/zfs-textfile-metrics.sh

Run it on a timer — a systemd timer is cleaner than cron:

cat > /etc/systemd/system/zfs-textfile-metrics.service << 'EOF'
[Unit]
Description=Generate ZFS metrics for node_exporter textfile collector

[Service]
Type=oneshot
ExecStart=/usr/local/bin/zfs-textfile-metrics.sh
User=root
EOF

cat > /etc/systemd/system/zfs-textfile-metrics.timer << 'EOF'
[Unit]
Description=Run ZFS textfile metrics every 5 minutes

[Timer]
OnBootSec=30
OnUnitActiveSec=5min
AccuracySec=30s

[Install]
WantedBy=timers.target
EOF

systemctl daemon-reload
systemctl enable --now zfs-textfile-metrics.timer

# Run once immediately to populate the file
/usr/local/bin/zfs-textfile-metrics.sh

Filtering metrics at scrape time

node_exporter exposes ~800 metrics by default. If you only need a subset, you can filter at the Prometheus scrape config level using metric_relabel_configs:

# In prometheus.yml scrape_configs
- job_name: "kldload-nodes"
  metric_relabel_configs:
    # Drop high-cardinality metrics you don't need
    - source_labels: [__name__]
      regex: 'node_scrape_collector_duration_seconds'
      action: drop
    # Keep only CPU modes you care about
    - source_labels: [__name__, mode]
      regex: 'node_cpu_seconds_total;(idle|iowait|system|user)'
      action: keep

ZFS exporter — dedicated ZFS metrics

While node_exporter provides basic ZFS metrics (ARC stats, pool state), a dedicated ZFS exporter gives you deeper visibility: per-dataset usage, per-pool I/O, scrub progress percentages, replication lag, and individual pool member (vdev) health. The pdf/zfs_exporter is the standard choice.

Installation

# Download
curl -LO https://github.com/pdf/zfs_exporter/releases/download/v2.3.5/zfs_exporter-2.3.5.linux-amd64.tar.gz
tar xzf zfs_exporter-2.3.5.linux-amd64.tar.gz
cp zfs_exporter-2.3.5.linux-amd64/zfs_exporter /usr/local/bin/
chmod 755 /usr/local/bin/zfs_exporter

Systemd unit

cat > /etc/systemd/system/zfs_exporter.service << 'EOF'
[Unit]
Description=Prometheus ZFS Exporter
Documentation=https://github.com/pdf/zfs_exporter
After=zfs-mount.service
Requires=zfs-mount.service

[Service]
Type=simple
ExecStart=/usr/local/bin/zfs_exporter \
  --web.listen-address=:9134 \
  --collector.dataset-snapshot \
  --collector.pool
Restart=always
RestartSec=5

ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now zfs_exporter

Key metrics exposed

# Pool health
zfs_pool_health{pool="rpool",state="online"}                    1
zfs_pool_health{pool="rpool",state="degraded"}                  0

# Pool I/O operations
zfs_pool_read_ops_total{pool="rpool"}                           45678901
zfs_pool_write_ops_total{pool="rpool"}                          23456789
zfs_pool_read_bytes_total{pool="rpool"}                         1.234e+12
zfs_pool_write_bytes_total{pool="rpool"}                        5.678e+11

# Dataset usage
zfs_dataset_used_bytes{dataset="rpool/ROOT/os",type="filesystem"}     8.5e+09
zfs_dataset_available_bytes{dataset="rpool/ROOT/os",type="filesystem"} 9.2e+10
zfs_dataset_referenced_bytes{dataset="rpool/ROOT/os",type="filesystem"} 7.8e+09

# Snapshot metrics
zfs_dataset_snapshot_count{dataset="rpool/ROOT/os"}             42
zfs_dataset_snapshot_used_bytes{dataset="rpool/ROOT/os"}        2.1e+09

# ARC detailed stats
zfs_arc_size_bytes                                              1.7179869e+10
zfs_arc_hits_total                                              1.28934567e+08
zfs_arc_misses_total                                            4.567890e+06
zfs_arc_l2_hits_total                                           0
zfs_arc_l2_misses_total                                         0
zfs_arc_mfu_size_bytes                                          8.589934e+09
zfs_arc_mru_size_bytes                                          6.442450e+09

# Scrub progress
zfs_pool_scrub_progress{pool="rpool"}                           0.73
zfs_pool_scrub_errors_total{pool="rpool"}                       0
zfs_pool_scrub_duration_seconds{pool="rpool"}                   3456

Verify

curl -s http://localhost:9134/metrics | grep zfs_pool_health
# zfs_pool_health{pool="rpool",state="online"} 1

The built-in node_exporter ZFS collector is good enough for ARC stats and basic pool state. But when you have 50 datasets, 200 snapshots, and multiple pools with different vdev topologies, you want the dedicated exporter. It knows about scrub progress, per-vdev error counts, L2ARC stats, and dataset-level accounting. The extra 15MB of RAM it uses is nothing compared to the visibility it provides.

WireGuard exporter — per-peer metrics

If you run a WireGuard mesh (and on kldload you probably do), you need visibility into peer connectivity, handshake recency, and data transfer. The WireGuard exporter reads from the kernel's WireGuard interface and exposes per-peer metrics.

Installation

# Download prometheus-wireguard-exporter
curl -LO https://github.com/MindFlavor/prometheus_wireguard_exporter/releases/download/3.6.6/prometheus_wireguard_exporter-3.6.6-x86_64-unknown-linux-musl.tar.gz
tar xzf prometheus_wireguard_exporter-3.6.6-x86_64-unknown-linux-musl.tar.gz
cp prometheus_wireguard_exporter /usr/local/bin/
chmod 755 /usr/local/bin/prometheus_wireguard_exporter

Systemd unit

cat > /etc/systemd/system/wireguard_exporter.service << 'EOF'
[Unit]
Description=Prometheus WireGuard Exporter
Documentation=https://github.com/MindFlavor/prometheus_wireguard_exporter
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
# Needs CAP_NET_ADMIN to read WireGuard interface data
ExecStart=/usr/local/bin/prometheus_wireguard_exporter \
  -p 9586 \
  -n /etc/wireguard/
Restart=always
RestartSec=5

AmbientCapabilities=CAP_NET_ADMIN
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadOnlyPaths=/etc/wireguard

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now wireguard_exporter

Key metrics

# Per-peer metrics (one set per peer per interface)
wireguard_sent_bytes_total{interface="wg0",public_key="abc123...",friendly_name="node-1"}     1.234e+09
wireguard_received_bytes_total{interface="wg0",public_key="abc123...",friendly_name="node-1"} 5.678e+08
wireguard_latest_handshake_seconds{interface="wg0",public_key="abc123...",friendly_name="node-1"} 1.71e+09

# Derived: time since last handshake (use in PromQL)
# time() - wireguard_latest_handshake_seconds
# A peer with no handshake in >180s is likely down

Friendly names from config comments

The exporter reads WireGuard config files and maps public keys to friendly names using comments in the config. Add a comment above each [Peer] block:

# /etc/wireguard/wg0.conf

[Interface]
PrivateKey = ...
Address = 10.78.0.1/24
ListenPort = 51820

# node-1 (web server)
[Peer]
PublicKey = abc123...
AllowedIPs = 10.78.1.1/32
Endpoint = 203.0.113.10:51820

# node-2 (database)
[Peer]
PublicKey = def456...
AllowedIPs = 10.78.2.1/32
Endpoint = 203.0.113.20:51820

The comment text becomes the friendly_name label. This makes dashboards readable — you see "node-1 (web server)" instead of a base64 public key.

Verify

curl -s http://localhost:9586/metrics | grep wireguard_latest_handshake
# wireguard_latest_handshake_seconds{interface="wg0",public_key="abc123...",friendly_name="node-1 (web server)"} 1.712345678e+09

Prometheus — the metrics server

Prometheus is the center of the stack. It scrapes every exporter, stores time-series data, evaluates alert rules, and serves PromQL queries to Grafana. On a kldload system, Prometheus stores its TSDB on ZFS for compression, checksumming, and snapshots.

Installation

# Download
curl -LO https://github.com/prometheus/prometheus/releases/download/v2.54.1/prometheus-2.54.1.linux-amd64.tar.gz
tar xzf prometheus-2.54.1.linux-amd64.tar.gz
cp prometheus-2.54.1.linux-amd64/{prometheus,promtool} /usr/local/bin/
chmod 755 /usr/local/bin/{prometheus,promtool}
mkdir -p /etc/prometheus /var/lib/prometheus

Storage on ZFS

Prometheus TSDB writes 2-hour blocks, then compacts them. The access pattern is: sequential writes for ingestion, random reads for queries. The optimal ZFS configuration:

# Create a dedicated dataset for Prometheus
zfs create -o mountpoint=/var/lib/prometheus \
           -o compression=zstd \
           -o recordsize=128k \
           -o atime=off \
           -o xattr=sa \
           -o dnodesize=auto \
           -o primarycache=all \
           rpool/prometheus

# Set ownership
useradd --no-create-home --shell /sbin/nologin --system prometheus
chown -R prometheus:prometheus /var/lib/prometheus /etc/prometheus

Why these settings: recordsize=128k matches Prometheus' large sequential writes. zstd compresses time-series data at 3-5x — a 50GB TSDB might use only 12GB on disk. atime=off avoids a metadata write on every read. The dataset inherits checksumming from the pool, so silent corruption of your metrics database is impossible.

Retention sizing

Prometheus stores data locally. How much space you need depends on the number of time series, scrape interval, and retention period. The formula:

# Space per sample: ~1-2 bytes (Prometheus is very efficient)
# Formula: series_count * samples_per_day * bytes_per_sample * retention_days

# Example: 10 nodes, ~800 series each, 15s scrape interval, 30d retention
#   8,000 series * 5,760 samples/day * 1.5 bytes * 30 days = ~2GB uncompressed
#   With zstd compression on ZFS: ~500MB actual disk

# Example: 100 nodes, 30d retention
#   80,000 series * 5,760 * 1.5 * 30 = ~20GB uncompressed = ~5GB on ZFS

# Example: 100 nodes, 1 year retention
#   80,000 * 5,760 * 1.5 * 365 = ~250GB uncompressed = ~60GB on ZFS

ZFS compression on Prometheus data is a cheat code. Datadog charges by the metric. AWS CloudWatch charges by the metric. Prometheus on ZFS stores a year of data for 100 hosts in 60GB of actual disk. That is $3/month on a 1TB NVMe drive. The commercial equivalent would be thousands per month. This is the math that makes self-hosted observability a no-brainer.

Complete prometheus.yml

This is a complete, production-ready Prometheus configuration for a kldload cluster. It scrapes all exporters, loads alert rules, connects to Alertmanager, and includes recording rules for pre-computed queries.

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

  # External labels for federation and remote write
  external_labels:
    cluster: "prod-east"
    environment: "production"

# Alert rules and recording rules
rule_files:
  - "alerts.yml"
  - "recording_rules.yml"

# Alertmanager connection
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - "localhost:9093"

# Scrape configurations
scrape_configs:
  # Prometheus monitors itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # node_exporter on every host
  - job_name: "node"
    static_configs:
      - targets:
        - "10.78.0.1:9100"     # hub (monitoring node)
        - "10.78.1.1:9100"     # node-1 (web)
        - "10.78.2.1:9100"     # node-2 (database)
        - "10.78.3.1:9100"     # node-3 (app server)
        - "10.78.4.1:9100"     # node-4 (build runner)
    relabel_configs:
      - source_labels: [__address__]
        regex: '10\.78\.(\d+)\.\d+:\d+'
        target_label: node_id
      - source_labels: [__address__]
        regex: '(.+):\d+'
        target_label: instance

  # ZFS exporter on every host
  - job_name: "zfs"
    static_configs:
      - targets:
        - "10.78.0.1:9134"
        - "10.78.1.1:9134"
        - "10.78.2.1:9134"
        - "10.78.3.1:9134"
        - "10.78.4.1:9134"
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):\d+'
        target_label: instance

  # WireGuard exporter on every host
  - job_name: "wireguard"
    static_configs:
      - targets:
        - "10.78.0.1:9586"
        - "10.78.1.1:9586"
        - "10.78.2.1:9586"
        - "10.78.3.1:9586"
        - "10.78.4.1:9586"
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):\d+'
        target_label: instance

  # Grafana health
  - job_name: "grafana"
    static_configs:
      - targets: ["localhost:3000"]

  # Alertmanager health
  - job_name: "alertmanager"
    static_configs:
      - targets: ["localhost:9093"]

  # libvirt exporter (KVM hosts only)
  - job_name: "libvirt"
    static_configs:
      - targets:
        - "10.78.0.1:9177"
        - "10.78.3.1:9177"
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):\d+'
        target_label: instance

Notice the scrape targets use WireGuard addresses (10.78.x.x). Every exporter port is only reachable over the mesh. Prometheus does not need to reach any public IP. The WireGuard tunnel encrypts all metric traffic. This is why the kldload mesh has a dedicated metrics plane on wg2 (10.79.x.x) — so you can apply different firewall rules to monitoring traffic vs application traffic. If the mesh is down, Prometheus will show the node as down, which is exactly what you want.

Systemd service

cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Time Series Database
Documentation=https://prometheus.io/docs/
After=network-online.target zfs-mount.service
Wants=network-online.target
Requires=zfs-mount.service

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=90d \
  --storage.tsdb.retention.size=50GB \
  --storage.tsdb.wal-compression \
  --web.listen-address=:9090 \
  --web.enable-lifecycle \
  --web.enable-admin-api
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=5

ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
ReadWritePaths=/var/lib/prometheus

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now prometheus

Key flags: --storage.tsdb.retention.time=90d keeps 90 days of data. --storage.tsdb.retention.size=50GB caps disk usage. Whichever limit is hit first triggers eviction of the oldest blocks. --storage.tsdb.wal-compression compresses the write-ahead log, saving 50% WAL space. --web.enable-lifecycle allows hot-reloading config via curl -X POST http://localhost:9090/-/reload.

Recording rules — pre-compute expensive queries

Recording rules run a PromQL query on a schedule and store the result as a new time series. Use them for dashboard queries that would be too expensive to compute on every page load.

cat > /etc/prometheus/recording_rules.yml << 'EOF'
groups:
  - name: node_recording_rules
    interval: 60s
    rules:
      # CPU usage percentage (pre-computed for dashboards)
      - record: instance:node_cpu_utilization:ratio
        expr: |
          1 - avg without(cpu, mode) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          )

      # Memory usage percentage
      - record: instance:node_memory_utilization:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes
            / node_memory_MemTotal_bytes
          )

      # Disk I/O utilization
      - record: instance:node_disk_io_utilization:ratio
        expr: |
          rate(node_disk_io_time_seconds_total[5m])

  - name: zfs_recording_rules
    interval: 60s
    rules:
      # ARC hit rate
      - record: instance:zfs_arc_hit_ratio:ratio
        expr: |
          rate(node_zfs_arc_hits_total[5m])
          / (
            rate(node_zfs_arc_hits_total[5m])
            + rate(node_zfs_arc_misses_total[5m])
          )

      # Pool usage ratio (from zfs_exporter)
      - record: instance:zfs_pool_usage:ratio
        expr: |
          zfs_dataset_used_bytes{type="filesystem"}
          / (
            zfs_dataset_used_bytes{type="filesystem"}
            + zfs_dataset_available_bytes{type="filesystem"}
          )

  - name: wireguard_recording_rules
    interval: 60s
    rules:
      # WireGuard peer handshake staleness (seconds since last handshake)
      - record: instance:wireguard_peer_handshake_age:seconds
        expr: |
          time() - wireguard_latest_handshake_seconds

      # WireGuard throughput per peer
      - record: instance:wireguard_peer_sent_rate:bytes_per_second
        expr: |
          rate(wireguard_sent_bytes_total[5m])
EOF

Federation — multi-cluster Prometheus

If you have multiple kldload clusters (e.g., prod-east and prod-west), a federated Prometheus on a central node can scrape the recording rules from each cluster's Prometheus:

# On the central/global Prometheus
scrape_configs:
  - job_name: "federate-prod-east"
    honor_labels: true
    metrics_path: "/federate"
    params:
      'match[]':
        - '{__name__=~"instance:.*"}'           # All recording rules
        - '{__name__=~"job:.*"}'                 # Job-level aggregates
        - 'up'                                   # Target health
    static_configs:
      - targets:
        - "10.79.0.1:9090"   # prod-east Prometheus over wg2

  - job_name: "federate-prod-west"
    honor_labels: true
    metrics_path: "/federate"
    params:
      'match[]':
        - '{__name__=~"instance:.*"}'
        - '{__name__=~"job:.*"}'
        - 'up'
    static_configs:
      - targets:
        - "10.79.10.1:9090"  # prod-west Prometheus over wg2

Remote write — Thanos or Mimir for long-term storage

For retention beyond what local ZFS can hold, Prometheus can remote-write to Thanos or Grafana Mimir, which store data in object storage (S3, MinIO). Add to prometheus.yml:

# Remote write to Thanos receive or Mimir
remote_write:
  - url: "http://thanos-receive.internal:19291/api/v1/receive"
    queue_config:
      max_samples_per_send: 5000
      batch_send_deadline: 5s
      max_shards: 10
    write_relabel_configs:
      # Only send recording rules to long-term storage (reduce volume)
      - source_labels: [__name__]
        regex: 'instance:.*|job:.*'
        action: keep

Service discovery

For large deployments, static configs become unwieldy. Prometheus supports file-based service discovery — drop JSON or YAML files into a directory and Prometheus picks up new targets automatically:

# In prometheus.yml
scrape_configs:
  - job_name: "node"
    file_sd_configs:
      - files:
        - "/etc/prometheus/targets/nodes/*.yml"
        refresh_interval: 30s

# /etc/prometheus/targets/nodes/prod.yml
# Add/remove hosts by editing this file — no Prometheus restart needed
- targets:
    - "10.78.0.1:9100"
    - "10.78.1.1:9100"
    - "10.78.2.1:9100"
  labels:
    environment: "production"
    site: "east"

- targets:
    - "10.78.10.1:9100"
    - "10.78.11.1:9100"
  labels:
    environment: "production"
    site: "west"

Verify

# Check config syntax
promtool check config /etc/prometheus/prometheus.yml
# Checking /etc/prometheus/prometheus.yml
#   SUCCESS: 2 rule files found
#  SUCCESS: /etc/prometheus/prometheus.yml is valid prometheus config file

# Open the UI
curl -s http://localhost:9090/-/healthy
# Prometheus Server is Healthy.

# Query the API
curl -s 'http://localhost:9090/api/v1/targets' | python3 -m json.tool | head -20

Grafana — dashboards and visualization

Grafana is the visualization layer. On kldload systems, we provision Grafana entirely via YAML and JSON — data sources, dashboards, and alert notification channels are all defined as files, deployed with the system, and version-controlled in git. No clicking through a web UI to configure things that should be code.

Installation

CentOS / RHEL / Rocky / Fedora

cat > /etc/yum.repos.d/grafana.repo << 'EOF'
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF

dnf install -y grafana

Debian / Ubuntu

apt install -y apt-transport-https software-properties-common
curl -fsSL https://apt.grafana.com/gpg.key | gpg --dearmor -o /usr/share/keyrings/grafana.gpg
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
  > /etc/apt/sources.list.d/grafana.list
apt update && apt install -y grafana

systemctl enable --now grafana-server

Open http://<monitoring-node>:3000 — default login is admin/admin. You will be prompted to change the password.

Provisioning via YAML — no GUI clicking

Grafana reads provisioning files from /etc/grafana/provisioning/. This is how you define data sources and dashboard directories as code:

# Data source provisioning
cat > /etc/grafana/provisioning/datasources/prometheus.yml << 'EOF'
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: false
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

  - name: Loki
    type: loki
    access: proxy
    url: http://localhost:3100
    editable: false
    jsonData:
      maxLines: 1000
EOF

# Dashboard provisioning
mkdir -p /var/lib/grafana/dashboards

cat > /etc/grafana/provisioning/dashboards/kldload.yml << 'EOF'
apiVersion: 1
providers:
  - name: 'kldload'
    orgId: 1
    folder: 'kldload'
    type: file
    disableDeletion: true
    updateIntervalSeconds: 30
    allowUiUpdates: false
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: false
EOF

Dashboard: Host Overview

This dashboard JSON gives you CPU, memory, network, and disk for every kldload node. Drop it into /var/lib/grafana/dashboards/host-overview.json:

{
  "dashboard": {
    "title": "kldload Host Overview",
    "uid": "kldload-host-overview",
    "timezone": "browser",
    "refresh": "30s",
    "time": { "from": "now-1h", "to": "now" },
    "templating": {
      "list": [
        {
          "name": "instance",
          "type": "query",
          "query": "label_values(up{job=\"node\"}, instance)",
          "datasource": "Prometheus",
          "refresh": 2,
          "includeAll": true,
          "multi": true
        }
      ]
    },
    "panels": [
      {
        "title": "CPU Usage",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "targets": [
          {
            "expr": "instance:node_cpu_utilization:ratio{instance=~\"$instance\"} * 100",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "percent", "max": 100, "min": 0 }
        }
      },
      {
        "title": "Memory Usage",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
        "targets": [
          {
            "expr": "instance:node_memory_utilization:ratio{instance=~\"$instance\"} * 100",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "percent", "max": 100, "min": 0 }
        }
      },
      {
        "title": "Network Receive",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total{device!~\"lo|veth.*|br.*|docker.*\",instance=~\"$instance\"}[5m])",
            "legendFormat": "{{instance}} - {{device}}"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "Bps" }
        }
      },
      {
        "title": "Disk I/O",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
        "targets": [
          {
            "expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\"}[5m])",
            "legendFormat": "{{instance}} read - {{device}}"
          },
          {
            "expr": "rate(node_disk_written_bytes_total{instance=~\"$instance\"}[5m])",
            "legendFormat": "{{instance}} write - {{device}}"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "Bps" }
        }
      }
    ]
  },
  "overwrite": true
}

Dashboard: ZFS Health

The ZFS dashboard shows pool state, usage, ARC performance, scrub progress, and snapshot counts. These are the PromQL queries for each panel:

# Panel: Pool Health (stat panel, green/red)
zfs_pool_health{state="online",instance=~"$instance"}

# Panel: Pool Usage Percentage (gauge)
zfs_pool_usage_percent{instance=~"$instance"}

# Panel: ARC Hit Rate (gauge, threshold: green >90, yellow >80, red <80)
instance:zfs_arc_hit_ratio:ratio{instance=~"$instance"} * 100

# Panel: ARC Size vs Target (timeseries)
node_zfs_arc_size{instance=~"$instance"}                    # actual ARC size
zfs_arc_target_bytes{instance=~"$instance"}                 # target (arc_c)
node_memory_MemTotal_bytes{instance=~"$instance"} * 0.5     # 50% of RAM reference line

# Panel: Scrub Progress (bar gauge)
zfs_pool_scrub_progress{instance=~"$instance"} * 100

# Panel: Scrub Errors (stat, threshold: 0=green, >0=red)
zfs_scrub_errors_total{instance=~"$instance"}

# Panel: Snapshot Count per Dataset (table)
zfs_snapshot_count{instance=~"$instance"}

# Panel: Dataset Usage (bar chart)
zfs_dataset_used_bytes{instance=~"$instance",type="filesystem"}

# Panel: Pool I/O (timeseries)
rate(zfs_pool_read_ops_total{instance=~"$instance"}[5m])
rate(zfs_pool_write_ops_total{instance=~"$instance"}[5m])

# Panel: Pool Throughput (timeseries)
rate(zfs_pool_read_bytes_total{instance=~"$instance"}[5m])
rate(zfs_pool_write_bytes_total{instance=~"$instance"}[5m])

# Panel: Compression Ratio (table)
zfs_compression_ratio{instance=~"$instance"}

Dashboard: ARC Performance

The ARC (Adaptive Replacement Cache) is ZFS's read cache. If the ARC hit rate drops below 85%, you are leaving performance on the table. This dashboard helps you tune ARC sizing.

# Panel: ARC Hit Rate over Time (timeseries)
rate(node_zfs_arc_hits_total{instance=~"$instance"}[5m])
/ (rate(node_zfs_arc_hits_total{instance=~"$instance"}[5m])
   + rate(node_zfs_arc_misses_total{instance=~"$instance"}[5m])) * 100

# Panel: ARC MFU vs MRU (stacked area)
# MFU = Most Frequently Used, MRU = Most Recently Used
zfs_arc_mfu_size_bytes{instance=~"$instance"}
zfs_arc_mru_size_bytes{instance=~"$instance"}

# Panel: ARC Evictions (timeseries — high evictions = ARC too small)
rate(node_zfs_arc_evict_skip_total{instance=~"$instance"}[5m])

# Panel: ARC Demand vs Prefetch Hits (timeseries)
rate(node_zfs_arc_demand_hits_total{instance=~"$instance"}[5m])
rate(node_zfs_arc_prefetch_hits_total{instance=~"$instance"}[5m])

# Panel: L2ARC Hit Rate (if you have L2ARC configured)
rate(zfs_arc_l2_hits_total{instance=~"$instance"}[5m])
/ (rate(zfs_arc_l2_hits_total{instance=~"$instance"}[5m])
   + rate(zfs_arc_l2_misses_total{instance=~"$instance"}[5m])) * 100

Dashboard: WireGuard Mesh

# Panel: Peer Status (stat, per peer — green if handshake <180s ago)
time() - wireguard_latest_handshake_seconds{instance=~"$instance"}

# Panel: Peer Throughput (timeseries)
rate(wireguard_sent_bytes_total{instance=~"$instance"}[5m])
rate(wireguard_received_bytes_total{instance=~"$instance"}[5m])

# Panel: Handshake Age (table — sort by age, flag stale peers)
sort_desc(
  time() - wireguard_latest_handshake_seconds{instance=~"$instance"}
)

# Panel: Total Mesh Traffic (single stat)
sum(rate(wireguard_sent_bytes_total[5m])) + sum(rate(wireguard_received_bytes_total[5m]))

Dashboard: KVM Virtual Machines

# Panel: VM CPU Usage (timeseries, requires libvirt_exporter)
rate(libvirt_domain_info_cpu_time_seconds_total{instance=~"$instance"}[5m])

# Panel: VM Memory Usage (gauge)
libvirt_domain_info_memory_usage_bytes{instance=~"$instance"}
/ libvirt_domain_info_maximum_memory_bytes{instance=~"$instance"} * 100

# Panel: VM Disk Read/Write (timeseries)
rate(libvirt_domain_block_stats_read_bytes_total{instance=~"$instance"}[5m])
rate(libvirt_domain_block_stats_write_bytes_total{instance=~"$instance"}[5m])

# Panel: VM Network I/O (timeseries)
rate(libvirt_domain_interface_stats_receive_bytes_total{instance=~"$instance"}[5m])
rate(libvirt_domain_interface_stats_transmit_bytes_total{instance=~"$instance"}[5m])

# Panel: VM State (stat — running=green, shutoff=grey, paused=yellow)
libvirt_domain_info_state{instance=~"$instance"}

Dashboard: eBPF Metrics

# Panel: Syscall Latency p99 (requires eBPF exporter or textfile metrics)
histogram_quantile(0.99, rate(ebpf_syscall_latency_seconds_bucket[5m]))

# Panel: TCP Retransmits per Second
rate(node_netstat_Tcp_RetransSegs[5m])

# Panel: File System Latency (bcc/bpftrace textfile metrics)
ebpf_bio_latency_seconds{quantile="0.99",instance=~"$instance"}

# Panel: TCP Connection Rate
rate(node_netstat_Tcp_ActiveOpens[5m])
rate(node_netstat_Tcp_PassiveOpens[5m])

Alertmanager — alert routing and notification

Alertmanager receives alerts from Prometheus, deduplicates them, groups related alerts, applies silences and inhibitions, and routes them to the correct receiver. It runs as a separate process so Prometheus can be restarted without losing alert state.

Installation

curl -LO https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
cp alertmanager-0.27.0.linux-amd64/{alertmanager,amtool} /usr/local/bin/
chmod 755 /usr/local/bin/{alertmanager,amtool}
mkdir -p /etc/alertmanager /var/lib/alertmanager

Complete alertmanager.yml

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_from: 'alertmanager@kldload.local'
  smtp_smarthost: 'smtp.example.com:587'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'smtp-password-here'
  smtp_require_tls: true
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'

# Inhibition rules: suppress lower-severity alerts when critical fires
inhibit_rules:
  # If a node is down, suppress all other alerts for that node
  - source_matchers:
      - alertname = NodeDown
    target_matchers:
      - severity =~ "warning|info"
    equal: ['instance']

  # If a pool is faulted, suppress degraded alerts
  - source_matchers:
      - alertname = ZFSPoolFaulted
    target_matchers:
      - alertname = ZFSPoolDegraded
    equal: ['instance', 'pool']

# Routing tree
route:
  receiver: 'default-slack'
  group_by: ['alertname', 'cluster', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical alerts go to PagerDuty AND Slack immediately
    - matchers:
        - severity = critical
      receiver: 'critical-pagerduty'
      group_wait: 10s
      repeat_interval: 1h
      continue: true     # Also send to next matching route

    - matchers:
        - severity = critical
      receiver: 'critical-slack'
      group_wait: 10s

    # ZFS alerts go to the storage channel
    - matchers:
        - alertname =~ "ZFS.*|ARC.*|Scrub.*|Pool.*"
      receiver: 'storage-slack'
      group_by: ['alertname', 'pool']

    # WireGuard alerts go to the network channel
    - matchers:
        - alertname =~ "WireGuard.*|Peer.*"
      receiver: 'network-slack'

# Receivers
receivers:
  - name: 'default-slack'
    slack_configs:
      - channel: '#monitoring'
        send_resolved: true
        title: '{{ .Status | toUpper }} {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }}
          *Instance:* {{ .Labels.instance }}
          *Severity:* {{ .Labels.severity }}
          *Summary:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          {{ end }}

  - name: 'critical-pagerduty'
    pagerduty_configs:
      - service_key: 'your-pagerduty-service-key'
        send_resolved: true

  - name: 'critical-slack'
    slack_configs:
      - channel: '#critical-alerts'
        send_resolved: true
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '{{ .Status | toUpper }} {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Labels.alertname }}* on {{ .Labels.instance }}
          {{ .Annotations.summary }}
          {{ end }}

  - name: 'storage-slack'
    slack_configs:
      - channel: '#storage-alerts'
        send_resolved: true
        title: 'ZFS {{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Pool:* {{ .Labels.pool | default "n/a" }}
          *Instance:* {{ .Labels.instance }}
          {{ .Annotations.summary }}
          {{ end }}

  - name: 'network-slack'
    slack_configs:
      - channel: '#network-alerts'
        send_resolved: true

Systemd unit

cat > /etc/systemd/system/alertmanager.service << 'EOF'
[Unit]
Description=Prometheus Alertmanager
Documentation=https://prometheus.io/docs/alerting/latest/alertmanager/
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=:9093 \
  --cluster.listen-address=""
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=5

ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
ReadWritePaths=/var/lib/alertmanager

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now alertmanager

Complete alert rules

These are the alert rules that cover every critical dimension of a kldload system. Drop this into /etc/prometheus/alerts.yml:

cat > /etc/prometheus/alerts.yml << 'EOF'
groups:
  # ── Node health ────────────────────────────────────────
  - name: node_health
    rules:
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is unreachable"
          description: "Prometheus has been unable to scrape {{ $labels.instance }} for 2 minutes."

      - alert: HighCPU
        expr: instance:node_cpu_utilization:ratio > 0.90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage >90% on {{ $labels.instance }}"
          description: "CPU has been above 90% for 10 minutes. Current: {{ $value | humanizePercentage }}"

      - alert: HighMemory
        expr: instance:node_memory_utilization:ratio > 0.90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage >90% on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | humanizePercentage }}. Consider expanding ARC limits or adding RAM."

      - alert: HighMemoryCritical
        expr: instance:node_memory_utilization:ratio > 0.95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Memory usage >95% on {{ $labels.instance }} — OOM risk"
          description: "Memory usage is {{ $value | humanizePercentage }}. OOM killer may activate."

      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Less than 10% disk space on {{ $labels.instance }}"

      - alert: DiskSpaceCritical
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Less than 5% disk space on {{ $labels.instance }} — risk of data loss"

      - alert: ClockDrift
        expr: abs(node_ntp_offset_seconds) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "NTP clock drift >100ms on {{ $labels.instance }}"
          description: "Clock offset: {{ $value }}s. This can cause Prometheus sample ordering issues."

      - alert: SystemdUnitFailed
        expr: node_systemd_unit_state{state="failed"} == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "systemd unit {{ $labels.name }} failed on {{ $labels.instance }}"

  # ── ZFS health ─────────────────────────────────────────
  - name: zfs_health
    rules:
      - alert: ZFSPoolDegraded
        expr: zfs_pool_health{state="degraded"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool {{ $labels.pool }} DEGRADED on {{ $labels.instance }}"
          description: "A vdev in pool {{ $labels.pool }} has failed. Data is still accessible but redundancy is lost. Replace the failed device immediately."

      - alert: ZFSPoolFaulted
        expr: zfs_pool_health{state="faulted"} == 1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool {{ $labels.pool }} FAULTED on {{ $labels.instance }}"
          description: "Pool {{ $labels.pool }} has experienced an unrecoverable error. DATA MAY BE INACCESSIBLE."

      - alert: ZFSPoolUsageHigh
        expr: zfs_pool_usage_percent > 80
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "ZFS pool usage >80% on {{ $labels.instance }}"
          description: "Pool {{ $labels.pool }} is {{ $value }}% full. ZFS performance degrades significantly above 80% capacity. Add storage or delete snapshots."

      - alert: ZFSPoolUsageCritical
        expr: zfs_pool_usage_percent > 90
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool usage >90% on {{ $labels.instance }} — critical"
          description: "Pool {{ $labels.pool }} is {{ $value }}% full. Pool may become read-only at 100%."

      - alert: ARCHitRateLow
        expr: instance:zfs_arc_hit_ratio:ratio < 0.85
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "ZFS ARC hit rate below 85% on {{ $labels.instance }}"
          description: "ARC hit rate is {{ $value | humanizePercentage }}. Consider increasing ARC max size or investigating the workload. Below 85% means significant I/O is hitting disk."

      - alert: ARCHitRateCritical
        expr: instance:zfs_arc_hit_ratio:ratio < 0.70
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "ZFS ARC hit rate below 70% on {{ $labels.instance }} — severe cache pressure"
          description: "ARC hit rate is {{ $value | humanizePercentage }}. The working set exceeds ARC capacity. Add RAM or reduce dataset count."

      - alert: ScrubErrors
        expr: zfs_scrub_errors_total > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "ZFS scrub found errors on {{ $labels.instance }}"
          description: "Pool {{ $labels.pool }} scrub detected {{ $value }} errors. Check `zpool status {{ $labels.pool }}` immediately."

      - alert: ScrubOverdue
        expr: (time() - zfs_pool_scrub_duration_seconds) > (8 * 24 * 3600)
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "ZFS scrub overdue on {{ $labels.instance }}"
          description: "Pool {{ $labels.pool }} has not been scrubbed in over 8 days."

  # ── WireGuard health ───────────────────────────────────
  - name: wireguard_health
    rules:
      - alert: WireGuardPeerStale
        expr: instance:wireguard_peer_handshake_age:seconds > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "WireGuard peer {{ $labels.friendly_name }} stale on {{ $labels.instance }}"
          description: "No handshake in {{ $value | humanizeDuration }}. Peer may be unreachable."

      - alert: WireGuardPeerDown
        expr: instance:wireguard_peer_handshake_age:seconds > 900
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "WireGuard peer {{ $labels.friendly_name }} DOWN on {{ $labels.instance }}"
          description: "No handshake in {{ $value | humanizeDuration }}. The peer is not reachable."

      - alert: WireGuardNoTraffic
        expr: rate(wireguard_received_bytes_total[15m]) == 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "No WireGuard traffic from peer {{ $labels.friendly_name }} for 15m"

  # ── Latency and performance ────────────────────────────
  - name: performance
    rules:
      - alert: HighP99Latency
        expr: histogram_quantile(0.99, rate(ebpf_syscall_latency_seconds_bucket[5m])) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "p99 syscall latency >100ms on {{ $labels.instance }}"

      - alert: HighTCPRetransmits
        expr: rate(node_netstat_Tcp_RetransSegs[5m]) > 50
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High TCP retransmit rate on {{ $labels.instance }}"
          description: "{{ $value }} retransmits/sec. Check network path for congestion or packet loss."

      - alert: HighDiskIOUtilization
        expr: rate(node_disk_io_time_seconds_total[5m]) > 0.90
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Disk I/O utilization >90% on {{ $labels.instance }} device {{ $labels.device }}"

  # ── Monitoring stack health ────────────────────────────
  - name: monitoring_health
    rules:
      - alert: PrometheusTargetDown
        expr: up == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus target {{ $labels.job }}/{{ $labels.instance }} is down"

      - alert: PrometheusStorageFull
        expr: prometheus_tsdb_storage_blocks_bytes / (1024*1024*1024) > 45
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus storage >45GB — approaching retention limit"

      - alert: AlertmanagerDown
        expr: up{job="alertmanager"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Alertmanager is down — alerts will not be delivered"
EOF

# Validate the rules
promtool check rules /etc/prometheus/alerts.yml

# Reload Prometheus
curl -X POST http://localhost:9090/-/reload

Silencing alerts during maintenance

# Silence all alerts for node-2 for 2 hours (maintenance window)
amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --author="todd" \
  --comment="scheduled maintenance on node-2" \
  --duration=2h \
  instance="10.78.2.1"

# List active silences
amtool silence query --alertmanager.url=http://localhost:9093

# Expire a silence early
amtool silence expire --alertmanager.url=http://localhost:9093 

# View currently firing alerts
amtool alert query --alertmanager.url=http://localhost:9093

Loki — log aggregation

Loki is "Prometheus for logs." It stores log data with the same label model as Prometheus, so you can jump from a metric spike to the corresponding log lines in a single Grafana click. Unlike Elasticsearch or Splunk, Loki does not index log content — it only indexes labels. This makes it dramatically cheaper to operate: less CPU, less storage, less RAM.

Installation

# Loki server (on the monitoring node)
curl -LO https://github.com/grafana/loki/releases/download/v3.1.1/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
chmod 755 loki-linux-amd64
mv loki-linux-amd64 /usr/local/bin/loki

# Promtail (on every node)
curl -LO https://github.com/grafana/loki/releases/download/v3.1.1/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
chmod 755 promtail-linux-amd64
mv promtail-linux-amd64 /usr/local/bin/promtail

Loki storage on ZFS

zfs create -o mountpoint=/var/lib/loki \
           -o compression=zstd \
           -o recordsize=128k \
           -o atime=off \
           rpool/loki

mkdir -p /var/lib/loki/{chunks,index,wal,ruler}
useradd --no-create-home --shell /sbin/nologin --system loki
chown -R loki:loki /var/lib/loki

Loki configuration

mkdir -p /etc/loki

cat > /etc/loki/loki.yml << 'EOF'
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: warn

common:
  path_prefix: /var/lib/loki
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  filesystem:
    directory: /var/lib/loki/chunks

  tsdb_shipper:
    active_index_directory: /var/lib/loki/index
    cache_location: /var/lib/loki/cache

compactor:
  working_directory: /var/lib/loki/compactor

limits_config:
  retention_period: 30d
  max_query_series: 500
  max_query_parallelism: 2
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20

chunk_store_config:
  chunk_cache_config:
    embedded_cache:
      enabled: true
      max_size_mb: 256

query_range:
  align_queries_with_step: true
  cache_results: true

analytics:
  reporting_enabled: false
EOF

Loki systemd unit

cat > /etc/systemd/system/loki.service << 'EOF'
[Unit]
Description=Grafana Loki Log Aggregation
Documentation=https://grafana.com/docs/loki/latest/
After=network-online.target zfs-mount.service

[Service]
User=loki
Group=loki
Type=simple
ExecStart=/usr/local/bin/loki \
  -config.file=/etc/loki/loki.yml
Restart=always
RestartSec=5

ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
ReadWritePaths=/var/lib/loki

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now loki

Promtail configuration (on every node)

mkdir -p /etc/promtail

cat > /etc/promtail/promtail.yml << 'EOF'
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/promtail/positions.yml

clients:
  - url: http://10.78.0.1:3100/loki/api/v1/push

scrape_configs:
  # systemd journal — captures all systemd service logs
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'
      - source_labels: ['__journal__hostname']
        target_label: 'hostname'
      - source_labels: ['__journal_priority_keyword']
        target_label: 'level'

  # Syslog and auth logs
  - job_name: syslog
    static_configs:
      - targets: [localhost]
        labels:
          job: syslog
          __path__: /var/log/syslog
      - targets: [localhost]
        labels:
          job: auth
          __path__: /var/log/auth.log

  # ZFS event logs
  - job_name: zfs
    static_configs:
      - targets: [localhost]
        labels:
          job: zfs
          __path__: /var/log/zfs*.log
    pipeline_stages:
      - regex:
          expression: '^(?P\S+ \S+) (?P\w+) (?P.*)$'
      - labels:
          level:

  # Kernel logs (dmesg)
  - job_name: kernel
    static_configs:
      - targets: [localhost]
        labels:
          job: kernel
          __path__: /var/log/kern.log

  # WireGuard logs (from journal)
  - job_name: wireguard
    journal:
      max_age: 12h
      labels:
        job: wireguard
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        regex: 'wg-quick@.*\.service'
        action: keep
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'

  # libvirt/KVM logs
  - job_name: libvirt
    static_configs:
      - targets: [localhost]
        labels:
          job: libvirt
          __path__: /var/log/libvirt/qemu/*.log
    relabel_configs:
      - source_labels: ['__path__']
        regex: '.*/(.*)\.log'
        target_label: 'vm_name'
EOF

Promtail systemd unit

mkdir -p /var/lib/promtail

cat > /etc/systemd/system/promtail.service << 'EOF'
[Unit]
Description=Grafana Promtail Log Shipper
Documentation=https://grafana.com/docs/loki/latest/clients/promtail/
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/promtail \
  -config.file=/etc/promtail/promtail.yml
Restart=always
RestartSec=5

# Promtail needs read access to log files
ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
ReadOnlyPaths=/var/log
ReadWritePaths=/var/lib/promtail

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now promtail

LogQL — querying logs

LogQL is Loki's query language. It looks like PromQL but operates on log streams. In Grafana, select the Loki data source and use these queries:

# All logs from a specific host
{hostname="node-1"}

# All error-level logs across all hosts
{level="err"} OR {level="crit"} OR {level="alert"} OR {level="emerg"}

# ZFS errors in kernel logs
{job="kernel"} |= "zfs" |= "error"

# WireGuard handshake failures
{job="wireguard"} |= "handshake"

# SSH authentication failures
{job="auth"} |= "Failed password"

# All logs from a specific systemd unit
{unit="prometheus.service"}

# Count errors per host over time (metric from logs)
count_over_time({level="err"}[5m])

# Top 10 noisiest systemd units
topk(10, sum by(unit) (count_over_time({job="systemd-journal"}[1h])))

# OOM killer events
{job="kernel"} |= "Out of memory" OR {job="kernel"} |= "oom-kill"

# ZFS scrub completions
{job="kernel"} |= "scan: scrub repaired"

# libvirt VM state changes
{job="libvirt"} |~ "domain.*state"

Correlating logs with metrics

In Grafana, you can link from a metric panel to the corresponding logs. When you see a CPU spike, click the time range and jump to Loki to see what was running at that moment. This requires matching labels between Prometheus and Loki:

Both Promtail and node_exporter should produce a hostname or instance label that matches. The Promtail journal source automatically uses __journal__hostname. Prometheus uses the target address. Use relabel_configs to normalize them to the same label so Grafana can correlate across data sources.

The moment you can go from "CPU spike at 14:32" to "kernel log: OOM killer invoked at 14:32:07" in a single click is the moment observability clicks. That is what metrics + logs + correlation gives you. Metrics tell you something is wrong. Logs tell you why. Without both, you are flying blind.

SLOs and error budgets

Service Level Objectives (SLOs) define how reliable your infrastructure must be. An error budget is the allowed amount of unreliability. If your SLO is 99.9% availability (43.8 minutes downtime/month), your error budget is 0.1%. Once you burn through it, you stop deploying and fix reliability. This is the SRE discipline that makes infrastructure sustainable.

Define SLOs for kldload infrastructure

SLO: Host availability 99.9%

SLI: avg_over_time(up{job="node"}[30d]). Target: >0.999. Error budget: 43.8 minutes/month of allowed downtime per host.

SLO: ZFS pool online 99.99%

SLI: avg_over_time(zfs_pool_health{state="online"}[30d]). Target: >0.9999. Error budget: 4.38 minutes/month. A pool going degraded or offline eats budget fast.

SLO: ARC hit rate >90%

SLI: instance:zfs_arc_hit_ratio:ratio. Target: >0.90. When this SLO breaks, disk I/O increases and application latency rises. Add RAM.

SLO: WireGuard mesh connectivity 99.9%

SLI: fraction of peers with handshake <300s. Target: >0.999. A stale peer means a node is isolated from the mesh and cannot be managed.

Recording rules for SLIs

# /etc/prometheus/recording_rules.yml (append to existing)
groups:
  - name: slo_recording_rules
    interval: 60s
    rules:
      # Host availability SLI (1 = up, 0 = down)
      - record: slo:host_availability:ratio
        expr: avg_over_time(up{job="node"}[30d])

      # ZFS pool online SLI
      - record: slo:zfs_pool_online:ratio
        expr: avg_over_time(zfs_pool_health{state="online"}[30d])

      # Error budget remaining (1 = full budget, 0 = budget exhausted)
      - record: slo:host_availability:error_budget_remaining
        expr: |
          1 - (
            (1 - slo:host_availability:ratio)
            / (1 - 0.999)
          )

      - record: slo:zfs_pool_online:error_budget_remaining
        expr: |
          1 - (
            (1 - slo:zfs_pool_online:ratio)
            / (1 - 0.9999)
          )

Burn rate alerting

Instead of alerting on raw thresholds, burn rate alerting asks: "at the current error rate, how fast are we consuming the error budget?" This avoids alert fatigue from brief blips while catching sustained problems early.

# Burn rate alerts
groups:
  - name: slo_burn_rate
    rules:
      # Fast burn: consuming budget at 14.4x rate over 1h (pages immediately)
      - alert: HostAvailabilityBudgetFastBurn
        expr: |
          (1 - avg_over_time(up{job="node"}[1h])) > (14.4 * 0.001)
          and
          (1 - avg_over_time(up{job="node"}[5m])) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo: host_availability
        annotations:
          summary: "Host availability error budget burning fast"
          description: "At current error rate, the 30-day error budget will be exhausted in {{ printf \"%.0f\" (divf 1 14.4 | mulf 30) }} days."

      # Slow burn: consuming budget at 3x rate over 6h (warns early)
      - alert: HostAvailabilityBudgetSlowBurn
        expr: |
          (1 - avg_over_time(up{job="node"}[6h])) > (3 * 0.001)
          and
          (1 - avg_over_time(up{job="node"}[30m])) > (3 * 0.001)
        for: 5m
        labels:
          severity: warning
          slo: host_availability
        annotations:
          summary: "Host availability error budget burning slowly"
          description: "At current error rate, the 30-day error budget will be exhausted in {{ printf \"%.0f\" (divf 1 3 | mulf 30) }} days."

Multi-host monitoring over WireGuard

On a kldload mesh, every node runs exporters. The monitoring node runs Prometheus, Grafana, Alertmanager, and Loki. All scraping and log shipping happens over the WireGuard mesh — no public ports, no firewall exceptions, no VPN appliances.

Deployment topology

# Monitoring node (10.78.0.1 / wg0)
#   Runs: Prometheus, Grafana, Alertmanager, Loki
#   Runs: node_exporter, zfs_exporter, wireguard_exporter, promtail

# All other nodes (10.78.x.1 / wg0)
#   Run: node_exporter, zfs_exporter, wireguard_exporter, promtail
#   That's it. Four lightweight daemons. ~80MB RAM total.

# Dedicated metrics plane (optional — wg2 / 10.79.x.x)
#   Separate WireGuard interface for monitoring traffic
#   Allows different MTU, different firewall rules, different routing
#   Prometheus scrapes 10.79.x.x:9100 instead of 10.78.x.x:9100

Secure transport

WireGuard encrypts all traffic by default. This means:

No TLS configuration needed on exporters. Prometheus scrapes over HTTP, but the HTTP runs inside the WireGuard tunnel, which provides authenticated encryption. This eliminates the complexity of managing TLS certificates for every exporter on every node. The mesh is the security boundary.

# On each node, ensure exporter ports are only reachable on WireGuard interfaces
# This prevents accidental exposure on public interfaces

# nftables rule (add to your existing ruleset)
nft add rule inet filter input iifname != "wg0" tcp dport { 9100, 9134, 9586, 9080 } drop

# Or with firewalld (CentOS/RHEL)
firewall-cmd --zone=public --remove-port=9100/tcp --permanent
firewall-cmd --zone=trusted --add-interface=wg0 --permanent
firewall-cmd --reload

Adding a new node to monitoring

# On the new node: install exporters + promtail (same steps as above)
# Then on the monitoring node:

# 1. Add to Prometheus targets (if using file_sd)
cat >> /etc/prometheus/targets/nodes/prod.yml << 'EOF'
- targets:
    - "10.78.5.1:9100"
  labels:
    environment: "production"
    role: "app-server"
EOF

# 2. Add ZFS and WireGuard exporter targets similarly

# 3. Prometheus auto-discovers via file_sd — no restart needed
# Verify the new target appears:
curl -s http://localhost:9090/api/v1/targets | python3 -c "
import json, sys
targets = json.load(sys.stdin)['data']['activeTargets']
for t in targets:
    print(f\"{t['labels'].get('instance','?'):20s} {t['health']:8s} {t['lastScrape'][:19]}\")"

Monitoring ZFS replication

If you use sanoid/syncoid for ZFS snapshot management and replication, you need to know when replication falls behind or fails. A missed replication window means your DR target is stale. These textfile metrics make replication lag visible in Prometheus.

Replication metrics script

cat > /usr/local/bin/replication-metrics.sh << 'SCRIPT'
#!/bin/bash
# Replication lag and snapshot age metrics for sanoid/syncoid
set -euo pipefail

OUTPUT="/var/lib/node_exporter/textfile/replication.prom"
TMPFILE="${OUTPUT}.tmp"

{
  echo "# HELP zfs_replication_lag_seconds Seconds since last successful syncoid replication"
  echo "# TYPE zfs_replication_lag_seconds gauge"

  echo "# HELP zfs_latest_snapshot_age_seconds Age of the newest snapshot in seconds"
  echo "# TYPE zfs_latest_snapshot_age_seconds gauge"

  echo "# HELP zfs_sanoid_snapshot_count Number of snapshots managed by sanoid"
  echo "# TYPE zfs_sanoid_snapshot_count gauge"

  NOW=$(date +%s)

  # Check replication lag by looking at the newest syncoid snapshot on the target
  # This assumes syncoid snapshots have "syncoid_" prefix
  for ds in $(zfs list -H -o name -t filesystem 2>/dev/null); do
    newest_sync=$(zfs list -t snapshot -H -o name,creation -S creation "$ds" 2>/dev/null \
      | grep "syncoid_" | head -1)

    if [[ -n "$newest_sync" ]]; then
      snap_name=$(echo "$newest_sync" | awk '{print $1}')
      snap_date=$(echo "$newest_sync" | awk '{$1=""; print $0}' | xargs)
      snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null || echo 0)
      if [[ "$snap_epoch" -gt 0 ]]; then
        lag=$((NOW - snap_epoch))
        echo "zfs_replication_lag_seconds{dataset=\"${ds}\"} ${lag}"
      fi
    fi
  done

  # Latest snapshot age per dataset (any type, not just syncoid)
  for ds in $(zfs list -H -o name -t filesystem 2>/dev/null); do
    newest=$(zfs list -t snapshot -H -o name,creation -S creation "$ds" 2>/dev/null | head -1)
    if [[ -n "$newest" ]]; then
      snap_date=$(echo "$newest" | awk '{$1=""; print $0}' | xargs)
      snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null || echo 0)
      if [[ "$snap_epoch" -gt 0 ]]; then
        age=$((NOW - snap_epoch))
        echo "zfs_latest_snapshot_age_seconds{dataset=\"${ds}\"} ${age}"
      fi
    fi
  done

  # Sanoid snapshot count per policy
  for ds in $(zfs list -H -o name -t filesystem 2>/dev/null); do
    count=$(zfs list -t snapshot -H -o name "$ds" 2>/dev/null | grep -c "autosnap" || echo 0)
    echo "zfs_sanoid_snapshot_count{dataset=\"${ds}\"} ${count}"
  done

} > "${TMPFILE}"
mv "${TMPFILE}" "${OUTPUT}"
SCRIPT

chmod 755 /usr/local/bin/replication-metrics.sh

Systemd timer for replication metrics

cat > /etc/systemd/system/replication-metrics.service << 'EOF'
[Unit]
Description=Generate ZFS replication metrics for node_exporter

[Service]
Type=oneshot
ExecStart=/usr/local/bin/replication-metrics.sh
User=root
EOF

cat > /etc/systemd/system/replication-metrics.timer << 'EOF'
[Unit]
Description=Run replication metrics every 5 minutes

[Timer]
OnBootSec=60
OnUnitActiveSec=5min

[Install]
WantedBy=timers.target
EOF

systemctl daemon-reload
systemctl enable --now replication-metrics.timer

Replication alert rules

# Add to /etc/prometheus/alerts.yml
groups:
  - name: replication_health
    rules:
      - alert: ReplicationLagHigh
        expr: zfs_replication_lag_seconds > 3600
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "ZFS replication lag >1h on {{ $labels.instance }} dataset {{ $labels.dataset }}"
          description: "Last syncoid snapshot is {{ $value | humanizeDuration }} old."

      - alert: ReplicationLagCritical
        expr: zfs_replication_lag_seconds > 86400
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "ZFS replication lag >24h on {{ $labels.instance }}"
          description: "Dataset {{ $labels.dataset }} has not replicated in {{ $value | humanizeDuration }}. DR target is dangerously stale."

      - alert: SnapshotAgeHigh
        expr: zfs_latest_snapshot_age_seconds > 7200
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "No new snapshots for >2h on {{ $labels.instance }} dataset {{ $labels.dataset }}"
          description: "Sanoid may have stopped or the cron/timer is failing."

      - alert: SanoidNotRunning
        expr: node_systemd_unit_state{name="sanoid.timer",state="active"} != 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "sanoid timer not active on {{ $labels.instance }}"

Monitoring KVM virtual machines

If you run KVM VMs on ZFS zvols, you want visibility into per-VM CPU, memory, disk I/O, and network I/O from the host side (without agents inside the guest). The libvirt exporter provides this.

libvirt exporter installation

curl -LO https://github.com/prometheus-community/libvirt_exporter/releases/download/v0.4.0/libvirt_exporter-0.4.0.linux-amd64.tar.gz
tar xzf libvirt_exporter-0.4.0.linux-amd64.tar.gz
cp libvirt_exporter-0.4.0.linux-amd64/libvirt_exporter /usr/local/bin/
chmod 755 /usr/local/bin/libvirt_exporter

cat > /etc/systemd/system/libvirt_exporter.service << 'EOF'
[Unit]
Description=Prometheus Libvirt Exporter
After=libvirtd.service
Requires=libvirtd.service

[Service]
Type=simple
ExecStart=/usr/local/bin/libvirt_exporter \
  --web.listen-address=:9177 \
  --libvirt.uri="qemu:///system"
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now libvirt_exporter

Key metrics

# VM state (1=running, 5=shutoff, 3=paused)
libvirt_domain_info_state{domain="webserver"}                   1

# CPU time (rate this for usage)
libvirt_domain_info_cpu_time_seconds_total{domain="webserver"}  45678.9

# Memory
libvirt_domain_info_memory_usage_bytes{domain="webserver"}      4.294967296e+09
libvirt_domain_info_maximum_memory_bytes{domain="webserver"}    8.589934592e+09

# Block device I/O (per-disk, typically the zvol)
libvirt_domain_block_stats_read_bytes_total{domain="webserver",target_device="vda"}   1.234e+11
libvirt_domain_block_stats_write_bytes_total{domain="webserver",target_device="vda"}  5.678e+10
libvirt_domain_block_stats_read_requests_total{domain="webserver",target_device="vda"} 4567890
libvirt_domain_block_stats_write_requests_total{domain="webserver",target_device="vda"} 2345678

# Network I/O (per-interface)
libvirt_domain_interface_stats_receive_bytes_total{domain="webserver",target_device="vnet0"} 8.9e+09
libvirt_domain_interface_stats_transmit_bytes_total{domain="webserver",target_device="vnet0"} 3.4e+09

Per-VM zvol I/O correlation

To correlate ZFS zvol I/O with VM workload, match the zvol block device to the VM's disk. On a kldload system with VM zvols under rpool/vms/:

# Find which block device a zvol uses
ls -la /dev/zvol/rpool/vms/webserver
# lrwxrwxrwx 1 root root 10 Jan 15 10:00 /dev/zvol/rpool/vms/webserver -> ../../zd0

# The libvirt exporter reports I/O for the domain's "vda" device
# The node_exporter reports I/O for the "zd0" block device
# Correlate them: libvirt_domain_block_stats_*{domain="webserver"} ←→ node_disk_*{device="zd0"}

VM alert rules

groups:
  - name: kvm_health
    rules:
      - alert: VMDown
        expr: libvirt_domain_info_state != 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "VM {{ $labels.domain }} is not running on {{ $labels.instance }}"

      - alert: VMHighCPU
        expr: rate(libvirt_domain_info_cpu_time_seconds_total[5m]) > 0.90
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "VM {{ $labels.domain }} CPU usage >90% on {{ $labels.instance }}"

      - alert: VMHighMemory
        expr: |
          libvirt_domain_info_memory_usage_bytes
          / libvirt_domain_info_maximum_memory_bytes > 0.95
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "VM {{ $labels.domain }} memory usage >95% on {{ $labels.instance }}"

      - alert: VMHighDiskIO
        expr: |
          (rate(libvirt_domain_block_stats_read_bytes_total[5m])
          + rate(libvirt_domain_block_stats_write_bytes_total[5m])) > 500e6
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "VM {{ $labels.domain }} disk I/O >500MB/s sustained for 15m"

The libvirt exporter gives you host-side VM metrics without any agent inside the guest. This is the correct approach for infrastructure monitoring. You do not need to install node_exporter inside every VM to know that a VM is consuming too much CPU or disk I/O. The hypervisor host already knows. The libvirt exporter just exposes what libvirt/QEMU already tracks internally.

Quick reference

All ports

Port    Service                 Purpose
────    ───────                 ───────
3000    Grafana                 Dashboards and log viewer
3100    Loki                    Log aggregation API
9080    Promtail                Log shipper status
9090    Prometheus              Metrics TSDB and PromQL API
9093    Alertmanager            Alert routing and silencing
9100    node_exporter           Linux system metrics
9134    zfs_exporter            ZFS pool/dataset/ARC metrics
9177    libvirt_exporter        KVM VM metrics
9586    wireguard_exporter      WireGuard peer metrics

All URLs

http://<monitoring-node>:9090           Prometheus UI (query, targets, alerts)
http://<monitoring-node>:9090/targets   Prometheus target health
http://<monitoring-node>:9090/-/healthy Prometheus health check
http://<monitoring-node>:3000           Grafana dashboards
http://<monitoring-node>:9093           Alertmanager UI (alerts, silences)
http://<monitoring-node>:3100/ready     Loki readiness check
http://<any-node>:9100/metrics          node_exporter metrics endpoint
http://<any-node>:9134/metrics          zfs_exporter metrics endpoint

All config paths

/etc/prometheus/prometheus.yml               Prometheus main config
/etc/prometheus/alerts.yml                   Alert rules
/etc/prometheus/recording_rules.yml          Recording rules
/etc/prometheus/targets/nodes/*.yml          File-based service discovery
/etc/alertmanager/alertmanager.yml           Alertmanager config
/etc/loki/loki.yml                           Loki server config
/etc/promtail/promtail.yml                   Promtail log shipper config
/etc/grafana/provisioning/datasources/       Grafana data source YAML
/etc/grafana/provisioning/dashboards/        Grafana dashboard provider YAML
/var/lib/grafana/dashboards/                 Dashboard JSON files
/var/lib/node_exporter/textfile/             Custom metrics (.prom files)
/var/lib/prometheus/                         Prometheus TSDB (on ZFS)
/var/lib/loki/                               Loki chunks and index (on ZFS)
/var/lib/alertmanager/                       Alertmanager state

All systemctl commands

# Status check — run on the monitoring node
systemctl status prometheus grafana-server alertmanager loki

# Status check — run on every node
systemctl status node_exporter zfs_exporter wireguard_exporter promtail

# Restart the full stack
systemctl restart prometheus grafana-server alertmanager loki

# Hot-reload Prometheus config (no restart, no data loss)
curl -X POST http://localhost:9090/-/reload

# Hot-reload Alertmanager config
curl -X POST http://localhost:9093/-/reload

# Check Prometheus config syntax before applying
promtool check config /etc/prometheus/prometheus.yml

# Check alert rules syntax
promtool check rules /etc/prometheus/alerts.yml

# Validate Alertmanager config
amtool check-config /etc/alertmanager/alertmanager.yml

# View Prometheus targets from CLI
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool

# View firing alerts from CLI
amtool alert query --alertmanager.url=http://localhost:9093

# Test an alert rule against historical data
promtool test rules /etc/prometheus/tests/alert_tests.yml

ZFS datasets for the monitoring stack

# Create all ZFS datasets for the monitoring stack
zfs create -o mountpoint=/var/lib/prometheus -o compression=zstd -o recordsize=128k -o atime=off rpool/prometheus
zfs create -o mountpoint=/var/lib/loki       -o compression=zstd -o recordsize=128k -o atime=off rpool/loki
zfs create -o mountpoint=/var/lib/grafana    -o compression=zstd -o recordsize=16k  -o atime=off rpool/grafana
zfs create -o mountpoint=/var/lib/alertmanager -o compression=zstd -o atime=off rpool/alertmanager

# Snapshot the monitoring stack daily
# Add to sanoid.conf:
# [rpool/prometheus]
#   use_template = monitoring
#   autosnap = yes
# [rpool/loki]
#   use_template = monitoring
#   autosnap = yes
# [monitoring]
#   daily = 7
#   weekly = 4
#   monthly = 3

Deploy all exporters to a new node — single script

#!/bin/bash
# deploy-exporters.sh — install all exporters + promtail on a kldload node
# Usage: bash deploy-exporters.sh 
set -euo pipefail

LOKI_HOST="${1:-10.78.0.1}"

echo "=== Installing node_exporter ==="
curl -sLO https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
useradd --no-create-home --shell /sbin/nologin --system node_exporter 2>/dev/null || true
mkdir -p /var/lib/node_exporter/textfile
chown node_exporter:node_exporter /var/lib/node_exporter/textfile

echo "=== Installing zfs_exporter ==="
curl -sLO https://github.com/pdf/zfs_exporter/releases/download/v2.3.5/zfs_exporter-2.3.5.linux-amd64.tar.gz
tar xzf zfs_exporter-2.3.5.linux-amd64.tar.gz
cp zfs_exporter-2.3.5.linux-amd64/zfs_exporter /usr/local/bin/

echo "=== Installing wireguard_exporter ==="
curl -sLO https://github.com/MindFlavor/prometheus_wireguard_exporter/releases/download/3.6.6/prometheus_wireguard_exporter-3.6.6-x86_64-unknown-linux-musl.tar.gz
tar xzf prometheus_wireguard_exporter-3.6.6-x86_64-unknown-linux-musl.tar.gz
cp prometheus_wireguard_exporter /usr/local/bin/

echo "=== Installing promtail ==="
curl -sLO https://github.com/grafana/loki/releases/download/v3.1.1/promtail-linux-amd64.zip
unzip -o promtail-linux-amd64.zip
mv promtail-linux-amd64 /usr/local/bin/promtail
chmod 755 /usr/local/bin/promtail
mkdir -p /var/lib/promtail /etc/promtail

# Generate promtail config pointing to the Loki host
cat > /etc/promtail/promtail.yml << PROMEOF
server:
  http_listen_port: 9080
  grpc_listen_port: 0
positions:
  filename: /var/lib/promtail/positions.yml
clients:
  - url: http://${LOKI_HOST}:3100/loki/api/v1/push
scrape_configs:
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'
      - source_labels: ['__journal__hostname']
        target_label: 'hostname'
PROMEOF

echo "=== Creating systemd units ==="
# [unit files created here — same as shown in individual sections above]

echo "=== Starting services ==="
systemctl daemon-reload
systemctl enable --now node_exporter zfs_exporter wireguard_exporter promtail

echo "=== Verifying ==="
for port in 9100 9134 9586 9080; do
  if curl -sf "http://localhost:${port}/metrics" > /dev/null 2>&1 || \
     curl -sf "http://localhost:${port}/ready" > /dev/null 2>&1; then
    echo "  Port ${port}: OK"
  else
    echo "  Port ${port}: FAILED"
  fi
done

echo "=== Done. Add this node to Prometheus targets. ==="
rm -f *.tar.gz *.zip

This entire stack — Prometheus, Grafana, Alertmanager, Loki, five exporters, recording rules, 50+ alert rules, SLO burn rate alerting, multi-host federation, ZFS-optimized storage, WireGuard-secured transport — runs on a single kldload box with 4 cores and 16GB RAM. Total RAM usage under load: Prometheus ~2GB, Grafana ~200MB, Loki ~500MB, Alertmanager ~50MB, exporters ~80MB total, Promtail ~40MB. Under 3GB for the entire observability stack. Datadog wants $23/host/month for less.

← The bpftrace language AI Admin Assistant — teach an LLM to run your infrastructure. →