| your Linux construction kit
Source

Grafana, Prometheus & Alerting on ZFS — observe everything, lose nothing.

Monitoring data is the most undervalued data on a server. You only realize you needed last month's CPU graph after the outage. Prometheus stores metrics as time series. Grafana renders them into dashboards. Alertmanager wakes you up when something goes wrong. On kldloadOS, all of this sits on ZFS datasets — snapshotable, compressible, replicable. Your monitoring history is as durable as your production data.

kst is the terminal view. Grafana is the browser view. Same infrastructure, same metrics, different interface. Use kst when you're SSH'd in at 2 AM. Use Grafana when you're reviewing trends over coffee. They complement each other — they don't compete.

1. ZFS datasets for the monitoring stack

Every component gets its own dataset. Separate datasets mean separate recordsizes, separate snapshot policies, and separate compression ratios. Prometheus TSDB writes are append-heavy with periodic compactions — 128k recordsize and zstd compression are ideal. Grafana's SQLite database is tiny but critical — snapshot it frequently so you never lose a dashboard.

Create dedicated datasets

# Prometheus TSDB — append-heavy, compresses well
zfs create -o recordsize=128k -o compression=zstd -o atime=off \
    -o mountpoint=/srv/prometheus rpool/srv/prometheus

# Grafana config and SQLite DB — small but precious
zfs create -o recordsize=32k -o compression=zstd -o atime=off \
    -o mountpoint=/srv/grafana rpool/srv/grafana

# Alertmanager data — silence history, notification log
zfs create -o recordsize=32k -o compression=zstd -o atime=off \
    -o mountpoint=/srv/alertmanager rpool/srv/alertmanager

# Check them
kdf
A filing cabinet with one big drawer is a junk drawer. Separate drawers for invoices, contracts, and receipts means you can lock, copy, or move each one independently. Datasets are drawers.

Sanoid snapshot policy for monitoring data

# /etc/sanoid/sanoid.conf — monitoring datasets

[rpool/srv/prometheus]
    use_template = monitoring
    recursive = no

[rpool/srv/grafana]
    use_template = monitoring-critical
    recursive = no

[rpool/srv/alertmanager]
    use_template = monitoring
    recursive = no

[template_monitoring]
    hourly = 48
    daily = 30
    monthly = 6
    yearly = 0
    autosnap = yes
    autoprune = yes

[template_monitoring-critical]
    # Grafana dashboards are hand-crafted — keep more history
    hourly = 72
    daily = 90
    monthly = 12
    yearly = 1
    autosnap = yes
    autoprune = yes
You back up your database every night. Why wouldn't you back up the dashboards that tell you whether the database is healthy? Grafana configs are infrastructure too.

2. Docker deployment — the full stack

Grafana, Prometheus, and Alertmanager all run as containers. Each container's data volume is a ZFS dataset. The containers are stateless and disposable. The data underneath is durable and versioned. Blow away a container, start a new one, point it at the same dataset — nothing lost.

docker-compose.yml for the monitoring stack

# /srv/monitoring/docker-compose.yml
version: "3.8"

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - /srv/prometheus:/prometheus
      - /srv/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - /srv/monitoring/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=90d"
      - "--storage.tsdb.retention.size=50GB"
      - "--web.enable-lifecycle"

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    volumes:
      - /srv/grafana:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=http://localhost:3000

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - /srv/alertmanager:/alertmanager
      - /srv/monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
      - /srv/monitoring/zfs-collector:/var/lib/node_exporter/textfile:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--path.rootfs=/rootfs"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
      - "--collector.textfile.directory=/var/lib/node_exporter/textfile"
# Launch it
cd /srv/monitoring && docker compose up -d

# Snapshot the entire stack before upgrading images
ksnap /srv/prometheus && ksnap /srv/grafana && ksnap /srv/alertmanager
docker compose pull && docker compose up -d
Containers are like tents — easy to set up, easy to tear down. ZFS datasets are the land underneath. The tent blows away in the wind. The land stays.

3. Prometheus configuration — scraping everything

Prometheus pulls metrics from exporters. Node exporter gives you CPU, RAM, disk, and network. The ZFS custom collector (below) gives you ARC stats, pool health, scrub status, and snapshot ages. If you run WireGuard, you get peer status, bandwidth, and handshake ages too.

prometheus.yml

# /srv/monitoring/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert-rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  # This node
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]
        labels:
          instance: "kldload-01"

  # Prometheus self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Fleet nodes over WireGuard
  - job_name: "fleet"
    static_configs:
      - targets:
          - "10.10.0.2:9100"   # node-02
          - "10.10.0.3:9100"   # node-03
          - "10.10.0.4:9100"   # node-04
        labels:
          env: "production"

  # WireGuard exporter (if running)
  - job_name: "wireguard"
    static_configs:
      - targets: ["localhost:9586"]
    scrape_interval: 30s

4. ZFS custom collector for node_exporter

Node exporter's textfile collector reads .prom files from a directory. A cron job writes fresh ZFS metrics every minute. This gives Prometheus everything ZFS-specific: ARC hit rate, pool capacity, scrub age, snapshot counts, and dataset compression ratios. No special exporter binary needed — just a shell script and the textfile collector you already have.

ZFS metrics collector script

#!/bin/bash
# /usr/local/bin/zfs-prom-collector — write ZFS metrics for node_exporter

OUT="/srv/monitoring/zfs-collector/zfs.prom"
TMP="${OUT}.tmp"

{
    # ARC stats
    if [ -f /proc/spl/kstat/zfs/arcstats ]; then
        awk '
            /^size /   { printf "zfs_arc_size_bytes %s\n", $3 }
            /^c_max /  { printf "zfs_arc_max_bytes %s\n", $3 }
            /^hits /   { h=$3 }
            /^misses / { m=$3 }
            END {
                if (h+m > 0) printf "zfs_arc_hit_ratio %.4f\n", h/(h+m)
                printf "zfs_arc_hits_total %s\n", h
                printf "zfs_arc_misses_total %s\n", m
            }
        ' /proc/spl/kstat/zfs/arcstats
    fi

    # Pool health: 0=ONLINE, 1=DEGRADED, 2=FAULTED, 3=UNAVAIL
    zpool list -H -o name,health 2>/dev/null | while IFS=$'\t' read -r pool health; do
        case "$health" in
            ONLINE)   val=0 ;;
            DEGRADED) val=1 ;;
            FAULTED)  val=2 ;;
            *)        val=3 ;;
        esac
        echo "zfs_pool_health{pool=\"${pool}\"} ${val}"
    done

    # Pool capacity
    zpool list -Hp -o name,size,alloc,free,frag,cap 2>/dev/null | \
        while IFS=$'\t' read -r pool size alloc free frag cap; do
            echo "zfs_pool_size_bytes{pool=\"${pool}\"} ${size}"
            echo "zfs_pool_allocated_bytes{pool=\"${pool}\"} ${alloc}"
            echo "zfs_pool_free_bytes{pool=\"${pool}\"} ${free}"
            echo "zfs_pool_fragmentation_percent{pool=\"${pool}\"} ${frag}"
            echo "zfs_pool_capacity_percent{pool=\"${pool}\"} ${cap}"
        done

    # Scrub age in seconds
    zpool list -H -o name 2>/dev/null | while read -r pool; do
        last_scrub=$(zpool status "$pool" 2>/dev/null | \
            grep 'scan: scrub repaired' | \
            grep -oP '\w+ \w+ +\d+ \d+:\d+:\d+ \d+' | head -1)
        if [ -n "$last_scrub" ]; then
            scrub_epoch=$(date -d "$last_scrub" +%s 2>/dev/null)
            if [ -n "$scrub_epoch" ]; then
                age=$(( $(date +%s) - scrub_epoch ))
                echo "zfs_scrub_age_seconds{pool=\"${pool}\"} ${age}"
            fi
        fi
    done

    # Snapshot count per dataset
    zfs list -H -t snapshot -o name 2>/dev/null | \
        awk -F@ '{count[$1]++} END {for(ds in count) printf "zfs_snapshot_count{dataset=\"%s\"} %d\n", ds, count[ds]}'

    # Oldest snapshot age per dataset (seconds)
    zfs list -H -t snapshot -o name,creation -s creation 2>/dev/null | \
        awk -F@ 'NR>0 {
            split($2, parts, "\t")
            ds=$1
            if (!(ds in seen)) {
                seen[ds]=1
                print ds "\t" parts[2]
            }
        }' | while IFS=$'\t' read -r ds creation; do
            if [ -n "$creation" ]; then
                snap_epoch=$(date -d "$creation" +%s 2>/dev/null)
                if [ -n "$snap_epoch" ]; then
                    age=$(( $(date +%s) - snap_epoch ))
                    echo "zfs_oldest_snapshot_age_seconds{dataset=\"${ds}\"} ${age}"
                fi
            fi
        done

} > "$TMP"
mv "$TMP" "$OUT"
# Install it
chmod +x /usr/local/bin/zfs-prom-collector
mkdir -p /srv/monitoring/zfs-collector

# Run every minute via cron
cat > /etc/cron.d/zfs-prom-collector <<'EOF'
* * * * * root /usr/local/bin/zfs-prom-collector
EOF

# Test it
/usr/local/bin/zfs-prom-collector && cat /srv/monitoring/zfs-collector/zfs.prom
Node exporter is a reporter who reads whatever documents you leave on the desk. The ZFS collector writes a fresh status report every 60 seconds. The reporter picks it up and hands it to Prometheus. No custom software. Just files.

5. WireGuard monitoring

If your fleet talks over WireGuard, you need to know when a peer drops off, when a handshake goes stale, and how much bandwidth each tunnel is consuming. The prometheus-wireguard-exporter scrapes wg show and exposes it as Prometheus metrics.

WireGuard exporter

# Run the WireGuard exporter as a container
docker run -d --name wg-exporter \
    --restart unless-stopped \
    --cap-add NET_ADMIN \
    --network host \
    -v /run/wireguard:/run/wireguard:ro \
    mindflavor/prometheus-wireguard-exporter:latest \
    -p 9586

# Metrics available at http://localhost:9586/metrics
# Key metrics:
#   wireguard_sent_bytes_total{peer="..."}        — bytes sent per peer
#   wireguard_received_bytes_total{peer="..."}    — bytes received per peer
#   wireguard_latest_handshake_seconds{peer="..."}  — last handshake timestamp
A VPN tunnel looks fine until someone tries to use it. Handshake age tells you if the tunnel is alive. If the last handshake was 5 minutes ago, the tunnel is up. If it was 3 hours ago, something is wrong.

6. Alert rules — know before your users do

Alerts are not optional. They are the difference between "we caught it at 3 AM" and "the CEO called at 9 AM." These rules cover pool health, ARC efficiency, scrub schedules, snapshot freshness, disk errors, and WireGuard peer status.

alert-rules.yml

# /srv/monitoring/alert-rules.yml
groups:
  - name: zfs
    rules:
      - alert: ZfsPoolDegraded
        expr: zfs_pool_health != 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool {{ $labels.pool }} is not ONLINE"
          description: "Pool {{ $labels.pool }} health value is {{ $value }} (1=DEGRADED, 2=FAULTED). Check zpool status immediately."

      - alert: ZfsArcHitRateLow
        expr: zfs_arc_hit_ratio < 0.80
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "ZFS ARC hit rate below 80%"
          description: "ARC hit ratio is {{ $value | humanizePercentage }}. Consider increasing zfs_arc_max or adding L2ARC."

      - alert: ZfsScrubOverdue
        expr: zfs_scrub_age_seconds > 2592000
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "ZFS scrub overdue on pool {{ $labels.pool }}"
          description: "Last scrub was {{ $value | humanizeDuration }} ago. Run: zpool scrub {{ $labels.pool }}"

      - alert: ZfsSnapshotTooOld
        expr: zfs_oldest_snapshot_age_seconds > 7776000
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Oldest snapshot on {{ $labels.dataset }} is over 90 days old"
          description: "Consider reviewing snapshot retention. Old snapshots hold space."

      - alert: ZfsPoolCapacityHigh
        expr: zfs_pool_capacity_percent > 80
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "ZFS pool {{ $labels.pool }} over 80% capacity"
          description: "Pool {{ $labels.pool }} is at {{ $value }}%. ZFS performance degrades above 80%. Free space or expand the pool."

      - alert: ZfsPoolCapacityCritical
        expr: zfs_pool_capacity_percent > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool {{ $labels.pool }} over 90% — act now"
          description: "Pool {{ $labels.pool }} is at {{ $value }}%. Delete snapshots, remove data, or add vdevs."

  - name: wireguard
    rules:
      - alert: WireguardPeerDown
        expr: time() - wireguard_latest_handshake_seconds > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "WireGuard peer {{ $labels.peer }} last handshake over 5 minutes ago"
          description: "Peer may be offline or unreachable. Check network connectivity."

  - name: system
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage above 85% on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage above 90% on {{ $labels.instance }}"
          description: "Note: ZFS ARC uses RAM intentionally. Check arc_summary before panicking."

      - alert: DiskErrorsDetected
        expr: rate(node_disk_io_time_weighted_seconds_total[5m]) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High disk I/O latency on {{ $labels.instance }}"
Smoke detectors don't put out fires. But they wake you up before the house burns down. Alert rules are smoke detectors for infrastructure.

alertmanager.yml

# /srv/monitoring/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: "default"
  group_by: ["alertname", "instance"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - receiver: "critical"
      match:
        severity: critical
      repeat_interval: 1h

receivers:
  - name: "default"
    webhook_configs:
      - url: "http://localhost:8065/hooks/your-webhook-id"
        # Replace with your webhook — Mattermost, Slack, Gotify, ntfy, whatever

  - name: "critical"
    webhook_configs:
      - url: "http://localhost:8065/hooks/your-webhook-id"
    # Add email, PagerDuty, or SMS for critical alerts

7. Pre-built Grafana dashboard

Import this JSON into Grafana and you get a single-pane view of your kldloadOS node: pool health, ARC statistics, disk I/O, CPU/RAM, and service status. Every panel uses the metrics from node_exporter and the ZFS custom collector above.

kldloadOS dashboard JSON (import via Grafana UI)

{
  "dashboard": {
    "title": "kldloadOS — Node Overview",
    "tags": ["kldload", "zfs"],
    "timezone": "browser",
    "panels": [
      {
        "title": "ZFS Pool Health",
        "type": "stat",
        "targets": [{"expr": "zfs_pool_health"}],
        "fieldConfig": {
          "defaults": {
            "mappings": [
              {"type": "value", "options": {"0": {"text": "ONLINE", "color": "green"}}},
              {"type": "value", "options": {"1": {"text": "DEGRADED", "color": "orange"}}},
              {"type": "value", "options": {"2": {"text": "FAULTED", "color": "red"}}}
            ]
          }
        },
        "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0}
      },
      {
        "title": "ARC Hit Rate",
        "type": "gauge",
        "targets": [{"expr": "zfs_arc_hit_ratio * 100"}],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0, "max": 100,
            "thresholds": {
              "steps": [
                {"color": "red", "value": 0},
                {"color": "orange", "value": 70},
                {"color": "yellow", "value": 80},
                {"color": "green", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0}
      },
      {
        "title": "ARC Size",
        "type": "timeseries",
        "targets": [
          {"expr": "zfs_arc_size_bytes", "legendFormat": "ARC Current"},
          {"expr": "zfs_arc_max_bytes", "legendFormat": "ARC Max"}
        ],
        "fieldConfig": {"defaults": {"unit": "bytes"}},
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "title": "Pool Capacity",
        "type": "bargauge",
        "targets": [{"expr": "zfs_pool_capacity_percent", "legendFormat": "{{ pool }}"}],
        "fieldConfig": {
          "defaults": {
            "unit": "percent", "min": 0, "max": 100,
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 85}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 4}
      },
      {
        "title": "Scrub Age (days)",
        "type": "stat",
        "targets": [{"expr": "zfs_scrub_age_seconds / 86400", "legendFormat": "{{ pool }}"}],
        "fieldConfig": {
          "defaults": {
            "unit": "d",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 14},
                {"color": "red", "value": 30}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 6, "y": 4}
      },
      {
        "title": "CPU Usage",
        "type": "timeseries",
        "targets": [
          {"expr": "100 - (avg(rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)", "legendFormat": "CPU %"}
        ],
        "fieldConfig": {"defaults": {"unit": "percent", "min": 0, "max": 100}},
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 12}
      },
      {
        "title": "Memory Usage",
        "type": "timeseries",
        "targets": [
          {"expr": "node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes", "legendFormat": "Used"},
          {"expr": "node_memory_MemTotal_bytes", "legendFormat": "Total"}
        ],
        "fieldConfig": {"defaults": {"unit": "bytes"}},
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 12}
      },
      {
        "title": "Disk I/O",
        "type": "timeseries",
        "targets": [
          {"expr": "rate(node_disk_read_bytes_total[5m])", "legendFormat": "Read {{ device }}"},
          {"expr": "rate(node_disk_written_bytes_total[5m])", "legendFormat": "Write {{ device }}"}
        ],
        "fieldConfig": {"defaults": {"unit": "Bps"}},
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 20}
      },
      {
        "title": "Snapshot Count per Dataset",
        "type": "bargauge",
        "targets": [{"expr": "zfs_snapshot_count", "legendFormat": "{{ dataset }}"}],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 20}
      }
    ],
    "time": {"from": "now-24h", "to": "now"},
    "refresh": "30s"
  }
}
# Import via Grafana API
curl -s -X POST http://localhost:3000/api/dashboards/db \
    -H "Content-Type: application/json" \
    -u admin:changeme \
    -d @/srv/monitoring/kldload-dashboard.json

8. Federation — central Grafana, fleet nodes

One Grafana instance. Many Prometheus instances. Each node in the fleet runs its own Prometheus and node_exporter behind WireGuard. The central Prometheus federates from all of them, or you add each node's Prometheus as a Grafana datasource directly. WireGuard makes this secure without certificates or VPN appliances.

Option A: Prometheus federation

Central Prometheus scrapes /federate from each node's Prometheus. One TSDB with all fleet data. Good for up to ~20 nodes.

# In central prometheus.yml
- job_name: "federate-fleet"
  honor_labels: true
  metrics_path: "/federate"
  params:
    'match[]':
      - '{job="node"}'
      - '{__name__=~"zfs_.*"}'
  static_configs:
    - targets:
        - "10.10.0.2:9090"
        - "10.10.0.3:9090"
        - "10.10.0.4:9090"

Option B: Multiple Grafana datasources

Each node's Prometheus is a separate Grafana datasource. No central TSDB needed. Dashboards use datasource variables to switch between nodes.

# Add each node as a datasource in Grafana
# Settings > Data Sources > Add data source
# Type: Prometheus
# URL: http://10.10.0.2:9090
# Name: node-02

# In your dashboard, add a variable:
# Name: datasource
# Type: Datasource
# Filter: Prometheus

9. kst vs Grafana — terminal and browser, same data

kst is the terminal view. Grafana is the browser view. They are not competitors. Use kst when you're on the box over SSH, diagnosing a live issue. Use Grafana when you're reviewing trends, building dashboards, or showing a colleague what happened yesterday. Same node_exporter. Same ZFS metrics. Different rendering engine.

kst — the terminal view

kst reads system state directly and renders it in the terminal. No browser needed. Pool health, ARC hit rate, service status, disk I/O — all in a single glance. Perfect for quick checks, automation scripts, and headless servers.

# Quick health check via SSH
ssh node-04 kst

# Watch live stats (refreshes every 2s)
watch -n 2 kst

Grafana — the browser view

Grafana reads from Prometheus and renders time-series graphs. Historical trends, drill-downs, annotations, alerting visualizations. Perfect for post-mortems, capacity planning, and showing stakeholders what the infrastructure looks like over time.

# Open in browser
http://kldload-01:3000

# Or over WireGuard
http://10.10.0.1:3000

Monitoring data belongs on ZFS. Prometheus metrics, Grafana dashboards, Alertmanager state — this data tells you the history of your infrastructure. Without it, every incident starts from zero. With ZFS underneath, your monitoring history is checksummed, compressed, snapshotted, and replicable to a backup node with a single syncoid command.

You don't need a SaaS vendor to observe your own servers. You need Prometheus, Grafana, a shell script, and a filesystem that won't let your data rot. You already have all four.