| pick your distro, get ZFS on root
kldload — your platform, your way, free
Source

Observability Masterclass

This guide goes deep on observability for kldload systems — Prometheus, Grafana, node_exporter, eBPF tracing, ZFS metrics, Alertmanager, and the incident response workflow that ties it all together. If you have a kldload node running and want to understand what it is doing, why it is slow, or what broke at 3am, this is the guide for you.

What this page covers: the difference between monitoring and observability, the full kldload observability stack, Prometheus setup and PromQL, node_exporter and ZFS metrics, Grafana dashboards, eBPF drill-down with bcc and bpftrace, log aggregation, Alertmanager alerting rules, distributed tracing, fleet architecture, and a concrete incident response workflow.

Prerequisites: a running kldload node (any distro, any profile). The monitoring profile installs Prometheus + Grafana + Alertmanager automatically. On any other profile, node_exporter is pre-installed and ready to scrape.


1. Observability Is Not Monitoring

Monitoring and observability are not the same thing, and conflating them leads to dashboards that look impressive during demos but fail you during incidents.

Monitoring tells you something is broken. It checks predefined conditions — CPU > 90%, disk < 10% free, service not responding — and fires an alert when a threshold is crossed. Monitoring requires you to know what can go wrong before it goes wrong. You write the rules in advance. If the failure mode is novel, your monitoring misses it.

Observability tells you why. An observable system lets you ask arbitrary questions about its behavior after the fact, using the data you already collect, without needing to predict the failure mode in advance. The three pillars: metrics (what happened, in aggregate over time), logs (what happened, event by event), and traces (what happened, end-to-end across services). Together they let you reconstruct any system state, past or present.

The practical difference: monitoring tells you "disk latency is high." Observability lets you answer "which process, which file, on which dataset, since when, and why." Monitoring requires a predefined rule. Observability requires instrumented data and a query language.

Most "observability" tutorials teach you to install Grafana and make dashboards. That is monitoring. Real observability means you can debug a problem you have never seen before using the data you already collect. The tools on this page — Prometheus, Grafana, node_exporter, eBPF tracing, ZFS metrics — give you that capability. A dashboard shows you the what. eBPF shows you the why. Logs show you the when. You need all three to debug anything nontrivial. Build the full stack, not just the pretty graphs.

Metrics — the what

Numeric measurements aggregated over time. CPU usage, request rate, memory consumed, bytes written. Efficient to store (one float per sample), fast to query, excellent for trending and alerting. Bad at preserving detail — you know disk latency was high at 14:23, not which file caused it.

// Prometheus: pull-based time-series database // node_exporter: ~1000 metrics per node, every 15 seconds

Logs — the when

Discrete events with timestamps. Application errors, authentication attempts, systemd unit state changes, kernel messages. Detailed but expensive to store and search at scale. Essential for reconstructing exactly what happened at a specific moment.

// journald: structured logs, every unit, since boot // journalctl -u nginx --since "14:20" --until "14:30"

Traces — the why

End-to-end records of a single request as it travels through multiple services. Each span records the operation, duration, and outcome. Essential for microservices where a slow request might traverse 10 services before it times out.

// OpenTelemetry → Jaeger/Tempo // Only needed when you have distributed services

eBPF — the microscope

Kernel-level instrumentation that answers questions no other tool can. Which syscall is slow? Which process is hammering the disk? What does the TCP retransmit pattern look like for this specific connection? eBPF gives you per-process, per-file, per-call granularity — live, in production.

// bcc/bpftrace: ad-hoc kernel tracing // No agents. No restarts. No overhead at rest.

2. The kldload Observability Stack

kldload ships the foundation. You add the collection and visualization layer on top.

What ships with kldload (every profile except core)

  • node_exporter — systemd service, listening on port 9100, scrape-ready out of the box
  • bcc-tools — the BPF Compiler Collection: tcpconnect, biolatency, runqlat, execsnoop, and 70+ others
  • bpftrace — high-level tracing language for writing custom eBPF programs in one line
  • bpftool — inspect loaded eBPF programs and maps in the kernel

What you add

  • Prometheus — the scraper and time-series database; scrapes node_exporter on all your nodes
  • Grafana — visualization; connects to Prometheus as a data source; where your dashboards live
  • Alertmanager — alert routing; Prometheus sends alerts here; Alertmanager routes to PagerDuty, Slack, email, etc.

Architecture (text diagram)

kldload nodes (all profiles)
  └── node_exporter :9100
       └── exposes ~1000 metrics: CPU, memory, disk, ZFS, network, systemd

Prometheus server (monitoring profile, or dedicated node)
  ├── scrapes node_exporter on all nodes every 15s
  ├── stores time-series in local TSDB (15 days default)
  ├── evaluates alerting rules
  └── sends fired alerts → Alertmanager

Grafana (same node as Prometheus, or separate)
  ├── queries Prometheus via PromQL
  ├── renders dashboards in your browser
  └── can also query Loki (logs), Tempo (traces)

Alertmanager
  ├── receives alerts from Prometheus
  ├── deduplicates, groups, routes
  └── sends to: PagerDuty / Slack / email / webhook

Your browser → Grafana :3000
kldload's monitoring profile installs the full Prometheus + Grafana + Alertmanager stack automatically during install. On all other profiles — desktop, server, core-server — node_exporter is pre-installed and enabled so you can scrape it from a central Prometheus server without touching the target node. The design principle: every kldload node is observable from day one, even if you haven't decided where your Prometheus lives yet.

3. Prometheus — the Metrics Engine

Prometheus is a pull-based metrics scraper and time-series database. It periodically fetches metrics from HTTP endpoints (called targets), stores them with timestamps, and provides PromQL — a functional query language — for slicing, aggregating, and alerting on that data.

Install on kldload

# CentOS Stream 9 / RHEL / Rocky
dnf install -y prometheus2

# Debian 13 / Ubuntu 24.04
apt install -y prometheus

# Or install from upstream binary (latest version)
PROM_VER=2.51.2
curl -Lo /tmp/prometheus.tar.gz \
  https://github.com/prometheus/prometheus/releases/download/v${PROM_VER}/prometheus-${PROM_VER}.linux-amd64.tar.gz
tar -xzf /tmp/prometheus.tar.gz -C /opt/
mv /opt/prometheus-${PROM_VER}.linux-amd64 /opt/prometheus

useradd -r -s /sbin/nologin prometheus
mkdir -p /var/lib/prometheus
chown prometheus:prometheus /var/lib/prometheus

cat > /etc/systemd/system/prometheus.service <<'EOF'
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
ExecStart=/opt/prometheus/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --web.listen-address=0.0.0.0:9090
Restart=always

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now prometheus

Prometheus configuration — scraping a kldload fleet

# /etc/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - '/etc/prometheus/rules/*.yml'

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # kldload nodes — scrape node_exporter on each
  - job_name: 'kldload-nodes'
    scrape_interval: 15s
    static_configs:
      - targets:
          - 'node01.wg2:9100'   # WireGuard monitoring plane
          - 'node02.wg2:9100'
          - 'node03.wg2:9100'
          - 'node04.wg2:9100'
          - 'node05.wg2:9100'
        labels:
          env: 'production'
      - targets:
          - 'lab01.wg2:9100'
          - 'lab02.wg2:9100'
        labels:
          env: 'lab'

  # ZFS custom metrics (textfile collector)
  - job_name: 'zfs-custom'
    scrape_interval: 60s
    static_configs:
      - targets:
          - 'node01.wg2:9100'
          - 'node02.wg2:9100'
          - 'node03.wg2:9100'

PromQL basics

rate() — per-second rate over a window

Converts a counter into a rate. Use this for any metric that only goes up (bytes received, requests total, errors total). Always wrap counters in rate() before using them.

rate(node_network_receive_bytes_total[5m]) # bytes/sec received, averaged over 5 minutes

increase() — total increase over a window

How much did a counter increase in the last N minutes? Useful for "requests in the last hour" or "errors in the last 15 minutes."

increase(node_disk_io_time_seconds_total[1h]) # seconds of disk I/O in the last hour

histogram_quantile() — percentile from histogram

Calculates the Nth percentile from a histogram metric. Essential for latency — P99 tells you the worst case for 99% of requests.

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # P99 request latency over 5 minutes

avg_over_time() — average of a gauge over time

Smooths out spikes in a gauge metric. Useful for "average CPU over the last hour" or "average ARC size over the last day."

avg_over_time(node_zfs_arc_size[1h]) # average ZFS ARC size over the last hour

10 essential PromQL queries for kldload servers

# 1. CPU utilization % per node (all modes except idle)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 2. Available memory bytes
node_memory_MemAvailable_bytes

# 3. Disk space used % per filesystem
1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
     / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})

# 4. Disk I/O utilization % (time busy)
rate(node_disk_io_time_seconds_total[5m]) * 100

# 5. Network throughput in bytes/sec (receive)
rate(node_network_receive_bytes_total{device!~"lo|wg.*"}[5m])

# 6. ZFS ARC size in bytes
node_zfs_arc_size

# 7. ZFS ARC hit ratio (cache effectiveness)
rate(node_zfs_arc_hits[5m])
  / (rate(node_zfs_arc_hits[5m]) + rate(node_zfs_arc_misses[5m]))

# 8. Systemd unit failures (state != active for type=service)
node_systemd_unit_state{state="failed"}

# 9. Node load average (1-minute)
node_load1

# 10. Disk read latency (ms) — average across all devices
rate(node_disk_read_time_seconds_total[5m])
  / rate(node_disk_reads_completed_total[5m]) * 1000
Prometheus is pull-based — it scrapes targets on a schedule, they don't push to it. This design choice has major operational benefits: your monitoring works even when targets are partially down (Prometheus just records "no data"), you never lose metrics to a flooded push pipeline, and adding a new target is one line in the scrape config. The pull model also makes it trivial to test locally — run node_exporter on your laptop, add it to the config, and Prometheus scrapes it immediately. This is why Prometheus won the monitoring wars. Push-based systems (StatsD, InfluxDB push mode) feel simpler until you have to debug why metrics stopped arriving.

4. node_exporter — System Metrics

node_exporter is the Prometheus exporter for hardware and OS metrics. It runs as a systemd service, exposes an HTTP endpoint at :9100/metrics, and provides approximately 1000 metrics per node covering every major subsystem.

node_exporter is pre-installed on kldload

# Verify it is running
systemctl status node_exporter

# View all metrics it exposes (pipe to less)
curl -s http://localhost:9100/metrics | less

# Count the number of distinct metric families
curl -s http://localhost:9100/metrics | grep '^# HELP' | wc -l

Key metric categories

Category Prefix What it covers
CPU node_cpu_seconds_total Per-core time in user, system, iowait, idle, steal, nice, irq modes
Memory node_memory_* MemTotal, MemFree, MemAvailable, Buffers, Cached, SwapUsed, Hugepages
Disk node_disk_* Reads/writes completed, bytes, I/O time, queue depth per device
Filesystem node_filesystem_* Available, free, size, files (inodes) per mountpoint and fstype
Network node_network_* Bytes/packets in/out, errors, drops, carrier changes per interface
ZFS node_zfs_* ARC size, hits, misses, pool state, dataset stats, scrub progress
systemd node_systemd_unit_state State (active/inactive/failed/activating) for every systemd unit
System node_load*, node_boot_time_seconds Load averages (1/5/15 min), uptime, open file descriptors

ZFS-specific metrics

# ARC size (how much RAM ZFS is using for cache)
node_zfs_arc_size

# ARC maximum (zfs_arc_max tunable)
node_zfs_arc_c_max

# ARC hits and misses (for hit ratio calculation)
node_zfs_arc_hits
node_zfs_arc_misses

# Pool health: 0=online, 1=degraded, 2=faulted, 3=offline, 4=unavail, 5=removed
node_zfs_state{pool="rpool"}

# Dataset used bytes (requires textfile collector — see section 6)
zfs_dataset_used_bytes{dataset="rpool/data"}

# Scrub state (via textfile collector)
zfs_scrub_state{pool="rpool"}   # 0=idle, 1=scanning, 2=finished, 3=canceled

WireGuard metrics via textfile collector

The WireGuard Masterclass covers the textfile exporter setup. Once running, these metrics appear in Prometheus:

# Bytes transferred per WireGuard peer
wireguard_sent_bytes_total{interface="wg0",public_key="..."}
wireguard_received_bytes_total{interface="wg0",public_key="..."}

# Seconds since last handshake (peer is stale if > 180)
wireguard_latest_handshake_seconds{interface="wg0",public_key="..."}
node_exporter gives you roughly 1000 metrics per node. The 10 that matter most: node_cpu_seconds_total (CPU usage by mode — iowait tells you if CPU is waiting on disk), node_memory_MemAvailable_bytes (actual available memory — not MemFree, which ignores page cache), node_filesystem_avail_bytes (disk space remaining), node_disk_io_time_seconds_total (disk busy percentage — near 100% means your disk is saturated), node_network_receive_bytes_total (network throughput), node_zfs_arc_size (how much RAM ZFS is using for ARC), node_zfs_arc_hits / node_zfs_arc_misses (ARC hit ratio — below 80% is a problem), node_systemd_unit_state{state="failed"} (service health — any non-zero value means a unit has crashed), node_load1 (1-minute load average), and node_boot_time_seconds (uptime, inverted). If you dashboard nothing else, dashboard these.

5. Grafana — Visualization

Grafana is the visualization layer. It connects to Prometheus as a data source, executes PromQL queries, and renders the results as time-series graphs, gauges, stat panels, heatmaps, and tables.

Install on kldload

# CentOS Stream 9 / RHEL / Rocky — add Grafana repo
cat > /etc/yum.repos.d/grafana.repo <<'EOF'
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF
dnf install -y grafana
systemctl enable --now grafana-server

# Debian 13 / Ubuntu 24.04
apt install -y apt-transport-https software-properties-common wget
wget -q -O /usr/share/keyrings/grafana.key \
  https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] \
  https://apt.grafana.com stable main" \
  > /etc/apt/sources.list.d/grafana.list
apt update && apt install -y grafana
systemctl enable --now grafana-server

Add Prometheus as a data source

# Grafana listens on :3000 by default
# Open http://your-node:3000 — default login: admin/admin

# Via API (scripted setup)
curl -s -u admin:admin \
  -H "Content-Type: application/json" \
  -X POST http://localhost:3000/api/datasources \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true
  }'

kldload host dashboard — panels to build

CPU Usage (stacked area)

Stack user, system, iowait, and steal by mode. iowait spikes tell you disk is the bottleneck. steal spikes on a VM tell you the hypervisor is CPU-constrained.

sum by(mode) (rate(node_cpu_seconds_total {instance="$node"}[5m]))

Memory (gauge + time series)

Show MemAvailable as a gauge (green/yellow/red thresholds) and a time series. Track Buffers + Cached separately so you can see if the kernel is holding onto page cache aggressively.

node_memory_MemAvailable_bytes{instance="$node"} node_memory_Buffers_bytes + node_memory_Cached_bytes

ZFS ARC (time series)

Plot ARC size, ARC max, and ARC hit ratio on the same panel with two Y-axes. You want ARC size near ARC max (ZFS is using all available memory for cache) and hit ratio above 80%.

node_zfs_arc_size{instance="$node"} rate(node_zfs_arc_hits[5m]) / (rate(node_zfs_arc_hits[5m]) + rate(node_zfs_arc_misses[5m]))

Disk I/O (time series)

Plot read and write bytes/sec per device, plus disk utilization %. A flat line at 100% utilization means your disk is saturated — any further load queues up.

rate(node_disk_read_bytes_total{instance="$node"}[5m]) rate(node_disk_io_time_seconds_total[5m]) * 100

Network throughput (time series)

Plot receive and transmit bytes/sec per interface. Exclude loopback. On a WireGuard mesh, plot wg0/wg1/wg2 separately so you can see traffic on each plane.

rate(node_network_receive_bytes_total {instance="$node",device!="lo"}[5m])

Service health (stat panel)

A stat panel showing the count of failed systemd units. Should always be 0. Any non-zero value turns the panel red and tells you something has crashed.

sum(node_systemd_unit_state {instance="$node",state="failed"})

Dashboard JSON export for fleet standardization

# Export a dashboard to JSON
curl -s -u admin:admin \
  http://localhost:3000/api/dashboards/uid/kldload-host \
  | jq .dashboard > kldload-host-dashboard.json

# Import to another Grafana instance
curl -s -u admin:admin \
  -H "Content-Type: application/json" \
  -X POST http://grafana2:3000/api/dashboards/import \
  -d "{\"dashboard\": $(cat kldload-host-dashboard.json), \"overwrite\": true}"
A dashboard is only useful if you know what normal looks like. Install Grafana, build your dashboard, then look at it for a week before your first incident. Learn the baseline patterns — what does CPU look like during your nightly backup? What does ARC do when you run a ZFS scrub? What is the normal disk throughput for your workload? When something breaks, you will see the anomaly instantly because you know the baseline. A dashboard you have never looked at during normal operations is useless during an incident. The patterns live in your head, not in the software.

6. ZFS-Specific Monitoring

ZFS generates more operationally useful metrics than any other filesystem. node_exporter surfaces many of them automatically. For the rest — snapshot counts, replication lag, compression ratios — you write a textfile collector script that runs on a schedule and outputs Prometheus exposition format.

Custom ZFS textfile collector

cat > /usr/local/bin/zfs-metrics.sh <<'EOF'
#!/bin/bash
# Outputs Prometheus text format to stdout
# Run via cron or systemd timer, redirect to textfile collector dir

POOLS=$(zpool list -H -o name)
OUTFILE=/var/lib/node_exporter/textfile_collector/zfs.prom
TMPFILE=$(mktemp)

for pool in $POOLS; do
  # Pool health: 0=ONLINE, 1=DEGRADED, 2=FAULTED
  health=$(zpool list -H -o health "$pool")
  case "$health" in
    ONLINE)   hval=0 ;;
    DEGRADED) hval=1 ;;
    *)        hval=2 ;;
  esac
  echo "zfs_pool_health{pool=\"$pool\"} $hval" >> "$TMPFILE"

  # Scrub state and errors
  scrub_errs=$(zpool status "$pool" | grep 'scan:' | grep -o '[0-9]* errors' | grep -o '[0-9]*' || echo 0)
  echo "zfs_scrub_errors{pool=\"$pool\"} ${scrub_errs:-0}" >> "$TMPFILE"

  # Snapshot count per pool
  snap_count=$(zfs list -H -t snapshot -o name | grep "^${pool}" | wc -l)
  echo "zfs_snapshot_count{pool=\"$pool\"} $snap_count" >> "$TMPFILE"

  # Compression ratio (pool-wide)
  compratio=$(zfs get -H -o value compressratio "$pool" | tr -d 'x')
  echo "zfs_compression_ratio{pool=\"$pool\"} $compratio" >> "$TMPFILE"
done

# Dataset sizes
zfs list -H -o name,used,available | while read ds used avail; do
  # Convert human-readable to bytes (simplified — use zfs get -p for precision)
  used_b=$(zfs get -H -p -o value used "$ds" 2>/dev/null || echo 0)
  avail_b=$(zfs get -H -p -o value available "$ds" 2>/dev/null || echo 0)
  ds_escaped="${ds//\//_}"
  echo "zfs_dataset_used_bytes{dataset=\"$ds\"} $used_b" >> "$TMPFILE"
  echo "zfs_dataset_available_bytes{dataset=\"$ds\"} $avail_b" >> "$TMPFILE"
done

mv "$TMPFILE" "$OUTFILE"
EOF
chmod +x /usr/local/bin/zfs-metrics.sh

# Configure node_exporter to read the textfile collector directory
# Add to node_exporter systemd unit:
# ExecStart=... --collector.textfile.directory=/var/lib/node_exporter/textfile_collector

mkdir -p /var/lib/node_exporter/textfile_collector

# Run every 60 seconds via systemd timer
cat > /etc/systemd/system/zfs-metrics.service <<'EOF'
[Unit]
Description=ZFS Prometheus metrics collector

[Service]
Type=oneshot
ExecStart=/usr/local/bin/zfs-metrics.sh
EOF

cat > /etc/systemd/system/zfs-metrics.timer <<'EOF'
[Unit]
Description=Run ZFS metrics collector every 60 seconds

[Timer]
OnBootSec=30s
OnUnitActiveSec=60s

[Install]
WantedBy=timers.target
EOF

systemctl daemon-reload
systemctl enable --now zfs-metrics.timer

Prometheus alerting rules for ZFS

# /etc/prometheus/rules/zfs.yml
groups:
  - name: zfs
    rules:

      # Pool degraded — you have a failed drive, no redundancy
      - alert: ZFSPoolDegraded
        expr: zfs_pool_health > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool degraded on {{ $labels.instance }}"
          description: "Pool {{ $labels.pool }} is {{ $value }} (1=DEGRADED, 2=FAULTED). Replace the failed drive immediately."

      # ARC hit ratio below 80% — working set exceeds ARC
      - alert: ZFSARCHitRatioLow
        expr: |
          rate(node_zfs_arc_hits[15m])
            / (rate(node_zfs_arc_hits[15m]) + rate(node_zfs_arc_misses[15m])) < 0.80
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "ZFS ARC hit ratio low on {{ $labels.instance }}"
          description: "ARC hit ratio is {{ $value | humanizePercentage }}. Increase RAM, raise zfs_arc_max, or reduce working set."

      # Scrub errors — silent data corruption detected
      - alert: ZFSScrubErrors
        expr: zfs_scrub_errors > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "ZFS scrub errors on {{ $labels.instance }}"
          description: "Pool {{ $labels.pool }} has {{ $value }} scrub error(s). Run 'zpool status {{ $labels.pool }}' immediately."

      # Replication lag (if you track last-sync timestamps via textfile collector)
      - alert: ZFSReplicationLag
        expr: time() - zfs_last_replication_timestamp > 7200
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ZFS replication lagging on {{ $labels.instance }}"
          description: "Last successful replication was {{ $value | humanizeDuration }} ago. Check sanoid/syncoid."
The single most important ZFS metric is the ARC hit ratio. ZFS uses RAM as a read cache (the ARC). If the hit ratio is above 95%, your working set fits in the ARC and nearly all reads are served from RAM at memory speed. If it drops below 80%, your working set does not fit — every cache miss goes to disk, and disk is 100-1000x slower than RAM. The fix depends on the cause: if your workload's working set genuinely grew, add RAM or increase zfs_arc_max. If a single runaway process is polluting the ARC, use eBPF to identify it. The second most important metric: pool health. A DEGRADED pool has a failed drive and is running without the redundancy its vdev was designed for. One more drive failure and you lose the pool. Alert on this with zero tolerance for delay — it should page you immediately, any time of day.

7. eBPF Observability — Beyond node_exporter

node_exporter gives you system-wide aggregates. eBPF gives you per-process, per-file, per-syscall granularity. When Prometheus tells you "disk latency is high," eBPF tells you which process, which file, which syscall, and exactly how long each one took.

When to reach for eBPF

  • node_exporter shows high disk I/O but you don't know which process is responsible
  • CPU iowait is elevated and you need to find the bottleneck at the process level
  • A specific application is making unexpected network connections
  • You need syscall-level latency breakdown for a performance investigation
  • Something is spawning processes you didn't expect

bcc-tools quick reference

# Network connections — which process is connecting where
tcpconnect -p 0        # all processes
tcpconnect -p 1234     # PID 1234 only

# Disk I/O latency histogram — where is disk time going
biolatency -D          # per-device breakdown
biolatency 5 1         # 5-second interval, once

# Run queue latency — are processes waiting for CPU
runqlat                # how long do processes wait in the run queue

# Process execution — what is spawning
execsnoop              # every exec() call: who, what, args

# File open events — which files is a process accessing
opensnoop -p 1234

# Block I/O tracer per process
biotop                 # top-like view of disk I/O by process

# TCP retransmit events
tcpretrans             # dropped/retransmitted packets with source/dest

# Syscall latency by syscall type
syscount -L -p 1234    # count and latency of each syscall for PID 1234

bpftrace one-liners for custom metrics

# Disk I/O latency histogram per process name
bpftrace -e '
kprobe:blk_account_io_start { @start[arg0] = nsecs; }
kprobe:blk_account_io_done  /@start[arg0]/
{
  @usecs[comm] = hist((nsecs - @start[arg0]) / 1000);
  delete(@start[arg0]);
}'

# Count read() syscalls by process, 10-second window
bpftrace -e '
tracepoint:syscalls:sys_enter_read { @[comm] = count(); }
interval:s:10 { print(@); clear(@); exit(); }'

# TCP connect latency by destination port
bpftrace -e '
kprobe:tcp_v4_connect { @start[tid] = nsecs; }
kretprobe:tcp_v4_connect /@start[tid]/
{
  @ms[tid] = (nsecs - @start[tid]) / 1000000;
  delete(@start[tid]);
}'

# ZFS read latency (requires DEBUG kernel or USDT probes)
bpftrace -e '
kprobe:zfs_read { @start[tid] = nsecs; }
kretprobe:zfs_read /@start[tid]/
{
  @us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

Exporting eBPF data to Prometheus — the textfile pattern

# Example: export TCP connection counts per destination port
# Run this script, collect output via node_exporter textfile collector

cat > /usr/local/bin/tcp-metrics.sh <<'EOF'
#!/bin/bash
OUTFILE=/var/lib/node_exporter/textfile_collector/tcp.prom
TMPFILE=$(mktemp)

echo "# HELP tcp_connection_count Active TCP connections by state" > "$TMPFILE"
echo "# TYPE tcp_connection_count gauge" >> "$TMPFILE"

ss -tn state established | tail -n +2 | \
  awk '{print $4}' | cut -d: -f2 | sort | uniq -c | \
  while read count port; do
    echo "tcp_connection_count{port=\"$port\"} $count" >> "$TMPFILE"
  done

mv "$TMPFILE" "$OUTFILE"
EOF
chmod +x /usr/local/bin/tcp-metrics.sh
The eBPF workflow has a clear mental model. Prometheus is your early warning system — it alerts when something crosses a threshold. Grafana is your map — it shows you the shape of the problem in time. eBPF is your microscope — it shows you the root cause at the kernel level. When a disk latency alert fires, you open Grafana to see which node and when. Then you SSH in and run biolatency to see the latency distribution, biotop to see which process, and opensnoop to see which file. The entire investigation takes under 5 minutes once you have the workflow internalized. Practice it on a non-incident first so it is muscle memory when the real alert fires at 3am.

8. Log Aggregation

Every kldload system runs journald. It is structured, indexed, queryable, and persistent across reboots. For most deployments, journald on each node with journalctl is sufficient — no log shipping pipeline required.

journalctl patterns for daily operations

# All logs from the last hour
journalctl --since "1 hour ago"

# Logs from a specific unit, last 100 lines
journalctl -u nginx -n 100

# Follow logs in real time (like tail -f)
journalctl -u postgres -f

# Errors and above only (priority: emerg, alert, crit, err)
journalctl -p err --since "today"

# Kernel messages only (dmesg equivalent, but structured)
journalctl -k --since "1 hour ago"

# Boot messages for the previous boot (post-crash)
journalctl -b -1

# All logs between two timestamps
journalctl --since "2026-04-02 14:00:00" --until "2026-04-02 14:30:00"

# JSON output for scripted log parsing
journalctl -u sshd -o json | jq '.MESSAGE'

# Check disk space used by journal
journalctl --disk-usage

# Vacuum old journals to free space
journalctl --vacuum-time=30d

Forward to a central log server (Loki)

# Install Loki (on the monitoring node)
# Loki is Grafana's log aggregation engine — same label model as Prometheus

# Install promtail on each kldload node (the log shipper)
# CentOS / RHEL / Rocky
dnf install -y promtail   # or install from Grafana RPM repo

# /etc/promtail/config.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://monitoring.wg2:3100/loki/api/v1/push

scrape_configs:
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
        host: __HOSTNAME__
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'
      - source_labels: ['__journal_priority_keyword']
        target_label: 'level'

systemctl enable --now promtail
Do not build a log pipeline until you actually need one. journald on each node with journalctl is enough for deployments under 10 nodes. You can SSH to the relevant node and query structured logs in seconds — no pipeline to maintain, no Loki to operate, no storage costs for shipping logs across the network. When you outgrow it — more than 10 nodes, compliance requirements that need central audit logs, cross-node log correlation for debugging distributed problems — add Loki. Loki uses the same label model as Prometheus (job, instance, env) so your LogQL queries feel familiar if you know PromQL. But resist the urge to add it early. Every tool you operate is a tool that can break at 3am.

9. Alerting with Alertmanager

Alertmanager receives alerts from Prometheus, deduplicates and groups them, applies routing rules to send the right alert to the right person, and handles silences and inhibitions. It is the last mile between a firing Prometheus rule and a human being woken up.

Install and configure Alertmanager

# CentOS / RHEL / Rocky
dnf install -y alertmanager

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    # Critical ZFS alerts go to on-call immediately
    - match:
        severity: critical
      receiver: 'oncall'
      group_wait: 0s

    # Warnings go to Slack only
    - match:
        severity: warning
      receiver: 'slack-only'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'oncall'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
    slack_configs:
      - channel: '#alerts-critical'

  - name: 'slack-only'
    slack_configs:
      - channel: '#alerts'

systemctl enable --now alertmanager

Essential alerting rules for kldload deployments

# /etc/prometheus/rules/kldload.yml
groups:
  - name: kldload-nodes
    rules:

      # Node unreachable — node_exporter stopped responding
      - alert: NodeDown
        expr: up{job="kldload-nodes"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "node_exporter on {{ $labels.instance }} has been unreachable for 2 minutes."

      # Disk full prediction — will fill in 4 hours at current rate
      - alert: DiskFillingSoon
        expr: |
          predict_linear(node_filesystem_avail_bytes
            {fstype!~"tmpfs|overlay"}[1h], 4 * 3600) < 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Disk filling on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} will be full in under 4 hours."

      # High CPU for sustained period
      - alert: HighCPU
        expr: |
          100 - (avg by(instance)
            (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage has been above 90% for 15 minutes."

      # WireGuard peer stale (no handshake in 3 minutes)
      - alert: WireGuardPeerStale
        expr: |
          (time() - wireguard_latest_handshake_seconds) > 180
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "WireGuard peer stale on {{ $labels.instance }}"
          description: "Peer {{ $labels.public_key }} on {{ $labels.interface }} has not completed a handshake in over 3 minutes."

      # Memory critically low
      - alert: MemoryLow
        expr: |
          node_memory_MemAvailable_bytes
            / node_memory_MemTotal_bytes < 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Memory critically low on {{ $labels.instance }}"
          description: "Available memory is below 5% of total."
The most common alerting mistake is alerting on symptoms instead of impact. "CPU is 95%" is a symptom. "Request latency is over 500ms" is impact. "Disk is 85% full" is a symptom. "Disk will be full in 4 hours" is actionable. Alert on what actually affects users or risks data loss. Investigate symptoms. If CPU is 95% but latency is normal and no requests are failing, nothing is actually wrong — the system is just working hard. A CPU alert that fires every deployment because your app recompiles caches on startup trains your team to ignore alerts. Alert fatigue is the silent killer of on-call rotations. Every alert that fires should require a human decision. If the right decision is always "ignore it," delete the alert.

10. Distributed Tracing — for Microservices

Distributed tracing records end-to-end request flows across multiple services. Each service adds a span to the trace with its start time, duration, and outcome. The trace assembles into a waterfall diagram that shows exactly where a slow request spent its time.

OpenTelemetry — the collection standard

# OpenTelemetry Collector — receives traces from applications, exports to Jaeger/Tempo
# Install on the monitoring node

cat > /etc/otel/config.yaml <<'EOF'
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  jaeger:
    endpoint: localhost:14250
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [jaeger]
EOF

# Instrument your application (Go example)
# import "go.opentelemetry.io/otel"
# tracer := otel.Tracer("my-service")
# ctx, span := tracer.Start(ctx, "database-query")
# defer span.End()

Jaeger — trace storage and UI

# Run Jaeger all-in-one (development/small deployments)
podman run -d \
  --name jaeger \
  -p 16686:16686 \
  -p 14250:14250 \
  jaegertracing/all-in-one:latest

# Access Jaeger UI at http://your-node:16686
You do not need distributed tracing unless you have microservices. A single kldload server running PostgreSQL, nginx, and your application does not need tracing — logs and metrics are enough, and adding tracing requires instrumenting your application code, running a collector, running a trace storage backend, and learning a new query language. That is real operational cost for zero benefit on a monolith. The rule: if a slow request involves only one process, use logs and eBPF. If a slow request spans multiple services (API gateway → auth service → user service → database), tracing helps you find which service is slow without guessing. Ten microservices behind a load balancer? Add tracing. One app server? Do not.

11. The Complete Monitoring Stack for a kldload Fleet

A production kldload fleet has one monitoring node and N target nodes. The monitoring node runs Prometheus, Grafana, and Alertmanager. Every other node runs node_exporter (pre-installed by kldload). All scraping happens over the WireGuard monitoring plane (wg2), so the metrics traffic is encrypted and isolated from production traffic.

Architecture overview

# Fleet topology (10 nodes + 1 monitoring node)
#
# monitoring.wg2 (10.2.0.1)
#   prometheus :9090      ← scrapes all wg2 addresses
#   grafana    :3000      ← queries prometheus
#   alertmanager :9093    ← receives alerts from prometheus
#
# node01.wg2 (10.2.0.2) — node_exporter :9100
# node02.wg2 (10.2.0.3) — node_exporter :9100
# ...
# node10.wg2 (10.2.0.11) — node_exporter :9100
#
# Prometheus scrapes over WireGuard wg2 plane (10.2.0.0/24)
# Grafana is only accessible from the WireGuard network
# No monitoring ports are exposed on the LAN or internet

Complete Prometheus scrape config for a 10-node fleet

# /etc/prometheus/prometheus.yml (on monitoring node)
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'kldload-prod'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - '/etc/prometheus/rules/*.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'kldload-nodes'
    scrape_interval: 15s
    static_configs:
      - targets:
          - '10.2.0.2:9100'
          - '10.2.0.3:9100'
          - '10.2.0.4:9100'
          - '10.2.0.5:9100'
          - '10.2.0.6:9100'
          - '10.2.0.7:9100'
          - '10.2.0.8:9100'
          - '10.2.0.9:9100'
          - '10.2.0.10:9100'
          - '10.2.0.11:9100'
    relabel_configs:
      - source_labels: [__address__]
        regex: '10\.2\.0\.(\d+):9100'
        target_label: node_id
        replacement: 'node${1}'

  - job_name: 'zfs-textfile'
    scrape_interval: 60s
    static_configs:
      - targets:
          - '10.2.0.2:9100'
          - '10.2.0.3:9100'

  - job_name: 'wireguard-textfile'
    scrape_interval: 30s
    static_configs:
      - targets:
          - '10.2.0.2:9100'

Recommended Grafana dashboards for a kldload fleet

  • Fleet Overview — one row per node, showing CPU%, memory%, disk%, ARC hit ratio, and service health. The "wall of green" dashboard — any red cell requires investigation.
  • Node Detail — templated with a $node variable, showing all metrics for a single selected node. Every panel drills into one host.
  • ZFS Pool Health — pool status for every node, ARC size, hit ratio, scrub state, snapshot counts, and compression ratio.
  • WireGuard Mesh — peer connectivity matrix, bytes transferred per tunnel, last handshake age for each peer.
  • Alert History — timeline of all fired alerts over the last 30 days. Use this for capacity planning and post-mortems.

12. Incident Response Workflow

Observability tools are only useful if you have a response workflow. The best monitoring setup in the world does not help if you don't know what to do when it fires.

The workflow

# Step 1: Alert fires (PagerDuty / Slack)
# Read the alert. Note: node, alert name, threshold crossed, time.

# Step 2: Check the Fleet Overview dashboard
# Confirm the affected node. Look for correlated anomalies on neighboring nodes.
# Is this one node, or a cluster-wide pattern?

# Step 3: Open Node Detail dashboard for the affected node
# Look at the time window before and after the alert.
# What changed? CPU spike? Memory drop? Disk saturation? Network blip?

# Step 4: SSH to the affected node
ssh node04.wg2

# Step 5: Confirm with local tools
top               # current CPU/memory picture
iostat -x 1 5    # disk utilization per device
ss -s             # socket summary — connections, established, time_wait

# Step 6: eBPF drill-down based on what you saw
# Disk latency alert:
biolatency -D 10 1   # I/O latency histogram per device, 10-second window
biotop               # top-like disk I/O by process
# CPU alert:
runqlat              # how long are processes waiting for CPU
profile -F 99 30     # CPU flame graph (30 seconds)
# Network alert:
tcpconnect           # what is connecting to/from where
tcpretrans           # TCP retransmits indicate packet loss

Concrete example: disk latency alert

# Alert fires: DiskLatencyHigh on node04 at 02:17

# Check Grafana: disk I/O on node04 shows sustained 100% utilization starting 02:14
# Correlates with: ZFS scrub_state changed from idle to scanning at 02:14

# SSH to node04:
zpool status rpool
# Output:
#   scan: scrub in progress since Wed Apr  2 02:14:01 2026
#         16.3G scanned out of 847G at 112M/s, 2h6m to go
#         0 errors, 0 repaired

# Root cause: scheduled ZFS scrub started at 02:14, saturating disk bandwidth.
# This is expected behavior. Non-incident.

# Resolution: Add a Prometheus alert inhibition rule:
#   inhibit_rules:
#     - source_match:
#         alertname: ZFSScrubRunning
#       target_match:
#         alertname: DiskLatencyHigh
#       equal: ['instance']

# Or: schedule scrubs during off-peak hours
zpool set scrub_at=04:00 rpool   # (implementation varies by ZFS version)
The best monitoring setup in the world is useless without a response workflow. Before your first real incident, write a runbook for each alert: what does this alert mean, what is the first thing to check, which bcc tool is relevant, what does a non-incident look like (like a ZFS scrub), what does a real incident look like. Store the runbooks with the alerting rules. Practice the workflow on a lab node before you need it at 3am. The ZFS scrub example above is real: disk latency alerts during a scheduled scrub are not incidents, but they will page you if you don't handle them explicitly. Every false positive you silence with a documented inhibition rule makes your alerts more trustworthy. Trustworthy alerts get responded to. Noisy alerts get ignored.

Related pages