eBPF Metrics & Exporters — Kernel-Grade Observability for Prometheus
You have two ways to understand what your system is doing. You can poll /proc every 15 seconds and get a counter that went up by some amount. Or you can attach an eBPF program to a kernel function and get every single event — with latency, with call stacks, with the process that caused it, in real time. One gives you a speedometer reading. The other gives you a dashcam with GPS.
The thesis: /proc/diskstats tells you your disk did 1,247 reads in the last scrape interval. eBPF tells you that 1,200 of those reads completed in under 100μs (page cache), 40 took 2-5ms (SSD), and 7 took over 50ms (queue depth saturation at 14:23:07 caused by PID 8821 running pg_dump). The counter went up the same amount. The story is completely different.
Why eBPF metrics are fundamentally different
/proc polling gives you rate of change between two sample points. You compute deltas. You lose everything that happened between samples. A 200ms disk stall that lasted 3 seconds and resolved before your next scrape? Gone. Never existed as far as your monitoring is concerned.
eBPF tracing gives you per-event data. Every disk I/O, every syscall, every TCP retransmit fires your program. You aggregate in-kernel into histograms, counters, and gauges — then export the aggregates. Nothing is lost. The 200ms stall shows up as a spike in your p99 latency histogram, pointing at the exact device and the exact process.
This page is about building production Prometheus exporters powered by eBPF. Not toy examples. Full working code you can drop onto a kldloadOS box and have kernel-grade metrics in Grafana within 10 minutes. Every exporter includes the eBPF program, the Python/C glue, the systemd unit, the Prometheus scrape config, and the Grafana query.
The eBPF Metrics Pipeline
Before building anything, understand the data flow. Every eBPF metric follows the same path from kernel event to dashboard panel:
1. Kernel Event
Something happens: a block I/O completes, a syscall fires, a TCP packet is retransmitted, a ZFS ARC lookup hits or misses. The kernel has a hook point — a tracepoint, kprobe, or fentry — at each of these locations.
2. eBPF Program
Your program, loaded into the kernel via BCC/libbpf/bpftrace, runs on every event. It extracts fields (latency, device, PID, return code), does minimal computation (bucket into histogram, increment counter), and writes to a BPF map. Runs in nanoseconds. No context switch.
3. BPF Map
An in-kernel data structure shared between the eBPF program (writer) and your userspace process (reader). Types include hash maps, arrays, per-CPU arrays, histograms, and ring buffers. The map is the boundary between kernel and userspace.
4. Userspace Exporter
A Python (BCC + prometheus_client), Go (cilium/ebpf + promhttp), or C (libbpf + libmicrohttpd) process that reads BPF maps on each Prometheus scrape and formats them as Prometheus text exposition. Exposes an HTTP endpoint on a configured port.
5. Prometheus Scrape
Prometheus hits /metrics on your exporter every 15s (or whatever your scrape interval is). Gets fresh histogram buckets, counters, and gauges. Stores in TSDB. The exporter resets or re-reads BPF maps as needed.
6. Grafana + Alertmanager
Grafana queries PromQL against the Prometheus TSDB. Heatmaps for latency distributions. Rate graphs for throughput. Alertmanager fires when p99 latency exceeds your threshold or ARC hit rate drops below 90%.
KERNEL SPACE | USERSPACE
|
disk_io_complete ──► eBPF prog ──► BPF histogram | ──► Python exporter ──► :9101/metrics
tcp_retransmit ──► eBPF prog ──► BPF counter | ──► Python exporter ──► :9102/metrics
zfs_arc_hit ──► eBPF prog ──► BPF map | ──► Python exporter ──► :9103/metrics
sched_switch ──► eBPF prog ──► BPF histogram | ──► Python exporter ──► :9104/metrics
|
| ▼
| Prometheus (scrapes every 15s)
| ▼
| Grafana dashboards + Alertmanager
The critical insight: aggregation happens in the kernel. Your eBPF program doesn't send every event to userspace — that would be millions of events per second on a busy system. Instead, it maintains histograms, counters, and maps in-kernel. Userspace only reads the aggregated result on each scrape. This is why eBPF exporters have near-zero overhead even on systems doing 500K IOPS.
Prerequisites
kldloadOS desktop and server profiles install all of this automatically. If you're on a stock distro, here's what you need:
# CentOS Stream 9 / RHEL 9 / Rocky 9
dnf install -y bcc-tools bpftrace python3-bcc python3-pip kernel-devel
pip3 install prometheus_client
# Debian 13 / Ubuntu 24.04
apt install -y bpfcc-tools bpftrace python3-bpfcc python3-pip linux-headers-$(uname -r)
pip3 install prometheus_client
# Fedora 41
dnf install -y bcc-tools bpftrace python3-bcc python3-pip kernel-devel
pip3 install prometheus_client
# Arch Linux
pacman -S bcc bpftrace python-bcc python-pip
pip install prometheus_client
# Verify BPF is working
bpftrace -e 'BEGIN { printf("BPF works\n"); exit(); }'
Output:
Attaching 1 probe...
BPF works
Pattern 1: BCC Python + prometheus_client
This is the most practical pattern for production exporters. BCC compiles your eBPF C code at runtime, loads it into the kernel, and gives you Python objects to read the BPF maps. prometheus_client handles the HTTP endpoint and text exposition format. You end up with a single Python file that is a complete Prometheus exporter.
When to use BCC Python
Use this when you need a long-running daemon that continuously exports metrics. It handles the full lifecycle: compile eBPF, attach probes, read maps, serve HTTP. The overhead is the Python process itself (~20MB RSS) plus negligible kernel-side cost. Good for the 80% case.
Complete disk I/O latency exporter
This exporter traces every block I/O completion, records the latency in a histogram keyed by device and operation type, and exposes it as a Prometheus histogram. Drop this file on any kldloadOS box and you have per-device, per-operation I/O latency distributions.
cat > /usr/local/bin/ebpf-diskio-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF Disk I/O Latency Exporter for Prometheus.
Traces block I/O completions via the block:block_rq_complete tracepoint.
Records latency histograms per device and operation type (read/write).
Serves Prometheus metrics on :9101/metrics.
"""
from bcc import BPF
from prometheus_client import start_http_server, Histogram, Info
import ctypes
import time
import os
import signal
import sys
# eBPF program — runs in the kernel on every block I/O request
BPF_PROGRAM = r"""
#include
#include
typedef struct {
u32 dev;
u8 rwflag;
u64 ts;
} start_key_t;
typedef struct {
u32 dev;
u8 rwflag;
} hist_key_t;
BPF_HASH(start, struct request *, start_key_t);
BPF_HISTOGRAM(hist, hist_key_t);
// Trace block I/O request insertion (start time)
RAW_TRACEPOINT_PROBE(block_rq_insert) {
struct request *req = (struct request *)ctx->args[0];
start_key_t sk = {};
sk.ts = bpf_ktime_get_ns();
sk.dev = req->__RQ_DISK()->disk_name[0]
| (req->__RQ_DISK()->disk_name[1] << 8)
| (req->__RQ_DISK()->disk_name[2] << 16)
| (req->__RQ_DISK()->disk_name[3] << 24);
sk.rwflag = !!((req->cmd_flags & REQ_OP_MASK) == REQ_OP_WRITE);
start.update(&req, &sk);
return 0;
}
// Trace block I/O completion (calculate latency)
RAW_TRACEPOINT_PROBE(block_rq_complete) {
struct request *req = (struct request *)ctx->args[0];
start_key_t *skp = start.lookup(&req);
if (!skp) return 0;
u64 delta = bpf_ktime_get_ns() - skp->ts;
hist_key_t hk = {};
hk.dev = skp->dev;
hk.rwflag = skp->rwflag;
hist.increment(hk, bpf_log2l(delta / 1000)); // log2 microseconds
start.delete(&req);
return 0;
}
"""
# Prometheus histogram buckets (microseconds, matching log2 ranges)
LATENCY_BUCKETS = (1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1024, 2048, 4096, 8192, 16384, 32768, 65536,
131072, 262144, 524288, 1048576)
disk_io_latency = Histogram(
'ebpf_disk_io_latency_microseconds',
'Block I/O latency in microseconds, per device and operation',
['device', 'operation'],
buckets=LATENCY_BUCKETS
)
exporter_info = Info('ebpf_diskio_exporter', 'eBPF disk I/O latency exporter')
exporter_info.info({
'version': '1.0.0',
'source': 'block:block_rq_insert + block:block_rq_complete'
})
def dev_name(dev_int):
"""Convert packed device int back to name."""
chars = []
for i in range(4):
c = (dev_int >> (i * 8)) & 0xFF
if c == 0:
break
chars.append(chr(c))
return ''.join(chars) if chars else 'unknown'
def collect_and_export(bpf_obj):
"""Read BPF histogram, push to Prometheus, clear."""
hist = bpf_obj["hist"]
for k, v in hist.items():
device = dev_name(k.dev)
op = "write" if k.rwflag else "read"
# v.value is count in this log2 bucket
# Convert log2(us) bucket index to microsecond value
bucket_us = 1 << v.value if v.value < 30 else 1048576
disk_io_latency.labels(device=device, operation=op).observe(bucket_us)
hist.clear()
def shutdown(signum, frame):
print("Shutting down eBPF disk I/O exporter...")
sys.exit(0)
if __name__ == '__main__':
signal.signal(signal.SIGTERM, shutdown)
signal.signal(signal.SIGINT, shutdown)
print("Loading eBPF program...")
b = BPF(text=BPF_PROGRAM)
print("eBPF program loaded. Tracing block I/O...")
start_http_server(9101)
print("Prometheus metrics available at :9101/metrics")
while True:
try:
collect_and_export(b)
time.sleep(5) # Collect every 5s, Prometheus scrapes every 15s
except KeyboardInterrupt:
break
print("Exiting.")
PYEOF
chmod +x /usr/local/bin/ebpf-diskio-exporter
systemd unit
cat > /etc/systemd/system/ebpf-diskio-exporter.service << 'EOF'
[Unit]
Description=eBPF Disk I/O Latency Exporter for Prometheus
After=network.target
Wants=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/ebpf-diskio-exporter
Restart=on-failure
RestartSec=5
# BPF requires CAP_SYS_ADMIN or CAP_BPF (kernel 5.8+)
AmbientCapabilities=CAP_SYS_ADMIN CAP_BPF CAP_PERFMON
NoNewPrivileges=no
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
ReadOnlyPaths=/
# Allow access to kernel debug/trace filesystem
ReadWritePaths=/sys/kernel/debug /sys/kernel/tracing
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now ebpf-diskio-exporter
Prometheus scrape config
# Add to /etc/prometheus/prometheus.yml under scrape_configs:
scrape_configs:
- job_name: 'ebpf-diskio'
scrape_interval: 15s
static_configs:
- targets: ['localhost:9101']
labels:
instance: 'kldload-node-01'
Grafana query
# Heatmap panel — disk I/O latency distribution over time
histogram_quantile(0.99, rate(ebpf_disk_io_latency_microseconds_bucket{device="sda"}[5m]))
# Per-operation p50 / p99
histogram_quantile(0.50, rate(ebpf_disk_io_latency_microseconds_bucket{operation="read"}[5m]))
histogram_quantile(0.99, rate(ebpf_disk_io_latency_microseconds_bucket{operation="write"}[5m]))
# I/O rate by device (operations per second)
rate(ebpf_disk_io_latency_microseconds_count{device="sda"}[5m])
The BCC approach compiles C in the kernel at startup, which means you need kernel headers installed. On kldloadOS this is handled automatically. On a stock distro you need kernel-devel (RPM) or linux-headers-$(uname -r) (APT). If you can't install headers, use the bpftrace + textfile collector pattern below.
Pattern 2: bpftrace + node_exporter textfile collector
Sometimes you don't need a full daemon. You just want one metric — say, TCP retransmit rate — and you already have node_exporter running. The textfile collector pattern is perfect: run bpftrace on a timer, write Prometheus-formatted text to a file, and node_exporter picks it up automatically.
When to use bpftrace + textfile
Use this when you want a single metric or small set of metrics without running another daemon. bpftrace runs periodically (via systemd timer or cron), traces for a short window, writes results to /var/lib/node_exporter/textfile_collector/, and exits. node_exporter serves the file contents as if they were its own metrics. Zero additional ports. Zero additional scrape targets.
TCP retransmit rate exporter
# Create the textfile collector directory (node_exporter reads from here)
mkdir -p /var/lib/node_exporter/textfile_collector
# Write the bpftrace script
cat > /usr/local/bin/ebpf-tcp-retransmit-collector << 'BPFEOF'
#!/bin/bash
# Trace TCP retransmits for 10 seconds, output Prometheus metrics.
# Run via systemd timer every 60 seconds.
OUTFILE="/var/lib/node_exporter/textfile_collector/tcp_retransmit.prom"
TMPFILE="${OUTFILE}.tmp"
# bpftrace: count retransmits by remote address and state for 10 seconds
RESULT=$(bpftrace -e '
tracepoint:tcp:tcp_retransmit_skb {
@retransmits[ntop(args->daddr)] = count();
@total = count();
}
interval:s:10 { exit(); }
END {
printf("# HELP ebpf_tcp_retransmits_total TCP retransmits observed in sampling window\n");
printf("# TYPE ebpf_tcp_retransmits_total counter\n");
print(@total);
printf("# HELP ebpf_tcp_retransmits_by_dest TCP retransmits by destination IP\n");
printf("# TYPE ebpf_tcp_retransmits_by_dest counter\n");
print(@retransmits);
clear(@retransmits);
clear(@total);
}
' 2>/dev/null)
# Parse bpftrace output into Prometheus format
{
echo "# HELP ebpf_tcp_retransmits_total TCP retransmits in last sampling window"
echo "# TYPE ebpf_tcp_retransmits_total gauge"
TOTAL=$(echo "$RESULT" | grep '@total:' | awk '{print $2}')
[ -z "$TOTAL" ] && TOTAL=0
echo "ebpf_tcp_retransmits_total ${TOTAL}"
echo "# HELP ebpf_tcp_retransmits_by_dest TCP retransmits by destination address"
echo "# TYPE ebpf_tcp_retransmits_by_dest gauge"
echo "$RESULT" | grep '@retransmits\[' | while read -r line; do
DEST=$(echo "$line" | sed 's/.*\[\(.*\)\].*/\1/')
COUNT=$(echo "$line" | awk '{print $NF}')
echo "ebpf_tcp_retransmits_by_dest{destination=\"${DEST}\"} ${COUNT}"
done
} > "$TMPFILE"
# Atomic rename so node_exporter never reads a partial file
mv "$TMPFILE" "$OUTFILE"
BPFEOF
chmod +x /usr/local/bin/ebpf-tcp-retransmit-collector
systemd timer
# The service that runs the collector
cat > /etc/systemd/system/ebpf-tcp-retransmit.service << 'EOF'
[Unit]
Description=eBPF TCP Retransmit Collector (textfile for node_exporter)
[Service]
Type=oneshot
ExecStart=/usr/local/bin/ebpf-tcp-retransmit-collector
EOF
# The timer that fires every 60 seconds
cat > /etc/systemd/system/ebpf-tcp-retransmit.timer << 'EOF'
[Unit]
Description=Run eBPF TCP retransmit collector every 60s
[Timer]
OnBootSec=30
OnUnitActiveSec=60
AccuracySec=5
[Install]
WantedBy=timers.target
EOF
systemctl daemon-reload
systemctl enable --now ebpf-tcp-retransmit.timer
node_exporter configuration
# node_exporter must be started with the textfile collector enabled
# (this is the default on most installations)
# If running manually:
node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
# In the systemd unit, add to ExecStart:
# --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
Now curl localhost:9100/metrics | grep ebpf_tcp shows your eBPF-collected retransmit metrics alongside all the standard node_exporter metrics. No additional Prometheus scrape target needed.
$ curl -s localhost:9100/metrics | grep ebpf_tcp
# HELP ebpf_tcp_retransmits_total TCP retransmits in last sampling window
# TYPE ebpf_tcp_retransmits_total gauge
ebpf_tcp_retransmits_total 17
# HELP ebpf_tcp_retransmits_by_dest TCP retransmits by destination address
# TYPE ebpf_tcp_retransmits_by_dest gauge
ebpf_tcp_retransmits_by_dest{destination="10.0.0.5"} 12
ebpf_tcp_retransmits_by_dest{destination="10.0.0.9"} 5
Pattern 3: libbpf C + Custom HTTP Server
When you need maximum performance — sub-microsecond overhead per event, zero runtime compilation, minimal memory — go native. libbpf loads pre-compiled eBPF bytecode (no kernel headers needed at runtime), and you write a small C program that reads BPF maps and serves HTTP.
When to use libbpf C
Use this when you're building a distributable binary that needs to run on any kernel without headers, or when you need the absolute minimum overhead. The eBPF program is compiled once with clang -target bpf, generating a .o file that libbpf loads via BTF (BPF Type Format). No runtime compilation. No Python. Just a statically linked binary and a BPF object file.
Architecture
┌─────────────────────────────────────────────────────────┐
│ Build time (your dev machine) │
│ │
│ scheduler_latency.bpf.c ──clang -target bpf──► scheduler_latency.bpf.o
│ scheduler_latency.c ──gcc──────────────────► scheduler_latency (binary)
│ scheduler_latency.bpf.o ──bpftool gen skeleton──► scheduler_latency.skel.h
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Runtime (target machine, no headers needed) │
│ │
│ scheduler_latency (binary) │
│ ├── loads scheduler_latency.bpf.o via libbpf │
│ ├── attaches to sched:sched_switch tracepoint │
│ ├── reads BPF histogram map every 5 seconds │
│ └── serves :9104/metrics (libmicrohttpd) │
└─────────────────────────────────────────────────────────┘
eBPF kernel program (scheduler_latency.bpf.c)
cat > scheduler_latency.bpf.c << 'EOF'
// SPDX-License-Identifier: GPL-2.0
// Trace scheduler run queue latency — time between task wakeup and getting a CPU.
#include "vmlinux.h"
#include
#include
#define MAX_SLOTS 32
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 65536);
__type(key, u32); // pid
__type(value, u64); // wakeup timestamp
} start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, MAX_SLOTS);
__type(key, u32);
__type(value, u64);
} hist SEC(".maps");
SEC("tp_btf/sched_wakeup")
int BPF_PROG(sched_wakeup, struct task_struct *p) {
u32 pid = p->pid;
u64 ts = bpf_ktime_get_ns();
bpf_map_update_elem(&start, &pid, &ts, BPF_ANY);
return 0;
}
SEC("tp_btf/sched_switch")
int BPF_PROG(sched_switch, bool preempt,
struct task_struct *prev, struct task_struct *next) {
u32 pid = next->pid;
u64 *tsp = bpf_map_lookup_elem(&start, &pid);
if (!tsp) return 0;
u64 delta = bpf_ktime_get_ns() - *tsp;
bpf_map_delete_elem(&start, &pid);
// Log2 histogram in microseconds
u64 us = delta / 1000;
u32 slot = 0;
while (us > 0 && slot < MAX_SLOTS - 1) {
us >>= 1;
slot++;
}
u64 *count = bpf_map_lookup_elem(&hist, &slot);
if (count)
__sync_fetch_and_add(count, 1);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
EOF
Userspace loader + HTTP server (scheduler_latency.c)
cat > scheduler_latency.c << 'EOF'
/* Scheduler latency exporter — reads BPF histogram, serves Prometheus metrics.
* Compile: gcc -o scheduler_latency scheduler_latency.c -lbpf -lmicrohttpd
* Requires: libbpf-devel libmicrohttpd-devel
*/
#include
#include
#include
#include
#include
#include
#include
#include
#include "scheduler_latency.skel.h"
#define PORT 9104
#define MAX_SLOTS 32
static volatile int running = 1;
static struct scheduler_latency_bpf *skel;
static void sig_handler(int sig) { running = 0; }
static enum MHD_Result metrics_handler(void *cls,
struct MHD_Connection *conn, const char *url,
const char *method, const char *version,
const char *upload_data, size_t *upload_data_size,
void **con_cls) {
char buf[8192];
int off = 0;
int hist_fd = bpf_map__fd(skel->maps.hist);
off += snprintf(buf + off, sizeof(buf) - off,
"# HELP ebpf_sched_runqueue_latency_us Scheduler run queue latency histogram\n"
"# TYPE ebpf_sched_runqueue_latency_us histogram\n");
uint64_t total_count = 0;
uint64_t total_sum = 0;
for (uint32_t i = 0; i < MAX_SLOTS; i++) {
uint64_t count = 0;
bpf_map_lookup_elem(hist_fd, &i, &count);
total_count += count;
uint64_t bucket_us = (uint64_t)1 << i;
total_sum += count * bucket_us;
off += snprintf(buf + off, sizeof(buf) - off,
"ebpf_sched_runqueue_latency_us_bucket{le=\"%lu\"} %lu\n",
bucket_us, total_count);
}
off += snprintf(buf + off, sizeof(buf) - off,
"ebpf_sched_runqueue_latency_us_bucket{le=\"+Inf\"} %lu\n"
"ebpf_sched_runqueue_latency_us_sum %lu\n"
"ebpf_sched_runqueue_latency_us_count %lu\n",
total_count, total_sum, total_count);
struct MHD_Response *resp = MHD_create_response_from_buffer(
off, buf, MHD_RESPMEM_MUST_COPY);
MHD_add_response_header(resp, "Content-Type",
"text/plain; version=0.0.4; charset=utf-8");
enum MHD_Result ret = MHD_queue_response(conn, MHD_HTTP_OK, resp);
MHD_destroy_response(resp);
return ret;
}
int main(int argc, char **argv) {
signal(SIGINT, sig_handler);
signal(SIGTERM, sig_handler);
skel = scheduler_latency_bpf__open_and_load();
if (!skel) {
fprintf(stderr, "Failed to load BPF skeleton\n");
return 1;
}
if (scheduler_latency_bpf__attach(skel)) {
fprintf(stderr, "Failed to attach BPF programs\n");
scheduler_latency_bpf__destroy(skel);
return 1;
}
struct MHD_Daemon *httpd = MHD_start_daemon(
MHD_USE_INTERNAL_POLLING_THREAD, PORT, NULL, NULL,
&metrics_handler, NULL, MHD_OPTION_END);
if (!httpd) {
fprintf(stderr, "Failed to start HTTP server on port %d\n", PORT);
scheduler_latency_bpf__destroy(skel);
return 1;
}
printf("Scheduler latency exporter running on :%d/metrics\n", PORT);
while (running) sleep(1);
MHD_stop_daemon(httpd);
scheduler_latency_bpf__destroy(skel);
return 0;
}
EOF
Build commands
# Generate vmlinux.h from running kernel's BTF
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
# Compile eBPF program to BPF bytecode
clang -g -O2 -target bpf -D__TARGET_ARCH_x86 \
-c scheduler_latency.bpf.c -o scheduler_latency.bpf.o
# Generate skeleton header
bpftool gen skeleton scheduler_latency.bpf.o > scheduler_latency.skel.h
# Compile userspace program
gcc -Wall -O2 -o scheduler_latency scheduler_latency.c \
-lbpf -lmicrohttpd -lelf -lz
The libbpf + CO-RE (Compile Once, Run Everywhere) approach means your binary works on any kernel 5.2+ with BTF enabled, without recompilation. The skeleton header embeds the BPF bytecode directly into the userspace binary. No external .o file to ship. This is how production eBPF tools like Cilium and Falco do it.
Key Metrics — Full Working Exporters
The following are complete, working BCC Python exporters for the metrics that matter most in production. Each one is a standalone file you can drop into /usr/local/bin/ and run.
Syscall rate by type
Which syscalls are hot on your system? Is it all read and write? Lots of futex (lock contention)? Unexpected clone storms (fork bombs)? This exporter counts every syscall by type and exposes the top 20.
cat > /usr/local/bin/ebpf-syscall-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF Syscall Rate Exporter — counts syscalls by type, per second."""
from bcc import BPF
from prometheus_client import start_http_server, Counter, Info
import time, signal, sys
BPF_PROGRAM = r"""
#include
BPF_HASH(syscall_count, u32, u64);
TRACEPOINT_PROBE(raw_syscalls, sys_enter) {
u32 id = args->id;
u64 *count = syscall_count.lookup_or_try_init(&id, &(u64){0});
if (count)
__sync_fetch_and_add(count, 1);
return 0;
}
"""
# Map of common syscall numbers to names (x86_64)
SYSCALL_NAMES = {
0: "read", 1: "write", 2: "open", 3: "close", 4: "stat",
5: "fstat", 6: "lstat", 7: "poll", 8: "lseek", 9: "mmap",
10: "mprotect", 11: "munmap", 12: "brk", 13: "rt_sigaction",
14: "rt_sigprocmask", 16: "ioctl", 17: "pread64", 18: "pwrite64",
19: "readv", 20: "writev", 21: "access", 23: "select",
24: "sched_yield", 28: "madvise", 32: "dup", 33: "dup2",
35: "nanosleep", 39: "getpid", 41: "socket", 42: "connect",
43: "accept", 44: "sendto", 45: "recvfrom", 46: "sendmsg",
47: "recvmsg", 56: "clone", 57: "fork", 59: "execve",
60: "exit", 62: "kill", 72: "fcntl", 78: "getdents",
79: "getcwd", 80: "chdir", 89: "readlink", 202: "futex",
228: "clock_gettime", 230: "clock_nanosleep", 257: "openat",
262: "newfstatat", 281: "epoll_pwait", 288: "accept4",
291: "epoll_create1", 292: "eventfd2", 302: "prlimit64",
318: "getrandom", 332: "statx", 435: "clone3",
439: "faccessat2", 441: "epoll_pwait2",
}
syscall_counter = Counter(
'ebpf_syscalls_total',
'Total syscall count by syscall name',
['syscall']
)
def collect(bpf_obj):
sc = bpf_obj["syscall_count"]
for k, v in sc.items():
name = SYSCALL_NAMES.get(k.value, f"sys_{k.value}")
syscall_counter.labels(syscall=name).inc(v.value)
sc.clear()
if __name__ == '__main__':
signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))
b = BPF(text=BPF_PROGRAM)
start_http_server(9105)
print("Syscall rate exporter on :9105/metrics")
while True:
collect(b)
time.sleep(5)
PYEOF
chmod +x /usr/local/bin/ebpf-syscall-exporter
Sample output:
$ curl -s localhost:9105/metrics | grep ebpf_syscalls_total | head -15
ebpf_syscalls_total{syscall="read"} 487293
ebpf_syscalls_total{syscall="write"} 312847
ebpf_syscalls_total{syscall="futex"} 198432
ebpf_syscalls_total{syscall="epoll_pwait"} 156721
ebpf_syscalls_total{syscall="openat"} 89234
ebpf_syscalls_total{syscall="close"} 87112
ebpf_syscalls_total{syscall="newfstatat"} 45623
ebpf_syscalls_total{syscall="mmap"} 34521
ebpf_syscalls_total{syscall="clock_gettime"} 29877
ebpf_syscalls_total{syscall="recvfrom"} 23456
ebpf_syscalls_total{syscall="sendto"} 21098
ebpf_syscalls_total{syscall="ioctl"} 18234
ebpf_syscalls_total{syscall="mprotect"} 12876
ebpf_syscalls_total{syscall="brk"} 8943
ebpf_syscalls_total{syscall="clone3"} 234
What to look for
futex dominating: your workload is lock-contended. Look at the processes doing the most futex calls and check for mutex bottlenecks.
clone/clone3 spikes: something is forking aggressively. Could be a fork bomb, a misbehaving cron job, or a PHP-FPM pool that's too small and keeps spawning workers.
openat/close churn: a process is opening and closing files in a tight loop. Common with log rotation gone wrong or misconfigured inotify watches.
Page cache hit rate (cachestat equivalent)
The page cache is the single most important performance feature in Linux. Every file read goes through it. A 99% hit rate means 99% of your reads come from RAM at memory speed. A 90% hit rate means 10x more disk I/O. This exporter gives you real-time page cache hit/miss rates.
cat > /usr/local/bin/ebpf-cachestat-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF Page Cache Hit Rate Exporter — tracks add/hit/miss/dirty events."""
from bcc import BPF
from prometheus_client import start_http_server, Counter, Gauge
import time, signal, sys
BPF_PROGRAM = r"""
#include
BPF_ARRAY(stats, u64, 4); // 0=hits, 1=misses, 2=adds, 3=dirties
static __always_inline void inc_stat(int idx) {
u32 key = idx;
u64 *val = stats.lookup(&key);
if (val)
__sync_fetch_and_add(val, 1);
}
// mark_page_accessed — page was found in cache (hit)
int kprobe__mark_page_accessed(struct pt_regs *ctx) {
inc_stat(0);
return 0;
}
// add_to_page_cache_lru — page added to cache (miss that triggered read)
int kprobe__add_to_page_cache_lru(struct pt_regs *ctx) {
inc_stat(1);
inc_stat(2);
return 0;
}
// account_page_dirtied — page was dirtied (write)
int kprobe__account_page_dirtied(struct pt_regs *ctx) {
inc_stat(3);
return 0;
}
// filemap_add_folio — newer kernels (5.15+) use folio-based API
int kprobe__filemap_add_folio(struct pt_regs *ctx) {
inc_stat(1);
inc_stat(2);
return 0;
}
// folio_mark_accessed — newer kernels (5.15+) hit path
int kprobe__folio_mark_accessed(struct pt_regs *ctx) {
inc_stat(0);
return 0;
}
// folio_account_dirtied — newer kernels (5.15+) dirty path
int kprobe__folio_account_dirtied(struct pt_regs *ctx) {
inc_stat(3);
return 0;
}
"""
cache_hits = Counter('ebpf_page_cache_hits_total', 'Page cache hits')
cache_misses = Counter('ebpf_page_cache_misses_total', 'Page cache misses')
cache_adds = Counter('ebpf_page_cache_adds_total', 'Pages added to cache')
cache_dirties = Counter('ebpf_page_cache_dirties_total', 'Pages dirtied')
cache_hit_rate = Gauge('ebpf_page_cache_hit_ratio', 'Page cache hit ratio (0.0-1.0)')
prev = [0, 0, 0, 0]
def collect(bpf_obj):
global prev
stats = bpf_obj["stats"]
cur = [stats[i].value for i in range(4)]
deltas = [cur[i] - prev[i] for i in range(4)]
prev = cur[:]
if deltas[0] > 0:
cache_hits.inc(deltas[0])
if deltas[1] > 0:
cache_misses.inc(deltas[1])
if deltas[2] > 0:
cache_adds.inc(deltas[2])
if deltas[3] > 0:
cache_dirties.inc(deltas[3])
total = deltas[0] + deltas[1]
if total > 0:
cache_hit_rate.set(deltas[0] / total)
else:
cache_hit_rate.set(1.0)
if __name__ == '__main__':
signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))
b = BPF(text=BPF_PROGRAM)
start_http_server(9106)
print("Page cache hit rate exporter on :9106/metrics")
while True:
collect(b)
time.sleep(5)
PYEOF
chmod +x /usr/local/bin/ebpf-cachestat-exporter
Sample output:
$ curl -s localhost:9106/metrics | grep ebpf_page_cache
# HELP ebpf_page_cache_hits_total Page cache hits
# TYPE ebpf_page_cache_hits_total counter
ebpf_page_cache_hits_total 1.247832e+06
# HELP ebpf_page_cache_misses_total Page cache misses
# TYPE ebpf_page_cache_misses_total counter
ebpf_page_cache_misses_total 12847.0
# HELP ebpf_page_cache_hit_ratio Page cache hit ratio (0.0-1.0)
# TYPE ebpf_page_cache_hit_ratio gauge
ebpf_page_cache_hit_ratio 0.9897
Scheduler run queue latency
Are your CPUs saturated? The answer isn't CPU utilization — a box at 70% utilization with 50ms run queue latency is overloaded, while a box at 95% with 200μs latency is fine. Run queue latency measures how long a task waits between being woken up and actually getting a CPU. This is the metric that tells you whether you need more cores.
cat > /usr/local/bin/ebpf-schedlat-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF Scheduler Run Queue Latency Exporter."""
from bcc import BPF
from prometheus_client import start_http_server, Histogram
import time, signal, sys
BPF_PROGRAM = r"""
#include
#include
BPF_HASH(start_ts, u32, u64);
BPF_HISTOGRAM(runq_lat);
RAW_TRACEPOINT_PROBE(sched_wakeup) {
struct task_struct *p = (struct task_struct *)ctx->args[0];
u32 pid = p->pid;
u64 ts = bpf_ktime_get_ns();
start_ts.update(&pid, &ts);
return 0;
}
RAW_TRACEPOINT_PROBE(sched_wakeup_new) {
struct task_struct *p = (struct task_struct *)ctx->args[0];
u32 pid = p->pid;
u64 ts = bpf_ktime_get_ns();
start_ts.update(&pid, &ts);
return 0;
}
RAW_TRACEPOINT_PROBE(sched_switch) {
// args: [bool preempt, struct task_struct *prev, struct task_struct *next]
struct task_struct *next = (struct task_struct *)ctx->args[2];
u32 pid = next->pid;
u64 *tsp = start_ts.lookup(&pid);
if (!tsp) return 0;
u64 delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
runq_lat.increment(bpf_log2l(delta_us));
start_ts.delete(&pid);
return 0;
}
"""
BUCKETS = tuple(2**i for i in range(21)) # 1us to ~1s
sched_latency = Histogram(
'ebpf_sched_runqueue_latency_microseconds',
'Scheduler run queue latency in microseconds',
buckets=BUCKETS
)
def collect(bpf_obj):
hist = bpf_obj["runq_lat"]
for k, v in hist.items():
bucket_us = 1 << k.value
for _ in range(v.value):
sched_latency.observe(bucket_us)
hist.clear()
if __name__ == '__main__':
signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))
b = BPF(text=BPF_PROGRAM)
start_http_server(9107)
print("Scheduler latency exporter on :9107/metrics")
while True:
collect(b)
time.sleep(5)
PYEOF
chmod +x /usr/local/bin/ebpf-schedlat-exporter
Grafana query for run queue latency heatmap:
# p50 / p99 / p999 scheduler latency
histogram_quantile(0.50, rate(ebpf_sched_runqueue_latency_microseconds_bucket[5m]))
histogram_quantile(0.99, rate(ebpf_sched_runqueue_latency_microseconds_bucket[5m]))
histogram_quantile(0.999, rate(ebpf_sched_runqueue_latency_microseconds_bucket[5m]))
# Percentage of scheduling events over 10ms (overloaded signal)
(
rate(ebpf_sched_runqueue_latency_microseconds_bucket{le="10240"}[5m])
/ rate(ebpf_sched_runqueue_latency_microseconds_count[5m])
) * 100
Memory allocation rate and page faults
cat > /usr/local/bin/ebpf-memory-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF Memory Allocation & Page Fault Rate Exporter."""
from bcc import BPF
from prometheus_client import start_http_server, Counter, Histogram
import time, signal, sys
BPF_PROGRAM = r"""
#include
BPF_ARRAY(fault_stats, u64, 2); // 0=minor faults, 1=major faults
BPF_HISTOGRAM(alloc_sizes);
// Track page faults
TRACEPOINT_PROBE(exceptions, page_fault_user) {
u32 key = 0; // minor fault
u64 *val = fault_stats.lookup(&key);
if (val) __sync_fetch_and_add(val, 1);
return 0;
}
// Track mm_page_alloc (every physical page allocation)
TRACEPOINT_PROBE(kmem, mm_page_alloc) {
unsigned int order = args->order;
alloc_sizes.increment(bpf_log2l(order + 1));
return 0;
}
// Track major faults (disk reads for paged-out memory)
int kprobe__handle_mm_fault(struct pt_regs *ctx) {
// We'll distinguish major faults in the return probe
return 0;
}
int kretprobe__handle_mm_fault(struct pt_regs *ctx) {
int ret = PT_REGS_RC(ctx);
// VM_FAULT_MAJOR = 0x0004
if (ret & 0x0004) {
u32 key = 1;
u64 *val = fault_stats.lookup(&key);
if (val) __sync_fetch_and_add(val, 1);
}
return 0;
}
"""
minor_faults = Counter('ebpf_page_faults_minor_total', 'Minor page faults')
major_faults = Counter('ebpf_page_faults_major_total', 'Major page faults (disk I/O)')
page_allocs = Counter('ebpf_page_allocations_total', 'Physical page allocations')
alloc_order_hist = Histogram(
'ebpf_page_alloc_order',
'Page allocation order (log2 of pages)',
buckets=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
)
prev = [0, 0]
def collect(bpf_obj):
global prev
stats = bpf_obj["fault_stats"]
cur = [stats[i].value for i in range(2)]
deltas = [cur[i] - prev[i] for i in range(2)]
prev = cur[:]
if deltas[0] > 0:
minor_faults.inc(deltas[0])
if deltas[1] > 0:
major_faults.inc(deltas[1])
hist = bpf_obj["alloc_sizes"]
for k, v in hist.items():
for _ in range(v.value):
alloc_order_hist.observe(k.value)
page_allocs.inc(v.value)
hist.clear()
if __name__ == '__main__':
signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))
b = BPF(text=BPF_PROGRAM)
start_http_server(9108)
print("Memory allocation exporter on :9108/metrics")
while True:
collect(b)
time.sleep(5)
PYEOF
chmod +x /usr/local/bin/ebpf-memory-exporter
Sample output:
$ curl -s localhost:9108/metrics | grep ebpf_page
ebpf_page_faults_minor_total 2.34721e+06
ebpf_page_faults_major_total 127.0
ebpf_page_allocations_total 1.87234e+06
ebpf_page_alloc_order_bucket{le="0.0"} 1654321
ebpf_page_alloc_order_bucket{le="1.0"} 1698234
ebpf_page_alloc_order_bucket{le="2.0"} 1712098
Why major faults matter
A minor fault means the kernel allocated a physical page — fast, just a TLB update. A major fault means the kernel had to read data from disk because it was swapped out or the page was in a memory-mapped file that wasn't cached. Major faults are disk I/O in disguise. If you're seeing major faults on a system that shouldn't be swapping, you've found the performance bug.
File open/read/write rates by process
cat > /usr/local/bin/ebpf-fileops-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF File Operations Rate Exporter — tracks open/read/write by process."""
from bcc import BPF
from prometheus_client import start_http_server, Counter
import time, signal, sys
BPF_PROGRAM = r"""
#include
typedef struct {
char comm[16];
u8 op; // 0=open, 1=read, 2=write
} key_t;
BPF_HASH(fileops, key_t, u64);
static __always_inline void record_op(u8 op) {
key_t k = {};
bpf_get_current_comm(&k.comm, sizeof(k.comm));
k.op = op;
u64 *val = fileops.lookup_or_try_init(&k, &(u64){0});
if (val) __sync_fetch_and_add(val, 1);
}
TRACEPOINT_PROBE(syscalls, sys_enter_openat) { record_op(0); return 0; }
TRACEPOINT_PROBE(syscalls, sys_enter_read) { record_op(1); return 0; }
TRACEPOINT_PROBE(syscalls, sys_enter_write) { record_op(2); return 0; }
TRACEPOINT_PROBE(syscalls, sys_enter_pread64) { record_op(1); return 0; }
TRACEPOINT_PROBE(syscalls, sys_enter_pwrite64) { record_op(2); return 0; }
TRACEPOINT_PROBE(syscalls, sys_enter_readv) { record_op(1); return 0; }
TRACEPOINT_PROBE(syscalls, sys_enter_writev) { record_op(2); return 0; }
"""
OPS = {0: "open", 1: "read", 2: "write"}
file_ops = Counter(
'ebpf_file_ops_total',
'File operations by process and type',
['process', 'operation']
)
def collect(bpf_obj):
fops = bpf_obj["fileops"]
for k, v in fops.items():
comm = k.comm.decode('utf-8', errors='replace').rstrip('\x00')
op = OPS.get(k.op, "unknown")
file_ops.labels(process=comm, operation=op).inc(v.value)
fops.clear()
if __name__ == '__main__':
signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))
b = BPF(text=BPF_PROGRAM)
start_http_server(9109)
print("File operations exporter on :9109/metrics")
while True:
collect(b)
time.sleep(5)
PYEOF
chmod +x /usr/local/bin/ebpf-fileops-exporter
Sample output:
$ curl -s localhost:9109/metrics | grep ebpf_file_ops_total | head -12
ebpf_file_ops_total{process="postgres",operation="read"} 892341
ebpf_file_ops_total{process="postgres",operation="write"} 234567
ebpf_file_ops_total{process="postgres",operation="open"} 12345
ebpf_file_ops_total{process="nginx",operation="read"} 567890
ebpf_file_ops_total{process="nginx",operation="write"} 456789
ebpf_file_ops_total{process="nginx",operation="open"} 8901
ebpf_file_ops_total{process="node",operation="read"} 123456
ebpf_file_ops_total{process="node",operation="write"} 98765
ebpf_file_ops_total{process="zfs_arc_evict",operation="read"} 45678
ebpf_file_ops_total{process="prometheus",operation="write"} 34567
ebpf_file_ops_total{process="grafana",operation="read"} 23456
ebpf_file_ops_total{process="sshd",operation="read"} 12345
The process name is limited to 16 characters (kernel's TASK_COMM_LEN). This means "prometheus" shows as "prometheus" but "containerd-shim-" gets truncated. For containers, you'll want to also capture the cgroup ID to map back to container names.
ZFS-Specific Exporters
ZFS has its own internal statistics in /proc/spl/kstat/zfs/, but those are polled counters with the same limitations as any /proc file. eBPF lets you trace ZFS at the function level — ARC lookups, ZIO completions, TXG syncs — and build distributions instead of rates.
ARC hit rate, size, and eviction rate
cat > /usr/local/bin/ebpf-zfs-arc-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF ZFS ARC Exporter — hit/miss rate, size, eviction rate via kprobes."""
from bcc import BPF
from prometheus_client import start_http_server, Counter, Gauge
import time, signal, sys
BPF_PROGRAM = r"""
#include
BPF_ARRAY(arc_stats, u64, 4);
// 0 = hits (arc_read found in cache)
// 1 = misses (arc_read not in cache, must do I/O)
// 2 = evictions (arc_evict called)
// 3 = demand_data_hits
static __always_inline void inc(int idx) {
u32 key = idx;
u64 *val = arc_stats.lookup(&key);
if (val) __sync_fetch_and_add(val, 1);
}
// arc_read — the main entry point for all ARC lookups
// returns 0 on hit, non-zero on miss
int kprobe__arc_read(struct pt_regs *ctx) {
return 0;
}
int kretprobe__arc_read(struct pt_regs *ctx) {
int ret = PT_REGS_RC(ctx);
if (ret == 0) {
inc(0); // hit
} else {
inc(1); // miss
}
return 0;
}
// arc_evict — called when ARC needs to free memory
int kprobe__arc_evict(struct pt_regs *ctx) {
inc(2);
return 0;
}
// arc_read_done — callback when async ARC read completes (confirms miss + I/O)
int kprobe__arc_read_done(struct pt_regs *ctx) {
inc(3);
return 0;
}
"""
arc_hits = Counter('ebpf_zfs_arc_hits_total', 'ZFS ARC cache hits')
arc_misses = Counter('ebpf_zfs_arc_misses_total', 'ZFS ARC cache misses')
arc_evictions = Counter('ebpf_zfs_arc_evictions_total', 'ZFS ARC eviction operations')
arc_async_reads = Counter('ebpf_zfs_arc_async_reads_total', 'ZFS ARC async read completions (confirmed misses)')
arc_hit_ratio = Gauge('ebpf_zfs_arc_hit_ratio', 'ZFS ARC hit ratio (0.0-1.0)')
arc_size_bytes = Gauge('ebpf_zfs_arc_size_bytes', 'Current ZFS ARC size in bytes')
arc_target_bytes = Gauge('ebpf_zfs_arc_target_bytes', 'ZFS ARC target size (c) in bytes')
prev = [0, 0, 0, 0]
def read_proc_arc():
"""Read ARC size from /proc/spl/kstat/zfs/arcstats (supplement eBPF data)."""
try:
with open('/proc/spl/kstat/zfs/arcstats') as f:
for line in f:
parts = line.split()
if len(parts) == 3:
if parts[0] == 'size':
arc_size_bytes.set(int(parts[2]))
elif parts[0] == 'c':
arc_target_bytes.set(int(parts[2]))
except (FileNotFoundError, ValueError):
pass
def collect(bpf_obj):
global prev
stats = bpf_obj["arc_stats"]
cur = [stats[i].value for i in range(4)]
deltas = [cur[i] - prev[i] for i in range(4)]
prev = cur[:]
if deltas[0] > 0:
arc_hits.inc(deltas[0])
if deltas[1] > 0:
arc_misses.inc(deltas[1])
if deltas[2] > 0:
arc_evictions.inc(deltas[2])
if deltas[3] > 0:
arc_async_reads.inc(deltas[3])
total = deltas[0] + deltas[1]
if total > 0:
arc_hit_ratio.set(deltas[0] / total)
read_proc_arc()
if __name__ == '__main__':
signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))
b = BPF(text=BPF_PROGRAM)
start_http_server(9110)
print("ZFS ARC exporter on :9110/metrics")
while True:
collect(b)
time.sleep(5)
PYEOF
chmod +x /usr/local/bin/ebpf-zfs-arc-exporter
Sample output:
$ curl -s localhost:9110/metrics | grep ebpf_zfs_arc
ebpf_zfs_arc_hits_total 8.923471e+06
ebpf_zfs_arc_misses_total 234567.0
ebpf_zfs_arc_evictions_total 12345.0
ebpf_zfs_arc_hit_ratio 0.9744
ebpf_zfs_arc_size_bytes 6.87194624e+09
ebpf_zfs_arc_target_bytes 8.589934592e+09
ARC hit ratio tells you everything
An ARC hit ratio above 95% means your working set fits in RAM. Below 90% means ZFS is constantly evicting and re-reading data from disk. If the ratio drops and arc_size_bytes is near arc_target_bytes (the c parameter), your pool's working set exceeds available memory. Either add RAM, reduce zfs_arc_max pressure from other applications, or accept slower reads.
ZIO latency per operation type
ZIO is ZFS's internal I/O pipeline. Every read, write, free, and claim operation passes through it. Tracing ZIO gives you latency distributions for each operation type — much more useful than raw block device stats because ZIO sees the logical operations, not just the physical I/O.
cat > /usr/local/bin/ebpf-zfs-zio-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF ZFS ZIO Latency Exporter — per-operation-type latency histograms."""
from bcc import BPF
from prometheus_client import start_http_server, Histogram
import time, signal, sys
BPF_PROGRAM = r"""
#include
typedef struct {
u64 ts;
u8 type;
} start_t;
BPF_HASH(zio_start, u64, start_t); // key = zio pointer
BPF_HISTOGRAM(zio_lat_read);
BPF_HISTOGRAM(zio_lat_write);
BPF_HISTOGRAM(zio_lat_free);
// zio_execute — entry point for ZIO pipeline execution
// arg0 = zio_t *zio, zio->io_type at offset depends on kernel
int kprobe__zio_execute(struct pt_regs *ctx) {
u64 zio_ptr = PT_REGS_PARM1(ctx);
start_t s = {};
s.ts = bpf_ktime_get_ns();
// Read io_type from zio struct (offset varies by ZFS version)
// ZFS 2.2+ io_type is typically at offset 72
u8 type = 0;
bpf_probe_read_kernel(&type, sizeof(type), (void *)(zio_ptr + 72));
s.type = type;
zio_start.update(&zio_ptr, &s);
return 0;
}
// zio_done — ZIO completion
int kprobe__zio_done(struct pt_regs *ctx) {
u64 zio_ptr = PT_REGS_PARM1(ctx);
start_t *sp = zio_start.lookup(&zio_ptr);
if (!sp) return 0;
u64 delta_us = (bpf_ktime_get_ns() - sp->ts) / 1000;
u32 slot = bpf_log2l(delta_us);
// ZIO types: 0=null, 1=read, 2=write, 3=free, 4=claim, 5=ioctl
if (sp->type == 1) {
zio_lat_read.increment(slot);
} else if (sp->type == 2) {
zio_lat_write.increment(slot);
} else if (sp->type == 3) {
zio_lat_free.increment(slot);
}
zio_start.delete(&zio_ptr);
return 0;
}
"""
BUCKETS = tuple(2**i for i in range(21))
zio_read_lat = Histogram(
'ebpf_zfs_zio_read_latency_microseconds',
'ZFS ZIO read latency',
buckets=BUCKETS
)
zio_write_lat = Histogram(
'ebpf_zfs_zio_write_latency_microseconds',
'ZFS ZIO write latency',
buckets=BUCKETS
)
zio_free_lat = Histogram(
'ebpf_zfs_zio_free_latency_microseconds',
'ZFS ZIO free latency',
buckets=BUCKETS
)
def collect(bpf_obj):
for name, prom_hist in [
("zio_lat_read", zio_read_lat),
("zio_lat_write", zio_write_lat),
("zio_lat_free", zio_free_lat),
]:
hist = bpf_obj[name]
for k, v in hist.items():
bucket_us = 1 << k.value
for _ in range(v.value):
prom_hist.observe(bucket_us)
hist.clear()
if __name__ == '__main__':
signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))
b = BPF(text=BPF_PROGRAM)
start_http_server(9111)
print("ZFS ZIO latency exporter on :9111/metrics")
while True:
collect(b)
time.sleep(5)
PYEOF
chmod +x /usr/local/bin/ebpf-zfs-zio-exporter
TXG sync time
ZFS batches writes into transaction groups (TXGs). Every 5 seconds (by default), ZFS flushes the current TXG to disk. The time this takes is a critical metric — if TXG sync time approaches the TXG timeout (5s default), ZFS is I/O bound and writes will start to stall.
cat > /usr/local/bin/ebpf-zfs-txg-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF ZFS TXG Sync Time Exporter."""
from bcc import BPF
from prometheus_client import start_http_server, Histogram, Counter, Gauge
import time, signal, sys
BPF_PROGRAM = r"""
#include
BPF_HASH(txg_start_ts, u64, u64); // key = spa pointer
BPF_HISTOGRAM(txg_sync_hist);
BPF_ARRAY(txg_stats, u64, 2); // 0 = total syncs, 1 = last sync time us
// spa_sync — called to sync a transaction group
int kprobe__spa_sync(struct pt_regs *ctx) {
u64 spa = PT_REGS_PARM1(ctx);
u64 ts = bpf_ktime_get_ns();
txg_start_ts.update(&spa, &ts);
return 0;
}
int kretprobe__spa_sync(struct pt_regs *ctx) {
u64 spa = PT_REGS_PARM1(ctx);
u64 *tsp = txg_start_ts.lookup(&spa);
if (!tsp) return 0;
u64 delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
txg_sync_hist.increment(bpf_log2l(delta_us));
u32 key0 = 0, key1 = 1;
u64 *total = txg_stats.lookup(&key0);
if (total) __sync_fetch_and_add(total, 1);
u64 *last = txg_stats.lookup(&key1);
if (last) *last = delta_us;
txg_start_ts.delete(&spa);
return 0;
}
"""
BUCKETS = tuple(2**i for i in range(24)) # 1us to ~16 seconds
txg_sync_time = Histogram(
'ebpf_zfs_txg_sync_microseconds',
'ZFS transaction group sync duration',
buckets=BUCKETS
)
txg_sync_count = Counter('ebpf_zfs_txg_syncs_total', 'Total TXG syncs')
txg_last_sync = Gauge('ebpf_zfs_txg_last_sync_microseconds', 'Last TXG sync duration')
prev_total = 0
def collect(bpf_obj):
global prev_total
hist = bpf_obj["txg_sync_hist"]
for k, v in hist.items():
bucket_us = 1 << k.value
for _ in range(v.value):
txg_sync_time.observe(bucket_us)
hist.clear()
stats = bpf_obj["txg_stats"]
cur_total = stats[0].value
if cur_total > prev_total:
txg_sync_count.inc(cur_total - prev_total)
prev_total = cur_total
txg_last_sync.set(stats[1].value)
if __name__ == '__main__':
signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))
b = BPF(text=BPF_PROGRAM)
start_http_server(9112)
print("ZFS TXG sync exporter on :9112/metrics")
while True:
collect(b)
time.sleep(5)
PYEOF
chmod +x /usr/local/bin/ebpf-zfs-txg-exporter
Sample output:
$ curl -s localhost:9112/metrics | grep ebpf_zfs_txg
ebpf_zfs_txg_syncs_total 847.0
ebpf_zfs_txg_last_sync_microseconds 234567.0
ebpf_zfs_txg_sync_microseconds_bucket{le="1024.0"} 12
ebpf_zfs_txg_sync_microseconds_bucket{le="2048.0"} 45
ebpf_zfs_txg_sync_microseconds_bucket{le="131072.0"} 623
ebpf_zfs_txg_sync_microseconds_bucket{le="262144.0"} 789
ebpf_zfs_txg_sync_microseconds_bucket{le="524288.0"} 834
ebpf_zfs_txg_sync_microseconds_bucket{le="1048576.0"} 845
ebpf_zfs_txg_sync_microseconds_bucket{le="+Inf"} 847
TXG sync time is your write health indicator
A healthy pool shows TXG syncs completing in 100-500ms. If syncs regularly exceed 2 seconds, your write pipeline is saturated. If they approach 5 seconds (the default zfs_txg_timeout), ZFS will start throttling incoming writes. This is the metric that predicts write stalls before they happen.
Metaslab allocation latency
Metaslab allocation is how ZFS finds free space on disk. As a pool fills up and fragments, allocation takes longer. Tracing metaslab_alloc gives you a leading indicator of pool fragmentation — allocation latency rises well before you start seeing performance degradation in application I/O.
cat > /usr/local/bin/ebpf-zfs-metaslab-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF ZFS Metaslab Allocation Latency Exporter."""
from bcc import BPF
from prometheus_client import start_http_server, Histogram, Counter
import time, signal, sys
BPF_PROGRAM = r"""
#include
BPF_HASH(alloc_start, u32, u64); // key = tid
BPF_HISTOGRAM(alloc_lat);
BPF_ARRAY(alloc_counts, u64, 2); // 0 = success, 1 = failure
int kprobe__metaslab_alloc(struct pt_regs *ctx) {
u32 tid = bpf_get_current_pid_tgid();
u64 ts = bpf_ktime_get_ns();
alloc_start.update(&tid, &ts);
return 0;
}
int kretprobe__metaslab_alloc(struct pt_regs *ctx) {
u32 tid = bpf_get_current_pid_tgid();
u64 *tsp = alloc_start.lookup(&tid);
if (!tsp) return 0;
u64 delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
alloc_lat.increment(bpf_log2l(delta_us));
int ret = PT_REGS_RC(ctx);
u32 key = (ret == 0) ? 0 : 1;
u64 *count = alloc_counts.lookup(&key);
if (count) __sync_fetch_and_add(count, 1);
alloc_start.delete(&tid);
return 0;
}
"""
BUCKETS = tuple(2**i for i in range(21))
metaslab_latency = Histogram(
'ebpf_zfs_metaslab_alloc_microseconds',
'ZFS metaslab allocation latency',
buckets=BUCKETS
)
metaslab_success = Counter('ebpf_zfs_metaslab_alloc_success_total', 'Successful metaslab allocations')
metaslab_failure = Counter('ebpf_zfs_metaslab_alloc_failure_total', 'Failed metaslab allocations')
prev = [0, 0]
def collect(bpf_obj):
global prev
hist = bpf_obj["alloc_lat"]
for k, v in hist.items():
bucket_us = 1 << k.value
for _ in range(v.value):
metaslab_latency.observe(bucket_us)
hist.clear()
counts = bpf_obj["alloc_counts"]
cur = [counts[i].value for i in range(2)]
deltas = [cur[i] - prev[i] for i in range(2)]
prev = cur[:]
if deltas[0] > 0:
metaslab_success.inc(deltas[0])
if deltas[1] > 0:
metaslab_failure.inc(deltas[1])
if __name__ == '__main__':
signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))
b = BPF(text=BPF_PROGRAM)
start_http_server(9113)
print("ZFS metaslab exporter on :9113/metrics")
while True:
collect(b)
time.sleep(5)
PYEOF
chmod +x /usr/local/bin/ebpf-zfs-metaslab-exporter
Metaslab allocation failure doesn't mean your pool is out of space. It means a specific metaslab (a slice of a vdev) couldn't satisfy the allocation, so ZFS tries the next one. Repeated failures mean fragmentation is bad enough that ZFS has to scan multiple metaslabs to find contiguous free space. The failure rate is a fragmentation proxy.
Per-dataset I/O via DMU kprobes
ZFS's Data Management Unit (DMU) sits above ZIO and handles object-level I/O per dataset. By tracing dmu_read and dmu_write, you can see I/O broken down by dataset — which is what you actually want when you have 50 datasets and need to know which one is hammering the pool.
cat > /usr/local/bin/ebpf-zfs-dmu-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF ZFS DMU Per-Dataset I/O Exporter."""
from bcc import BPF
from prometheus_client import start_http_server, Counter, Histogram
import time, signal, sys
BPF_PROGRAM = r"""
#include
typedef struct {
u64 objset_id;
u8 op; // 0=read, 1=write
} key_t;
BPF_HASH(dmu_ops, key_t, u64);
BPF_HASH(dmu_bytes, key_t, u64);
// dmu_read(objset_t *os, uint64_t object, uint64_t offset, uint64_t size, ...)
int kprobe__dmu_read(struct pt_regs *ctx) {
u64 os_ptr = PT_REGS_PARM1(ctx);
u64 size = PT_REGS_PARM4(ctx);
// Read os_id from objset_t (offset varies — typically 0 or 8)
u64 os_id = 0;
bpf_probe_read_kernel(&os_id, sizeof(os_id), (void *)os_ptr);
key_t k = {};
k.objset_id = os_id;
k.op = 0;
u64 *ops = dmu_ops.lookup_or_try_init(&k, &(u64){0});
if (ops) __sync_fetch_and_add(ops, 1);
u64 *bytes = dmu_bytes.lookup_or_try_init(&k, &(u64){0});
if (bytes) __sync_fetch_and_add(bytes, size);
return 0;
}
// dmu_write(objset_t *os, uint64_t object, uint64_t offset, uint64_t size, ...)
int kprobe__dmu_write(struct pt_regs *ctx) {
u64 os_ptr = PT_REGS_PARM1(ctx);
u64 size = PT_REGS_PARM4(ctx);
u64 os_id = 0;
bpf_probe_read_kernel(&os_id, sizeof(os_id), (void *)os_ptr);
key_t k = {};
k.objset_id = os_id;
k.op = 1;
u64 *ops = dmu_ops.lookup_or_try_init(&k, &(u64){0});
if (ops) __sync_fetch_and_add(ops, 1);
u64 *bytes = dmu_bytes.lookup_or_try_init(&k, &(u64){0});
if (bytes) __sync_fetch_and_add(bytes, size);
return 0;
}
"""
OPS = {0: "read", 1: "write"}
dmu_op_count = Counter(
'ebpf_zfs_dmu_ops_total',
'ZFS DMU operations by objset and type',
['objset_id', 'operation']
)
dmu_byte_count = Counter(
'ebpf_zfs_dmu_bytes_total',
'ZFS DMU bytes by objset and type',
['objset_id', 'operation']
)
def collect(bpf_obj):
ops = bpf_obj["dmu_ops"]
for k, v in ops.items():
op = OPS.get(k.op, "unknown")
dmu_op_count.labels(objset_id=str(k.objset_id), operation=op).inc(v.value)
ops.clear()
bts = bpf_obj["dmu_bytes"]
for k, v in bts.items():
op = OPS.get(k.op, "unknown")
dmu_byte_count.labels(objset_id=str(k.objset_id), operation=op).inc(v.value)
bts.clear()
if __name__ == '__main__':
signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))
b = BPF(text=BPF_PROGRAM)
start_http_server(9114)
print("ZFS DMU per-dataset exporter on :9114/metrics")
while True:
collect(b)
time.sleep(5)
PYEOF
chmod +x /usr/local/bin/ebpf-zfs-dmu-exporter
Map objset IDs to dataset names with zfs list -o name,objsetid:
$ zfs list -o name,objsetid
NAME OBJSETID
rpool 54
rpool/ROOT 260
rpool/ROOT/default 518
rpool/home 776
rpool/var 1034
rpool/docker 1292
Scrub progress and error rate
cat > /usr/local/bin/ebpf-zfs-scrub-collector << 'BPFEOF'
#!/bin/bash
# ZFS scrub metrics via bpftrace + textfile collector.
# Traces dsl_scan_scrub_cb for scrub I/O, counts errors.
# Run via systemd timer every 60 seconds.
OUTFILE="/var/lib/node_exporter/textfile_collector/zfs_scrub.prom"
TMPFILE="${OUTFILE}.tmp"
# Check if a scrub is actually running
SCRUB_ACTIVE=$(zpool status | grep -c "scan: scrub in progress")
if [ "$SCRUB_ACTIVE" -eq 0 ]; then
{
echo "# HELP ebpf_zfs_scrub_active Whether a ZFS scrub is currently running"
echo "# TYPE ebpf_zfs_scrub_active gauge"
echo "ebpf_zfs_scrub_active 0"
} > "$TMPFILE"
mv "$TMPFILE" "$OUTFILE"
exit 0
fi
# Trace scrub I/O for 10 seconds
RESULT=$(bpftrace -e '
kprobe:dsl_scan_scrub_cb {
@scrub_ios = count();
}
kprobe:zio_done {
// Check for errors in ZIO completion
@total_zios = count();
}
kretprobe:zio_wait {
$ret = retval;
if ($ret != 0) {
@scrub_errors = count();
}
}
interval:s:10 { exit(); }
' 2>/dev/null)
SCRUB_IOS=$(echo "$RESULT" | grep '@scrub_ios:' | awk '{print $2}')
SCRUB_ERRORS=$(echo "$RESULT" | grep '@scrub_errors:' | awk '{print $2}')
[ -z "$SCRUB_IOS" ] && SCRUB_IOS=0
[ -z "$SCRUB_ERRORS" ] && SCRUB_ERRORS=0
# Also get progress from zpool status
SCRUB_PCT=$(zpool status | grep "done" | grep -oP '[\d.]+(?=%)' | head -1)
[ -z "$SCRUB_PCT" ] && SCRUB_PCT=0
{
echo "# HELP ebpf_zfs_scrub_active Whether a ZFS scrub is currently running"
echo "# TYPE ebpf_zfs_scrub_active gauge"
echo "ebpf_zfs_scrub_active 1"
echo "# HELP ebpf_zfs_scrub_ios_sampled Scrub I/Os observed in 10s sample window"
echo "# TYPE ebpf_zfs_scrub_ios_sampled gauge"
echo "ebpf_zfs_scrub_ios_sampled ${SCRUB_IOS}"
echo "# HELP ebpf_zfs_scrub_errors_sampled Scrub errors observed in 10s sample window"
echo "# TYPE ebpf_zfs_scrub_errors_sampled gauge"
echo "ebpf_zfs_scrub_errors_sampled ${SCRUB_ERRORS}"
echo "# HELP ebpf_zfs_scrub_progress_percent Scrub progress percentage"
echo "# TYPE ebpf_zfs_scrub_progress_percent gauge"
echo "ebpf_zfs_scrub_progress_percent ${SCRUB_PCT}"
} > "$TMPFILE"
mv "$TMPFILE" "$OUTFILE"
BPFEOF
chmod +x /usr/local/bin/ebpf-zfs-scrub-collector
WireGuard-Specific Exporters
WireGuard's kernel module handles encryption and tunneling, but it exposes almost no runtime metrics. wg show gives you transfer counters and last handshake time — that's it. No per-peer throughput rate, no handshake failure counts, no packet loss visibility. eBPF fixes all of that.
Per-peer throughput
cat > /usr/local/bin/ebpf-wireguard-exporter << 'PYEOF'
#!/usr/bin/env python3
"""eBPF WireGuard Per-Peer Throughput Exporter."""
from bcc import BPF
from prometheus_client import start_http_server, Counter, Gauge
import time, signal, sys, subprocess, re
BPF_PROGRAM = r"""
#include
#include
typedef struct {
u32 peer_idx;
u8 direction; // 0=rx, 1=tx
} key_t;
BPF_HASH(peer_bytes, key_t, u64);
BPF_HASH(peer_packets, key_t, u64);
// wg_packet_receive — called when a WireGuard packet arrives
int kprobe__wg_packet_receive(struct pt_regs *ctx) {
struct sk_buff *skb = (struct sk_buff *)PT_REGS_PARM2(ctx);
u32 len = 0;
bpf_probe_read_kernel(&len, sizeof(len), &skb->len);
key_t k = {};
k.peer_idx = 0; // We'll differentiate by source in userspace
k.direction = 0;
u64 *b = peer_bytes.lookup_or_try_init(&k, &(u64){0});
if (b) __sync_fetch_and_add(b, len);
u64 *p = peer_packets.lookup_or_try_init(&k, &(u64){0});
if (p) __sync_fetch_and_add(p, 1);
return 0;
}
// wg_packet_send — called when a WireGuard packet is transmitted
int kprobe__wg_packet_send(struct pt_regs *ctx) {
struct sk_buff *skb = (struct sk_buff *)PT_REGS_PARM2(ctx);
u32 len = 0;
bpf_probe_read_kernel(&len, sizeof(len), &skb->len);
key_t k = {};
k.peer_idx = 0;
k.direction = 1;
u64 *b = peer_bytes.lookup_or_try_init(&k, &(u64){0});
if (b) __sync_fetch_and_add(b, len);
u64 *p = peer_packets.lookup_or_try_init(&k, &(u64){0});
if (p) __sync_fetch_and_add(p, 1);
return 0;
}
"""
wg_bytes = Counter(
'ebpf_wireguard_bytes_total',
'WireGuard bytes transferred',
['interface', 'peer', 'direction']
)
wg_packets = Counter(
'ebpf_wireguard_packets_total',
'WireGuard packets transferred',
['interface', 'peer', 'direction']
)
wg_handshake_age = Gauge(
'ebpf_wireguard_handshake_age_seconds',
'Seconds since last successful handshake',
['interface', 'peer']
)
DIRS = {0: "rx", 1: "tx"}
def get_wg_peers():
"""Parse wg show output for peer info and handshake times."""
peers = {}
try:
out = subprocess.check_output(['wg', 'show', 'all', 'dump'],
text=True, timeout=5)
for line in out.strip().split('\n'):
parts = line.split('\t')
if len(parts) >= 8:
iface = parts[0]
pubkey = parts[1][:12] + '...' # truncate for label safety
endpoint = parts[3] if parts[3] != '(none)' else 'none'
last_handshake = int(parts[5]) if parts[5] != '0' else 0
tx_bytes = int(parts[6])
rx_bytes = int(parts[7])
peers[pubkey] = {
'interface': iface,
'endpoint': endpoint,
'last_handshake': last_handshake,
'tx_bytes': tx_bytes,
'rx_bytes': rx_bytes,
}
except (subprocess.SubprocessError, FileNotFoundError):
pass
return peers
def collect(bpf_obj):
# eBPF aggregate bytes/packets
bytes_map = bpf_obj["peer_bytes"]
pkts_map = bpf_obj["peer_packets"]
for k, v in bytes_map.items():
direction = DIRS.get(k.direction, "unknown")
wg_bytes.labels(interface="wg0", peer="aggregate", direction=direction).inc(v.value)
bytes_map.clear()
for k, v in pkts_map.items():
direction = DIRS.get(k.direction, "unknown")
wg_packets.labels(interface="wg0", peer="aggregate", direction=direction).inc(v.value)
pkts_map.clear()
# Supplement with per-peer data from wg show
now = int(time.time())
for peer, info in get_wg_peers().items():
if info['last_handshake'] > 0:
age = now - info['last_handshake']
wg_handshake_age.labels(
interface=info['interface'],
peer=peer
).set(age)
if __name__ == '__main__':
signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))
b = BPF(text=BPF_PROGRAM)
start_http_server(9115)
print("WireGuard exporter on :9115/metrics")
while True:
collect(b)
time.sleep(5)
PYEOF
chmod +x /usr/local/bin/ebpf-wireguard-exporter
Sample output:
$ curl -s localhost:9115/metrics | grep ebpf_wireguard
ebpf_wireguard_bytes_total{interface="wg0",peer="aggregate",direction="rx"} 1.234567e+09
ebpf_wireguard_bytes_total{interface="wg0",peer="aggregate",direction="tx"} 9.87654e+08
ebpf_wireguard_packets_total{interface="wg0",peer="aggregate",direction="rx"} 8234567
ebpf_wireguard_packets_total{interface="wg0",peer="aggregate",direction="tx"} 7654321
ebpf_wireguard_handshake_age_seconds{interface="wg0",peer="a1b2c3d4e5f6..."} 87
ebpf_wireguard_handshake_age_seconds{interface="wg0",peer="f6e5d4c3b2a1..."} 12
ebpf_wireguard_handshake_age_seconds{interface="wg0",peer="1234abcd5678..."} 345
Handshake timing and failure rate
cat > /usr/local/bin/ebpf-wg-handshake-collector << 'BPFEOF'
#!/bin/bash
# WireGuard handshake metrics via bpftrace + textfile collector.
# Traces wg_noise_handshake_* functions for timing and failures.
OUTFILE="/var/lib/node_exporter/textfile_collector/wg_handshake.prom"
TMPFILE="${OUTFILE}.tmp"
RESULT=$(bpftrace -e '
kprobe:wg_noise_handshake_begin {
@hs_start[tid] = nsecs;
@hs_attempts = count();
}
kretprobe:wg_noise_handshake_begin {
$start = @hs_start[tid];
if ($start > 0) {
$delta_us = (nsecs - $start) / 1000;
@hs_latency_us = hist($delta_us);
if (retval == 0) {
@hs_success = count();
} else {
@hs_failure = count();
}
delete(@hs_start[tid]);
}
}
kprobe:wg_noise_handshake_consume_response {
@hs_responses = count();
}
interval:s:30 { exit(); }
' 2>/dev/null)
ATTEMPTS=$(echo "$RESULT" | grep '@hs_attempts:' | awk '{print $2}')
SUCCESS=$(echo "$RESULT" | grep '@hs_success:' | awk '{print $2}')
FAILURE=$(echo "$RESULT" | grep '@hs_failure:' | awk '{print $2}')
RESPONSES=$(echo "$RESULT" | grep '@hs_responses:' | awk '{print $2}')
[ -z "$ATTEMPTS" ] && ATTEMPTS=0
[ -z "$SUCCESS" ] && SUCCESS=0
[ -z "$FAILURE" ] && FAILURE=0
[ -z "$RESPONSES" ] && RESPONSES=0
{
echo "# HELP ebpf_wireguard_handshake_attempts_sampled Handshake attempts in 30s window"
echo "# TYPE ebpf_wireguard_handshake_attempts_sampled gauge"
echo "ebpf_wireguard_handshake_attempts_sampled ${ATTEMPTS}"
echo "# HELP ebpf_wireguard_handshake_success_sampled Successful handshakes in 30s window"
echo "# TYPE ebpf_wireguard_handshake_success_sampled gauge"
echo "ebpf_wireguard_handshake_success_sampled ${SUCCESS}"
echo "# HELP ebpf_wireguard_handshake_failure_sampled Failed handshakes in 30s window"
echo "# TYPE ebpf_wireguard_handshake_failure_sampled gauge"
echo "ebpf_wireguard_handshake_failure_sampled ${FAILURE}"
echo "# HELP ebpf_wireguard_handshake_responses_sampled Handshake responses consumed in 30s window"
echo "# TYPE ebpf_wireguard_handshake_responses_sampled gauge"
echo "ebpf_wireguard_handshake_responses_sampled ${RESPONSES}"
} > "$TMPFILE"
mv "$TMPFILE" "$OUTFILE"
BPFEOF
chmod +x /usr/local/bin/ebpf-wg-handshake-collector
WireGuard packet loss detection
cat > /usr/local/bin/ebpf-wg-packetloss-collector << 'BPFEOF'
#!/bin/bash
# WireGuard packet loss detection via bpftrace + textfile collector.
# Detects drops by tracing wg receive path and comparing to skb frees.
OUTFILE="/var/lib/node_exporter/textfile_collector/wg_packetloss.prom"
TMPFILE="${OUTFILE}.tmp"
RESULT=$(bpftrace -e '
// Count packets entering the WireGuard receive path
kprobe:wg_packet_receive {
@wg_rx_enter = count();
}
// Count packets successfully decrypted and delivered
kprobe:wg_packet_rx_poll {
@wg_rx_delivered = count();
}
// Count packets dropped due to decryption failure or replay
kprobe:kfree_skb_reason {
$reason = arg1;
// SKB_DROP_REASON_NOT_SPECIFIED = 1, track all drops on wg interface
@skb_drops = count();
}
// Count WireGuard-specific drops (invalid packets)
kprobe:wg_packet_consume_data_done {
@wg_decrypt_done = count();
}
interval:s:15 { exit(); }
' 2>/dev/null)
RX_ENTER=$(echo "$RESULT" | grep '@wg_rx_enter:' | awk '{print $2}')
RX_DELIVERED=$(echo "$RESULT" | grep '@wg_rx_delivered:' | awk '{print $2}')
DECRYPT_DONE=$(echo "$RESULT" | grep '@wg_decrypt_done:' | awk '{print $2}')
SKB_DROPS=$(echo "$RESULT" | grep '@skb_drops:' | awk '{print $2}')
[ -z "$RX_ENTER" ] && RX_ENTER=0
[ -z "$RX_DELIVERED" ] && RX_DELIVERED=0
[ -z "$DECRYPT_DONE" ] && DECRYPT_DONE=0
[ -z "$SKB_DROPS" ] && SKB_DROPS=0
{
echo "# HELP ebpf_wireguard_rx_enter_sampled Packets entering WG receive path (15s window)"
echo "# TYPE ebpf_wireguard_rx_enter_sampled gauge"
echo "ebpf_wireguard_rx_enter_sampled ${RX_ENTER}"
echo "# HELP ebpf_wireguard_rx_delivered_sampled Packets delivered after decryption (15s window)"
echo "# TYPE ebpf_wireguard_rx_delivered_sampled gauge"
echo "ebpf_wireguard_rx_delivered_sampled ${RX_DELIVERED}"
echo "# HELP ebpf_wireguard_decrypt_done_sampled Decryption completions (15s window)"
echo "# TYPE ebpf_wireguard_decrypt_done_sampled gauge"
echo "ebpf_wireguard_decrypt_done_sampled ${DECRYPT_DONE}"
echo "# HELP ebpf_wireguard_skb_drops_sampled Total SKB drops system-wide (15s window)"
echo "# TYPE ebpf_wireguard_skb_drops_sampled gauge"
echo "ebpf_wireguard_skb_drops_sampled ${SKB_DROPS}"
} > "$TMPFILE"
mv "$TMPFILE" "$OUTFILE"
BPFEOF
chmod +x /usr/local/bin/ebpf-wg-packetloss-collector
WireGuard's kernel module is intentionally minimal, which means the kprobe surface is small. The packet loss collector above catches drops at the SKB level, not just WireGuard-specific drops. For production, filter by network namespace or interface index to isolate WireGuard traffic from everything else.
Grafana Dashboard Templates
Below is a complete Grafana dashboard JSON that creates panels for all the exporters above. Import it via Grafana UI (Dashboards > Import > paste JSON) or provision it as a file.
Dashboard layout
The dashboard is organized into four rows:
Row 1: System Health
Panel 1: Disk I/O Latency Heatmap — x-axis is time, y-axis is latency buckets (1μs to 1s), color intensity is request count. Shows the full distribution, not just averages. Hot spots in the high-latency buckets mean trouble.
Panel 2: Scheduler Run Queue Latency p50/p99 — two time-series lines. p99 above 10ms means CPU saturation.
Panel 3: Page Cache Hit Ratio — single stat with thresholds: green above 95%, yellow 90-95%, red below 90%.
Row 2: ZFS Health
Panel 4: ARC Hit Ratio — gauge, same thresholds as page cache. Should track close to page cache ratio on a ZFS-only system.
Panel 5: TXG Sync Time p50/p99 — time series. Alert threshold at 3 seconds.
Panel 6: ZIO Latency by Type — stacked time series, read/write/free separated. Writes should be faster than reads (async).
Panel 7: Metaslab Allocation Latency — p99 line graph. Rising trend = increasing fragmentation.
Row 3: Network
Panel 8: TCP Retransmit Rate — counter rate graph. Any sustained rate above 1% is network trouble.
Panel 9: WireGuard Throughput — rx/tx bytes per second, stacked area graph.
Panel 10: WireGuard Handshake Age — per-peer bar gauge. Any peer over 180s without a handshake may be unreachable.
Row 4: Application
Panel 11: Syscall Rate Top 10 — stacked time series of the 10 hottest syscalls. Shows workload character.
Panel 12: File Ops by Process — table panel, sorted by total ops. Identifies I/O-heavy processes.
Panel 13: Memory — minor vs major fault rates, page allocation order histogram.
Grafana provisioning JSON
{
"dashboard": {
"title": "eBPF Kernel Metrics",
"tags": ["ebpf", "kernel", "zfs", "wireguard"],
"timezone": "browser",
"refresh": "30s",
"panels": [
{
"title": "Disk I/O Latency Distribution",
"type": "heatmap",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 0 },
"targets": [{
"expr": "rate(ebpf_disk_io_latency_microseconds_bucket[5m])",
"format": "heatmap",
"legendFormat": "{{le}}"
}],
"options": {
"calculate": false,
"color": { "scheme": "Spectral", "reverse": true },
"yAxis": { "unit": "us" }
}
},
{
"title": "Scheduler Run Queue Latency",
"type": "timeseries",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 0 },
"targets": [
{
"expr": "histogram_quantile(0.50, rate(ebpf_sched_runqueue_latency_microseconds_bucket[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.99, rate(ebpf_sched_runqueue_latency_microseconds_bucket[5m]))",
"legendFormat": "p99"
}
],
"fieldConfig": {
"defaults": { "unit": "us" }
}
},
{
"title": "Page Cache Hit Ratio",
"type": "stat",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 0 },
"targets": [{
"expr": "ebpf_page_cache_hit_ratio"
}],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 0.90 },
{ "color": "green", "value": 0.95 }
]
}
}
}
},
{
"title": "ZFS ARC Hit Ratio",
"type": "gauge",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 6, "x": 0, "y": 8 },
"targets": [{
"expr": "ebpf_zfs_arc_hit_ratio"
}],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"min": 0, "max": 1,
"thresholds": {
"steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 0.90 },
{ "color": "green", "value": 0.95 }
]
}
}
}
},
{
"title": "TXG Sync Time",
"type": "timeseries",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 6, "x": 6, "y": 8 },
"targets": [
{
"expr": "histogram_quantile(0.50, rate(ebpf_zfs_txg_sync_microseconds_bucket[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.99, rate(ebpf_zfs_txg_sync_microseconds_bucket[5m]))",
"legendFormat": "p99"
},
{
"expr": "ebpf_zfs_txg_last_sync_microseconds",
"legendFormat": "last sync"
}
],
"fieldConfig": { "defaults": { "unit": "us" } }
},
{
"title": "ZIO Latency by Type (p99)",
"type": "timeseries",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 6, "x": 12, "y": 8 },
"targets": [
{
"expr": "histogram_quantile(0.99, rate(ebpf_zfs_zio_read_latency_microseconds_bucket[5m]))",
"legendFormat": "read p99"
},
{
"expr": "histogram_quantile(0.99, rate(ebpf_zfs_zio_write_latency_microseconds_bucket[5m]))",
"legendFormat": "write p99"
},
{
"expr": "histogram_quantile(0.99, rate(ebpf_zfs_zio_free_latency_microseconds_bucket[5m]))",
"legendFormat": "free p99"
}
],
"fieldConfig": { "defaults": { "unit": "us" } }
},
{
"title": "Metaslab Allocation Latency (p99)",
"type": "timeseries",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 6, "x": 18, "y": 8 },
"targets": [{
"expr": "histogram_quantile(0.99, rate(ebpf_zfs_metaslab_alloc_microseconds_bucket[5m]))",
"legendFormat": "p99"
}],
"fieldConfig": { "defaults": { "unit": "us" } }
},
{
"title": "TCP Retransmit Rate",
"type": "timeseries",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 16 },
"targets": [{
"expr": "ebpf_tcp_retransmits_total",
"legendFormat": "retransmits / sample window"
}]
},
{
"title": "WireGuard Throughput",
"type": "timeseries",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 16 },
"targets": [
{
"expr": "rate(ebpf_wireguard_bytes_total{direction='rx'}[5m])",
"legendFormat": "RX bytes/s"
},
{
"expr": "rate(ebpf_wireguard_bytes_total{direction='tx'}[5m])",
"legendFormat": "TX bytes/s"
}
],
"fieldConfig": { "defaults": { "unit": "Bps" } }
},
{
"title": "WireGuard Handshake Age",
"type": "bargauge",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 16 },
"targets": [{
"expr": "ebpf_wireguard_handshake_age_seconds",
"legendFormat": "{{peer}}"
}],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 120 },
{ "color": "red", "value": 180 }
]
}
}
}
},
{
"title": "Syscall Rate (Top 10)",
"type": "timeseries",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 24 },
"targets": [{
"expr": "topk(10, rate(ebpf_syscalls_total[5m]))",
"legendFormat": "{{syscall}}"
}]
},
{
"title": "File Ops by Process",
"type": "table",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 24 },
"targets": [{
"expr": "topk(15, rate(ebpf_file_ops_total[5m]))",
"format": "table",
"instant": true
}]
},
{
"title": "Page Fault Rate",
"type": "timeseries",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 24 },
"targets": [
{
"expr": "rate(ebpf_page_faults_minor_total[5m])",
"legendFormat": "minor faults/s"
},
{
"expr": "rate(ebpf_page_faults_major_total[5m])",
"legendFormat": "major faults/s"
}
]
}
]
}
}
Alertmanager Rules
Drop this into /etc/prometheus/rules/ebpf-alerts.yml and reload Prometheus. These rules fire when eBPF metrics indicate real problems, not noisy approximations.
cat > /etc/prometheus/rules/ebpf-alerts.yml << 'EOF'
groups:
- name: ebpf_disk
rules:
- alert: DiskLatencyP99High
expr: histogram_quantile(0.99, rate(ebpf_disk_io_latency_microseconds_bucket[5m])) > 50000
for: 5m
labels:
severity: warning
annotations:
summary: "Disk I/O p99 latency exceeds 50ms"
description: >
The 99th percentile disk I/O latency on {{ $labels.instance }}
device {{ $labels.device }} has been above 50ms for 5 minutes.
Current value: {{ $value | humanizeDuration }}
- alert: DiskLatencyP99Critical
expr: histogram_quantile(0.99, rate(ebpf_disk_io_latency_microseconds_bucket[5m])) > 200000
for: 2m
labels:
severity: critical
annotations:
summary: "Disk I/O p99 latency exceeds 200ms"
description: >
The 99th percentile disk I/O latency on {{ $labels.instance }}
is above 200ms. Investigate queue depth and device health.
- name: ebpf_zfs
rules:
- alert: ZfsArcHitRateLow
expr: ebpf_zfs_arc_hit_ratio < 0.90
for: 10m
labels:
severity: warning
annotations:
summary: "ZFS ARC hit rate below 90%"
description: >
ARC hit ratio on {{ $labels.instance }} is {{ $value | humanizePercentage }}.
Working set exceeds available memory. Consider adding RAM or
investigating which workload is thrashing the cache.
- alert: ZfsArcHitRateCritical
expr: ebpf_zfs_arc_hit_ratio < 0.80
for: 5m
labels:
severity: critical
annotations:
summary: "ZFS ARC hit rate below 80%"
description: >
ARC hit ratio on {{ $labels.instance }} is critically low at
{{ $value | humanizePercentage }}. Expect significant I/O performance
degradation. Immediate investigation required.
- alert: ZfsTxgSyncTimeSlow
expr: histogram_quantile(0.99, rate(ebpf_zfs_txg_sync_microseconds_bucket[5m])) > 3000000
for: 5m
labels:
severity: warning
annotations:
summary: "ZFS TXG sync time p99 exceeds 3 seconds"
description: >
Transaction group syncs on {{ $labels.instance }} are taking over
3 seconds (p99). The default txg timeout is 5 seconds — approaching
that limit will cause write throttling.
- alert: ZfsScrubErrors
expr: ebpf_zfs_scrub_errors_sampled > 0
for: 1m
labels:
severity: critical
annotations:
summary: "ZFS scrub detected errors"
description: >
ZFS scrub on {{ $labels.instance }} found {{ $value }} errors in
the last sampling window. Run 'zpool status' immediately to check
for data corruption and vdev health.
- alert: ZfsMetaslabAllocSlow
expr: histogram_quantile(0.99, rate(ebpf_zfs_metaslab_alloc_microseconds_bucket[5m])) > 10000
for: 15m
labels:
severity: warning
annotations:
summary: "ZFS metaslab allocation latency elevated"
description: >
Metaslab allocation p99 on {{ $labels.instance }} exceeds 10ms.
This indicates pool fragmentation. Check pool usage and consider
running 'zpool trim' or adding capacity.
- name: ebpf_network
rules:
- alert: TcpRetransmitRateHigh
expr: ebpf_tcp_retransmits_total > 100
for: 5m
labels:
severity: warning
annotations:
summary: "TCP retransmit rate elevated"
description: >
{{ $labels.instance }} observed {{ $value }} TCP retransmits in the
last sampling window. Check for network congestion, bad cables,
or overloaded interfaces.
- alert: WireguardHandshakeStale
expr: ebpf_wireguard_handshake_age_seconds > 300
for: 5m
labels:
severity: warning
annotations:
summary: "WireGuard peer handshake stale"
description: >
Peer {{ $labels.peer }} on {{ $labels.interface }} has not completed
a handshake in {{ $value | humanizeDuration }}. The peer may be
unreachable or the tunnel may be down.
- name: ebpf_system
rules:
- alert: SchedulerLatencyHigh
expr: histogram_quantile(0.99, rate(ebpf_sched_runqueue_latency_microseconds_bucket[5m])) > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "CPU scheduler run queue latency p99 exceeds 10ms"
description: >
Tasks on {{ $labels.instance }} are waiting over 10ms in the run
queue before getting CPU time. The system is CPU-saturated.
Consider reducing workload or adding CPU capacity.
- alert: SchedulerLatencyCritical
expr: histogram_quantile(0.99, rate(ebpf_sched_runqueue_latency_microseconds_bucket[5m])) > 50000
for: 2m
labels:
severity: critical
annotations:
summary: "CPU scheduler latency p99 exceeds 50ms"
description: >
Run queue latency on {{ $labels.instance }} is critically high.
Applications will experience visible delays. This indicates severe
CPU contention.
- alert: PageCacheHitRateLow
expr: ebpf_page_cache_hit_ratio < 0.90
for: 10m
labels:
severity: warning
annotations:
summary: "Page cache hit rate below 90%"
description: >
Page cache hit ratio on {{ $labels.instance }} is
{{ $value | humanizePercentage }}. The working set exceeds
available memory, causing increased disk I/O.
- alert: MajorPageFaultsHigh
expr: rate(ebpf_page_faults_major_total[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Major page fault rate elevated"
description: >
{{ $labels.instance }} is experiencing {{ $value }} major page
faults per second. This means the kernel is reading data from disk
that should be in memory. Check for memory pressure or swap usage.
EOF
Reload Prometheus to pick up the rules
# Validate the rules file first
promtool check rules /etc/prometheus/rules/ebpf-alerts.yml
# Output:
# Checking /etc/prometheus/rules/ebpf-alerts.yml
# SUCCESS: 12 rules found
# Reload Prometheus (assumes systemd)
systemctl reload prometheus
# Or send SIGHUP
kill -HUP $(pidof prometheus)
Running All Exporters
For production, create a single systemd target that starts all eBPF exporters together:
cat > /etc/systemd/system/ebpf-exporters.target << 'EOF'
[Unit]
Description=All eBPF Prometheus Exporters
Wants=ebpf-diskio-exporter.service
Wants=ebpf-syscall-exporter.service
Wants=ebpf-cachestat-exporter.service
Wants=ebpf-schedlat-exporter.service
Wants=ebpf-memory-exporter.service
Wants=ebpf-fileops-exporter.service
Wants=ebpf-zfs-arc-exporter.service
Wants=ebpf-zfs-zio-exporter.service
Wants=ebpf-zfs-txg-exporter.service
Wants=ebpf-zfs-metaslab-exporter.service
Wants=ebpf-zfs-dmu-exporter.service
Wants=ebpf-wireguard-exporter.service
[Install]
WantedBy=multi-user.target
EOF
# Create systemd units for each exporter (same pattern as diskio above)
for exporter in syscall cachestat schedlat memory fileops; do
cat > /etc/systemd/system/ebpf-${exporter}-exporter.service << UNIT
[Unit]
Description=eBPF ${exporter} Exporter for Prometheus
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/ebpf-${exporter}-exporter
Restart=on-failure
RestartSec=5
AmbientCapabilities=CAP_SYS_ADMIN CAP_BPF CAP_PERFMON
ProtectSystem=strict
ProtectHome=yes
[Install]
WantedBy=ebpf-exporters.target
UNIT
done
# Same for ZFS exporters
for exporter in zfs-arc zfs-zio zfs-txg zfs-metaslab zfs-dmu; do
cat > /etc/systemd/system/ebpf-${exporter}-exporter.service << UNIT
[Unit]
Description=eBPF ${exporter} Exporter for Prometheus
After=network.target zfs.target
[Service]
Type=simple
ExecStart=/usr/local/bin/ebpf-${exporter}-exporter
Restart=on-failure
RestartSec=5
AmbientCapabilities=CAP_SYS_ADMIN CAP_BPF CAP_PERFMON
[Install]
WantedBy=ebpf-exporters.target
UNIT
done
# WireGuard exporter
cat > /etc/systemd/system/ebpf-wireguard-exporter.service << 'UNIT'
[Unit]
Description=eBPF WireGuard Exporter for Prometheus
After=network.target wg-quick@wg0.service
[Service]
Type=simple
ExecStart=/usr/local/bin/ebpf-wireguard-exporter
Restart=on-failure
RestartSec=5
AmbientCapabilities=CAP_SYS_ADMIN CAP_BPF CAP_PERFMON CAP_NET_ADMIN
[Install]
WantedBy=ebpf-exporters.target
UNIT
systemctl daemon-reload
systemctl enable --now ebpf-exporters.target
Complete Prometheus scrape configuration
# Add all exporters to /etc/prometheus/prometheus.yml
cat >> /etc/prometheus/prometheus.yml << 'EOF'
scrape_configs:
# eBPF Exporters — kernel metrics
- job_name: 'ebpf-diskio'
static_configs:
- targets: ['localhost:9101']
- job_name: 'ebpf-syscall'
static_configs:
- targets: ['localhost:9105']
- job_name: 'ebpf-cachestat'
static_configs:
- targets: ['localhost:9106']
- job_name: 'ebpf-schedlat'
static_configs:
- targets: ['localhost:9107']
- job_name: 'ebpf-memory'
static_configs:
- targets: ['localhost:9108']
- job_name: 'ebpf-fileops'
static_configs:
- targets: ['localhost:9109']
# eBPF Exporters — ZFS metrics
- job_name: 'ebpf-zfs-arc'
static_configs:
- targets: ['localhost:9110']
- job_name: 'ebpf-zfs-zio'
static_configs:
- targets: ['localhost:9111']
- job_name: 'ebpf-zfs-txg'
static_configs:
- targets: ['localhost:9112']
- job_name: 'ebpf-zfs-metaslab'
static_configs:
- targets: ['localhost:9113']
- job_name: 'ebpf-zfs-dmu'
static_configs:
- targets: ['localhost:9114']
# eBPF Exporters — WireGuard metrics
- job_name: 'ebpf-wireguard'
static_configs:
- targets: ['localhost:9115']
# node_exporter picks up bpftrace textfile collectors automatically
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
EOF
Exporter Port Reference
| Port | Exporter | Metrics | Pattern |
|---|---|---|---|
| 9100 | node_exporter | Standard + textfile collectors (TCP retransmit, WG handshake, scrub, packet loss) | textfile |
| 9101 | ebpf-diskio-exporter | Block I/O latency histograms per device and operation | BCC Python |
| 9105 | ebpf-syscall-exporter | Syscall counts by type | BCC Python |
| 9106 | ebpf-cachestat-exporter | Page cache hit/miss/dirty rates and hit ratio | BCC Python |
| 9107 | ebpf-schedlat-exporter | Scheduler run queue latency histogram | BCC Python |
| 9108 | ebpf-memory-exporter | Page faults (minor/major), allocation order | BCC Python |
| 9109 | ebpf-fileops-exporter | File open/read/write counts by process | BCC Python |
| 9110 | ebpf-zfs-arc-exporter | ARC hit/miss/eviction, size, hit ratio | BCC Python |
| 9111 | ebpf-zfs-zio-exporter | ZIO latency by operation type (read/write/free) | BCC Python |
| 9112 | ebpf-zfs-txg-exporter | TXG sync time histogram, sync count | BCC Python |
| 9113 | ebpf-zfs-metaslab-exporter | Metaslab allocation latency, success/failure rate | BCC Python |
| 9114 | ebpf-zfs-dmu-exporter | Per-dataset (objset) I/O ops and bytes | BCC Python |
| 9115 | ebpf-wireguard-exporter | Per-peer throughput, handshake age | BCC Python |
Comparison: eBPF Metrics vs Traditional Approaches
| Dimension | eBPF Exporters | /proc Polling (node_exporter) | SNMP | Agent-Based (Datadog, etc.) |
|---|---|---|---|---|
| Resolution | Per-event (nanosecond timestamps) | Scrape interval (15s typical) | Poll interval (60s typical) | Agent interval (10-15s) |
| Data type | Distributions, histograms, per-event | Counters and gauges only | Counters only | Counters, gauges, some histograms |
| Causality | Process, device, stack trace per event | System-wide aggregates only | Interface-level only | Process-level, no stack traces |
| Overhead | <1% CPU, in-kernel aggregation | <0.1% CPU | Negligible | 1-5% CPU, 200-500MB RAM |
| Kernel visibility | Any kernel function, tracepoint, or kprobe | Only what /proc exposes | None | Userspace only (some kernel via /proc) |
| Custom metrics | Write any eBPF program for any kernel event | Fixed set of kernel counters | Fixed MIBs | Custom checks, not kernel-level |
| ZFS-specific | ARC, ZIO, TXG, metaslab, DMU — full internals | /proc/spl/kstat/zfs — counters only | None | None (or basic zpool status parsing) |
| WireGuard-specific | Packet-level tracing, handshake timing | None (wg show output only) | None | None (or wg show parsing) |
| Deployment | Requires kernel 4.15+, BCC or libbpf | Single binary, no dependencies | Built into most network gear | Agent install + SaaS subscription |
| Cost | Free, open source | Free, open source | Free (protocol), gear costs vary | $15-30/host/month (SaaS) |
| Misses short events | Never — traces every event | Yes — anything between scrapes is invisible | Yes — worse than /proc due to longer intervals | Yes — same as /proc polling |
The honest answer is you want both. Use node_exporter for the basics — CPU, memory, disk space, network counters. It's stable, lightweight, and covers 80% of monitoring needs. Layer eBPF exporters on top for the 20% that matters most: latency distributions, cache hit rates, per-process breakdowns, and ZFS internals. Don't replace node_exporter. Supplement it.
The production recommendation
Start with three eBPF exporters: disk I/O latency (catches storage problems before they cascade), scheduler run queue latency (catches CPU saturation that utilization metrics miss), and ZFS ARC hit rate (catches memory pressure on ZFS systems). These three metrics, combined with standard node_exporter, give you more operational insight than any $30/host SaaS monitoring product.