eBPF for Performance — Diagnose Anything
When something is slow, you need to know where the time goes. Is it disk? CPU? Network? Memory pressure? eBPF instruments the kernel itself, giving you latency histograms, I/O breakdowns, cache hit rates, and scheduler metrics — all with near-zero overhead. No sampling artifacts. No guessing.
The approach: traditional performance tools (top, iostat, vmstat) show you averages and aggregates. They tell you the disk is busy but not which process is doing it or how latent each I/O is. eBPF traces individual events at the kernel level. You see the full distribution — not just the mean, but the outliers that actually cause the pain.
biolatency — disk I/O latency histograms
biolatency traces every block device I/O and shows how long each one took, as a histogram. This is the single most important storage performance tool.
# Show disk I/O latency histogram (press Ctrl+C to see results)
biolatency
Output:
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 12 |** |
8 -> 15 : 45 |********* |
16 -> 31 : 123 |************************* |
32 -> 63 : 189 |****************************************|
64 -> 127 : 87 |****************** |
128 -> 255 : 34 |******* |
256 -> 511 : 8 |* |
512 -> 1023 : 2 | |
1024 -> 2047 : 1 | |
How to read this
Healthy NVMe: the bulk of I/O completes in 4-127 microseconds. Anything above 1ms is rare.
Healthy spinning disk: expect the peak around 2-16ms (2048-16384 usecs). That is seek time + rotational latency.
Degraded: if you see a bimodal distribution — a peak at 32us AND another peak at 50ms — some I/O is hitting cache and some is hitting a dying disk or saturated controller.
Catastrophic: if the histogram peaks above 100ms, your storage is actively destroying application performance.
# Break down latency by disk device
biolatency -D
# Show latency in milliseconds instead of microseconds
biolatency -m
# Run for 10 seconds then exit
biolatency 10
# Show latency by disk and include queue time
biolatency -DQ
biotop — I/O by process
biotop shows real-time block device I/O by process. It answers "who is killing my disk?" in seconds.
# Real-time disk I/O by process (updates every second)
biotop
Output:
PID COMM D MAJ MIN DISK I/O Kbytes AVGms
18234 postgres W 8 0 sda 847 13552 2.41
18190 postgres R 8 0 sda 312 4992 1.87
1456 jbd2/sda1-8 W 8 0 sda 89 712 0.34
2891 rsync R 8 0 sda 67 8576 3.12
Postgres is doing 847 writes per second averaging 2.4ms each. That is your disk hog. rsync is doing fewer I/Os but with higher average latency (3.1ms) because it is reading large sequential blocks that sometimes miss cache.
# Show only reads
biotop -R
# Show only writes
biotop -W
# Run for 30 seconds then exit
biotop 30
cachestat — page cache hit/miss
The page cache is Linux's primary caching layer — it keeps frequently accessed file data in RAM. cachestat shows hit/miss ratios in real time. If your hit rate drops, you are either running out of RAM or your working set changed.
# Page cache statistics every second
cachestat
Output:
HITS MISSES DIRTIES BUFFERS_MB CACHED_MB HIT%
12847 234 89 128 3841 98.2%
13102 187 92 128 3842 98.6%
11893 1847 312 128 3798 86.6%
9234 4521 567 128 3612 67.1%
8012 6234 891 127 3401 56.2%
Reading the numbers
98% hit rate: your working set fits in RAM. Everything is fast.
86% and dropping: something is pushing your working set out of cache. Maybe a backup job, maybe a new query pattern, maybe ZFS ARC is competing for RAM.
Below 70%: serious cache pressure. The system is reading from disk for data that should be in RAM. Performance is degrading.
# Run for 60 seconds then exit
cachestat 60
# Show interval of 5 seconds
cachestat 5
runqlat — CPU scheduler latency
runqlat measures how long processes wait in the CPU scheduler's run queue before actually getting CPU time. This detects CPU saturation that top's "load average" hides.
# CPU scheduler latency histogram
runqlat
Output:
usecs : count distribution
0 -> 1 : 1423 |****************************************|
2 -> 3 : 892 |************************* |
4 -> 7 : 345 |********* |
8 -> 15 : 123 |*** |
16 -> 31 : 45 |* |
32 -> 63 : 12 | |
64 -> 127 : 3 | |
128 -> 255 : 1 | |
Most tasks wait 0-3 microseconds. That is healthy. If you see a significant cluster above 1ms (1024 usecs), processes are queuing for CPU time. Above 10ms and your users notice.
# Break down by PID
runqlat -P
# Show in milliseconds
runqlat -m
# Run for 10 seconds
runqlat 10
CPU saturation vs. CPU utilization
A CPU at 100% utilization is not necessarily a problem — it might be doing useful work efficiently. CPU saturation means tasks are waiting. runqlat measures the waiting. A server at 95% CPU with 2us runqlat is fine. A server at 80% CPU with 50ms runqlat has a scheduling problem (probably too many threads competing for too few cores).
funccount — count kernel function calls
funccount counts how many times a kernel function is called. Use it to understand workload patterns at the kernel level.
# How many times are ext4 functions called? (10 seconds)
funccount 'ext4_*' 10
Output:
FUNC COUNT
ext4_file_read_iter 8234
ext4_readdir 1234
ext4_file_write_iter 892
ext4_lookup 567
ext4_getattr 345
ext4_file_open 234
# Count ZFS function calls
funccount 'zfs_*' 10
# Count TCP functions (network activity)
funccount 'tcp_*' 10
# Count a specific function
funccount 'vfs_read' 10
profile — CPU flame graphs
profile samples the call stack across all CPUs at a fixed rate. The output becomes a flame graph — the most powerful visualization of where CPU time goes.
# Sample at 99Hz for 30 seconds, generate a flame graph
profile -af 30 > /tmp/profile.out
# If you have Brendan Gregg's FlameGraph tools:
git clone https://github.com/brendangregg/FlameGraph /opt/FlameGraph
/opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/profile.out | \
/opt/FlameGraph/flamegraph.pl > /tmp/flamegraph.svg
Open flamegraph.svg in a browser. The x-axis is not time — it is the percentage of samples. The wider a frame, the more CPU time that function consumed. Click frames to zoom.
# Profile only a specific process
profile -p $(pgrep postgres) -af 30 > /tmp/pg-profile.out
# Sample user stacks only (skip kernel)
profile -U -af 30 > /tmp/user-profile.out
# Sample kernel stacks only
profile -K -af 30 > /tmp/kernel-profile.out
Reading flame graphs
Wide plateau at top: a single function consuming most of the CPU. Optimize or fix that function.
Wide base, narrow towers: CPU time is distributed across many functions. This is normal for a balanced workload.
Deep narrow stacks: recursive or deeply nested calls. Common in parsers and tree-walking code.
"Ice cream cone" shape: one hot code path with many layers. Typical of a single slow query or hot loop.
bpftrace scripts for performance
Syscall latency by process
bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
@usecs[comm] = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
This shows a latency histogram for every syscall, grouped by process name. If postgres has a tail of 10ms+ syscalls, something is blocking it.
TCP retransmit analysis
bpftrace -e '
kprobe:tcp_retransmit_skb {
@retransmits[comm] = count();
}
interval:s:10 {
print(@retransmits);
clear(@retransmits);
}'
TCP retransmits mean packet loss. If one process accounts for most retransmits, the problem is likely that process's network path, not system-wide congestion.
Memory allocation tracking
bpftrace -e '
tracepoint:kmem:kmalloc {
@bytes[comm] = sum(args.bytes_alloc);
@count[comm] = count();
}
interval:s:10 {
printf("\n--- Top allocators (10s window) ---\n");
print(@bytes, 10);
print(@count, 10);
clear(@bytes);
clear(@count);
}'
Filesystem operation timing
bpftrace -e '
kprobe:vfs_read { @start_read[tid] = nsecs; }
kretprobe:vfs_read /@start_read[tid]/ {
@read_us[comm] = hist((nsecs - @start_read[tid]) / 1000);
delete(@start_read[tid]);
}
kprobe:vfs_write { @start_write[tid] = nsecs; }
kretprobe:vfs_write /@start_write[tid]/ {
@write_us[comm] = hist((nsecs - @start_write[tid]) / 1000);
delete(@start_write[tid]);
}'
ZFS-specific eBPF tracing
ZFS has its own caching layer (ARC), its own I/O pipeline (ZIO), and its own transaction system (TXG). eBPF can trace all of it.
ZFS ARC hit/miss ratio
bpftrace -e '
kprobe:arc_read {
@arc_reads = count();
}
kprobe:arc_read_done {
@arc_hits = count();
}
interval:s:5 {
printf("ARC reads: %d hits: %d\n", @arc_reads, @arc_hits);
clear(@arc_reads);
clear(@arc_hits);
}'
You can also check ARC stats directly, but eBPF gives you the real-time flow:
# Quick ARC stats from /proc (for comparison)
cat /proc/spl/kstat/zfs/arcstats | grep -E '^hits|^misses'
ZFS ZIO operation tracing
bpftrace -e '
kprobe:zio_execute {
@zio_ops = count();
}
kprobe:zio_done {
@zio_done = count();
}
interval:s:5 {
printf("ZIO started: %d completed: %d\n", @zio_ops, @zio_done);
clear(@zio_ops);
clear(@zio_done);
}'
TXG sync timing
bpftrace -e '
kprobe:txg_sync_thread {
@txg_start[tid] = nsecs;
}
kretprobe:txg_sync_thread /@txg_start[tid]/ {
$dur_ms = (nsecs - @txg_start[tid]) / 1000000;
printf("TXG sync took %d ms\n", $dur_ms);
@txg_duration = hist($dur_ms);
delete(@txg_start[tid]);
}'
ZFS performance hierarchy
ARC hit: data served from RAM. Microseconds. This is the fast path.
L2ARC hit: data served from SSD cache. Hundreds of microseconds. Still fast.
Pool read: data read from spinning disk or primary SSD. Milliseconds.
TXG sync: ZFS flushes dirty data to disk periodically. Long syncs (>5 seconds) cause write latency spikes. Trace them to find the root cause (too much dirty data, slow disks, fragmentation).
Metaslab allocation tracking
bpftrace -e '
kprobe:metaslab_alloc {
@alloc_start[tid] = nsecs;
}
kretprobe:metaslab_alloc /@alloc_start[tid]/ {
@alloc_us = hist((nsecs - @alloc_start[tid]) / 1000);
delete(@alloc_start[tid]);
}
interval:s:30 {
printf("\n--- Metaslab allocation latency (30s) ---\n");
print(@alloc_us);
clear(@alloc_us);
}'
Slow metaslab allocations mean ZFS is struggling to find free space. This happens on highly fragmented pools or pools above 80% capacity. If metaslab allocations regularly exceed 10ms, it is time to add capacity or rebalance.
Real example: "my database is slow"
Step-by-step diagnosis with eBPF. Someone reports that Postgres queries are taking 5x longer than yesterday.
Step 1: Is it CPU?
# Check scheduler latency for postgres processes
runqlat -p $(pgrep -d, postgres) 10
If the histogram shows most waits under 10us, CPU is not the bottleneck. Move on.
Step 2: Is it disk?
# Check disk latency histogram
biolatency 10
If you see a bimodal distribution with a peak at 50ms+, storage is the problem. Drill deeper:
# Which process is doing the I/O?
biotop 10
If postgres dominates with high average latency, the query is reading data that is not in cache.
Step 3: Is it cache?
# Check page cache hit rate
cachestat 5
If hit rate dropped from 98% to 70%, something evicted the database's working set from cache. Check what:
# What's allocating memory? (might be a backup, a cron job, etc.)
bpftrace -e '
tracepoint:kmem:mm_page_alloc {
@pages[comm] = count();
}
interval:s:10 {
print(@pages, 10);
clear(@pages);
}'
Step 4: Is it the network?
# Check for TCP retransmits on the postgres port
tcpretrans 10
If retransmits correlate with slow queries, the network between the app server and the database is dropping packets.
Verdict
In this real-world case, cachestat revealed the hit rate dropped because a nightly backup job (rsync) ran 2 hours late and overlapped with peak query time, evicting postgres pages from cache. The fix: adjust the backup schedule. Total diagnosis time with eBPF: 5 minutes.
Real example: "network latency spikes every 5 minutes"
The monitoring dashboard shows TCP latency between two services spikes to 200ms every 5 minutes, then drops back to 2ms.
Step 1: Correlate with disk
# Run biolatency with timestamps, watch for 15 minutes
biolatency -T 1 > /tmp/bio-timeline.log &
# Simultaneously watch TCP session durations
tcplife -T > /tmp/tcp-timeline.log &
# Wait 15 minutes, then kill both
sleep 900 && kill %1 %2
Step 2: Find the pattern
# Look for biolatency spikes in the log
grep -E '(1024|2048|4096|8192|16384|32768|65536) ->' /tmp/bio-timeline.log
If disk latency spikes at the same 5-minute interval as network latency, something is saturating the I/O bus and blocking the network stack.
Step 3: What is doing I/O at those times?
bpftrace -e '
kprobe:blk_account_io_start {
@io[comm] = count();
}
interval:s:30 {
time("%H:%M:%S ");
print(@io, 5);
clear(@io);
}'
If txg_sync or z_wr_iss (ZFS write issuer) appears in bursts every 5 minutes, ZFS TXG syncs are flooding the I/O subsystem. The default TXG timeout is 5 seconds, but if dirty data accumulates for 5 minutes due to zfs_txg_timeout tuning, the flush can be massive.
The fix
# Check current TXG timeout
cat /sys/module/zfs/parameters/zfs_txg_timeout
# If it's set too high, reduce it to flush more frequently with smaller batches
echo 5 > /sys/module/zfs/parameters/zfs_txg_timeout
# Also limit dirty data to prevent massive flushes
echo 4294967296 > /sys/module/zfs/parameters/zfs_dirty_data_max # 4GB cap