| your Linux construction kit
Source

eBPF for Performance — Diagnose Anything

When something is slow, you need to know where the time goes. Is it disk? CPU? Network? Memory pressure? eBPF instruments the kernel itself, giving you latency histograms, I/O breakdowns, cache hit rates, and scheduler metrics — all with near-zero overhead. No sampling artifacts. No guessing.

The approach: traditional performance tools (top, iostat, vmstat) show you averages and aggregates. They tell you the disk is busy but not which process is doing it or how latent each I/O is. eBPF traces individual events at the kernel level. You see the full distribution — not just the mean, but the outliers that actually cause the pain.


biolatency — disk I/O latency histograms

biolatency traces every block device I/O and shows how long each one took, as a histogram. This is the single most important storage performance tool.

# Show disk I/O latency histogram (press Ctrl+C to see results)
biolatency

Output:

     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 12       |**                                      |
         8 -> 15         : 45       |*********                               |
        16 -> 31         : 123      |*************************               |
        32 -> 63         : 189      |****************************************|
        64 -> 127        : 87       |******************                      |
       128 -> 255        : 34       |*******                                 |
       256 -> 511        : 8        |*                                       |
       512 -> 1023       : 2        |                                        |
      1024 -> 2047       : 1        |                                        |

How to read this

Healthy NVMe: the bulk of I/O completes in 4-127 microseconds. Anything above 1ms is rare.
Healthy spinning disk: expect the peak around 2-16ms (2048-16384 usecs). That is seek time + rotational latency.
Degraded: if you see a bimodal distribution — a peak at 32us AND another peak at 50ms — some I/O is hitting cache and some is hitting a dying disk or saturated controller.
Catastrophic: if the histogram peaks above 100ms, your storage is actively destroying application performance.

Analogy: iostat tells you the average commute time is 30 minutes. biolatency shows you that 90% of commutes are 10 minutes but 5% take 3 hours because of a bridge closure. The average hid the real problem.
# Break down latency by disk device
biolatency -D

# Show latency in milliseconds instead of microseconds
biolatency -m

# Run for 10 seconds then exit
biolatency 10

# Show latency by disk and include queue time
biolatency -DQ

biotop — I/O by process

biotop shows real-time block device I/O by process. It answers "who is killing my disk?" in seconds.

# Real-time disk I/O by process (updates every second)
biotop

Output:

PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
18234  postgres         W 8   0   sda        847  13552   2.41
18190  postgres         R 8   0   sda        312  4992    1.87
1456   jbd2/sda1-8      W 8   0   sda        89   712     0.34
2891   rsync            R 8   0   sda        67   8576    3.12

Postgres is doing 847 writes per second averaging 2.4ms each. That is your disk hog. rsync is doing fewer I/Os but with higher average latency (3.1ms) because it is reading large sequential blocks that sometimes miss cache.

# Show only reads
biotop -R

# Show only writes
biotop -W

# Run for 30 seconds then exit
biotop 30

cachestat — page cache hit/miss

The page cache is Linux's primary caching layer — it keeps frequently accessed file data in RAM. cachestat shows hit/miss ratios in real time. If your hit rate drops, you are either running out of RAM or your working set changed.

# Page cache statistics every second
cachestat

Output:

    HITS   MISSES  DIRTIES  BUFFERS_MB  CACHED_MB  HIT%
   12847      234       89        128       3841   98.2%
   13102      187       92        128       3842   98.6%
   11893     1847      312        128       3798   86.6%
    9234     4521      567        128       3612   67.1%
    8012     6234      891        127       3401   56.2%

Reading the numbers

98% hit rate: your working set fits in RAM. Everything is fast.
86% and dropping: something is pushing your working set out of cache. Maybe a backup job, maybe a new query pattern, maybe ZFS ARC is competing for RAM.
Below 70%: serious cache pressure. The system is reading from disk for data that should be in RAM. Performance is degrading.

Analogy: the page cache is a desk. When everything you need is on the desk (98% hit), work is fast. When half your documents are in a filing cabinet across the room (50% hit), you spend half your time walking back and forth.
# Run for 60 seconds then exit
cachestat 60

# Show interval of 5 seconds
cachestat 5

runqlat — CPU scheduler latency

runqlat measures how long processes wait in the CPU scheduler's run queue before actually getting CPU time. This detects CPU saturation that top's "load average" hides.

# CPU scheduler latency histogram
runqlat

Output:

     usecs               : count    distribution
         0 -> 1          : 1423     |****************************************|
         2 -> 3          : 892      |*************************               |
         4 -> 7          : 345      |*********                               |
         8 -> 15         : 123      |***                                     |
        16 -> 31         : 45       |*                                       |
        32 -> 63         : 12       |                                        |
        64 -> 127        : 3        |                                        |
       128 -> 255        : 1        |                                        |

Most tasks wait 0-3 microseconds. That is healthy. If you see a significant cluster above 1ms (1024 usecs), processes are queuing for CPU time. Above 10ms and your users notice.

# Break down by PID
runqlat -P

# Show in milliseconds
runqlat -m

# Run for 10 seconds
runqlat 10

CPU saturation vs. CPU utilization

A CPU at 100% utilization is not necessarily a problem — it might be doing useful work efficiently. CPU saturation means tasks are waiting. runqlat measures the waiting. A server at 95% CPU with 2us runqlat is fine. A server at 80% CPU with 50ms runqlat has a scheduling problem (probably too many threads competing for too few cores).

Analogy: utilization is how busy the checkout lanes are. Saturation is the length of the queue. Full lanes with no queue means efficient. Half the lanes open with 20-person queues means a staffing problem.

funccount — count kernel function calls

funccount counts how many times a kernel function is called. Use it to understand workload patterns at the kernel level.

# How many times are ext4 functions called? (10 seconds)
funccount 'ext4_*' 10

Output:

FUNC                          COUNT
ext4_file_read_iter            8234
ext4_readdir                   1234
ext4_file_write_iter            892
ext4_lookup                     567
ext4_getattr                    345
ext4_file_open                  234
# Count ZFS function calls
funccount 'zfs_*' 10

# Count TCP functions (network activity)
funccount 'tcp_*' 10

# Count a specific function
funccount 'vfs_read' 10

profile — CPU flame graphs

profile samples the call stack across all CPUs at a fixed rate. The output becomes a flame graph — the most powerful visualization of where CPU time goes.

# Sample at 99Hz for 30 seconds, generate a flame graph
profile -af 30 > /tmp/profile.out

# If you have Brendan Gregg's FlameGraph tools:
git clone https://github.com/brendangregg/FlameGraph /opt/FlameGraph
/opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/profile.out | \
  /opt/FlameGraph/flamegraph.pl > /tmp/flamegraph.svg

Open flamegraph.svg in a browser. The x-axis is not time — it is the percentage of samples. The wider a frame, the more CPU time that function consumed. Click frames to zoom.

# Profile only a specific process
profile -p $(pgrep postgres) -af 30 > /tmp/pg-profile.out

# Sample user stacks only (skip kernel)
profile -U -af 30 > /tmp/user-profile.out

# Sample kernel stacks only
profile -K -af 30 > /tmp/kernel-profile.out

Reading flame graphs

Wide plateau at top: a single function consuming most of the CPU. Optimize or fix that function.
Wide base, narrow towers: CPU time is distributed across many functions. This is normal for a balanced workload.
Deep narrow stacks: recursive or deeply nested calls. Common in parsers and tree-walking code.
"Ice cream cone" shape: one hot code path with many layers. Typical of a single slow query or hot loop.

Analogy: a flame graph is an X-ray of your CPU's workday. Instead of asking "are you busy?", you see exactly which tasks consumed each hour.

bpftrace scripts for performance

Syscall latency by process

bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  @usecs[comm] = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

This shows a latency histogram for every syscall, grouped by process name. If postgres has a tail of 10ms+ syscalls, something is blocking it.

TCP retransmit analysis

bpftrace -e '
kprobe:tcp_retransmit_skb {
  @retransmits[comm] = count();
}

interval:s:10 {
  print(@retransmits);
  clear(@retransmits);
}'

TCP retransmits mean packet loss. If one process accounts for most retransmits, the problem is likely that process's network path, not system-wide congestion.

Memory allocation tracking

bpftrace -e '
tracepoint:kmem:kmalloc {
  @bytes[comm] = sum(args.bytes_alloc);
  @count[comm] = count();
}

interval:s:10 {
  printf("\n--- Top allocators (10s window) ---\n");
  print(@bytes, 10);
  print(@count, 10);
  clear(@bytes);
  clear(@count);
}'

Filesystem operation timing

bpftrace -e '
kprobe:vfs_read { @start_read[tid] = nsecs; }
kretprobe:vfs_read /@start_read[tid]/ {
  @read_us[comm] = hist((nsecs - @start_read[tid]) / 1000);
  delete(@start_read[tid]);
}

kprobe:vfs_write { @start_write[tid] = nsecs; }
kretprobe:vfs_write /@start_write[tid]/ {
  @write_us[comm] = hist((nsecs - @start_write[tid]) / 1000);
  delete(@start_write[tid]);
}'

ZFS-specific eBPF tracing

ZFS has its own caching layer (ARC), its own I/O pipeline (ZIO), and its own transaction system (TXG). eBPF can trace all of it.

ZFS ARC hit/miss ratio

bpftrace -e '
kprobe:arc_read {
  @arc_reads = count();
}

kprobe:arc_read_done {
  @arc_hits = count();
}

interval:s:5 {
  printf("ARC reads: %d  hits: %d\n", @arc_reads, @arc_hits);
  clear(@arc_reads);
  clear(@arc_hits);
}'

You can also check ARC stats directly, but eBPF gives you the real-time flow:

# Quick ARC stats from /proc (for comparison)
cat /proc/spl/kstat/zfs/arcstats | grep -E '^hits|^misses'

ZFS ZIO operation tracing

bpftrace -e '
kprobe:zio_execute {
  @zio_ops = count();
}

kprobe:zio_done {
  @zio_done = count();
}

interval:s:5 {
  printf("ZIO started: %d  completed: %d\n", @zio_ops, @zio_done);
  clear(@zio_ops);
  clear(@zio_done);
}'

TXG sync timing

bpftrace -e '
kprobe:txg_sync_thread {
  @txg_start[tid] = nsecs;
}

kretprobe:txg_sync_thread /@txg_start[tid]/ {
  $dur_ms = (nsecs - @txg_start[tid]) / 1000000;
  printf("TXG sync took %d ms\n", $dur_ms);
  @txg_duration = hist($dur_ms);
  delete(@txg_start[tid]);
}'

ZFS performance hierarchy

ARC hit: data served from RAM. Microseconds. This is the fast path.
L2ARC hit: data served from SSD cache. Hundreds of microseconds. Still fast.
Pool read: data read from spinning disk or primary SSD. Milliseconds.
TXG sync: ZFS flushes dirty data to disk periodically. Long syncs (>5 seconds) cause write latency spikes. Trace them to find the root cause (too much dirty data, slow disks, fragmentation).

Analogy: ARC is your desk (instant). L2ARC is the filing cabinet in your office (quick walk). Pool read is the storage room down the hall (real trip). TXG sync is cleaning your desk at the end of the day — if you let too much pile up, it takes forever.

Metaslab allocation tracking

bpftrace -e '
kprobe:metaslab_alloc {
  @alloc_start[tid] = nsecs;
}

kretprobe:metaslab_alloc /@alloc_start[tid]/ {
  @alloc_us = hist((nsecs - @alloc_start[tid]) / 1000);
  delete(@alloc_start[tid]);
}

interval:s:30 {
  printf("\n--- Metaslab allocation latency (30s) ---\n");
  print(@alloc_us);
  clear(@alloc_us);
}'

Slow metaslab allocations mean ZFS is struggling to find free space. This happens on highly fragmented pools or pools above 80% capacity. If metaslab allocations regularly exceed 10ms, it is time to add capacity or rebalance.


Real example: "my database is slow"

Step-by-step diagnosis with eBPF. Someone reports that Postgres queries are taking 5x longer than yesterday.

Step 1: Is it CPU?

# Check scheduler latency for postgres processes
runqlat -p $(pgrep -d, postgres) 10

If the histogram shows most waits under 10us, CPU is not the bottleneck. Move on.

Step 2: Is it disk?

# Check disk latency histogram
biolatency 10

If you see a bimodal distribution with a peak at 50ms+, storage is the problem. Drill deeper:

# Which process is doing the I/O?
biotop 10

If postgres dominates with high average latency, the query is reading data that is not in cache.

Step 3: Is it cache?

# Check page cache hit rate
cachestat 5

If hit rate dropped from 98% to 70%, something evicted the database's working set from cache. Check what:

# What's allocating memory? (might be a backup, a cron job, etc.)
bpftrace -e '
tracepoint:kmem:mm_page_alloc {
  @pages[comm] = count();
}
interval:s:10 {
  print(@pages, 10);
  clear(@pages);
}'

Step 4: Is it the network?

# Check for TCP retransmits on the postgres port
tcpretrans 10

If retransmits correlate with slow queries, the network between the app server and the database is dropping packets.

Verdict

In this real-world case, cachestat revealed the hit rate dropped because a nightly backup job (rsync) ran 2 hours late and overlapped with peak query time, evicting postgres pages from cache. The fix: adjust the backup schedule. Total diagnosis time with eBPF: 5 minutes.


Real example: "network latency spikes every 5 minutes"

The monitoring dashboard shows TCP latency between two services spikes to 200ms every 5 minutes, then drops back to 2ms.

Step 1: Correlate with disk

# Run biolatency with timestamps, watch for 15 minutes
biolatency -T 1 > /tmp/bio-timeline.log &

# Simultaneously watch TCP session durations
tcplife -T > /tmp/tcp-timeline.log &

# Wait 15 minutes, then kill both
sleep 900 && kill %1 %2

Step 2: Find the pattern

# Look for biolatency spikes in the log
grep -E '(1024|2048|4096|8192|16384|32768|65536) ->' /tmp/bio-timeline.log

If disk latency spikes at the same 5-minute interval as network latency, something is saturating the I/O bus and blocking the network stack.

Step 3: What is doing I/O at those times?

bpftrace -e '
kprobe:blk_account_io_start {
  @io[comm] = count();
}

interval:s:30 {
  time("%H:%M:%S ");
  print(@io, 5);
  clear(@io);
}'

If txg_sync or z_wr_iss (ZFS write issuer) appears in bursts every 5 minutes, ZFS TXG syncs are flooding the I/O subsystem. The default TXG timeout is 5 seconds, but if dirty data accumulates for 5 minutes due to zfs_txg_timeout tuning, the flush can be massive.

The fix

# Check current TXG timeout
cat /sys/module/zfs/parameters/zfs_txg_timeout

# If it's set too high, reduce it to flush more frequently with smaller batches
echo 5 > /sys/module/zfs/parameters/zfs_txg_timeout

# Also limit dirty data to prevent massive flushes
echo 4294967296 > /sys/module/zfs/parameters/zfs_dirty_data_max  # 4GB cap