Documentation

eBPF for Performance — Diagnose Anything

When something is slow, you need to know where the time goes. Is it disk? CPU? Network? Memory pressure? eBPF instruments the kernel itself, giving you latency histograms, I/O breakdowns, cache hit rates, and scheduler metrics — all with near-zero overhead. No sampling artifacts. No guessing.

The approach: traditional performance tools (top, iostat, vmstat) show you averages and aggregates. They tell you the disk is busy but not which process is doing it or how latent each I/O is. eBPF traces individual events at the kernel level. You see the full distribution — not just the mean, but the outliers that actually cause the pain.

Every performance investigation I have ever done follows the same pattern: somebody says "it is slow", you spend 20 minutes guessing with top and iostat, and then you spend 5 minutes with eBPF finding the actual answer. This page is every tool and technique I reach for, in the order I reach for them. Brendan Gregg's work is the foundation of everything here. If you have not read "Systems Performance" or "BPF Performance Tools", stop reading this page and go buy those books. Then come back.

The 60-Second eBPF Checklist

Brendan Gregg's famous "60-second checklist" for Linux performance, fully implemented with eBPF tools. Run these in order. Within 60 seconds you will know which resource is the bottleneck. Then drill into the deep-dive section for that resource.

1. CPU scheduler latency (5 seconds)

# Are processes waiting for CPU time?
runqlat 5 2>&1 | tail -20

Output (healthy):

     usecs               : count    distribution
         0 -> 1          : 2847     |****************************************|
         2 -> 3          : 1293     |******************                      |
         4 -> 7          : 412      |*****                                   |
         8 -> 15         : 89       |*                                       |
        16 -> 31         : 12       |                                        |
        32 -> 63         : 3        |                                        |

If peaks above 1024us: CPU saturated. Go to CPU deep dive.

2. Disk I/O latency (5 seconds)

# How long are disk I/Os taking?
biolatency -D 5

Output (healthy NVMe):

disk = nvme0n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         4 -> 7          : 23       |***                                     |
         8 -> 15         : 89       |***********                             |
        16 -> 31         : 312      |****************************************|
        32 -> 63         : 187      |************************                |
        64 -> 127        : 45       |*****                                   |
       128 -> 255        : 8        |*                                       |

If bimodal or peaks above 4096us: storage is the problem. Go to Disk I/O deep dive.

3. Page cache efficiency (5 seconds)

# Is the page cache working?
cachestat 1 5

Output (healthy):

    HITS   MISSES  DIRTIES  BUFFERS_MB  CACHED_MB  HIT%
   14523      112       67        128       7841   99.2%
   13891      134       89        128       7842   99.0%
   14102      98        45        128       7843   99.3%
   13567      156       78        128       7841   98.9%
   14234      87        56        128       7842   99.4%

If hit rate drops below 90%: memory pressure. Go to Memory deep dive.

4. TCP retransmits (5 seconds)

# Are packets being lost?
tcpretrans 5

Output (problem detected):

TIME     PID    LADDR:LPORT          RADDR:RPORT          STATE
14:23:01 18234  10.0.1.5:5432        10.0.1.20:48912      ESTABLISHED
14:23:01 18234  10.0.1.5:5432        10.0.1.20:48912      ESTABLISHED
14:23:03 2891   10.0.1.5:22          10.0.2.100:55123     ESTABLISHED

If retransmits are frequent: network problem. Go to Network deep dive.

5. Who is doing I/O? (5 seconds)

# Top disk consumers right now
biotop 5

6. New processes (5 seconds)

# What is being spawned?
execsnoop 5

If hundreds of short-lived processes: something is fork-bombing or a cron job is misbehaving.

7. TCP connection lifecycle (5 seconds)

# Who is connecting and disconnecting?
tcplife 5

8. Filesystem latency snapshot (5 seconds)

# VFS operation latency
bpftrace -e '
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
  @read_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}
interval:s:5 { exit(); }'

The 60-second philosophy

You are not solving the problem in 60 seconds. You are classifying it. After these 8 checks, you know whether the bottleneck is CPU, disk, memory, network, or something else entirely. Then you use the deep-dive tools below to find the root cause. Most engineers skip this step and spend hours chasing the wrong resource.

Analogy: a doctor does not start with an MRI. They check pulse, temperature, blood pressure, and breathing first. Those four vitals tell them which specialist to call. The 60-second checklist is your vital signs for a Linux system.

The USE Method with eBPF

The USE method (Brendan Gregg, 2012) checks every resource for Utilization, Saturation, and Errors. For each resource in the system, ask three questions: How busy is it? Is work queuing? Are there errors? eBPF gives you direct answers to all three — no guessing, no inferring from indirect metrics.

Why USE works: it is exhaustive by design. If you check U, S, and E for every resource, you will find the bottleneck. The most common performance analysis failure is checking the wrong resource. USE forces you to check them all.

CPU — Utilization / Saturation / Errors

Utilization

What: percentage of time CPUs are not idle.
eBPF tool: mpstat gives you per-CPU breakdown, but eBPF can trace exactly which functions consume time.

# Per-CPU utilization with on-CPU time by function
profile -af 10 > /tmp/cpu-profile.out
# Count: each stack frame's width = % of CPU time

Saturation

What: processes waiting for CPU (run queue depth).
eBPF tool: runqlat measures queue wait time. runqlen measures queue depth.

# Run queue latency histogram
runqlat 10

# Run queue length over time
runqlen 10

Output (saturated):

     cpus               : count    distribution
         0              : 12       |***                                     |
         1              : 89       |***********************                 |
         2              : 156      |****************************************|
         3              : 123      |*******************************         |
         4              : 67       |*****************                       |
         5              : 23       |*****                                   |
         6              : 8        |**                                      |

If the queue length regularly exceeds the number of CPUs, you are saturated.

Errors

What: CPU errors are rare but include machine check exceptions (MCEs) and thermal throttling.
eBPF tool: trace the MCE handler and thermal throttle events.

# Trace machine check exceptions
bpftrace -e '
tracepoint:mce:mce_record {
  printf("MCE: cpu=%d bank=%d status=0x%lx\n",
    args->extcpu, args->bank, args->status);
}'

# Trace CPU frequency changes (thermal throttling)
bpftrace -e '
tracepoint:power:cpu_frequency {
  printf("CPU %d freq -> %d kHz\n",
    args->cpu_id, args->state);
}'

Memory — Utilization / Saturation / Errors

Utilization

What: how much physical memory is in use.
eBPF approach: track allocation rates to see who is consuming memory and how fast.

# Kernel memory allocation rate by process
bpftrace -e '
tracepoint:kmem:kmalloc {
  @alloc_bytes[comm] = sum(args.bytes_alloc);
}
interval:s:5 {
  print(@alloc_bytes, 10);
  clear(@alloc_bytes);
}'

Saturation

What: swapping, page scanning, OOM kills — the kernel is desperate for memory.
eBPF tool: trace page faults, swap activity, and the OOM killer.

# Major page faults (disk reads for pages)
bpftrace -e '
tracepoint:exceptions:page_fault_user {
  @faults[comm] = count();
}
interval:s:10 {
  print(@faults, 10);
  clear(@faults);
}'

# OOM killer invocations
bpftrace -e '
kprobe:oom_kill_process {
  printf("OOM kill triggered!\n");
}'

Errors

What: failed allocations, ECC errors, NUMA misses.
eBPF approach: trace allocation failures and NUMA fallbacks.

# Failed memory allocations
bpftrace -e '
tracepoint:kmem:kmalloc {
  if (args->ptr == 0) {
    printf("kmalloc FAILED: %d bytes by %s\n",
      args->bytes_req, comm);
  }
}'

# NUMA remote access (cross-node memory)
bpftrace -e '
tracepoint:migrate:mm_migrate_pages {
  printf("NUMA migrate: %d pages by %s\n",
    args->succeeded, comm);
}'

Disk — Utilization / Saturation / Errors

Utilization

What: percentage of time the device is servicing I/O.
eBPF approach: biolatency and biotop show I/O volume and latency per device.

# I/O count and bytes per device per second
biotop 1

Saturation

What: I/O requests queuing at the device.
eBPF tool: biolatency -Q includes queue time. Compare with and without -Q.

# Latency WITH queue time
biolatency -Q 10

# Latency WITHOUT queue time (device service only)
biolatency 10

# If -Q is significantly higher, I/O is queuing

Errors

What: I/O errors, failed requests, device timeouts.
eBPF approach: trace block layer error completions.

# Block I/O errors
bpftrace -e '
tracepoint:block:block_rq_complete {
  if (args->error != 0) {
    printf("BLK ERROR: dev=%d,%d error=%d\n",
      args->dev >> 20, args->dev & 0xfffff,
      args->error);
  }
}'

Network — Utilization / Saturation / Errors

Utilization

What: bandwidth consumed relative to link capacity.
eBPF approach: trace bytes sent/received per interface.

# Bytes transmitted per second by interface
bpftrace -e '
tracepoint:net:net_dev_xmit {
  @tx_bytes[args->name] = sum(args->len);
}
tracepoint:net:netif_receive_skb {
  @rx_bytes[str(args->name)] = sum(args->len);
}
interval:s:1 {
  print(@tx_bytes); print(@rx_bytes);
  clear(@tx_bytes); clear(@rx_bytes);
}'

Saturation

What: socket buffer overflows, TCP backlog drops, netdev backlog full.
eBPF tool: trace packet drops with reason codes.

# Packet drops with kernel reason
bpftrace -e '
tracepoint:skb:kfree_skb {
  @drops[args->reason] = count();
}
interval:s:10 {
  print(@drops);
  clear(@drops);
}'

Errors

What: TCP retransmits, connection resets, ICMP errors.
eBPF tool: tcpretrans traces retransmits. Trace RSTs separately.

# TCP retransmits with addresses
tcpretrans

# TCP connection resets
bpftrace -e '
kprobe:tcp_send_active_reset {
  printf("RST sent by %s pid %d\n", comm, pid);
}
kprobe:tcp_reset {
  printf("RST received by %s pid %d\n", comm, pid);
}'

Filesystem — Utilization / Saturation / Errors

Utilization

What: VFS operation rates, page cache usage, inode cache pressure.
eBPF approach: count VFS operations per second.

# VFS operation rates
funccount 'vfs_*' 10

Output:

FUNC                          COUNT
vfs_read                      45123
vfs_write                     23456
vfs_open                       8901
vfs_getattr                    6789
vfs_statx                      4567
vfs_lock_file                  1234

Saturation

What: filesystem operations blocking on locks, inode exhaustion, dentry cache evictions.
eBPF approach: measure VFS operation latency — high latency means contention.

# VFS read latency histogram
bpftrace -e '
kprobe:vfs_read { @s[tid] = nsecs; }
kretprobe:vfs_read /@s[tid]/ {
  @lat = hist((nsecs - @s[tid]) / 1000);
  delete(@s[tid]);
}
interval:s:10 { exit(); }'

Errors

What: failed opens, permission denials, stale NFS handles.
eBPF approach: trace VFS return codes.

# Failed file opens with error codes
bpftrace -e '
kretprobe:vfs_open {
  if (retval < 0) {
    printf("vfs_open FAILED: %d by %s\n",
      retval, comm);
  }
}'

The USE method is mechanical on purpose. You do not need to be clever. You do not need to guess. Check every resource, three questions each. The bottleneck reveals itself. I have seen senior engineers waste days because they "had a feeling" it was the network when the answer was memory pressure the entire time. USE forces you to check everything.

CPU Deep Dive

CPU performance analysis goes far beyond "is it busy?" eBPF can trace on-CPU time, off-CPU time (where threads are blocked), wakeup chains, scheduler decisions, CPU migrations between cores, IRQ handling time, and context switch overhead. Each of these tells a different story about why your application is slow.

On-CPU profiling

On-CPU analysis answers: "when my process has the CPU, what is it doing?" This is what flame graphs show. The wider a frame, the more CPU cycles it consumed.

# Sample at 99Hz for 30 seconds, generate a flame graph
profile -af 30 > /tmp/profile.out

# If you have Brendan Gregg's FlameGraph tools:
git clone https://github.com/brendangregg/FlameGraph /opt/FlameGraph
/opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/profile.out | \
  /opt/FlameGraph/flamegraph.pl > /tmp/flamegraph.svg

Open flamegraph.svg in a browser. The x-axis is not time — it is the percentage of samples. The wider a frame, the more CPU time that function consumed. Click frames to zoom.

# Profile only a specific process
profile -p $(pgrep postgres) -af 30 > /tmp/pg-profile.out

# Sample user stacks only (skip kernel)
profile -U -af 30 > /tmp/user-profile.out

# Sample kernel stacks only
profile -K -af 30 > /tmp/kernel-profile.out

Reading flame graphs

Wide plateau at top: a single function consuming most of the CPU. Optimize or fix that function.
Wide base, narrow towers: CPU time is distributed across many functions. This is normal for a balanced workload.
Deep narrow stacks: recursive or deeply nested calls. Common in parsers and tree-walking code.
"Ice cream cone" shape: one hot code path with many layers. Typical of a single slow query or hot loop.

Analogy: a flame graph is an X-ray of your CPU's workday. Instead of asking "are you busy?", you see exactly which tasks consumed each hour.

Off-CPU analysis

Off-CPU analysis is the complement: "when my process does NOT have the CPU, what is it waiting for?" This catches blocking I/O, lock contention, sleep calls, and page faults. A process can have 5% CPU utilization and still be slow because it spends 95% of its time blocked on a mutex.

# Off-CPU time by stack trace for a specific process
# This shows WHERE the process blocks
bpftrace -e '
tracepoint:sched:sched_switch /args->prev_pid == '$TARGET_PID'/ {
  @off_start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_switch /args->next_pid == '$TARGET_PID' && @off_start[args->next_pid]/ {
  @off_us[kstack] = hist((nsecs - @off_start[args->next_pid]) / 1000);
  delete(@off_start[args->next_pid]);
}' -c "sleep 30"

Replace $TARGET_PID with the actual PID. The output shows a histogram of off-CPU durations for each unique kernel stack trace. The stack trace tells you why the process was off-CPU — was it waiting for disk (you will see io_schedule), a mutex (mutex_lock), a futex (futex_wait), or a network socket (sk_wait_data)?

# Simpler: offcputime from bcc-tools, which does the same thing
offcputime -p $(pgrep postgres) 30 > /tmp/offcpu.out

# Generate an off-CPU flame graph (inverted: "icicle graph")
/opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/offcpu.out | \
  /opt/FlameGraph/flamegraph.pl --color=io --countname=us > /tmp/offcpu.svg

On-CPU + Off-CPU = the full picture

On-CPU flame graphs show you where CPU time goes. Off-CPU flame graphs show you where wall clock time goes. If a request takes 500ms but only uses 5ms of CPU, the on-CPU flame graph is nearly useless — you need the off-CPU graph to find the 495ms of blocking. Always generate both.

Analogy: on-CPU is how long you spend typing. Off-CPU is how long you spend waiting for the printer. If printing is slow, analyzing your typing speed is irrelevant.

Wakeup tracing

When a thread unblocks, something wakes it up. Tracing wakeups tells you the dependency chain: "postgres was sleeping, then the disk interrupt handler woke it up." This is critical for understanding latency in multi-threaded systems.

# Trace what wakes up a specific process
bpftrace -e '
tracepoint:sched:sched_wakeup /args->pid == '$TARGET_PID'/ {
  printf("Woke PID %d, waker: %s (PID %d)\n",
    args->pid, comm, pid);
}'

Output:

Woke PID 18234, waker: ksoftirqd/3 (PID 42)
Woke PID 18234, waker: postgres (PID 18190)
Woke PID 18234, waker: kworker/2:1 (PID 891)
Woke PID 18234, waker: ksoftirqd/3 (PID 42)

If ksoftirqd is the primary waker, the process is mostly waiting for network or disk interrupts. If another thread of the same application is the waker, there is a producer-consumer dependency you can optimize.

Scheduler latency deep dive

runqlat gives you the histogram, but sometimes you need per-process detail — especially when only one process has bad scheduler latency while others are fine.

# Scheduler latency histogram
runqlat

Output:

     usecs               : count    distribution
         0 -> 1          : 1423     |****************************************|
         2 -> 3          : 892      |*************************               |
         4 -> 7          : 345      |*********                               |
         8 -> 15         : 123      |***                                     |
        16 -> 31         : 45       |*                                       |
        32 -> 63         : 12       |                                        |
        64 -> 127        : 3        |                                        |
       128 -> 255        : 1        |                                        |

# Per-PID scheduler latency
runqlat -P 10

Output:

pid = 18234 (postgres)
     usecs               : count    distribution
         0 -> 1          : 12       |***                                     |
         2 -> 3          : 45       |************                            |
         4 -> 7          : 156      |****************************************|
         8 -> 15         : 89       |**********************                  |
        16 -> 31         : 23       |*****                                   |
        32 -> 63         : 3        |                                        |

pid = 2891 (nginx)
     usecs               : count    distribution
         0 -> 1          : 892      |****************************************|
         2 -> 3          : 123      |*****                                   |
         4 -> 7          : 12       |                                        |

nginx gets scheduled almost instantly (0-1us). postgres waits 4-15us regularly. That 15us per scheduler event adds up when postgres handles 10,000 queries/second.

# Detailed scheduler event tracing with timestamps
bpftrace -e '
tracepoint:sched:sched_switch {
  if (args->prev_comm == "postgres") {
    @off[args->prev_pid] = nsecs;
  }
  if (args->next_comm == "postgres" && @off[args->next_pid]) {
    $lat_us = (nsecs - @off[args->next_pid]) / 1000;
    if ($lat_us > 100) {
      printf("postgres PID %d waited %d us, preempted by: %s\n",
        args->next_pid, $lat_us, args->prev_comm);
    }
    delete(@off[args->next_pid]);
  }
}'

Output:

postgres PID 18234 waited 1234 us, preempted by: java
postgres PID 18234 waited 567 us, preempted by: java
postgres PID 18190 waited 2341 us, preempted by: kworker/3:1
postgres PID 18234 waited 891 us, preempted by: java

Java is stealing postgres's CPU time. Either pin them to separate cores with taskset or adjust cgroup CPU shares.

CPU migration tracking

When the scheduler moves a process from one CPU to another, the L1/L2 caches on the new CPU are cold. This "migration penalty" can cost 10-100us per event. High migration rates hurt latency-sensitive workloads.

# Count CPU migrations per process
bpftrace -e '
tracepoint:sched:sched_migrate_task {
  @migrations[args->comm] = count();
  @from_to[args->orig_cpu, args->dest_cpu] = count();
}
interval:s:10 {
  printf("\n--- CPU migrations (10s) ---\n");
  print(@migrations, 10);
  printf("\n--- Migration paths ---\n");
  print(@from_to, 10);
  clear(@migrations);
  clear(@from_to);
}'

Output:

--- CPU migrations (10s) ---
@migrations[postgres]: 234
@migrations[nginx]: 12
@migrations[java]: 567

--- Migration paths ---
@from_to[0, 3]: 89
@from_to[1, 2]: 67
@from_to[3, 0]: 78

Java is migrating 567 times in 10 seconds. That is 56 migrations per second, each costing cache invalidation. Fix: taskset -c 4-7 java_process to pin it to specific cores.

IRQ profiling

Hardware interrupts steal CPU time from application code. Network-heavy servers can spend significant CPU time in IRQ handlers, especially with high packet rates.

# IRQ handler duration
bpftrace -e '
tracepoint:irq:irq_handler_entry {
  @irq_start[args->irq] = nsecs;
}
tracepoint:irq:irq_handler_exit /@irq_start[args->irq]/ {
  $dur_us = (nsecs - @irq_start[args->irq]) / 1000;
  @irq_dur[args->name] = hist($dur_us);
  @irq_count[args->name] = count();
  delete(@irq_start[args->irq]);
}
interval:s:10 {
  printf("\n--- IRQ handler counts (10s) ---\n");
  print(@irq_count);
  printf("\n--- IRQ handler latency ---\n");
  print(@irq_dur);
  clear(@irq_count); clear(@irq_dur);
}'

Output:

--- IRQ handler counts (10s) ---
@irq_count[nvme0q1]: 45123
@irq_count[enp3s0]: 23456
@irq_count[ahci[0000:00:1f.2]]: 8901

--- IRQ handler latency ---
@irq_dur[nvme0q1]:
[0]                   34521  |****************************************|
[1]                    8923  |**********                              |
[2, 4)                 1245  |*                                       |

@irq_dur[enp3s0]:
[0]                   12345  |****************************************|
[1]                    6789  |**********************                  |
[2, 4)                 3456  |***********                             |
[4, 8)                  812  |**                                      |

NVMe IRQs are fast (all under 4us). Network IRQs are slower and more numerous — if this becomes a problem, enable IRQ coalescing or RSS (Receive Side Scaling) to distribute across CPUs.

Context switch analysis

# Context switch rate by process, with voluntary vs involuntary
bpftrace -e '
tracepoint:sched:sched_switch {
  if (args->prev_state == 0) {
    @involuntary[args->prev_comm] = count();
  } else {
    @voluntary[args->prev_comm] = count();
  }
}
interval:s:10 {
  printf("\n--- Voluntary context switches (10s) ---\n");
  print(@voluntary, 10);
  printf("\n--- Involuntary context switches (10s) ---\n");
  print(@involuntary, 10);
  clear(@voluntary); clear(@involuntary);
}'

Voluntary context switches happen when a process yields the CPU (waiting for I/O, sleeping, blocking on a lock). These are normal. Involuntary context switches happen when the scheduler forcibly preempts a process because it ran out of its time slice. High involuntary switch rates mean too many compute-bound threads fighting for too few cores.

CPU saturation vs. CPU utilization

A CPU at 100% utilization is not necessarily a problem — it might be doing useful work efficiently. CPU saturation means tasks are waiting. runqlat measures the waiting. A server at 95% CPU with 2us runqlat is fine. A server at 80% CPU with 50ms runqlat has a scheduling problem (probably too many threads competing for too few cores).

Analogy: utilization is how busy the checkout lanes are. Saturation is the length of the queue. Full lanes with no queue means efficient. Half the lanes open with 20-person queues means a staffing problem.

Memory Deep Dive

Memory performance is not just "how much RAM is free." It is page faults, slab allocator efficiency, NUMA locality, hugepage utilization, and the constant battle between the page cache and application memory. eBPF lets you trace every allocation, every fault, every reclaim event.

Page fault analysis (major vs. minor)

A minor page fault means the kernel mapped a page that was already in RAM (zero-fill or copy-on-write). Cost: ~1us. A major page fault means the kernel had to read the page from disk (swap or memory-mapped file). Cost: milliseconds. Major faults destroy latency.

# Major and minor page faults by process
bpftrace -e '
software:major-faults:1 {
  @major[comm, pid] = count();
}
software:minor-faults:100 {
  @minor[comm] = count();
}
interval:s:10 {
  printf("\n--- Major faults (disk reads) ---\n");
  print(@major, 10);
  printf("\n--- Minor faults (sampled 1:100) ---\n");
  print(@minor, 10);
  clear(@major); clear(@minor);
}'

Output:

--- Major faults (disk reads) ---
@major[postgres, 18234]: 1234
@major[java, 5678]: 89
@major[nginx, 2891]: 3

--- Minor faults (sampled 1:100) ---
@minor[java]: 89123
@minor[postgres]: 45678
@minor[nginx]: 12345

Postgres is hitting 1,234 major page faults in 10 seconds. That is 123 disk reads per second just for page faults. Either it is swapping, or it is accessing memory-mapped files that are not in the page cache. Check /proc/18234/status for VmSwap to distinguish.

Slab allocator tracking

The kernel's slab allocator manages fixed-size object caches (inodes, dentries, task structs, network buffers). Slab leaks are a common cause of gradual memory exhaustion where free shows memory disappearing but no process seems to own it.

# Track slab allocation by cache name
bpftrace -e '
tracepoint:kmem:kmem_cache_alloc {
  @slab_alloc[str(args->name)] = count();
}
tracepoint:kmem:kmem_cache_free {
  @slab_free[str(args->name)] = count();
}
interval:s:30 {
  printf("\n--- Slab allocations (30s) ---\n");
  print(@slab_alloc, 15);
  printf("\n--- Slab frees (30s) ---\n");
  print(@slab_free, 15);
  clear(@slab_alloc); clear(@slab_free);
}'

Output:

--- Slab allocations (30s) ---
@slab_alloc[dentry]: 234567
@slab_alloc[inode_cache]: 123456
@slab_alloc[skbuff_head_cache]: 89012
@slab_alloc[task_struct]: 4567
@slab_alloc[radix_tree_node]: 2345

--- Slab frees (30s) ---
@slab_free[dentry]: 234123
@slab_free[inode_cache]: 123400
@slab_free[skbuff_head_cache]: 89001
@slab_free[task_struct]: 4567
@slab_free[radix_tree_node]: 2300

Notice radix_tree_node allocations exceed frees by 45 in 30 seconds. Over hours, this adds up. That is a potential slab leak. Cross-reference with slabtop to confirm.

kmalloc/kfree balance

# Track kernel malloc sizes and who is allocating
bpftrace -e '
tracepoint:kmem:kmalloc {
  @alloc_size = hist(args.bytes_alloc);
  @alloc_by[comm] = sum(args.bytes_alloc);
  @alloc_count[comm] = count();
}
tracepoint:kmem:kfree {
  @frees = count();
}
interval:s:10 {
  printf("\n--- Allocation size histogram ---\n");
  print(@alloc_size);
  printf("\n--- Top allocators by bytes ---\n");
  print(@alloc_by, 10);
  printf("\n--- Total frees: ---\n");
  print(@frees);
  clear(@alloc_size); clear(@alloc_by);
  clear(@alloc_count); clear(@frees);
}'

Output:

--- Allocation size histogram ---
@alloc_size:
[1]                      234  |*                                       |
[2, 4)                   567  |***                                     |
[4, 8)                  1234  |*******                                 |
[8, 16)                 3456  |********************                    |
[16, 32)               2345   |*************                           |
[32, 64)               6789   |****************************************|
[64, 128)              4567   |***************************             |
[128, 256)             2345   |*************                           |
[256, 512)              891   |*****                                   |
[512, 1K)               234   |*                                       |
[1K, 2K)                 89   |                                        |
[2K, 4K)                 12   |                                        |

--- Top allocators by bytes ---
@alloc_by[postgres]: 12345678
@alloc_by[nginx]: 4567890
@alloc_by[kworker/3:1]: 2345678

OOM investigation

When the OOM killer fires, you need to know why memory was exhausted and what the allocation pattern looked like in the minutes before. eBPF can catch the OOM event and log the state.

# Trace OOM killer with victim details
bpftrace -e '
kprobe:oom_kill_process {
  printf("=== OOM KILL ===\n");
  printf("Killer: %s (PID %d)\n", comm, pid);
  printf("Time: "); time("%H:%M:%S\n");
}
kprobe:out_of_memory {
  printf("out_of_memory() called by %s\n", comm);
}'

# Continuous memory pressure monitoring (run in background)
bpftrace -e '
tracepoint:vmscan:mm_vmscan_direct_reclaim_begin {
  @reclaim_start[tid] = nsecs;
  @reclaim_count = count();
}
tracepoint:vmscan:mm_vmscan_direct_reclaim_end /@reclaim_start[tid]/ {
  @reclaim_us = hist((nsecs - @reclaim_start[tid]) / 1000);
  delete(@reclaim_start[tid]);
}
interval:s:30 {
  printf("\n--- Direct reclaim events (30s) ---\n");
  print(@reclaim_count);
  printf("--- Reclaim latency ---\n");
  print(@reclaim_us);
  clear(@reclaim_count); clear(@reclaim_us);
}'

Direct reclaim means the kernel is desperately freeing pages synchronously because a memory allocation could not be satisfied from free lists. If you see thousands of direct reclaim events per 30-second window, the system is under severe memory pressure and an OOM kill is likely imminent.

NUMA miss tracking

On multi-socket servers, accessing memory on the remote NUMA node costs 1.5-3x more latency than local memory. eBPF can trace when the kernel allocates pages on the wrong node.

# NUMA allocation misses by process
bpftrace -e '
tracepoint:kmem:mm_page_alloc {
  @alloc_node[comm, args->node] = count();
}
interval:s:10 {
  print(@alloc_node, 20);
  clear(@alloc_node);
}'

Output:

@alloc_node[postgres, 0]: 45678
@alloc_node[postgres, 1]: 12345
@alloc_node[nginx, 0]: 23456
@alloc_node[nginx, 1]: 234

Postgres is allocating 21% of pages on node 1, but its threads run on node 0 CPUs. That means 21% of memory accesses pay the NUMA penalty. Fix: numactl --membind=0 --cpunodebind=0 postgres.

Hugepage monitoring

# Track transparent hugepage allocations and collapses
bpftrace -e '
tracepoint:huge_memory:mm_khugepaged_scan_pmd {
  @thp_scan = count();
}
kprobe:collapse_huge_page {
  @thp_collapse[comm] = count();
}
kprobe:__alloc_pages /arg1 == 9/ {
  @hugepage_alloc[comm] = count();
}
interval:s:30 {
  printf("\n--- THP activity (30s) ---\n");
  print(@thp_scan); print(@thp_collapse);
  print(@hugepage_alloc);
  clear(@thp_scan); clear(@thp_collapse);
  clear(@hugepage_alloc);
}'

If THP collapses are frequent, the kernel is spending CPU time compacting memory into 2MB pages. This can cause latency spikes for latency-sensitive applications. Consider pre-allocating hugepages: echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages.

Disk I/O Deep Dive

Disk performance analysis with eBPF goes far beyond biolatency. You can trace queue depths, I/O merging, sequential vs. random patterns, and for ZFS specifically: the ZIO pipeline stages, vdev queue depths, and metaslab allocation pressure.

biolatency — the foundation

biolatency traces every block device I/O and shows how long each one took, as a histogram. This is the single most important storage performance tool.

# Show disk I/O latency histogram (press Ctrl+C to see results)
biolatency

Output:

     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 12       |**                                      |
         8 -> 15         : 45       |*********                               |
        16 -> 31         : 123      |*************************               |
        32 -> 63         : 189      |****************************************|
        64 -> 127        : 87       |******************                      |
       128 -> 255        : 34       |*******                                 |
       256 -> 511        : 8        |*                                       |
       512 -> 1023       : 2        |                                        |
      1024 -> 2047       : 1        |                                        |

How to read this

Healthy NVMe: the bulk of I/O completes in 4-127 microseconds. Anything above 1ms is rare.
Healthy spinning disk: expect the peak around 2-16ms (2048-16384 usecs). That is seek time + rotational latency.
Degraded: if you see a bimodal distribution — a peak at 32us AND another peak at 50ms — some I/O is hitting cache and some is hitting a dying disk or saturated controller.
Catastrophic: if the histogram peaks above 100ms, your storage is actively destroying application performance.

Analogy: iostat tells you the average commute time is 30 minutes. biolatency shows you that 90% of commutes are 10 minutes but 5% take 3 hours because of a bridge closure. The average hid the real problem.

# Break down latency by disk device
biolatency -D

# Show latency in milliseconds instead of microseconds
biolatency -m

# Run for 10 seconds then exit
biolatency 10

# Show latency by disk and include queue time
biolatency -DQ

biotop — I/O by process

biotop shows real-time block device I/O by process. It answers "who is killing my disk?" in seconds.

# Real-time disk I/O by process (updates every second)
biotop

Output:

PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
18234  postgres         W 8   0   sda        847  13552   2.41
18190  postgres         R 8   0   sda        312  4992    1.87
1456   jbd2/sda1-8      W 8   0   sda        89   712     0.34
2891   rsync            R 8   0   sda        67   8576    3.12

Postgres is doing 847 writes per second averaging 2.4ms each. That is your disk hog. rsync is doing fewer I/Os but with higher average latency (3.1ms) because it is reading large sequential blocks that sometimes miss cache.

Queue depth analysis

The I/O queue depth tells you how many requests are in-flight to the device simultaneously. NVMe drives handle 64K+ concurrent requests. SATA drives struggle above 32. Queue depth directly affects throughput and latency.

# I/O queue depth over time
bpftrace -e '
tracepoint:block:block_rq_issue {
  @inflight[args->dev] = count();
}
tracepoint:block:block_rq_complete {
  @inflight[args->dev] = count();
}
interval:s:1 {
  print(@inflight);
}'

# More precise: track concurrent in-flight I/Os
bpftrace -e '
BEGIN { @depth = 0; }
tracepoint:block:block_rq_issue {
  @depth++;
  @queue_depth = hist(@depth);
}
tracepoint:block:block_rq_complete {
  @depth--;
}
interval:s:10 {
  printf("\n--- Queue depth distribution (10s) ---\n");
  print(@queue_depth);
  clear(@queue_depth);
}'

Output:

--- Queue depth distribution (10s) ---
@queue_depth:
[0]                    1234  |*****                                   |
[1]                    5678  |***********************                 |
[2, 4)                 8901  |****************************************|
[4, 8)                 3456  |***************                         |
[8, 16)                1234  |*****                                   |
[16, 32)                456  |**                                      |
[32, 64)                 23  |                                        |

Most I/Os arrive when 2-4 requests are already in-flight. Peaks at 32+ are rare. This is healthy. If the distribution shifts right (regularly at 128+), the device cannot keep up.

I/O merging analysis

The block layer merges adjacent I/O requests into single operations. High merge rates mean sequential workload (good). Low merge rates mean random I/O (bad for spinning disks, fine for NVMe).

# Track I/O merging
bpftrace -e '
tracepoint:block:block_bio_merge {
  @merges[comm] = count();
}
tracepoint:block:block_rq_issue {
  @issued[comm] = count();
}
interval:s:10 {
  printf("\n--- I/O merges vs issues (10s) ---\n");
  print(@merges, 10);
  print(@issued, 10);
  clear(@merges); clear(@issued);
}'

Sequential vs. random I/O

# Track I/O patterns: sector offsets reveal sequential vs random
bpftrace -e '
tracepoint:block:block_rq_issue {
  @sector[comm] = hist(args->sector);
  @size[comm] = hist(args->nr_sector * 512);
}
interval:s:10 {
  printf("\n--- I/O sizes ---\n");
  print(@size, 5);
  clear(@sector); clear(@size);
}'

Output:

--- I/O sizes ---
@size[postgres]:
[512, 1K)               23  |*                                       |
[1K, 2K)                45  |**                                      |
[2K, 4K)               123  |*******                                 |
[4K, 8K)               891  |****************************************|
[8K, 16K)              567  |*************************               |
[16K, 32K)             234  |**********                              |
[32K, 64K)              89  |****                                    |

@size[rsync]:
[64K, 128K)            234  |***********                             |
[128K, 256K)           891  |****************************************|
[256K, 512K)           567  |*************************               |

Postgres does mostly 4-8KB random I/Os (typical for index lookups). rsync does 128-256KB sequential I/Os (bulk file transfer). On spinning disks, postgres suffers because each random I/O costs a full seek. On NVMe, both patterns perform similarly.

ZFS: ZIO pipeline tracing

ZFS has its own I/O pipeline called ZIO. Every read and write passes through multiple stages: open, wait for children, checksum, compress, vdev I/O, done. eBPF can trace each stage.

# ZIO pipeline stage timing
bpftrace -e '
kprobe:zio_execute {
  @zio_start[tid] = nsecs;
  @zio_ops = count();
}
kprobe:zio_done /@zio_start[tid]/ {
  @zio_latency = hist((nsecs - @zio_start[tid]) / 1000);
  @zio_done = count();
  delete(@zio_start[tid]);
}
interval:s:10 {
  printf("\n--- ZIO ops: started=%d completed=%d ---\n",
    @zio_ops, @zio_done);
  printf("--- ZIO latency (us) ---\n");
  print(@zio_latency);
  clear(@zio_ops); clear(@zio_done); clear(@zio_latency);
}'

Output:

--- ZIO ops: started=12345 completed=12340 ---
--- ZIO latency (us) ---
@zio_latency:
[1]                    2345  |*********                               |
[2, 4)                 4567  |*******************                     |
[4, 8)                 8901  |****************************************|
[8, 16)                3456  |***************                         |
[16, 32)               1234  |*****                                   |
[32, 64)                456  |**                                      |
[64, 128)               123  |                                        |
[128, 256)               45  |                                        |
[256, 512)               12  |                                        |

Most ZIO operations complete in 4-8us (ARC hits). The tail at 64-512us represents actual disk reads. The 5-operation gap between started and completed indicates 5 ZIOs were in-flight at snapshot time — normal.

ZFS: vdev queue tracking

# Track I/O to individual vdevs
bpftrace -e '
kprobe:vdev_queue_io {
  @vdev_issue[tid] = nsecs;
  @vdev_count = count();
}
kprobe:vdev_queue_io_done /@vdev_issue[tid]/ {
  @vdev_latency = hist((nsecs - @vdev_issue[tid]) / 1000);
  delete(@vdev_issue[tid]);
}
interval:s:10 {
  printf("\n--- vdev queue I/Os: %d ---\n", @vdev_count);
  print(@vdev_latency);
  clear(@vdev_count); clear(@vdev_latency);
}'

ZFS: metaslab allocation pressure

# Metaslab allocation latency — detects fragmentation
bpftrace -e '
kprobe:metaslab_alloc {
  @alloc_start[tid] = nsecs;
}

kretprobe:metaslab_alloc /@alloc_start[tid]/ {
  @alloc_us = hist((nsecs - @alloc_start[tid]) / 1000);
  delete(@alloc_start[tid]);
}

interval:s:30 {
  printf("\n--- Metaslab allocation latency (30s) ---\n");
  print(@alloc_us);
  clear(@alloc_us);
}'

Slow metaslab allocations mean ZFS is struggling to find free space. This happens on highly fragmented pools or pools above 80% capacity. If metaslab allocations regularly exceed 10ms, it is time to add capacity or rebalance.

ZFS: TXG sync timing

# TXG sync duration — long syncs cause write stalls
bpftrace -e '
kprobe:txg_sync_thread {
  @txg_start[tid] = nsecs;
}

kretprobe:txg_sync_thread /@txg_start[tid]/ {
  $dur_ms = (nsecs - @txg_start[tid]) / 1000000;
  printf("TXG sync took %d ms\n", $dur_ms);
  @txg_duration = hist($dur_ms);
  delete(@txg_start[tid]);
}'

ZFS performance hierarchy

ARC hit: data served from RAM. Microseconds. This is the fast path.
L2ARC hit: data served from SSD cache. Hundreds of microseconds. Still fast.
Pool read: data read from spinning disk or primary SSD. Milliseconds.
TXG sync: ZFS flushes dirty data to disk periodically. Long syncs (>5 seconds) cause write latency spikes. Trace them to find the root cause (too much dirty data, slow disks, fragmentation).

Analogy: ARC is your desk (instant). L2ARC is the filing cabinet in your office (quick walk). Pool read is the storage room down the hall (real trip). TXG sync is cleaning your desk at the end of the day — if you let too much pile up, it takes forever.

I have seen ZFS pools that looked fine in zpool iostat but had terrible write latency. The problem was always metaslab fragmentation or TXG sync storms. You cannot see either one without tracing. biolatency shows you the block device is slow. ZIO tracing shows you WHY it is slow — was it the checksum, the compression, the vdev queue, or the metaslab allocator? That distinction is the difference between "add more disks" and "defragment the pool."

Network Deep Dive

Network performance analysis with eBPF goes beyond packet counts. You can trace TCP retransmits with full connection details, connection lifecycle events, socket buffer pressure, packet drops with kernel-provided reasons, and the TCP congestion window over time. Every packet drop has a story — eBPF tells it.

TCP retransmit analysis

# TCP retransmits with full connection details
tcpretrans

Output:

TIME     PID    LADDR:LPORT          RADDR:RPORT          STATE
14:23:01 18234  10.0.1.5:5432        10.0.1.20:48912      ESTABLISHED
14:23:01 18234  10.0.1.5:5432        10.0.1.20:48913      ESTABLISHED
14:23:03 2891   10.0.1.5:80          10.0.2.100:55123     ESTABLISHED
14:23:04 18234  10.0.1.5:5432        10.0.1.20:48912      ESTABLISHED

Three retransmits to 10.0.1.20 in 3 seconds on the postgres port. The network path between these two hosts is dropping packets.

# Retransmits per remote host (aggregate)
bpftrace -e '
kprobe:tcp_retransmit_skb {
  $sk = (struct sock *)arg0;
  @retrans[ntop($sk->__sk_common.skc_daddr)] = count();
}
interval:s:30 {
  printf("\n--- Retransmits by remote host (30s) ---\n");
  print(@retrans, 10);
  clear(@retrans);
}'

Output:

--- Retransmits by remote host (30s) ---
@retrans[10.0.1.20]: 89
@retrans[10.0.2.100]: 3
@retrans[10.0.3.50]: 1

89 retransmits in 30 seconds to 10.0.1.20. That one host has a bad network path. Check the switch, the cable, the NIC — something between here and there is dropping packets.

TCP connection lifecycle

# Full TCP connection lifecycle: connect, duration, bytes
tcplife

Output:

PID   COMM       LADDR           LPORT RADDR           RPORT TX_KB RX_KB MS
18234 postgres   10.0.1.5        5432  10.0.1.20       48912  1234  5678  45123
2891  nginx      10.0.1.5        80    10.0.2.100      55123    89   234    512
5678  curl       10.0.1.5        44567 93.184.216.34   443      2    12    234
18234 postgres   10.0.1.5        5432  10.0.1.20       48913   567  2345  23456

That first postgres connection lasted 45 seconds and transferred 6.9MB. If this is a query connection, a 45-second query is probably the "slow database" everyone is complaining about.

# Connection rate: new connections per second
bpftrace -e '
kprobe:tcp_v4_connect { @connect[comm] = count(); }
kprobe:inet_csk_accept { @accept[comm] = count(); }
interval:s:1 {
  printf("--- Connects/Accepts per second ---\n");
  print(@connect, 5); print(@accept, 5);
  clear(@connect); clear(@accept);
}'

Output:

--- Connects/Accepts per second ---
@connect[curl]: 234
@connect[python3]: 89
@accept[nginx]: 312
@accept[postgres]: 45

234 new curl connections per second. Someone is hitting an API without connection pooling. Each TCP connect costs a 3-way handshake (1.5 RTT). At 50ms RTT, that is 75ms per request just for the handshake.

Socket buffer pressure

# Track socket buffer fullness — detect application backpressure
bpftrace -e '
kprobe:sock_rcvbuf_full {
  @rcvbuf_full[comm] = count();
}
kprobe:sock_sndbuf_full {
  @sndbuf_full[comm] = count();
}
interval:s:10 {
  printf("\n--- Socket buffer full events (10s) ---\n");
  printf("Receive buffers:\n");
  print(@rcvbuf_full, 10);
  printf("Send buffers:\n");
  print(@sndbuf_full, 10);
  clear(@rcvbuf_full); clear(@sndbuf_full);
}'

If receive buffers are frequently full, the application is not reading from the socket fast enough. If send buffers are full, the remote end is not acknowledging data fast enough (network congestion or slow receiver).

Packet drop tracing with reasons

# Packet drops with kernel-provided reason codes
bpftrace -e '
tracepoint:skb:kfree_skb {
  @drops[args->reason, args->protocol] = count();
  @drop_stack[kstack] = count();
}
interval:s:30 {
  printf("\n--- Packet drops by reason (30s) ---\n");
  print(@drops, 20);
  printf("\n--- Drop call stacks ---\n");
  print(@drop_stack, 5);
  clear(@drops); clear(@drop_stack);
}'

Output:

--- Packet drops by reason (30s) ---
@drops[2, 2048]: 1234
@drops[5, 2048]: 89
@drops[16, 2048]: 23

Reason codes map to kernel constants. Reason 2 is SKB_DROP_REASON_NOT_SPECIFIED (generic), reason 5 is SKB_DROP_REASON_TCP_OLD_DATA (retransmit of already-received data — harmless), reason 16 is SKB_DROP_REASON_NETFILTER_DROP (firewall). The call stack tells you exactly which kernel function dropped the packet.

TCP congestion window tracking

# Track TCP congestion window size over time for a specific connection
bpftrace -e '
kprobe:tcp_cong_avoid_ai /
  ((struct sock *)arg0)->__sk_common.skc_daddr == 0x140A000A/ {
  $tp = (struct tcp_sock *)arg0;
  printf("%lld cwnd=%d ssthresh=%d\n",
    nsecs / 1000000, $tp->snd_cwnd, $tp->snd_ssthresh);
}'

The congestion window (cwnd) determines how many bytes TCP can send before waiting for an ACK. After a retransmit, cwnd drops dramatically (halved or worse). Watching cwnd over time shows you the TCP flow's "heartbeat" — steady growth means healthy, sudden drops mean packet loss.

# Simpler: track cwnd changes for all connections
bpftrace -e '
kprobe:tcp_cong_avoid_ai {
  $tp = (struct tcp_sock *)arg0;
  @cwnd = hist($tp->snd_cwnd);
}
interval:s:10 {
  printf("\n--- Congestion window distribution ---\n");
  print(@cwnd);
  clear(@cwnd);
}'

Output:

--- Congestion window distribution ---
@cwnd:
[1]                      23  |*                                       |
[2, 4)                   89  |****                                    |
[4, 8)                  234  |***********                             |
[8, 16)                 567  |***************************             |
[16, 32)               891   |****************************************|
[32, 64)               456   |********************                    |
[64, 128)              123   |*****                                   |
[128, 256)              45   |**                                      |

Most connections have cwnd of 16-32 segments. Connections stuck at cwnd=1 are recovering from loss. Connections at 128+ are running on fast, clean paths. If you see most connections clustered at 1-4, there is widespread packet loss on the network.

Network performance is about tail latency

Average network latency is meaningless. A network that is 2ms 99% of the time and 200ms 1% of the time has "2ms average latency" but will destroy any application that makes 100+ requests per page load. eBPF lets you see the full distribution. tcplife shows per-connection duration. tcpretrans shows exactly which connections suffer. Aggregates lie. Distributions tell the truth.

Analogy: an airline that is on time 99% of flights but delays 1% by 24 hours has "great average performance." But if you fly weekly, you will lose a day of your life every two years to that 1%. That is what p99 network latency does to your application.

Filesystem Analysis

The filesystem sits between your application and the block device. It adds its own caching, its own locking, its own latency. eBPF can trace VFS operations, measure per-file latency, monitor cache hit rates, and for ZFS specifically: ARC behavior per dataset, DMU object access patterns, and ZPL (ZFS POSIX Layer) timing.

VFS operation timing

# VFS read/write latency by process
bpftrace -e '
kprobe:vfs_read { @start_read[tid] = nsecs; }
kretprobe:vfs_read /@start_read[tid]/ {
  @read_us[comm] = hist((nsecs - @start_read[tid]) / 1000);
  delete(@start_read[tid]);
}

kprobe:vfs_write { @start_write[tid] = nsecs; }
kretprobe:vfs_write /@start_write[tid]/ {
  @write_us[comm] = hist((nsecs - @start_write[tid]) / 1000);
  delete(@start_write[tid]);
}'

Output (after Ctrl+C):

@read_us[postgres]:
[0]                    5678  |*********************                   |
[1]                    8901  |*********************************       |
[2, 4)                10234  |****************************************|
[4, 8)                 3456  |*************                           |
[8, 16)                1234  |****                                    |
[16, 32)                456  |*                                       |
[32, 64)                 89  |                                        |
[64, 128)                12  |                                        |
[128, 256)                3  |                                        |

@write_us[postgres]:
[1]                    3456  |**************                          |
[2, 4)                 8901  |****************************************|
[4, 8)                 6789  |******************************          |
[8, 16)                2345  |**********                              |
[16, 32)                891  |****                                    |

Most reads complete in 0-4us (page cache hits). The tail at 64-256us represents cache misses that hit disk. Writes are slightly slower because they involve page dirtying and potential journal commits.

Per-file latency tracking

# Track which specific files have the highest read latency
bpftrace -e '
kprobe:vfs_read {
  @start[tid] = nsecs;
  @fd[tid] = arg0;
}
kretprobe:vfs_read /@start[tid]/ {
  $lat_us = (nsecs - @start[tid]) / 1000;
  if ($lat_us > 1000) {
    printf("SLOW READ: %d us by %s (pid %d)\n",
      $lat_us, comm, pid);
  }
  @latency[comm] = hist($lat_us);
  delete(@start[tid]); delete(@fd[tid]);
}'

Output:

SLOW READ: 12345 us by postgres (pid 18234)
SLOW READ: 8901 us by postgres (pid 18234)
SLOW READ: 5678 us by rsync (pid 2891)
SLOW READ: 23456 us by postgres (pid 18190)
SLOW READ: 1234 us by nginx (pid 3456)

postgres PID 18234 has reads taking 12ms and 23ms. Those are definitely not in cache. Use strace -p 18234 -e read -T to correlate with file descriptors, or use /proc/18234/fdinfo/ to map FDs to files.

Page cache hit rate analysis

# Page cache statistics every second
cachestat

Output:

    HITS   MISSES  DIRTIES  BUFFERS_MB  CACHED_MB  HIT%
   12847      234       89        128       3841   98.2%
   13102      187       92        128       3842   98.6%
   11893     1847      312        128       3798   86.6%
    9234     4521      567        128       3612   67.1%
    8012     6234      891        127       3401   56.2%

Reading the numbers

98% hit rate: your working set fits in RAM. Everything is fast.
86% and dropping: something is pushing your working set out of cache. Maybe a backup job, maybe a new query pattern, maybe ZFS ARC is competing for RAM.
Below 70%: serious cache pressure. The system is reading from disk for data that should be in RAM. Performance is degrading.

Analogy: the page cache is a desk. When everything you need is on the desk (98% hit), work is fast. When half your documents are in a filing cabinet across the room (50% hit), you spend half your time walking back and forth.

# Per-process cache miss tracking
bpftrace -e '
tracepoint:filemap:mm_filemap_add_to_page_cache {
  @cache_miss[comm] = count();
}
interval:s:10 {
  printf("\n--- Page cache misses by process (10s) ---\n");
  print(@cache_miss, 10);
  clear(@cache_miss);
}'

ZFS: ARC analysis by dataset

# ZFS ARC hit/miss ratio in real time
bpftrace -e '
kprobe:arc_read {
  @arc_reads = count();
}

kprobe:arc_read_done {
  @arc_hits = count();
}

interval:s:5 {
  printf("ARC reads: %d  hits: %d\n", @arc_reads, @arc_hits);
  clear(@arc_reads);
  clear(@arc_hits);
}'

# Quick ARC stats from /proc (for comparison)
cat /proc/spl/kstat/zfs/arcstats | grep -E '^hits|^misses'

# ARC eviction tracking — what gets kicked out and why
bpftrace -e '
kprobe:arc_evict {
  @arc_evict = count();
  @arc_evict_by[comm] = count();
}
kprobe:arc_evict_hdr {
  @hdr_evict = count();
}
interval:s:10 {
  printf("\n--- ARC evictions (10s) ---\n");
  print(@arc_evict); print(@arc_evict_by, 5);
  print(@hdr_evict);
  clear(@arc_evict); clear(@arc_evict_by);
  clear(@hdr_evict);
}'

High ARC eviction rates mean the working set does not fit in the ARC. Either increase zfs_arc_max or add an L2ARC device. If one process dominates evictions, that process's dataset is thrashing the cache for everyone else. Consider setting secondarycache=none on less important datasets.

ZFS: DMU object access

# DMU (Data Management Unit) read tracking
bpftrace -e '
kprobe:dmu_read {
  @dmu_start[tid] = nsecs;
  @dmu_reads = count();
}
kretprobe:dmu_read /@dmu_start[tid]/ {
  @dmu_latency = hist((nsecs - @dmu_start[tid]) / 1000);
  delete(@dmu_start[tid]);
}
interval:s:10 {
  printf("\n--- DMU reads: %d ---\n", @dmu_reads);
  printf("--- DMU read latency (us) ---\n");
  print(@dmu_latency);
  clear(@dmu_reads); clear(@dmu_latency);
}'

ZFS: ZPL (POSIX Layer) timing

# ZPL operation timing — the layer between POSIX syscalls and ZFS
bpftrace -e '
kprobe:zfs_read { @zr_start[tid] = nsecs; }
kretprobe:zfs_read /@zr_start[tid]/ {
  @zpl_read = hist((nsecs - @zr_start[tid]) / 1000);
  delete(@zr_start[tid]);
}
kprobe:zfs_write { @zw_start[tid] = nsecs; }
kretprobe:zfs_write /@zw_start[tid]/ {
  @zpl_write = hist((nsecs - @zw_start[tid]) / 1000);
  delete(@zw_start[tid]);
}
kprobe:zfs_lookup { @zl_start[tid] = nsecs; }
kretprobe:zfs_lookup /@zl_start[tid]/ {
  @zpl_lookup = hist((nsecs - @zl_start[tid]) / 1000);
  delete(@zl_start[tid]);
}
interval:s:10 {
  printf("\n--- ZPL read latency (us) ---\n");
  print(@zpl_read);
  printf("\n--- ZPL write latency (us) ---\n");
  print(@zpl_write);
  printf("\n--- ZPL lookup latency (us) ---\n");
  print(@zpl_lookup);
  clear(@zpl_read); clear(@zpl_write); clear(@zpl_lookup);
}'

Output:

--- ZPL read latency (us) ---
@zpl_read:
[0]                    3456  |***************                         |
[1]                    8901  |****************************************|
[2, 4)                 6789  |******************************          |
[4, 8)                 2345  |**********                              |
[8, 16)                 891  |****                                    |
[16, 32)                234  |*                                       |
[32, 64)                 45  |                                        |

--- ZPL lookup latency (us) ---
@zpl_lookup:
[0]                    5678  |****************************************|
[1]                    3456  |************************                |
[2, 4)                 1234  |********                                |
[4, 8)                  456  |***                                     |

ZPL reads are mostly 0-4us (ARC cached). Lookups are even faster because they hit the dnode cache. If lookup latency has a tail extending to milliseconds, the directory structure is not cached and ZFS is reading dnode blocks from disk.

Application Profiling with uprobes and USDT

eBPF is not just for kernel tracing. uprobes attach to any user-space function by symbol name. USDT (User Statically Defined Tracing) probes are pre-placed tracepoints in applications like PostgreSQL, MySQL, Node.js, Python, and the JVM. These let you trace application-level events with kernel-level efficiency.

PostgreSQL query tracing

# Trace PostgreSQL query execution (requires debug symbols or USDT)
# PostgreSQL has USDT probes when compiled with --enable-dtrace

# Trace query start/end with USDT
bpftrace -e '
usdt:/usr/pgsql-16/bin/postgres:postgresql:query__start {
  @query_start[tid] = nsecs;
  printf("QUERY START: %s\n", str(arg0));
}
usdt:/usr/pgsql-16/bin/postgres:postgresql:query__done {
  if (@query_start[tid]) {
    $lat_ms = (nsecs - @query_start[tid]) / 1000000;
    printf("QUERY DONE: %d ms\n", $lat_ms);
    @query_latency = hist($lat_ms);
    delete(@query_start[tid]);
  }
}'

Output:

QUERY START: SELECT * FROM orders WHERE customer_id = $1
QUERY DONE: 2 ms
QUERY START: SELECT * FROM orders WHERE customer_id = $1
QUERY DONE: 1 ms
QUERY START: UPDATE inventory SET quantity = quantity - $1 WHERE product_id = $2
QUERY DONE: 45 ms
QUERY START: SELECT o.*, p.name FROM orders o JOIN products p ON o.product_id = p.id
QUERY DONE: 234 ms

That JOIN query is taking 234ms. The simple SELECTs are 1-2ms. You now know exactly which query pattern to optimize.

# Without USDT: trace postgres via uprobe on the query executor
bpftrace -e '
uprobe:/usr/pgsql-16/bin/postgres:exec_simple_query {
  @q_start[tid] = nsecs;
  @queries = count();
}
uretprobe:/usr/pgsql-16/bin/postgres:exec_simple_query /@q_start[tid]/ {
  @q_latency = hist((nsecs - @q_start[tid]) / 1000000);
  delete(@q_start[tid]);
}
interval:s:30 {
  printf("\n--- Postgres queries: %d in 30s ---\n", @queries);
  printf("--- Query latency (ms) ---\n");
  print(@q_latency);
  clear(@queries); clear(@q_latency);
}'

nginx request tracing

# Trace nginx request processing time via uprobe
bpftrace -e '
uprobe:/usr/sbin/nginx:ngx_http_process_request {
  @req_start[tid] = nsecs;
}
uretprobe:/usr/sbin/nginx:ngx_http_finalize_request /@req_start[tid]/ {
  $lat_us = (nsecs - @req_start[tid]) / 1000;
  @request_latency = hist($lat_us);
  if ($lat_us > 10000) {
    printf("SLOW REQUEST: %d us (pid %d)\n", $lat_us, pid);
  }
  delete(@req_start[tid]);
}
interval:s:30 {
  printf("\n--- nginx request latency (us, 30s) ---\n");
  print(@request_latency);
  clear(@request_latency);
}'

Python function tracing

# Python 3 has USDT probes for function entry/return
# Requires: python3 built with --with-dtrace

# Trace Python function calls with latency
bpftrace -e '
usdt:/usr/bin/python3:python:function__entry {
  @py_start[tid, str(arg0), str(arg1)] = nsecs;
}
usdt:/usr/bin/python3:python:function__return {
  $key = (tid, str(arg0), str(arg1));
  if (@py_start[$key]) {
    $lat_us = (nsecs - @py_start[$key]) / 1000;
    if ($lat_us > 1000) {
      printf("SLOW: %s.%s took %d us\n",
        str(arg0), str(arg1), $lat_us);
    }
    @py_latency[str(arg0), str(arg1)] = hist($lat_us);
    delete(@py_start[$key]);
  }
}'

Output:

SLOW: app/views.py.get_orders took 45678 us
SLOW: app/models.py.serialize_order took 12345 us
SLOW: app/views.py.get_orders took 34567 us

Java / JVM tracing

# JVM has USDT probes for GC, method compilation, thread events
# Trace JVM garbage collection pauses

bpftrace -e '
usdt:/usr/lib/jvm/java-17/lib/server/libjvm.so:hotspot:gc__begin {
  @gc_start[tid] = nsecs;
  printf("GC BEGIN (type: %s)\n", str(arg0));
}
usdt:/usr/lib/jvm/java-17/lib/server/libjvm.so:hotspot:gc__end {
  if (@gc_start[tid]) {
    $dur_ms = (nsecs - @gc_start[tid]) / 1000000;
    printf("GC END: %d ms\n", $dur_ms);
    @gc_duration = hist($dur_ms);
    delete(@gc_start[tid]);
  }
}'

Output:

GC BEGIN (type: G1 Young Generation)
GC END: 12 ms
GC BEGIN (type: G1 Young Generation)
GC END: 8 ms
GC BEGIN (type: G1 Old Generation)
GC END: 456 ms
GC BEGIN (type: G1 Young Generation)
GC END: 15 ms

That 456ms old-gen GC pause is a stop-the-world event. Every request in flight during that pause waited 456ms. This is the classic Java latency spike. Tune -XX:MaxGCPauseMillis or switch to ZGC for sub-millisecond pauses.

Go runtime tracing

# Trace Go goroutine scheduling via uprobe
# Go binaries are statically linked — use the binary path directly
bpftrace -e '
uprobe:/usr/local/bin/mygoapp:runtime.newproc1 {
  @goroutine_create = count();
}
uprobe:/usr/local/bin/mygoapp:runtime.goexit1 {
  @goroutine_exit = count();
}
interval:s:10 {
  printf("Goroutines created: %d  exited: %d\n",
    @goroutine_create, @goroutine_exit);
  clear(@goroutine_create); clear(@goroutine_exit);
}'

# Trace Go GC pauses
bpftrace -e '
uprobe:/usr/local/bin/mygoapp:runtime.gcStart {
  @gc_start[tid] = nsecs;
}
uprobe:/usr/local/bin/mygoapp:runtime.gcMarkDone /@gc_start[tid]/ {
  $dur_us = (nsecs - @gc_start[tid]) / 1000;
  printf("Go GC: %d us\n", $dur_us);
  @go_gc = hist($dur_us);
  delete(@gc_start[tid]);
}'

uprobe vs USDT

uprobes can attach to any function symbol in a binary. They work on any application with debug symbols (or even without, if you know the offset). Downside: they are fragile across binary versions because function names and offsets change.
USDT probes are stable, documented tracepoints placed by the application developer. They survive binary upgrades. Downside: the application must be compiled with USDT support (e.g., --enable-dtrace for PostgreSQL, --with-dtrace for Python).
Rule of thumb: use USDT when available. Fall back to uprobes when it is not. Both have near-zero overhead when not actively tracing.

Application-level eBPF tracing is the bridge between "the system is slow" and "this specific function in this specific code path is slow." I have seen teams spend weeks optimizing database queries when the actual bottleneck was a Python serializer that called json.dumps() on a 50MB object. uprobe on the serializer, 5 minutes, found it. The kernel tools get you to the process. The application probes get you to the line of code.

Latency Analysis Patterns — the drill-down

The most powerful eBPF technique is the latency drill-down: start at the application, measure each layer, and find where the time goes. A request passes through: application code, system call interface, filesystem layer, block device layer, and physical device. eBPF can measure the latency at each boundary.

Full-stack latency breakdown

# Layer 1: Application (uprobe) — total request time
bpftrace -e '
uprobe:/usr/sbin/nginx:ngx_http_process_request {
  @app_start[tid] = nsecs;
}
uretprobe:/usr/sbin/nginx:ngx_http_finalize_request /@app_start[tid]/ {
  @app_latency = hist((nsecs - @app_start[tid]) / 1000);
  delete(@app_start[tid]);
}
interval:s:30 { printf("APP layer:\n"); print(@app_latency); clear(@app_latency); }'

# Layer 2: Syscall — time spent in kernel on behalf of the process
bpftrace -e '
tracepoint:raw_syscalls:sys_enter /comm == "nginx"/ {
  @sys_start[tid] = nsecs;
}
tracepoint:raw_syscalls:sys_exit /comm == "nginx" && @sys_start[tid]/ {
  @syscall_latency = hist((nsecs - @sys_start[tid]) / 1000);
  delete(@sys_start[tid]);
}
interval:s:30 { printf("SYSCALL layer:\n"); print(@syscall_latency); clear(@syscall_latency); }'

# Layer 3: Filesystem — VFS operation time
bpftrace -e '
kprobe:vfs_read /comm == "nginx"/ { @vfs_start[tid] = nsecs; }
kretprobe:vfs_read /comm == "nginx" && @vfs_start[tid]/ {
  @vfs_latency = hist((nsecs - @vfs_start[tid]) / 1000);
  delete(@vfs_start[tid]);
}
interval:s:30 { printf("VFS layer:\n"); print(@vfs_latency); clear(@vfs_latency); }'

# Layer 4: Block device — actual disk I/O time
biolatency 30

Compare the histograms. If the app shows 50ms but syscalls are 48ms, the time is in the kernel. If syscalls are 48ms but VFS is 45ms, the time is in the filesystem. If VFS is 45ms but block I/O is 40ms, the time is on the disk. If block I/O is only 2ms but VFS is 45ms, the filesystem is doing something expensive (lock contention, journaling, checksum computation) that is not I/O-bound.

Latency waterfall for a single request

# Trace a single request through every layer with timestamps
bpftrace -e '
uprobe:/usr/sbin/nginx:ngx_http_process_request {
  @t0[tid] = nsecs;
  printf("[%d] APP START\n", tid);
}
tracepoint:raw_syscalls:sys_enter /comm == "nginx" && @t0[tid]/ {
  printf("[%d] SYSCALL %d at +%d us\n",
    tid, args->id, (nsecs - @t0[tid]) / 1000);
}
kprobe:vfs_read /comm == "nginx" && @t0[tid]/ {
  printf("[%d] VFS READ at +%d us\n",
    tid, (nsecs - @t0[tid]) / 1000);
}
tracepoint:block:block_rq_issue /comm == "nginx" && @t0[tid]/ {
  printf("[%d] BLK ISSUE at +%d us\n",
    tid, (nsecs - @t0[tid]) / 1000);
}
tracepoint:block:block_rq_complete /comm == "nginx" && @t0[tid]/ {
  printf("[%d] BLK DONE at +%d us\n",
    tid, (nsecs - @t0[tid]) / 1000);
}
uretprobe:/usr/sbin/nginx:ngx_http_finalize_request /@t0[tid]/ {
  printf("[%d] APP DONE at +%d us\n",
    tid, (nsecs - @t0[tid]) / 1000);
  delete(@t0[tid]);
}'

Output:

[18234] APP START
[18234] SYSCALL 0 at +12 us
[18234] VFS READ at +15 us
[18234] BLK ISSUE at +23 us
[18234] BLK DONE at +156 us
[18234] SYSCALL 1 at +167 us
[18234] VFS READ at +170 us
[18234] APP DONE at +189 us

This single request took 189us total. The first VFS read caused a block I/O that took 133us (23 to 156). The second VFS read was a cache hit (170 to ~180us, no block I/O). You can see exactly where the time went, for one request, through every kernel layer.

The drill-down principle

Never optimize a layer until you have measured all layers. If 95% of latency is in the block device, optimizing the filesystem code path is wasted effort. If 95% of latency is in application code (serialization, parsing, computation), adding faster disks will not help. Measure first. Always.

Analogy: if your commute is 60 minutes and 55 of them are sitting in traffic, buying a faster car will not help. You need a different route. The drill-down shows you which "road" is congested.

funccount — count kernel function calls

funccount counts how many times a kernel function is called. Use it to understand workload patterns at the kernel level.

# How many times are ext4 functions called? (10 seconds)
funccount 'ext4_*' 10

Output:

FUNC                          COUNT
ext4_file_read_iter            8234
ext4_readdir                   1234
ext4_file_write_iter            892
ext4_lookup                     567
ext4_getattr                    345
ext4_file_open                  234

# Count ZFS function calls
funccount 'zfs_*' 10

# Count TCP functions (network activity)
funccount 'tcp_*' 10

# Count a specific function
funccount 'vfs_read' 10

Syscall latency by process

bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  @usecs[comm] = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

This shows a latency histogram for every syscall, grouped by process name. If postgres has a tail of 10ms+ syscalls, something is blocking it.

Real-World Investigation Examples

Theory is useful. But you learn eBPF by watching real investigations from start to finish. Here are five real-world scenarios, each with the full diagnostic path, the tools used, the output that revealed the answer, and the fix.

Example 1: "My database is slow"

Step-by-step diagnosis with eBPF. Someone reports that Postgres queries are taking 5x longer than yesterday.

Step 1: Is it CPU?

# Check scheduler latency for postgres processes
runqlat -p $(pgrep -d, postgres) 10

If the histogram shows most waits under 10us, CPU is not the bottleneck. Move on.

Step 2: Is it disk?

# Check disk latency histogram
biolatency 10

If you see a bimodal distribution with a peak at 50ms+, storage is the problem. Drill deeper:

# Which process is doing the I/O?
biotop 10

If postgres dominates with high average latency, the query is reading data that is not in cache.

Step 3: Is it cache?

# Check page cache hit rate
cachestat 5

If hit rate dropped from 98% to 70%, something evicted the database's working set from cache. Check what:

# What's allocating memory? (might be a backup, a cron job, etc.)
bpftrace -e '
tracepoint:kmem:mm_page_alloc {
  @pages[comm] = count();
}
interval:s:10 {
  print(@pages, 10);
  clear(@pages);
}'

Step 4: Is it the network?

# Check for TCP retransmits on the postgres port
tcpretrans 10

If retransmits correlate with slow queries, the network between the app server and the database is dropping packets.

Verdict

In this real-world case, cachestat revealed the hit rate dropped because a nightly backup job (rsync) ran 2 hours late and overlapped with peak query time, evicting postgres pages from cache. The fix: adjust the backup schedule. Total diagnosis time with eBPF: 5 minutes.

Example 2: "p99 latency spikes every 30 seconds"

The monitoring dashboard shows the application's p99 response time spikes from 5ms to 200ms every 30 seconds, like clockwork. p50 stays at 2ms. Something periodic is causing the tail.

Step 1: Correlate with system events

# Record timestamps of disk latency spikes
bpftrace -e '
tracepoint:block:block_rq_complete {
  $lat_us = args->nr_sector;  // approximate
  if ($lat_us > 10000) {
    time("%H:%M:%S "); printf("HIGH DISK LAT\n");
  }
}
interval:s:1 {
  time("%H:%M:%S ");
  printf("tick\n");
}' > /tmp/disk-events.log 2>&1 &

# Simultaneously record which processes are doing I/O
bpftrace -e '
tracepoint:block:block_rq_issue {
  time("%H:%M:%S ");
  printf("IO: %s dev=%d sectors=%d\n",
    comm, args->dev, args->nr_sector);
}' > /tmp/io-events.log 2>&1 &

Step 2: Check for periodic kernel activity

# What fires every 30 seconds?
bpftrace -e '
kprobe:txg_sync_thread {
  time("%H:%M:%S "); printf("TXG SYNC START\n");
  @txg_start[tid] = nsecs;
}
kretprobe:txg_sync_thread /@txg_start[tid]/ {
  $ms = (nsecs - @txg_start[tid]) / 1000000;
  time("%H:%M:%S "); printf("TXG SYNC DONE: %d ms\n", $ms);
  delete(@txg_start[tid]);
}'

Output:

14:23:00 TXG SYNC START
14:23:00 TXG SYNC DONE: 189 ms
14:23:30 TXG SYNC START
14:23:30 TXG SYNC DONE: 201 ms
14:24:00 TXG SYNC START
14:24:00 TXG SYNC DONE: 178 ms

TXG syncs every 30 seconds, taking ~200ms each. During the sync, ZFS flushes all dirty data to disk, temporarily consuming all I/O bandwidth. Any application read that happens during the sync gets queued behind the flush.

Step 3: Confirm with biolatency during the spike

# Run biolatency for a few cycles to capture during and between syncs
biolatency -T 5

Output (during TXG sync):

14:23:00
     usecs               : count    distribution
        16 -> 31         : 12       |*                                       |
        32 -> 63         : 34       |****                                    |
        64 -> 127        : 89       |***********                             |
       128 -> 255        : 234      |******************************          |
       256 -> 511        : 312      |****************************************|
       512 -> 1023       : 189      |************************                |
      1024 -> 2047       : 67       |********                                |

Output (between syncs):

14:23:05
     usecs               : count    distribution
         4 -> 7          : 34       |****                                    |
         8 -> 15         : 123      |***************                         |
        16 -> 31         : 312      |****************************************|
        32 -> 63         : 189      |************************                |
        64 -> 127        : 45       |*****                                   |

During syncs, disk latency jumps from 16-63us to 128-2047us. That is a 20x increase. Application reads that land during this window experience the p99 spike.

The fix

# Reduce TXG timeout to flush smaller batches more frequently
echo 5 > /sys/module/zfs/parameters/zfs_txg_timeout

# Cap dirty data to prevent massive flushes
echo 2147483648 > /sys/module/zfs/parameters/zfs_dirty_data_max  # 2GB

# Persistent:
echo 'options zfs zfs_txg_timeout=5 zfs_dirty_data_max=2147483648' \
  >> /etc/modprobe.d/zfs.conf

After the fix, TXG syncs happen every 5 seconds with much less data to flush. Each sync takes 20-40ms instead of 200ms. The p99 spike disappears.

Example 3: "Network throughput dropped"

A service that normally transfers 800 Mbps between two hosts dropped to 200 Mbps overnight. Nothing changed in the application. The network team says the links are fine.

Step 1: Check for retransmits

# TCP retransmits with rate
bpftrace -e '
kprobe:tcp_retransmit_skb {
  @retrans = count();
}
interval:s:5 {
  printf("Retransmits in 5s: %d\n", @retrans);
  clear(@retrans);
}'

Output:

Retransmits in 5s: 234
Retransmits in 5s: 189
Retransmits in 5s: 267
Retransmits in 5s: 201

200+ retransmits every 5 seconds. That is significant packet loss. Each retransmit halves the congestion window, throttling throughput.

Step 2: Which direction is losing packets?

# Track retransmit direction
bpftrace -e '
kprobe:tcp_retransmit_skb {
  $sk = (struct sock *)arg0;
  @retrans_to[ntop($sk->__sk_common.skc_daddr)] = count();
}
interval:s:10 {
  print(@retrans_to);
  clear(@retrans_to);
}'

Output:

@retrans_to[10.0.1.20]: 412
@retrans_to[10.0.2.100]: 3

All retransmits are to 10.0.1.20. The path to that specific host is the problem.

Step 3: Check congestion window collapse

# Congestion window for connections to the affected host
bpftrace -e '
kprobe:tcp_cong_avoid_ai {
  $sk = (struct sock *)arg0;
  if ($sk->__sk_common.skc_daddr == 0x140A000A) {
    $tp = (struct tcp_sock *)$sk;
    @cwnd = hist($tp->snd_cwnd);
  }
}
interval:s:10 {
  print(@cwnd); clear(@cwnd);
}'

Output:

@cwnd:
[1]                     89  |****************************************|
[2, 4)                  67  |******************************          |
[4, 8)                  45  |********************                    |
[8, 16)                 23  |**********                              |
[16, 32)                12  |*****                                   |
[32, 64)                 3  |*                                       |

Most of the time, cwnd is at 1-4 segments. Healthy connections should be at 32+ for 800 Mbps throughput. The congestion window keeps getting smashed by retransmits, then slowly grows, then gets smashed again.

Step 4: Trace packet drops on the local host

# Are packets being dropped locally?
bpftrace -e '
tracepoint:skb:kfree_skb /args->protocol == 0x0800/ {
  @local_drops[kstack(3)] = count();
}
interval:s:10 {
  print(@local_drops, 5);
  clear(@local_drops);
}'

If local drops are near zero, the loss is on the network. In this case, it was a failing SFP+ transceiver on the upstream switch port. The transceiver was not dead enough to trigger a link-down event, but was corrupting ~0.1% of frames. Replace the transceiver, throughput returns to 800 Mbps.

Example 4: "VMs slower on this host"

Four KVM guests run on two identical hosts. Guests on host-B consistently show 30% higher p99 latency than guests on host-A, running the same workload. Same hardware, same OS, same configuration.

Step 1: Compare CPU scheduling on both hosts

# On host-A
runqlat -P 10 2>&1 | head -30

# On host-B (same command)
runqlat -P 10 2>&1 | head -30

Host-A output:

pid = 0 (all)
     usecs               : count    distribution
         0 -> 1          : 4567     |****************************************|
         2 -> 3          : 2345     |********************                    |
         4 -> 7          : 891      |*******                                 |
         8 -> 15         : 234      |**                                      |

Host-B output:

pid = 0 (all)
     usecs               : count    distribution
         0 -> 1          : 1234     |***********                             |
         2 -> 3          : 2345     |*********************                   |
         4 -> 7          : 3456     |*******************************         |
         8 -> 15         : 4567     |****************************************|
        16 -> 31         : 2345     |********************                    |
        32 -> 63         : 891      |*******                                 |
        64 -> 127        : 234      |**                                      |

Host-B has dramatically worse scheduler latency. Most tasks wait 8-15us instead of 0-1us. Something on host-B is consuming CPU that the VMs need.

Step 2: What is consuming CPU on host-B?

# Profile host-B to find the CPU hog
profile -af 10 > /tmp/hostb-profile.out

# Summarize top consumers
bpftrace -e '
profile:hz:99 {
  @cpu[comm] = count();
}
interval:s:10 {
  print(@cpu, 10);
  clear(@cpu);
}'

Output:

@cpu[qemu-system-x86]: 456
@cpu[irqbalance]: 234
@cpu[tuned]: 189
@cpu[kcompactd0]: 156
@cpu[ksoftirqd/2]: 123
@cpu[khugepaged]: 89

kcompactd0 and khugepaged are consuming significant CPU. These kernel threads compact memory for transparent hugepages. Host-B has THP enabled; host-A has it disabled.

Step 3: Confirm THP compaction is the problem

# Trace compaction events
bpftrace -e '
kprobe:compact_zone {
  @compact_start[tid] = nsecs;
}
kretprobe:compact_zone /@compact_start[tid]/ {
  $dur_us = (nsecs - @compact_start[tid]) / 1000;
  @compact_duration = hist($dur_us);
  if ($dur_us > 1000) {
    printf("COMPACT: %d us\n", $dur_us);
  }
  delete(@compact_start[tid]);
}
interval:s:30 {
  print(@compact_duration);
  clear(@compact_duration);
}'

Output:

@compact_duration:
[64, 128)               45  |****                                    |
[128, 256)             234  |***********************                 |
[256, 512)             567  |****************************************|
[512, 1K)              891  |*****************************           |
[1K, 2K)               345  |***********                             |
[2K, 4K)               123  |****                                    |
[4K, 8K)                45  |*                                       |
[8K, 16K)               12  |                                        |

The fix

# Disable transparent hugepages on host-B
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Persistent via kernel command line:
# transparent_hugepage=never

After disabling THP on host-B, scheduler latency drops to match host-A. VM p99 latency equalizes. The kcompactd and khugepaged threads stop stealing CPU from qemu vCPU threads.

Example 5: "App OOM-killed but memory looks fine"

A Java application gets OOM-killed every few days. But free -h shows 20GB free at the time of the kill. The sysadmin says "there was plenty of memory." The developer says "the OOM killer is broken." Neither is right.

Step 1: Understand the OOM context

# Check which cgroup the OOM happened in
dmesg | grep -A 20 "Out of memory"

Output:

Out of memory: Killed process 18234 (java) total-vm:12345678kB,
  anon-rss:8388608kB, file-rss:1048576kB, shmem-rss:0kB,
  UID:1000 pgtables:16384kB oom_score_adj:0
Memory cgroup out of memory: Killed process 18234 (java)
  Memory cgroup: /system.slice/myapp.service

"Memory cgroup out of memory" — the OOM was triggered by a cgroup memory limit, not system-wide memory exhaustion. The system had 20GB free, but the cgroup limit was 8GB. The Java process hit the cgroup wall.

Step 2: Trace memory growth leading up to the OOM

# Monitor cgroup memory usage over time
bpftrace -e '
kprobe:mem_cgroup_charge {
  @cgroup_charge[comm] = count();
  @cgroup_bytes[comm] = sum(4096);  // PAGE_SIZE
}
interval:s:60 {
  printf("\n--- Cgroup charges (60s) ---\n");
  print(@cgroup_charge, 10);
  printf("--- Bytes charged ---\n");
  print(@cgroup_bytes, 10);
  clear(@cgroup_charge); clear(@cgroup_bytes);
}'

Step 3: Find the memory leak

# Track page allocations attributed to the java process
bpftrace -e '
tracepoint:kmem:mm_page_alloc /comm == "java"/ {
  @java_pages = count();
  @java_order = hist(args->order);
}
interval:s:300 {
  printf("\n--- Java page allocations (5min) ---\n");
  print(@java_pages);
  print(@java_order);
  clear(@java_pages); clear(@java_order);
}'

# Also check: is the JVM growing its heap, or is it off-heap allocations?
bpftrace -e '
uprobe:/usr/lib/jvm/java-17/lib/server/libjvm.so:os::malloc {
  @jvm_malloc = sum(arg0);
  @jvm_malloc_count = count();
}
interval:s:60 {
  printf("JVM native malloc: %d bytes in %d calls\n",
    @jvm_malloc, @jvm_malloc_count);
  clear(@jvm_malloc); clear(@jvm_malloc_count);
}'

Output (over several hours):

JVM native malloc: 234567890 bytes in 12345 calls
JVM native malloc: 245678901 bytes in 12456 calls
JVM native malloc: 256789012 bytes in 12567 calls
JVM native malloc: 267890123 bytes in 12678 calls

Native (off-heap) allocations growing by ~10MB every minute. The JVM heap is fine, but a JNI library or direct ByteBuffer pool is leaking memory outside of Java's garbage collector. The GC never sees these allocations, so it never frees them. The cgroup counter sees them.

The fix

# Find which JNI library is leaking
# Trace malloc calls from the java process with call stacks
bpftrace -e '
uprobe:/usr/lib/jvm/java-17/lib/server/libjvm.so:os::malloc {
  if (arg0 > 1048576) {  // Only track allocations > 1MB
    printf("LARGE ALLOC: %d bytes\n", arg0);
    print(ustack(5));
  }
}'

The stack trace revealed a Netty direct buffer pool that was not releasing buffers on HTTP/2 connection close. The application team fixed the leak. The cgroup memory limit stayed at 8GB. No more OOM kills.

The "OOM but memory looks fine" pattern catches people constantly. System-wide free memory is irrelevant when the process is in a cgroup with a memory limit. Kubernetes, systemd, Docker — they all use cgroups. The real question is never "how much RAM does the system have?" It is "how much RAM is the cgroup allowed to use, and how much has it used?" eBPF sees the cgroup charges in real time.

Example 6: "Network latency spikes every 5 minutes"

The monitoring dashboard shows TCP latency between two services spikes to 200ms every 5 minutes, then drops back to 2ms.

Step 1: Correlate with disk

# Run biolatency with timestamps, watch for 15 minutes
biolatency -T 1 > /tmp/bio-timeline.log &

# Simultaneously watch TCP session durations
tcplife -T > /tmp/tcp-timeline.log &

# Wait 15 minutes, then kill both
sleep 900 && kill %1 %2

Step 2: Find the pattern

# Look for biolatency spikes in the log
grep -E '(1024|2048|4096|8192|16384|32768|65536) ->' /tmp/bio-timeline.log

If disk latency spikes at the same 5-minute interval as network latency, something is saturating the I/O bus and blocking the network stack.

Step 3: What is doing I/O at those times?

bpftrace -e '
kprobe:blk_account_io_start {
  @io[comm] = count();
}

interval:s:30 {
  time("%H:%M:%S ");
  print(@io, 5);
  clear(@io);
}'

If txg_sync or z_wr_iss (ZFS write issuer) appears in bursts every 5 minutes, ZFS TXG syncs are flooding the I/O subsystem. The default TXG timeout is 5 seconds, but if dirty data accumulates for 5 minutes due to zfs_txg_timeout tuning, the flush can be massive.

The fix

# Check current TXG timeout
cat /sys/module/zfs/parameters/zfs_txg_timeout

# If it's set too high, reduce it to flush more frequently with smaller batches
echo 5 > /sys/module/zfs/parameters/zfs_txg_timeout

# Also limit dirty data to prevent massive flushes
echo 4294967296 > /sys/module/zfs/parameters/zfs_dirty_data_max  # 4GB cap

Putting It All Together — the Performance Investigation Workflow

Every performance investigation follows the same pattern. Here is the workflow, from "it is slow" to "here is the fix."

Step 1: Classify. Run the 60-second checklist. Within one minute, you know which resource is the bottleneck: CPU, disk, memory, or network.

Step 2: Quantify. Use the USE method on the bottleneck resource. Measure utilization, saturation, and errors. Now you know HOW bad it is.

Step 3: Drill down. Use the deep-dive section for that resource (CPU, Disk, Memory, Network, Filesystem). Now you know WHERE in the stack the time goes.

Step 4: Identify. Use application profiling or the latency drill-down to trace the specific code path or configuration causing the problem.

Step 5: Fix and verify. Make the change. Re-run the same eBPF tools to confirm the problem is gone. If the histogram shifted, you won.

Quick reference: which tool for which question

CPU questions

"Is the CPU busy?" — profile
"Are tasks queuing?" — runqlat, runqlen
"What is my process waiting for?" — offcputime
"Who preempted my process?" — scheduler tracing
"Is my process bouncing between CPUs?" — migration tracing
"How much time in IRQ handlers?" — IRQ tracing

Disk questions

"How slow is my disk?" — biolatency
"Who is doing I/O?" — biotop
"Is I/O queuing?" — biolatency -Q
"Random or sequential?" — I/O size histogram
"ZFS pipeline slow?" — ZIO tracing
"ZFS pool fragmented?" — metaslab tracing

Memory questions

"Who is allocating memory?" — kmalloc tracing
"Major page faults?" — fault tracing
"Slab leak?" — slab alloc/free tracking
"NUMA locality?" — node allocation tracing
"Near OOM?" — direct reclaim tracing
"THP compaction cost?" — compact_zone tracing

Network questions

"Packet loss?" — tcpretrans
"Connection rate?" — tcplife
"Socket buffer full?" — buffer tracing
"Why dropped?" — kfree_skb reason tracing
"Congestion window?" — cwnd tracing
"Bandwidth per interface?" — net_dev_xmit tracing

Installation: getting the tools

# CentOS / RHEL / Rocky / Fedora
dnf install -y bcc-tools bpftrace

# Debian / Ubuntu
apt install -y bpfcc-tools bpftrace

# Arch
pacman -S bcc bpftrace

# Tools are installed to /usr/share/bcc/tools/ (bcc)
# and available as commands (bpftrace)

# Verify everything works
bpftrace -e 'BEGIN { printf("eBPF works!\n"); exit(); }'

# kldload installs all eBPF tools automatically on desktop and server profiles
# Just run the tools — they are already in PATH

# Full FlameGraph toolkit
git clone https://github.com/brendangregg/FlameGraph /opt/FlameGraph

I spent the first decade of my career using top, iostat, and vmstat to diagnose performance problems. I was wrong more often than I was right. I would see high CPU and blame the application when it was actually lock contention. I would see high disk utilization and add more disks when the real problem was a misconfigured cache. eBPF changed everything. It does not guess. It measures. The histogram does not lie. The stack trace does not lie. Once you learn these tools, you will wonder how you ever debugged anything without them.

← execsnoop — process execution monitor The bpftrace language →