Documentation

eBPF Cookbook — 50+ Copy-Paste Recipes by Symptom

This page is organized the way you actually think during an incident: by symptom. You don't sit down and say "I'd like to trace kprobes today." You say "the disk is slow" or "something is eating all the memory" or "a process is doing something weird." Every recipe here starts with that feeling and gives you the exact command to answer it.

How to use this page: Find your symptom. Copy the command. Paste it. Read the output. Every recipe follows the same structure: problem → one-liner or script → what the output means → when to reach for this. All commands assume BCC tools and bpftrace are installed — kldload desktop and server profiles include both out of the box.

Most "eBPF documentation" is organized by tool name. That's useless at 3 AM when your pager fires. Nobody remembers whether they need biolatency or biosnoop or biotop. They remember "the database is slow and I think it's the disk." This page is the one you actually bookmark.

"My disk is slow"

Disk I/O problems are the most common performance complaint and the hardest to pin down with traditional tools. iostat gives you averages. eBPF gives you distributions — you can see the 99th percentile latency that's killing your tail latency while the average looks fine.

Disk I/O latency histogram

The problem

Your application is slow but iostat shows reasonable average latency. You suspect outlier I/O operations are dragging things down — the average hides the pain.

# Show disk I/O latency as a histogram (microseconds), per-disk
biolatency -D

Output:

disk = sda
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 12       |**                                      |
         8 -> 15         : 87       |**************                          |
        16 -> 31         : 243      |****************************************|
        32 -> 63         : 198      |********************************        |
        64 -> 127        : 42       |******                                  |
       128 -> 255        : 18       |**                                      |
       256 -> 511        : 3        |                                        |
      1024 -> 2047       : 1        |                                        |

What the output means: Most I/O completes in 16-63 microseconds (fast SSD territory). But those 3 operations at 256-511us and 1 at 1-2ms? Those are your outliers. If this histogram shows a bimodal distribution — a cluster at fast latency AND a cluster at slow — you likely have a mix of cached and uncached I/O, or your SSD's write-back cache is occasionally flushing.

When to use it: First thing you run when anyone says "disk is slow." The histogram shape tells you more in 5 seconds than an hour of staring at iostat.

Top I/O processes by throughput

The problem

Something is hammering the disk but iotop only shows a snapshot. You need continuous tracking of which processes are doing the most I/O, with byte counts.

# Top processes by disk I/O, refreshing every 1 second
biotop 1

Output:

PID    COMM             D MAJ MIN  DISK   I/O  Kbytes  AVGms
14501  postgres         W 8   0    sda    856  13696    0.31
14502  postgres         R 8   0    sda    342   5472    0.08
1823   jbd2/sda1-8      W 8   0    sda    128   1024    0.52
9012   rsync            R 8   0    sda     94   3008    0.12

What the output means: Postgres is writing 13MB/s and reading 5MB/s. The jbd2 process is the ext4 journal — if this is high, you're generating a lot of metadata writes. The AVGms column shows average latency per I/O — the journal writes (0.52ms) are slower because they're synchronous.

When to use it: When you know the disk is busy but don't know who's responsible. biotop is top for block I/O.

I/O size distribution

The problem

Your throughput numbers look fine but IOPS are through the roof. You suspect something is doing tiny I/O operations instead of batching — a classic database misconfiguration.

# Histogram of I/O request sizes in bytes
bpftrace -e 'tracepoint:block:block_rq_issue {
  @size = hist(args.bytes);
}'

Output:

@size:
[512, 1K)              182 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
[1K, 2K)                23 |@@@                                        |
[2K, 4K)              1847 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)               312 |@@@@@@@                                    |
[8K, 16K)               89 |@@                                         |
[16K, 32K)              14 |                                           |
[128K, 256K)            67 |@                                          |

What the output means: You have a bimodal distribution. Lots of 512-byte I/Os (probably filesystem metadata or small random reads) plus lots of 4K I/Os (application data, page-aligned). The 128K-256K cluster is likely sequential reads or writes. If you see thousands of tiny I/Os, your application might not be using O_DIRECT or buffered I/O properly.

When to use it: When IOPS are high but throughput is low. The shape of this histogram instantly tells you whether the workload is doing tiny random or large sequential I/O.

Random vs sequential I/O

The problem

You need to know whether your workload is random or sequential. This matters enormously for HDD (100x difference) and still matters for SSD (queue depth, wear leveling).

# Track sector offsets to see access pattern
bpftrace -e 'tracepoint:block:block_rq_issue {
  @sectors[args.dev] = hist(args.sector);
}'

# Simpler: count random vs sequential using BCC
bpftrace -e '
tracepoint:block:block_rq_issue {
  @last[args.dev] = args.sector;
  if (@prev[args.dev] != 0) {
    $delta = (int64)(args.sector) - (int64)(@prev[args.dev]);
    if ($delta < 0) { $delta = -$delta; }
    if ($delta <= 8) {
      @pattern["sequential"] = count();
    } else {
      @pattern["random"] = count();
    }
  }
  @prev[args.dev] = args.sector + args.nr_sector;
}'

What the output means: If @pattern["random"] dominates, your workload is random I/O — this is where SSDs shine and HDDs die. If sequential dominates, you're doing streaming reads/writes and HDD throughput should be fine. A 50/50 mix is the worst case for HDDs because the scheduler can't optimize for either pattern.

When to use it: When choosing storage hardware for a workload, or when diagnosing why an HDD-based system is slow (the answer is almost always "random I/O").

I/O queue depth

The problem

You want to know if your disk is saturated. Queue depth — how many I/O requests are in-flight simultaneously — is the true saturation metric, not utilization percentage.

# Track I/O queue depth over time
bpftrace -e '
tracepoint:block:block_rq_issue { @inflight = count(); @q = hist(@inflight); }
tracepoint:block:block_rq_complete { @inflight = count(); }'

# Simpler alternative using BCC
biolatency -Q

What the output means: NVMe drives handle queue depths of 64+ easily. SATA SSDs start struggling above 32. HDDs fall apart above 4. If your queue depth regularly exceeds your device's capability, you need faster storage or you need to reduce I/O concurrency.

When to use it: When iostat shows high %util but you're not sure if the device is actually saturated. Queue depth is the real answer.

I/O merge rate

The problem

You're seeing high IOPS but want to know if the I/O scheduler is merging adjacent requests. Low merge rates on sequential workloads means something is preventing the scheduler from doing its job.

# Count I/O merges by type
bpftrace -e '
tracepoint:block:block_bio_backmerge { @merges["back"] = count(); }
tracepoint:block:block_bio_frontmerge { @merges["front"] = count(); }
tracepoint:block:block_rq_issue { @issued = count(); }
interval:s:5 { print(@merges); print(@issued); clear(@merges); clear(@issued); }'

What the output means: If merges are zero but your workload is supposedly sequential, something is wrong — either the I/O scheduler is disabled (common with NVMe, which is fine), or your application is issuing I/O with gaps. Compare @merges to @issued: a healthy sequential workload on HDD should show merges equal to or exceeding raw issues.

When to use it: When tuning I/O schedulers or investigating why sequential throughput is lower than expected on rotating storage.

ZFS ZIO latency

The problem

Your ZFS pool is slow, but biolatency shows fine disk latency. The problem might be inside ZFS itself — the ZFS I/O pipeline (ZIO) adds its own latency for checksums, compression, encryption, and gang block handling.

# Trace ZFS I/O pipeline latency by operation type
bpftrace -e '
kprobe:zio_execute { @start[arg0] = nsecs; }
kretprobe:zio_execute /@start[arg0]/ {
  @zio_latency_us = hist((nsecs - @start[arg0]) / 1000);
  delete(@start[arg0]);
}'

# ZIO latency broken down by read vs write
bpftrace -e '
kprobe:zio_read { @start_r[arg0] = nsecs; }
kretprobe:zio_read /@start_r[arg0]/ {
  @read_us = hist((nsecs - @start_r[arg0]) / 1000);
  delete(@start_r[arg0]);
}
kprobe:zio_write { @start_w[arg0] = nsecs; }
kretprobe:zio_write /@start_w[arg0]/ {
  @write_us = hist((nsecs - @start_w[arg0]) / 1000);
  delete(@start_w[arg0]);
}'

What the output means: If ZIO latency is much higher than block device latency, ZFS is spending time on internal operations. Common causes: compression (especially zstd at high levels), encryption (aes-256-gcm adds ~5us per block), or gang blocks (when your recordsize doesn't match the pool's ashift). If write latency is much higher than read, check if the ZIL is on the same device as the pool.

When to use it: When ZFS feels slower than the underlying disks justify. This separates "slow disk" from "ZFS overhead."

Metaslab allocation latency

The problem

ZFS write latency is spiking, but the disks aren't slow. On heavily fragmented pools (above 80% capacity), metaslab allocation — finding free space on disk — becomes the bottleneck.

# Trace metaslab allocation time
bpftrace -e '
kprobe:metaslab_alloc { @start[tid] = nsecs; }
kretprobe:metaslab_alloc /@start[tid]/ {
  @alloc_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

What the output means: Healthy pools allocate in single-digit microseconds. If you see allocations taking milliseconds, your pool is fragmented. Above 80% capacity, ZFS metaslab allocation algorithms switch from fast (spacemap) to slow (linear scan). The fix: don't fill pools past 80%, or add a special allocation class vdev (metadata vdev) for small blocks.

When to use it: When ZFS write performance degrades over time as the pool fills up. This proves whether fragmentation is the culprit.

Metaslab allocation latency is one of those things you can only see with eBPF. There is no zpool iostat column for it. No ZFS counter exposes it. You either trace it or you guess. Most people guess and buy faster disks that don't help.

I/O traced back to filename

The problem

You can see block I/O but need to know which files are being read or written. Block-level tools only show sector numbers — useless for debugging applications.

# Show files being read/written with I/O size and latency
bpftrace -e '
kprobe:vfs_read { @start[tid] = nsecs; @file[tid] = str(((struct file *)arg0)->f_path.dentry->d_name.name); }
kretprobe:vfs_read /@start[tid]/ {
  printf("%-6d %-16s R %8d us  %s\n", pid, comm, (nsecs - @start[tid]) / 1000, @file[tid]);
  delete(@start[tid]); delete(@file[tid]);
}'

What the output means: You see exact filenames, process names, and latency per read. This is invaluable for figuring out which config file, log file, or database file is causing I/O. Note: this traces VFS reads, which includes cached reads — for disk-only I/O, combine with biosnoop PID matching.

When to use it: When biotop tells you which process is doing I/O but you need to know which specific files.

"My network is broken"

Network debugging is where eBPF shines brightest. Traditional tools give you packet counts. eBPF gives you why packets are being dropped, which kernel function dropped them, and the exact stack trace that led there. The difference between "5% packet loss" and "netfilter is dropping packets in nf_hook_slow because your conntrack table is full" is the difference between guessing and knowing.

TCP retransmit rate

The problem

Your application has intermittent slowness. You suspect packet loss but netstat -s only gives cumulative counters, not per-connection or per-second rates.

# Watch TCP retransmits in real-time with source, destination, and state
tcpretrans

Output:

TIME     PID    LADDR:LPORT         RADDR:RPORT         STATE
14:32:01 14501  10.0.1.5:5432       10.0.1.20:48892     ESTABLISHED
14:32:01 14501  10.0.1.5:5432       10.0.1.20:48892     ESTABLISHED
14:32:03 9282   10.0.1.5:443        203.0.113.5:61234   ESTABLISHED
14:32:07 14501  10.0.1.5:5432       10.0.1.21:49012     ESTABLISHED

What the output means: Postgres (port 5432) is retransmitting to two clients on 10.0.1.20 and 10.0.1.21. Multiple retransmits to the same host in seconds means either the network path to that host has loss, or the receiver's buffer is full. The HTTPS retransmit (port 443) to an external IP is separate — probably internet path loss. Focus on the internal retransmits first.

When to use it: First thing to run when applications report timeouts or slow responses over the network. Retransmits are the number one cause of TCP performance problems.

TCP connection lifecycle

The problem

Connections are failing, timing out, or being refused. You need to see every TCP state change — SYN_SENT, ESTABLISHED, CLOSE_WAIT — and how long each state lasts.

# Trace all TCP state changes with duration
tcplife

Output:

PID   COMM       LADDR           LPORT RADDR           RPORT TX_KB RX_KB MS
14501 postgres   10.0.1.5        5432  10.0.1.20       48892  1204   867 34523
9282  nginx      10.0.1.5        443   203.0.113.5     61234    42    12   851
7201  curl       10.0.1.5        52340 10.0.2.100      8080     0     0    30012

What the output means: Every closed TCP connection with duration, bytes transferred, and process info. That curl connection at the bottom transferred zero bytes but lasted 30 seconds — it timed out trying to connect. The postgres connection was healthy: 34 seconds, transferred data in both directions. Use this to find connections that connect but never transfer data (firewall issue) or transfer data but take too long (application issue).

When to use it: When you need the big picture of TCP connection health. This is your "who's talking to whom, for how long, and how much data" view.

DNS resolution latency

The problem

Applications are slow to start connections. You suspect DNS resolution is the bottleneck — a common problem when the local resolver is overloaded or the upstream DNS is slow.

# Trace DNS lookups with latency (traces getaddrinfo and gethostbyname)
gethostlatency

Output:

TIME      PID    COMM          LATms HOST
14:35:01  9282   nginx         0.42  api.internal.corp
14:35:01  9283   nginx         0.38  cdn.internal.corp
14:35:02  7301   curl          127.4 broken-dns.example.com
14:35:02  14501  postgres      0.11  replica-2.db.internal

What the output means: Three lookups resolved in under 1ms (cached or fast local resolver). That curl lookup took 127ms — either the DNS server is slow, the record doesn't exist (NXDOMAIN after timeout), or you're hitting a recursive lookup that's chasing CNAME chains. Anything over 10ms for an internal name is suspicious.

When to use it: When TCP connections are slow to establish but fast once connected. DNS latency hides in "connection time" and nobody thinks to check it.

Kernel packet drops with reasons

The problem

Packets are being dropped somewhere in the kernel networking stack. netstat -s tells you how many but not where or why. You need the exact kernel function and reason code.

# Trace kernel packet drops with full stack trace
dropwatch -l kas

# Or with bpftrace for more detail
bpftrace -e '
tracepoint:skb:kfree_skb {
  @drops[ksym(args.location)] = count();
}
interval:s:5 { print(@drops); clear(@drops); }'

Output:

@drops[nf_hook_slow+0x83]: 847
@drops[tcp_v4_rcv+0x4a]: 12
@drops[__udp4_lib_rcv+0x2c1]: 3

What the output means: nf_hook_slow is netfilter (iptables/nftables) — 847 drops means your firewall rules are dropping packets. tcp_v4_rcv drops mean TCP received packets for a connection that doesn't exist (port scans, SYN floods, or stale connections). __udp4_lib_rcv drops mean UDP packets arrived for a port nobody is listening on. The function name tells you exactly which subsystem to investigate.

When to use it: When you know packets are being lost but don't know where. This is the single most useful eBPF network diagnostic — it replaces hours of tcpdump analysis with one line that tells you the exact kernel function responsible.

Before eBPF, finding where the kernel drops packets required reading kernel source code and adding printk() statements. Now it's one bpftrace command. This single recipe has probably saved more engineering hours than any other on this page.

Socket buffer overflow detection

The problem

Your application can't read data fast enough and the kernel's socket receive buffer fills up. Packets get dropped silently — no error, no log, just gone.

# Watch for socket buffer overflows (rcvbuf full)
bpftrace -e '
tracepoint:sock:sock_rcvqueue_full {
  printf("%s pid=%d sk=%p\n", comm, pid, args.sk);
  @overflow[comm] = count();
}'

# Also check for UDP receive buffer errors specifically
bpftrace -e '
kprobe:__udp4_lib_rcv {
  @udp_in = count();
}
tracepoint:skb:kfree_skb /ksym(args.location) == "__udp4_lib_rcv"/ {
  @udp_drop = count();
}
interval:s:5 { printf("UDP in: %lld  dropped: %lld\n", @udp_in, @udp_drop); clear(@udp_in); clear(@udp_drop); }'

What the output means: If a process name shows up in @overflow, that process is not reading from its socket fast enough. For UDP, the kernel just drops the packet. For TCP, the kernel stops advertising window space (zero window) and the sender backs off. Fix: increase net.core.rmem_max and SO_RCVBUF, or make the application read faster.

When to use it: When you see packet loss on a receive-heavy service (syslog, DNS, metrics collectors) that disappears when you reduce the sender rate.

TCP zero window events

The problem

TCP throughput drops to zero intermittently. A zero window means the receiver told the sender "stop sending, my buffer is full." This is TCP backpressure in action and usually means the receiving application is slow.

# Detect TCP zero window advertisements
bpftrace -e '
kprobe:tcp_select_window {
  $sk = (struct sock *)arg0;
  $win = $sk->sk_rcvbuf - $sk->sk_backlog.len - atomic_read(&$sk->sk_rmem_alloc);
  if ($win <= 0) {
    printf("ZERO WINDOW: pid=%d comm=%s sport=%d dport=%d\n",
           pid, comm, $sk->__sk_common.skc_num,
           $sk->__sk_common.skc_dport >> 8 | ($sk->__sk_common.skc_dport & 0xff) << 8);
    @zerowin[comm] = count();
  }
}'

What the output means: The process name in the output is the slow receiver. If it's your application, it's not calling read() or recv() fast enough — probably blocked doing computation or waiting on disk. If it's a proxy like nginx, its upstream is slow and backpressure is propagating backwards through the connection chain.

When to use it: When tcpdump shows zero-window packets and you need to know which process on which connection is the slow reader.

WireGuard handshake failures

The problem

WireGuard peers can't establish or maintain tunnels. Handshakes are timing out. There's no WireGuard log (by design — it's silent to prevent enumeration attacks). You need to trace the handshake at the kernel level.

# Trace WireGuard handshake initiation and response
bpftrace -e '
kprobe:wg_packet_handshake_send_worker {
  printf("%s: handshake SEND pid=%d\n", strftime("%H:%M:%S", nsecs), pid);
  @handshake_sent = count();
}
kprobe:wg_packet_receive {
  @packets_in = count();
}
interval:s:10 {
  printf("--- sent=%lld  received=%lld ---\n", @handshake_sent, @packets_in);
  clear(@handshake_sent); clear(@packets_in);
}'

What the output means: If you see handshake SENDs but zero receives, your outbound UDP packets are being blocked (firewall, NAT, wrong endpoint). If you see receives but the tunnel still doesn't come up, the handshake is being rejected — wrong key, wrong allowed-IPs, or clock skew beyond 5 minutes (WireGuard uses TAI64N timestamps).

When to use it: When wg show shows "latest handshake: never" or a handshake timestamp that keeps resetting. This is the only way to see WireGuard handshake activity since the module doesn't log.

Network throughput by process

The problem

Your network interface is saturated but you don't know which process is responsible. iftop shows connections, not processes. nethogs misses short-lived connections.

# Network bytes sent/received per process, refreshing every second
bpftrace -e '
kprobe:tcp_sendmsg { @send[comm] = sum(arg2); }
kprobe:tcp_recvmsg { @recv[comm] = sum(arg2); }
interval:s:1 {
  printf("\n--- TCP bytes/sec ---\n");
  print(@send); print(@recv);
  clear(@send); clear(@recv);
}'

What the output means: Raw byte counts per process per second for TCP traffic. If one process dominates, you've found your bandwidth hog. This traces at the socket layer so it includes all TCP traffic regardless of interface or routing.

When to use it: When the NIC is pegged and you need to know who's responsible, fast. Unlike nethogs which samples, this traces every byte.

"What's eating my CPU"

CPU problems come in two flavors: something is using all the CPU (you need to find it), or nothing looks busy but everything is slow (you need to find what's blocking). eBPF handles both — it can profile on-CPU time and measure off-CPU/scheduler latency that no traditional tool exposes.

On-CPU flame graph

The problem

CPU is at 100% and you need to know exactly which functions are responsible — not just the process, but the full call stack down to the kernel function.

# Profile all CPUs for 30 seconds, generate flame graph
profile -af 30 > /tmp/profile.stacks
# Convert to flame graph (requires FlameGraph tools)
flamegraph.pl /tmp/profile.stacks > /tmp/cpu.svg

# Quick alternative: just see the top functions without flame graph
profile -f 10 | head -40

What the output means: The flame graph shows stack depth on the Y axis and time on the X axis. Wide plateaus at the top are where time is actually spent. If you see do_page_fault dominating, you're thrashing memory. If you see copy_user_enhanced_fast_string, you're spending time copying data between kernel and userspace (lots of read/write syscalls). If it's all in your application code, the problem is algorithmic.

When to use it: This is the gold standard for CPU profiling. Run this before anything else when CPU is high. The flame graph will tell you in seconds what might take hours of guessing.

Top kernel functions

The problem

System CPU (%sy in top) is high. The kernel is busy but you don't know which subsystem — networking? filesystem? scheduling?

# Top kernel functions by CPU time (sample for 5 seconds)
funccount -i 5 'k:*'

# More targeted: top functions in a specific subsystem
funccount -i 5 'k:tcp_*'     # TCP stack
funccount -i 5 'k:ext4_*'    # ext4 filesystem
funccount -i 5 'k:zfs_*'     # ZFS
funccount -i 5 'k:nf_*'      # netfilter/firewall

What the output means: High counts in tcp_* functions means the network stack is busy (lots of connections or high throughput). High counts in filesystem functions means I/O-heavy workload. High counts in nf_* means your firewall rules are doing a lot of work — consider optimizing rule order or using ipset/nftables sets.

When to use it: When %sy is high and you want a quick breakdown of which kernel subsystem is consuming the CPU before diving deeper.

Syscall frequency and latency

The problem

An application is slow and you suspect it's making too many syscalls. Each syscall has overhead — context switch to kernel, do work, context switch back. Thousands of tiny syscalls can dominate CPU time.

# Count syscalls by type for a specific process
syscount -p $(pgrep -x postgres) -i 5

Output:

SYSCALL                   COUNT
epoll_wait                48201
read                      34821
write                     29102
futex                      8923
sendto                     4201
recvfrom                   4198

# Syscall latency histogram for a specific call
bpftrace -e '
tracepoint:syscalls:sys_enter_read /comm == "postgres"/ { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /comm == "postgres" && @start[tid]/ {
  @read_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

What the output means: The count tells you how chatty the application is. 48K epoll_wait calls in 5 seconds is normal for an event-driven server (it's the event loop). But 34K read calls might mean it's reading tiny amounts instead of batching. The latency histogram tells you if the reads are fast (cached) or slow (disk).

When to use it: When an application uses more CPU than expected. The syscall breakdown instantly reveals whether it's doing useful work or burning CPU on overhead.

Scheduler run queue latency

The problem

Your application is slow but its CPU usage isn't that high. The problem might be that it's waiting to get scheduled — it's runnable but stuck in the scheduler queue behind other tasks.

# Histogram of scheduler run queue latency (how long tasks wait to get a CPU)
runqlat

Output:

     usecs               : count     distribution
         0 -> 1          : 1423     |**********                              |
         2 -> 3          : 5621     |****************************************|
         4 -> 7          : 3842     |***************************             |
         8 -> 15         : 1204     |********                                |
        16 -> 31         : 423      |***                                     |
        32 -> 63         : 87       |                                        |
        64 -> 127        : 12       |                                        |
       128 -> 255        : 3        |                                        |

What the output means: Most tasks are scheduled within 2-7 microseconds (healthy). If you see a significant tail at 128us+ or milliseconds, your CPUs are oversubscribed — more runnable tasks than CPU cores. The fix: reduce concurrency, pin critical tasks to dedicated cores, or add CPUs. On VMs, check if the hypervisor is overcommitting CPU.

When to use it: When applications are slow but no single process shows high CPU. Run queue latency is invisible to top and htop — it's the time between "ready to run" and "actually running."

Involuntary context switches

The problem

A latency-sensitive application (database, real-time audio) is experiencing jitter. Involuntary context switches mean the kernel is preempting your process to run something else — causing unpredictable pauses.

# Count involuntary context switches per process
bpftrace -e '
tracepoint:sched:sched_switch /args.prev_state == 0/ {
  @ivcsw[args.prev_comm] = count();
}
interval:s:5 { print(@ivcsw); clear(@ivcsw); }'

What the output means: prev_state == 0 means the task was still RUNNING (runnable) when it was switched out — that's involuntary. High counts for your critical process mean it's getting preempted frequently. Check what's preempting it: if it's kernel workers or interrupt handlers, consider CPU pinning with taskset or isolcpus.

When to use it: When you need deterministic latency. Real-time workloads, database query latency, audio processing — anywhere jitter matters more than throughput.

IRQ time per CPU

The problem

One CPU core is pegged at 100% while others are idle. You suspect interrupt processing (IRQ) is pinned to a single core — common with NICs and storage controllers that haven't been configured for multi-queue.

# Track time spent in hardirq handlers per CPU
hardirqs

# Detailed: IRQ time distribution per vector
bpftrace -e '
tracepoint:irq:irq_handler_entry { @start[cpu] = nsecs; @name[cpu] = args.name; }
tracepoint:irq:irq_handler_exit /@start[cpu]/ {
  @irq_us[@name[cpu], cpu] = hist((nsecs - @start[cpu]) / 1000);
  delete(@start[cpu]); delete(@name[cpu]);
}'

What the output means: If one CPU handles all interrupts for a NIC (e.g., eth0), all network processing goes through that core. The fix: enable RSS (Receive Side Scaling) with ethtool -L eth0 combined 8 and use irqbalance or manual IRQ affinity via /proc/irq/*/smp_affinity.

When to use it: When mpstat shows one CPU at 100% %irq or %soft while others are idle. This is extremely common on high-throughput network servers.

CPU cache miss rate

The problem

An application uses lots of CPU but perf counters show low IPC (instructions per cycle). You suspect poor cache utilization — the CPU is stalled waiting for data from main memory.

# LLC (Last Level Cache) miss rate using hardware perf counters
bpftrace -e '
hardware:cache-misses:1000000 { @misses[comm] = count(); }
hardware:cache-references:1000000 { @refs[comm] = count(); }
interval:s:5 { print(@misses); print(@refs); clear(@misses); clear(@refs); }'

# Alternative with perf and BPF
llcstat 10

What the output means: A miss rate above 5% is concerning. Above 20% means the working set doesn't fit in cache — the CPU spends more time waiting for memory than executing instructions. Common causes: large hash tables with random access patterns, pointer-chasing data structures (linked lists, trees), or NUMA-remote memory access. The fix is algorithmic: better data locality, smaller working sets, or NUMA-aware allocation.

When to use it: When CPU utilization is high but throughput is low. This separates "doing work" from "waiting for memory" — a distinction that top can't make.

CPU wakeup reasons

The problem

Your idle system keeps waking up. You want power efficiency (laptops, edge servers) and need to find what's interrupting the CPU idle state.

# Track what's waking up CPUs
bpftrace -e '
tracepoint:sched:sched_wakeup {
  @wakeup[args.comm, kstack] = count();
}
interval:s:10 { print(@wakeup, 10); clear(@wakeup); }'

What the output means: The stack trace tells you exactly what triggered the wakeup. Timer-based wakeups (hrtimer_wakeup) mean something set a timer — check for unnecessary polling. Network wakeups (tcp_data_ready) mean incoming packets. Disk wakeups (blk_mq_complete_request) mean I/O completion. Reduce wakeups by increasing polling intervals, batching work, and using event-driven rather than poll-driven I/O.

When to use it: When optimizing power consumption or when an idle system has unexpectedly high CPU usage due to constant wakeups.

"Memory is disappearing"

Memory problems are the sneakiest. The OOM killer fires hours after the real leak started. free -h tells you memory is used but not who's using it or how fast it's growing. eBPF lets you watch allocations in real-time and catch the leak as it happens, not after the corpse is cold.

Page fault rate by process

The problem

An application is consuming more memory over time. You want to see which processes are faulting in new pages — the actual moment memory consumption increases.

# Count major and minor page faults per process
bpftrace -e '
software:page-faults:1 { @faults[comm] = count(); }
interval:s:5 { print(@faults, 10); clear(@faults); }'

# Separate major (disk) from minor (memory) faults
bpftrace -e '
software:major-faults:1 { @major[comm] = count(); }
software:minor-faults:100 { @minor[comm] = count(); }
interval:s:5 {
  printf("--- Major faults (disk reads) ---\n"); print(@major, 5);
  printf("--- Minor faults (memory maps) ---\n"); print(@minor, 5);
  clear(@major); clear(@minor);
}'

What the output means: Minor faults are normal — they happen when a process accesses a new page for the first time (the kernel maps it on demand). A process with a steadily increasing minor fault rate is growing its memory footprint. Major faults mean the kernel had to read from disk (swap or memory-mapped file) — this means you're swapping and performance is about to crater.

When to use it: When you want to catch a memory leak as it happens rather than after the OOM killer fires.

Slab top consumers

The problem

Kernel memory (slab) is growing. /proc/meminfo shows Slab is large but doesn't tell you which slab cache is the culprit. Common offenders: dentry cache (too many files), inode cache, connection tracking.

# Watch slab allocation rates in real-time
slabratetop

# Track allocations for a specific slab cache
bpftrace -e '
tracepoint:kmem:kmem_cache_alloc /str(args.name) == "dentry"/ {
  @dentry_allocs = count();
  @dentry_stacks[kstack] = count();
}
interval:s:10 {
  printf("dentry allocs: %lld\n", @dentry_allocs);
  print(@dentry_stacks, 3);
  clear(@dentry_allocs); clear(@dentry_stacks);
}'

What the output means: If dentry allocations are high, something is creating or accessing lots of files/directories (find commands, recursive directory walks, container image layer resolution). If nf_conntrack is high, you have lots of network connections and conntrack table pressure. The kernel stack trace in @dentry_stacks tells you exactly which code path is allocating.

When to use it: When Slab: in /proc/meminfo is unexpectedly large. This is kernel memory that doesn't show up in any process's RSS.

Memory allocation rate

The problem

You want to measure how fast a process is allocating memory — not how much it has (RSS), but the rate of new allocations. A process can have stable RSS but churn through allocations if it's allocating and freeing constantly.

# Track memory allocation rates per process (via brk and mmap)
bpftrace -e '
tracepoint:syscalls:sys_enter_mmap { @mmap_bytes[comm] = sum(args.len); }
tracepoint:syscalls:sys_enter_brk { @brk[comm] = count(); }
interval:s:5 {
  printf("--- mmap bytes ---\n"); print(@mmap_bytes, 10);
  printf("--- brk calls ---\n"); print(@brk, 10);
  clear(@mmap_bytes); clear(@brk);
}'

What the output means: mmap bytes show large allocations (typically from malloc for sizes > 128KB). brk calls show small heap expansions. A process with high mmap bytes but stable RSS is allocating and freeing large buffers — this is expensive due to page table operations and TLB flushes. Consider using a memory pool.

When to use it: When you suspect allocation churn (allocate, use, free, repeat) is causing performance problems through page table overhead and TLB pressure.

OOM score tracking

The problem

The OOM killer keeps killing your application. You want to see which processes are closest to being killed before it happens, and watch the scores change over time.

# Watch OOM killer events with victim details
bpftrace -e '
kprobe:oom_kill_process {
  printf("OOM KILL: pid=%d comm=%s\n", pid, comm);
  @oom_victims[comm] = count();
}'

# Proactive: monitor which processes are closest to OOM
#!/bin/bash
while true; do
  echo "--- $(date) ---"
  for p in /proc/[0-9]*/oom_score; do
    score=$(cat "$p" 2>/dev/null)
    if [ "$score" -gt 100 ] 2>/dev/null; then
      pid=$(echo "$p" | cut -d/ -f3)
      comm=$(cat "/proc/$pid/comm" 2>/dev/null)
      echo "$score $pid $comm"
    fi
  done | sort -rn | head -10
  sleep 5
done

What the output means: OOM score ranges from 0 (never kill) to 1000 (kill first). Processes using the most memory relative to total memory get the highest scores. If your critical application has a high score, protect it with echo -1000 > /proc/PID/oom_score_adj. But be careful — protecting everything means the OOM killer has no good targets and may kill something worse.

When to use it: After any OOM kill event to understand why that process was chosen, or proactively to set up OOM score adjustments before production traffic.

Swap activity

The problem

The system is swapping and you need to know which processes are being swapped out and how much swap I/O is happening.

# Watch swap-in (page read from swap) events per process
bpftrace -e '
kprobe:swap_readpage {
  @swapin[comm] = count();
}
kprobe:swap_writepage {
  @swapout[comm] = count();
}
interval:s:5 {
  printf("--- swap in ---\n"); print(@swapin, 10);
  printf("--- swap out ---\n"); print(@swapout, 10);
  clear(@swapin); clear(@swapout);
}'

What the output means: Swap-out means the kernel is evicting pages from memory to make room. Swap-in means a process accessed a page that was previously evicted — this is the painful part (disk read in the critical path). If the same process appears in both swap-in and swap-out, it's thrashing: being evicted and immediately faulted back in. The fix is more RAM or less memory usage.

When to use it: When vmstat shows si and so activity and you need to know which processes are affected.

ZFS ARC size tracking

The problem

ZFS ARC is consuming memory and you're not sure if it's giving it back when the system needs it. The ARC is supposed to shrink under memory pressure, but sometimes it doesn't shrink fast enough.

# Watch ARC size and target in real-time
bpftrace -e '
interval:s:1 {
  $arc_size = *kaddr("zfs_arc_size");
  $arc_max = *kaddr("zfs_arc_max");
  $arc_min = *kaddr("zfs_arc_min");
  printf("ARC: size=%lld MB  max=%lld MB  min=%lld MB\n",
         $arc_size / 1048576, $arc_max / 1048576, $arc_min / 1048576);
}'

# Track ARC shrink events (memory pressure)
bpftrace -e '
kprobe:arc_shrink { @shrink_calls = count(); @shrink_stack[kstack] = count(); }
interval:s:10 { printf("ARC shrink calls: %lld\n", @shrink_calls); clear(@shrink_calls); }'

What the output means: If ARC size equals max and shrink calls are zero, the system has plenty of memory and ARC is using its full allocation. If you see frequent shrink calls but ARC size stays high, the ARC is fighting with application memory pressure. Set zfs_arc_max lower to leave headroom. On a dedicated file server, let it have 80% of RAM. On a database server, limit it to 25-30%.

When to use it: When tuning ZFS memory usage, or when applications compete with ARC for memory.

Hugepage allocation failures

The problem

Your application requests transparent hugepages (THP) but the kernel can't satisfy the request because memory is fragmented. It falls back to regular 4K pages, causing TLB pressure and performance loss.

# Track hugepage allocation attempts and failures
bpftrace -e '
tracepoint:huge_memory:mm_khugepaged_scan_pmd {
  @scan = count();
}
kprobe:__alloc_pages /arg1 >= 9/ {
  @huge_attempt = count();
}
kretprobe:__alloc_pages /retval == 0 && @huge_attempt/ {
  @huge_fail = count();
}
interval:s:10 {
  printf("huge attempts: %lld  failures: %lld\n", @huge_attempt, @huge_fail);
  clear(@huge_attempt); clear(@huge_fail); clear(@scan);
}'

What the output means: If failures are a significant fraction of attempts, the system is too fragmented for hugepages. Options: preallocate hugepages at boot with hugepages=1024 kernel parameter, use madvise instead of always for THP (echo madvise > /sys/kernel/mm/transparent_hugepage/enabled), or just accept 4K pages and stop the kernel from wasting time trying to compact memory.

When to use it: When a hugepage-aware application (JVM, databases) performs worse than expected and you suspect THP compaction stalls.

kswapd activity

The problem

kswapd is using CPU. This kernel thread reclaims memory pages when free memory drops below watermarks. High kswapd activity means the system is constantly under memory pressure.

# Track kswapd wake-ups and how long it runs
bpftrace -e '
kprobe:kswapd { @start[tid] = nsecs; @wakeups = count(); }
kretprobe:kswapd /@start[tid]/ {
  @runtime_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}
interval:s:10 {
  printf("kswapd wakeups: %lld\n", @wakeups);
  print(@runtime_us);
  clear(@wakeups); clear(@runtime_us);
}'

What the output means: Occasional kswapd runs (a few per minute) are normal. Constant kswapd activity (multiple per second) means the system is in perpetual memory reclaim. If kswapd runtime is in milliseconds, it's scanning lots of pages to find reclaimable ones — this steals CPU from your applications. The fix: add RAM, reduce memory usage, or tune vm.min_free_kbytes to keep a larger free memory buffer.

When to use it: When system CPU is unexpectedly high and kswapd shows up in top. This quantifies exactly how much time the kernel spends on memory reclaim.

"A process is misbehaving"

When you know which process is the problem but not what it's doing, eBPF lets you observe every interaction that process has with the kernel. Every file it opens, every signal it receives, every thread it creates, every lock it contends on. Like attaching a flight recorder to a single process.

All syscalls with timing

The problem

A process is slow and you don't know why. You need a complete timeline of everything it asks the kernel to do, with timing for each operation.

# Trace all syscalls for a specific PID with latency
bpftrace -e '
tracepoint:raw_syscalls:sys_enter /pid == $1/ { @start[tid] = nsecs; @id[tid] = args.id; }
tracepoint:raw_syscalls:sys_exit /pid == $1 && @start[tid]/ {
  $lat = (nsecs - @start[tid]) / 1000;
  if ($lat > 100) {
    printf("%-6d %-4d %8d us  syscall=%d\n", pid, tid, $lat, @id[tid]);
  }
  delete(@start[tid]); delete(@id[tid]);
}' $TARGET_PID

What the output means: Every syscall taking over 100 microseconds gets logged with its duration. Look for patterns: if futex calls dominate, you have lock contention. If read/write calls are slow, you have I/O issues. If nanosleep shows up, the application is sleeping on purpose (check why). The syscall number maps to your arch's syscall table — ausyscall --dump on RHEL/CentOS gives you the translation.

When to use it: First step in process-level debugging. This gives you the complete picture before you narrow down to specific syscall types.

File opens with path

The problem

You need to know every file a process opens — config files, libraries, temp files, device files. Useful for debugging "file not found" errors, understanding application behavior, or auditing file access.

# Trace all file opens for a specific process
opensnoop -p $(pgrep -x myapp)

# With bpftrace for more detail (flags, return value)
bpftrace -e '
tracepoint:syscalls:sys_enter_openat /pid == $1/ {
  printf("%-6d %-16s flags=0x%x %s\n", pid, comm, args.flags, str(args.filename));
}' $TARGET_PID

What the output means: You see every file the process touches. Failed opens (return value -1) often indicate missing config files or wrong paths. Watch for opens to /dev/urandom (crypto/TLS initialization), /etc/resolv.conf (DNS configuration), and /proc/self/ (process introspection). A process that opens thousands of files at startup might be loading a huge classpath or scanning a plugin directory.

When to use it: When an application fails silently (it tried to open a file, got ENOENT, and continued with wrong defaults) or when you want to understand all files an application depends on.

Signal delivery

The problem

A process keeps dying or restarting and you suspect it's receiving signals. kill -l lists signal names but you need to see who is sending what signal to your process in real-time.

# Trace all signals sent to a specific process
bpftrace -e '
tracepoint:signal:signal_deliver /args.sig != 0/ {
  printf("%s: sig=%d (%s) -> pid=%d (%s)\n",
         strftime("%H:%M:%S", nsecs), args.sig,
         args.sig == 9 ? "SIGKILL" : args.sig == 15 ? "SIGTERM" :
         args.sig == 11 ? "SIGSEGV" : args.sig == 6 ? "SIGABRT" : "other",
         pid, comm);
}'

# Find WHO sent the signal (the sender)
bpftrace -e '
tracepoint:signal:signal_generate {
  printf("%-6d %-16s sent sig=%d to pid=%d\n", pid, comm, args.sig, args.pid);
}'

What the output means: SIGKILL (9) from a process called "oom_reaper" means the OOM killer struck. SIGTERM (15) from systemd means a service stop was requested. SIGSEGV (11) means a segfault — the process accessed invalid memory. SIGABRT (6) usually means assert() failed or abort() was called. The sender PID and command tell you exactly who requested the kill.

When to use it: When a process dies mysteriously. Between this and the OOM killer trace, you can always find out what killed your process and who asked for it.

Thread creation rate

The problem

A process is creating threads faster than it should. Thread-per-request architectures can exhaust system resources if requests spike. You need to see the creation rate and who's creating them.

# Track thread creation with parent process info
bpftrace -e '
tracepoint:sched:sched_process_fork {
  printf("%-6d %-16s -> child %-6d\n", args.parent_pid, args.parent_comm, args.child_pid);
  @forks[args.parent_comm] = count();
}
interval:s:5 { print(@forks, 10); clear(@forks); }'

What the output means: A Java application creating hundreds of threads per second is probably using a thread-per-connection model under load — consider switching to NIO or virtual threads. A Go application forking is unusual (Go uses goroutines, not OS threads) and might indicate CGo calling fork-unsafe code. Shell scripts that fork rapidly (for every pipeline element) can create thousands of processes per second.

When to use it: When PID numbers are increasing rapidly or /proc/sys/kernel/threads-max is being approached. Also useful for finding fork bombs.

Library loading

The problem

An application is slow to start and you suspect it's loading lots of shared libraries. Or you want to verify that a specific library version is being loaded (dependency hell debugging).

# Trace shared library loads (dlopen and the dynamic linker)
bpftrace -e '
uprobe:/lib64/ld-linux-x86-64.so.2:_dl_open {
  printf("%-6d %-16s loading library\n", pid, comm);
}
tracepoint:syscalls:sys_enter_openat /comm == "myapp" && str(args.filename) == "*.so*"/ {
  printf("%-6d open: %s\n", pid, str(args.filename));
}'

# Simpler: just trace all .so file opens
opensnoop -p $(pgrep -x myapp) 2>&1 | grep '\.so'

What the output means: You see every shared library the process loads, in order. If it's loading hundreds of libraries, startup time suffers because each library requires mmap, relocation, and constructor execution. Libraries loaded from NFS or slow storage will be especially painful. Consider static linking or preloading with LD_PRELOAD for critical libraries.

When to use it: When application startup is slow, when debugging "wrong version of libssl loaded" problems, or when auditing what libraries a process actually uses.

Process start with full arguments

The problem

You need to see every process that starts, with its complete command-line arguments. ps only shows a snapshot. You need the complete timeline of process execution.

# Every process start with full arguments and parent
execsnoop -T

# With UID for multi-user systems
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
  printf("%-6d %-6d %-16s ", pid, uid, comm);
  join(args.argv);
}'

What the output means: Every binary execution on the system. Watch for processes started with suspicious arguments (encoded strings, download URLs, -e flags for code injection), processes running from unusual paths (/tmp, /dev/shm, hidden directories), and processes spawned by unexpected parents (web server spawning bash).

When to use it: Always. Seriously. On production systems, pipe this to a log file. It's the single most valuable audit trail on a Linux system.

Process exit codes

The problem

A process keeps failing but you don't know its exit code. systemd just says "failed" and the application doesn't log. The exit code tells you the failure mode.

# Trace process exits with exit code
bpftrace -e '
tracepoint:sched:sched_process_exit {
  $code = args.pid == 0 ? 0 : (uint32)(curtask->exit_code >> 8);
  $signal = (uint32)(curtask->exit_code & 0x7f);
  if ($code != 0 || $signal != 0) {
    printf("%-6d %-16s exit_code=%d signal=%d\n", pid, comm, $code, $signal);
  }
}'

What the output means: Exit code 0 is success (filtered out above). Exit code 1 is generic failure. Exit code 2 is usually "misuse of command" (wrong arguments). Exit code 126 is "permission denied" (can't execute). Exit code 127 is "command not found." Exit code 137 means killed by signal 9 (SIGKILL, often OOM). Exit code 139 means segfault (signal 11). If the signal field is non-zero, the process was killed by that signal.

When to use it: When processes fail silently. Combine with execsnoop to get the full picture: what launched, with what args, and how it exited.

Lock contention (futex)

The problem

A multi-threaded application is slow but each individual thread shows low CPU. The threads are spending time waiting on locks — pthread_mutex, std::mutex, Go mutexes — all eventually call futex in the kernel.

# Trace futex wait latency (lock contention) for a specific process
bpftrace -e '
tracepoint:syscalls:sys_enter_futex /pid == $1 && args.op == 0/ {
  @start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_futex /pid == $1 && @start[tid]/ {
  @futex_wait_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}' $TARGET_PID

# Find the most contended lock addresses
bpftrace -e '
tracepoint:syscalls:sys_enter_futex /pid == $1 && args.op == 0/ {
  @contended_addr[args.uaddr] = count();
  @start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_futex /pid == $1 && @start[tid]/ {
  @wait_time[args.uaddr] = sum((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}' $TARGET_PID

What the output means: The futex wait histogram shows how long threads spend waiting for locks. If the histogram has a tail into milliseconds, you have serious contention. The address-based tracking shows which specific lock is the bottleneck — map the address back to your code using /proc/PID/maps to find whether it's a heap lock, a global lock, or a library lock. High contention on a single address means a single "hot lock" that all threads fight over.

When to use it: When a multi-threaded application doesn't scale — adding more threads doesn't improve throughput. Lock contention is the usual reason.

Lock contention is one of the hardest performance problems to diagnose without eBPF. Traditional profilers show CPU time, but lock contention is off-CPU time — the thread is waiting, using zero CPU, but still blocking your application. eBPF is one of the only tools that can measure time spent doing nothing.

"Audit who did what"

Every production system needs an audit trail that can't be tampered with from userspace. eBPF traces run in the kernel — an attacker with root can't delete the trace output if you're streaming it to a remote syslog. These recipes give you a complete audit trail of every significant action on the system.

All command executions with user

The problem

You need a tamper-resistant log of every command executed on the system, by which user, at what time. bash history is trivially deletable. auditd is complex to configure. eBPF just works.

# Log every command execution with UID, PID, and full arguments
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
  printf("%s uid=%-5d pid=%-6d ppid=%-6d ", strftime("%Y-%m-%d %H:%M:%S", nsecs), uid, pid, curtask->parent->pid);
  join(args.argv);
}'

What the output means: A complete, timestamped audit log of every binary execution. The UID tells you which user ran it (0 = root). The PPID lets you trace the chain of execution — was this command run from an SSH session (ppid -> sshd), a cron job (ppid -> crond), or a web shell (ppid -> apache/nginx)? Pipe to tee /var/log/execlog.log and forward to remote syslog.

When to use it: On every production system, always. This is your forensic evidence after a security incident.

SSH login attempts

The problem

You want to trace SSH authentication attempts at the kernel level — not from sshd logs (which can be truncated) but from the actual PAM/syscall activity.

# Trace sshd process forks (each SSH connection gets a forked sshd)
bpftrace -e '
tracepoint:sched:sched_process_fork /args.parent_comm == "sshd"/ {
  printf("%s: SSH connection: sshd forked child pid=%d\n",
         strftime("%H:%M:%S", nsecs), args.child_pid);
  @ssh_connections = count();
}'

# Trace SSH auth by watching PAM
bpftrace -e '
uprobe:/lib64/libpam.so.0:pam_authenticate {
  printf("%s: PAM auth attempt by pid=%d (%s)\n",
         strftime("%H:%M:%S", nsecs), pid, comm);
}
uretprobe:/lib64/libpam.so.0:pam_authenticate {
  printf("  result: %s\n", retval == 0 ? "SUCCESS" : "FAILURE");
  @auth[retval == 0 ? "success" : "failure"] = count();
}'

What the output means: The fork trace shows every SSH connection attempt (even before authentication). The PAM trace shows success/failure of each authentication. If you see hundreds of failed PAM auths from sshd, you're being brute-forced. The kernel-level trace persists even if the attacker clears /var/log/secure.

When to use it: On any internet-facing SSH server. Combine with the command execution trace to get a full picture: who connected, whether they authenticated, and what they ran.

/etc/ file modifications

The problem

Configuration files in /etc/ are being modified and you need to know who, when, and what file. This catches unauthorized configuration changes, backdoor insertion, and configuration drift.

# Watch all writes to /etc/
bpftrace -e '
tracepoint:syscalls:sys_enter_openat
/str(args.filename, 5) == "/etc/" && (args.flags & 1 || args.flags & 2)/ {
  printf("%s uid=%-5d pid=%-6d %-16s WRITE %s\n",
         strftime("%H:%M:%S", nsecs), uid, pid, comm, str(args.filename));
}'

# More specific: watch critical auth files
bpftrace -e '
tracepoint:syscalls:sys_enter_openat
/(str(args.filename) == "/etc/passwd" ||
  str(args.filename) == "/etc/shadow" ||
  str(args.filename) == "/etc/sudoers" ||
  str(args.filename) == "/etc/ssh/sshd_config" ||
  str(args.filename) == "/etc/crontab") &&
 (args.flags & 1 || args.flags & 2)/ {
  printf("ALERT %s uid=%d pid=%d %s -> %s\n",
         strftime("%H:%M:%S", nsecs), uid, pid, comm, str(args.filename));
}'

What the output means: Any write to /etc/ is logged with the user, process, and filename. Watch for: unexpected writes to /etc/passwd or /etc/shadow (account creation), writes to /etc/crontab or /etc/cron.d/ (persistence mechanisms), writes to /etc/ssh/sshd_config (weakening SSH security), and writes to /etc/ld.so.preload (library injection).

When to use it: Always on production systems. Configuration file modification is the most common persistence technique after initial compromise.

sudo usage tracking

The problem

You want to track every sudo invocation — who used it, what command they ran with elevated privileges, and whether it succeeded.

# Track every sudo execution with the command being elevated
bpftrace -e '
tracepoint:syscalls:sys_enter_execve /comm == "sudo"/ {
  printf("%s uid=%-5d pid=%-6d SUDO: ", strftime("%H:%M:%S", nsecs), uid, pid);
  join(args.argv);
}'

# Track sudo's setuid transitions
bpftrace -e '
tracepoint:syscalls:sys_enter_setuid /comm == "sudo"/ {
  printf("%s: sudo pid=%d changing to uid=%d\n",
         strftime("%H:%M:%S", nsecs), pid, args.uid);
}'

What the output means: Every sudo command with the original UID (who ran it) and the target UID (usually 0/root). If someone runs sudo su - you'll see the sudo exec followed by the su exec followed by a shell. If someone runs sudo -u postgres psql you'll see the UID change to the postgres user. This is more reliable than /var/log/sudo.log because it can't be edited after the fact.

When to use it: On any multi-user system where privilege escalation needs to be audited. Compliance frameworks (SOC2, HIPAA, PCI) typically require this.

Kernel module loads

The problem

Unauthorized kernel modules being loaded is one of the most dangerous attack vectors — a malicious kernel module has full system access. You need to know every module load event.

# Trace kernel module loading
bpftrace -e '
kprobe:do_init_module {
  printf("MODULE LOAD: pid=%d uid=%d comm=%s module=%s\n",
         pid, uid, comm, str(((struct module *)arg0)->name));
}'

# Also trace module unloading
bpftrace -e '
tracepoint:module:module_load {
  printf("LOAD   %s: %s\n", strftime("%H:%M:%S", nsecs), str(args.name));
}
tracepoint:module:module_free {
  printf("UNLOAD %s: %s\n", strftime("%H:%M:%S", nsecs), str(args.name));
}'

What the output means: Every kernel module load and unload. Expected modules: filesystem drivers (zfs, ext4), network drivers (e1000, ixgbe), USB drivers at boot time. Unexpected: any module loaded after boot is suspicious. Rootkits install as kernel modules. If you see a module load event at 3 AM from a process that isn't modprobe or systemd-modules-load, investigate immediately.

When to use it: On hardened systems where kernel module loading should be rare after boot. Consider combining with kernel.modules_disabled=1 sysctl (locks out all future module loads) after the system is fully booted.

Network socket opens

The problem

You need to know every network connection a process opens — especially outbound connections from servers that shouldn't be making them (data exfiltration, C2 callbacks, unauthorized API calls).

# Trace all outbound TCP connections with destination
tcpconnect

# With filtering for unexpected destinations
bpftrace -e '
tracepoint:syscalls:sys_enter_connect /args.uservaddr->sa_family == 2/ {
  $addr = (struct sockaddr_in *)args.uservaddr;
  printf("%s pid=%-6d %-16s -> %d.%d.%d.%d:%d\n",
         strftime("%H:%M:%S", nsecs), pid, comm,
         ($addr->sin_addr.s_addr) & 0xff,
         ($addr->sin_addr.s_addr >> 8) & 0xff,
         ($addr->sin_addr.s_addr >> 16) & 0xff,
         ($addr->sin_addr.s_addr >> 24) & 0xff,
         ($addr->sin_port >> 8) | (($addr->sin_port & 0xff) << 8));
}'

What the output means: Every outbound TCP connection with process name and destination IP:port. On a web server, outbound connections should only go to your database, cache, and maybe an API. Any connection to unknown IPs — especially on ports 443 (HTTPS C2), 4444 (metasploit default), or 53 (DNS tunneling) — warrants investigation.

When to use it: On any server that shouldn't be making outbound connections. This is your network exfiltration detector.

"ZFS is weird"

ZFS has excellent built-in observability through zpool iostat, arcstat, and /proc/spl/kstat/zfs/. But when you need to go deeper — individual transaction group timing, scrub I/O patterns, DMU-level latency — eBPF is the only option. These recipes trace ZFS internals that no zpool command exposes.

ARC hit/miss ratio in real-time

The problem

arcstat gives you periodic snapshots but you want to see ARC efficiency change in real-time, especially during a workload shift.

# Real-time ARC hit/miss ratio
bpftrace -e '
kprobe:arc_read {
  @arc_reads = count();
}
kretprobe:arc_read /retval == 0/ {
  @arc_hits = count();
}
interval:s:1 {
  $total = @arc_reads;
  $hits = @arc_hits;
  if ($total > 0) {
    printf("ARC: %lld reads, %lld hits (%lld%% hit rate)\n",
           $total, $hits, $hits * 100 / $total);
  }
  clear(@arc_reads); clear(@arc_hits);
}'

What the output means: A healthy ARC has 90%+ hit rate for most workloads. Below 80% means your working set is larger than the ARC — either increase zfs_arc_max or accept the miss rate. A hit rate that drops suddenly means a new workload is thrashing the cache (full table scan, backup job reading cold data). Consider using ARC priorities or the L2ARC for the cold data.

When to use it: When ZFS read performance varies and you suspect ARC thrashing, or when tuning ARC size for a new workload.

ARC eviction rate

The problem

The ARC is at its max size and you want to see how aggressively it's evicting cached data. High eviction rates mean the cache is churning and many reads miss.

# Track ARC eviction events and bytes evicted
bpftrace -e '
kprobe:arc_evict {
  @evict_calls = count();
}
kprobe:arc_evict_hdr {
  @evict_hdrs = count();
}
interval:s:5 {
  printf("ARC evictions: %lld calls, %lld headers evicted\n",
         @evict_calls, @evict_hdrs);
  clear(@evict_calls); clear(@evict_hdrs);
}'

What the output means: Eviction calls happen when the ARC needs to free space for new data. If you see thousands of headers evicted per second, the ARC is churning — new data pushes out old data as fast as it arrives. This means the working set is much larger than the ARC. Either the ARC is too small, or a background job (scrub, backup, send/receive) is polluting the cache with cold data.

When to use it: When ARC hit rate is low and you want to quantify how bad the eviction pressure is.

TXG sync time

The problem

ZFS batches writes into transaction groups (TXGs) that sync to disk every 5 seconds (by default). If a TXG sync takes too long, write latency spikes. You need to measure individual TXG sync durations.

# Trace TXG sync latency
bpftrace -e '
kprobe:spa_sync {
  @start[arg0] = nsecs;
  @txg[arg0] = arg1;
}
kretprobe:spa_sync /@start[arg0]/ {
  $ms = (nsecs - @start[arg0]) / 1000000;
  printf("TXG %lld synced in %lld ms\n", @txg[arg0], $ms);
  @sync_ms = hist($ms);
  delete(@start[arg0]); delete(@txg[arg0]);
}'

What the output means: TXG syncs should complete in well under 5 seconds. If syncs take 5+ seconds, the next TXG has already accumulated and writes start to stall. Common causes: slow SLOG (ZIL) device, too many dirty bytes (zfs_dirty_data_max too high for your disk speed), or a pool with degraded vdevs doing resilver writes alongside application writes. If sync times are bimodal (fast and slow), check for periodic background writes (snapshots, send/receive).

When to use it: When ZFS write latency is inconsistent. TXG sync time is the heartbeat of ZFS write performance.

Scrub I/O patterns

The problem

A ZFS scrub is running and you want to know how much I/O it's generating and whether it's interfering with application I/O.

# Track scrub I/O separately from application I/O
bpftrace -e '
kprobe:dsl_scan_scrub_cb {
  @scrub_ios = count();
}
kprobe:zio_read { @total_reads = count(); }
kprobe:zio_write { @total_writes = count(); }
interval:s:5 {
  printf("scrub IOs: %lld  total reads: %lld  total writes: %lld\n",
         @scrub_ios, @total_reads, @total_writes);
  clear(@scrub_ios); clear(@total_reads); clear(@total_writes);
}'

What the output means: Compare scrub I/Os to total reads. If scrub is dominating (90%+ of reads), it's competing heavily with application I/O. Reduce scrub speed: echo 50 > /sys/module/zfs/parameters/zfs_scrub_delay (default 4, higher = slower scrub). On production systems, schedule scrubs during low-traffic windows and use the delay parameter to throttle them.

When to use it: When application performance degrades during a scrub and you need to quantify the interference.

Snapshot creation and deletion events

The problem

Snapshots are being created or deleted and you want a kernel-level audit trail — independent of what zfs list -t snapshot shows (which only shows current state, not history).

# Trace snapshot operations
bpftrace -e '
kprobe:dsl_dataset_snapshot {
  printf("%s: SNAPSHOT CREATE pid=%d uid=%d %s\n",
         strftime("%H:%M:%S", nsecs), pid, uid, comm);
  @snap_create = count();
}
kprobe:dsl_destroy_snapshot {
  printf("%s: SNAPSHOT DESTROY pid=%d uid=%d %s\n",
         strftime("%H:%M:%S", nsecs), pid, uid, comm);
  @snap_destroy = count();
}'

What the output means: Every snapshot creation and deletion with who did it. If you see unexpected snapshot deletions, someone (or an automated tool like sanoid/syncoid) is removing your backup points. If you see rapid snapshot creation, check if sanoid is running with too-frequent intervals. The UID field tells you if it was root (0) or another user.

When to use it: When you want an audit trail of ZFS snapshot operations, or when snapshots appear/disappear unexpectedly.

Pool import time breakdown

The problem

ZFS pool import is slow at boot. On pools with many datasets, thousands of snapshots, or degraded vdevs, import can take minutes. You need to know which phase is slow.

# Trace pool import phases
bpftrace -e '
kprobe:spa_load {
  @import_start = nsecs;
  printf("Pool import started\n");
}
kprobe:spa_load_verify {
  printf("  verify phase at +%lld ms\n", (nsecs - @import_start) / 1000000);
}
kprobe:vdev_open {
  @vdev_opens = count();
}
kretprobe:spa_load /@import_start/ {
  printf("Pool import complete: %lld ms, %lld vdev opens\n",
         (nsecs - @import_start) / 1000000, @vdev_opens);
  delete(@import_start);
}'

What the output means: If the verify phase takes a long time, ZFS is replaying the ZIL (intent log) — this means the pool wasn't cleanly exported before shutdown. If vdev opens take long, the disk enumeration is slow (multipath, enclosures, or USB devices). Large numbers of vdev opens indicate a pool with many vdevs. The total time tells you your boot-to-ZFS-ready window.

When to use it: When pool import at boot takes too long and you need to know whether it's ZIL replay, device enumeration, or metadata loading.

DMU read/write latency

The problem

You want to measure ZFS latency at the DMU (Data Management Unit) layer — above the block device but below the POSIX interface. This separates ZFS overhead from disk overhead.

# DMU read and write latency histograms
bpftrace -e '
kprobe:dmu_read { @read_start[tid] = nsecs; }
kretprobe:dmu_read /@read_start[tid]/ {
  @dmu_read_us = hist((nsecs - @read_start[tid]) / 1000);
  delete(@read_start[tid]);
}
kprobe:dmu_write { @write_start[tid] = nsecs; }
kretprobe:dmu_write /@write_start[tid]/ {
  @dmu_write_us = hist((nsecs - @write_start[tid]) / 1000);
  delete(@write_start[tid]);
}'

What the output means: DMU read latency includes ARC lookup, decompression, and (on miss) disk read. If DMU reads are fast but POSIX reads are slow, the overhead is in the VFS/POSIX layer (file locking, permission checks). If DMU reads are slow but disk I/O is fast, ZFS is spending time on decompression or checksum verification. Compare these histograms with biolatency to isolate the layer where latency lives.

When to use it: When you need to pinpoint whether slow ZFS performance is in the ZFS layer or the disk layer.

ZIL commit latency

The problem

Synchronous write performance is poor. ZFS synchronous writes go through the ZIL (ZFS Intent Log) — if the ZIL device is slow, every fsync() and O_SYNC write stalls.

# Measure ZIL commit latency
bpftrace -e '
kprobe:zil_commit {
  @start[tid] = nsecs;
}
kretprobe:zil_commit /@start[tid]/ {
  @zil_commit_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}
interval:s:10 { print(@zil_commit_us); clear(@zil_commit_us); }'

What the output means: ZIL commits should complete in under 1ms on a good SLOG (separate log device, ideally Optane or high-endurance NVMe). If you see multi-millisecond commits, the SLOG is slow or you don't have one (ZIL falls back to the main pool). Databases (PostgreSQL, MySQL with InnoDB) call fsync() constantly — ZIL commit latency directly determines their write transaction speed.

When to use it: When database write performance is poor on ZFS. The ZIL is almost always the bottleneck for sync-heavy workloads.

If you only add one ZFS-related piece of hardware after reading this page, make it a SLOG device. A $50 Optane M10 16GB as a dedicated ZIL can take your database from "unusable on ZFS" to "faster than ext4 on the same disk." The ZIL commit latency trace proves it — before and after, the histogram shift is dramatic.

"WireGuard is slow"

WireGuard is deliberately opaque — it doesn't log, doesn't expose counters, and doesn't have a debug mode. This is great for security but terrible for troubleshooting. eBPF is the only way to get inside the WireGuard kernel module without modifying it.

Per-peer byte counters

The problem

wg show gives cumulative byte counts per peer but no rate information. You need bytes-per-second per peer to find which tunnel is saturated.

# Track WireGuard TX/RX bytes per peer at kernel level
bpftrace -e '
kprobe:wg_packet_encrypt { @wg_tx = sum(arg1); @wg_tx_pkts = count(); }
kprobe:wg_packet_decrypt { @wg_rx = sum(arg1); @wg_rx_pkts = count(); }
interval:s:1 {
  printf("WG TX: %lld bytes (%lld pkts)  RX: %lld bytes (%lld pkts)\n",
         @wg_tx, @wg_tx_pkts, @wg_rx, @wg_rx_pkts);
  clear(@wg_tx); clear(@wg_tx_pkts); clear(@wg_rx); clear(@wg_rx_pkts);
}'

What the output means: Per-second throughput for all WireGuard tunnels combined. Compare TX and RX: if one direction is near zero, the tunnel might be half-broken (one-way traffic, usually a routing or firewall issue). If throughput plateaus at ~1 Gbps despite having a 10G NIC, the ChaCha20 encryption is CPU-bound on a single core — check if the WireGuard softirq is pinned to one CPU.

When to use it: When WireGuard throughput is lower than expected or when you need per-second rate data that wg show can't provide.

Handshake timing

The problem

WireGuard tunnels take a long time to establish, or they drop and re-handshake frequently. You need to measure handshake duration and frequency.

# Measure time between handshake initiation and completion
bpftrace -e '
kprobe:wg_noise_handshake_create_initiation {
  @hs_start = nsecs;
  @hs_initiated = count();
  printf("%s: Handshake INITIATED\n", strftime("%H:%M:%S", nsecs));
}
kprobe:wg_noise_handshake_consume_response /@hs_start/ {
  $ms = (nsecs - @hs_start) / 1000000;
  printf("%s: Handshake COMPLETE in %lld ms\n", strftime("%H:%M:%S", nsecs), $ms);
  @hs_duration_ms = hist($ms);
  delete(@hs_start);
  @hs_completed = count();
}
interval:s:30 {
  printf("--- initiated: %lld  completed: %lld ---\n", @hs_initiated, @hs_completed);
}'

What the output means: Handshake should complete in a few milliseconds on a LAN, 50-200ms over the internet (one RTT plus crypto). If handshakes take seconds, UDP packets are being dropped or the peer is unresponsive. If initiated is much higher than completed, handshakes are failing — check firewall rules on UDP port 51820 (or your configured port). Frequent re-handshakes (more than once per 2 minutes) suggest the peer's key is rotating or the connection is unstable.

When to use it: When WireGuard tunnels are flapping or slow to establish.

Encapsulation overhead

The problem

You want to measure the actual CPU overhead of WireGuard encryption/decryption per packet. On high-throughput links, crypto overhead can be significant.

# Measure per-packet encryption and decryption time
bpftrace -e '
kprobe:chacha20poly1305_encrypt { @enc_start[tid] = nsecs; }
kretprobe:chacha20poly1305_encrypt /@enc_start[tid]/ {
  @encrypt_ns = hist(nsecs - @enc_start[tid]);
  delete(@enc_start[tid]);
}
kprobe:chacha20poly1305_decrypt { @dec_start[tid] = nsecs; }
kretprobe:chacha20poly1305_decrypt /@dec_start[tid]/ {
  @decrypt_ns = hist(nsecs - @dec_start[tid]);
  delete(@dec_start[tid]);
}'

What the output means: ChaCha20-Poly1305 typically takes 200-500 nanoseconds per packet on modern x86 with AVX2. If you see microseconds, AVX2 is not being used (check lscpu for avx2 flag and dmesg | grep -i chacha for the implementation in use). ARM systems without NEON will be slower. The histogram shape tells you if there are outlier packets that take much longer (likely due to interrupt coalescing or CPU migration).

When to use it: When benchmarking WireGuard throughput or comparing encryption overhead across hardware platforms.

WireGuard routing lookup latency

The problem

With many WireGuard peers (mesh networks, hub-and-spoke with hundreds of sites), the allowed-IPs routing lookup might become a bottleneck.

# Measure WireGuard AllowedIPs routing table lookup time
bpftrace -e '
kprobe:wg_allowedips_lookup_dst {
  @start[tid] = nsecs;
}
kretprobe:wg_allowedips_lookup_dst /@start[tid]/ {
  @lookup_ns = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}
interval:s:10 { print(@lookup_ns); clear(@lookup_ns); }'

What the output means: WireGuard uses a trie (radix tree) for AllowedIPs lookup, so performance is O(prefix_length), not O(number_of_peers). Even with 10,000 peers, lookups should be sub-microsecond. If you see microsecond+ lookups, something is wrong — likely TLB misses because the routing table spans many memory pages. This is rare but possible with very large deployments.

When to use it: When scaling WireGuard to hundreds or thousands of peers and you want to verify the routing lookup isn't a bottleneck.

WireGuard UDP drops

The problem

WireGuard packets are being dropped before they reach the WireGuard module. The kernel's UDP layer might be dropping them due to buffer overflows or rate limits.

# Trace UDP drops on the WireGuard port
bpftrace -e '
tracepoint:skb:kfree_skb {
  $skb = (struct sk_buff *)args.skbaddr;
  @drops[ksym(args.location)] = count();
}
tracepoint:udp:udp_fail_queue_rcv_skb {
  printf("%s: UDP queue FULL, dropping WireGuard packet\n",
         strftime("%H:%M:%S", nsecs));
  @udp_drops = count();
}
interval:s:5 { print(@drops, 5); print(@udp_drops); clear(@drops); clear(@udp_drops); }'

What the output means: If udp_fail_queue_rcv_skb fires, the UDP receive buffer is full and packets are being dropped before WireGuard ever sees them. Fix: increase net.core.rmem_max and net.core.rmem_default. If drops happen in nf_hook_slow, a firewall rule is dropping UDP packets. If drops happen in __udp4_lib_rcv, the destination port doesn't match any socket (WireGuard interface not up).

When to use it: When WireGuard throughput is lower than expected and you suspect packet loss below the WireGuard layer.

"Container consuming resources"

Containers share a kernel. When one container misbehaves, traditional tools show the process but not the container it belongs to. eBPF can trace by cgroup — the kernel's own container boundary — giving you per-container breakdowns of every resource type.

Syscalls by cgroup (container)

The problem

You have dozens of containers and need to know which one is making the most syscalls — the noisiest neighbor.

# Count syscalls per cgroup
bpftrace -e '
tracepoint:raw_syscalls:sys_enter {
  @syscalls[cgroup] = count();
}
interval:s:5 { print(@syscalls, 10); clear(@syscalls); }'

# Map cgroup IDs to container names
#!/bin/bash
# Run alongside the bpftrace to resolve cgroup IDs
for cg in /sys/fs/cgroup/system.slice/docker-*.scope; do
  id=$(cat "$cg/cgroup.id" 2>/dev/null)
  name=$(basename "$cg" | sed 's/docker-//;s/\.scope//' | head -c 12)
  echo "$id -> $name"
done

What the output means: The cgroup ID maps to a container. The container with the highest syscall count is the noisiest. This doesn't mean it's misbehaving — a web server handling traffic will have high syscall counts. But a container doing 10x more syscalls than expected for its workload needs investigation.

When to use it: When the host is under CPU pressure and you need to identify which container is the heaviest kernel user.

Disk I/O by cgroup

The problem

Disk I/O is high and you need to know which container is responsible. docker stats shows blkio but with poor granularity and no latency data.

# Block I/O bytes and IOPS per cgroup
bpftrace -e '
tracepoint:block:block_rq_issue {
  @io_bytes[cgroup] = sum(args.bytes);
  @io_ops[cgroup] = count();
}
interval:s:5 {
  printf("--- I/O bytes by container ---\n"); print(@io_bytes, 10);
  printf("--- I/O ops by container ---\n"); print(@io_ops, 10);
  clear(@io_bytes); clear(@io_ops);
}'

What the output means: Per-container I/O throughput and IOPS. A container doing unexpected disk I/O might be logging excessively, running backups, or experiencing a memory leak that causes swapping. Combine with biotop to find the specific process within the container.

When to use it: When disk I/O is saturated and you need per-container attribution. This is essential for noisy-neighbor diagnosis on shared hosts.

Network traffic by cgroup

The problem

Network bandwidth is saturated and you need to attribute it to specific containers.

# TCP bytes sent/received per cgroup
bpftrace -e '
kprobe:tcp_sendmsg {
  @net_tx[cgroup] = sum(arg2);
}
kprobe:tcp_recvmsg {
  @net_rx[cgroup] = sum(arg2);
}
interval:s:5 {
  printf("--- TX bytes ---\n"); print(@net_tx, 10);
  printf("--- RX bytes ---\n"); print(@net_rx, 10);
  clear(@net_tx); clear(@net_rx);
}'

What the output means: Per-container network throughput. A container with unexpectedly high TX might be exfiltrating data, serving as a relay, or just logging verbosely to a remote syslog. A container with high RX but low TX might be silently consuming messages from a queue without processing them (backpressure issue).

When to use it: When the NIC is saturated and docker stats network numbers are too aggregated to be useful.

CPU time by cgroup

The problem

docker stats shows CPU percentage but not breakdown by user vs system time, or which kernel subsystem the container is spending time in.

# CPU time per cgroup with user vs kernel breakdown
bpftrace -e '
tracepoint:sched:sched_switch {
  if (@start[args.prev_pid]) {
    $delta = nsecs - @start[args.prev_pid];
    @cpu_ns[cgroup] = sum($delta);
    delete(@start[args.prev_pid]);
  }
  @start[args.next_pid] = nsecs;
}
interval:s:5 {
  printf("--- CPU nanoseconds by container ---\n");
  print(@cpu_ns, 10);
  clear(@cpu_ns);
}'

What the output means: Actual CPU nanoseconds consumed per container per interval. This is more accurate than docker stats because it measures wall-clock CPU time from the scheduler's perspective. High CPU in a container that shouldn't be busy indicates a runaway process, infinite loop, or resource exhaustion causing spin-waits.

When to use it: When you need precise CPU accounting per container for capacity planning or chargeback.

OOM kills by cgroup

The problem

Containers are being OOM-killed and you need to know which container, which process inside it, and how much memory it was using at the time.

# Trace OOM kills with cgroup (container) attribution
bpftrace -e '
kprobe:oom_kill_process {
  printf("OOM KILL: cgroup=%lld pid=%d comm=%s\n", cgroup, pid, comm);
  @oom_kills[cgroup] = count();
}
kprobe:mem_cgroup_out_of_memory {
  printf("CGROUP OOM: cgroup=%lld\n", cgroup);
  @cgroup_oom[cgroup] = count();
}'

What the output means: mem_cgroup_out_of_memory fires when a container's memory cgroup limit is hit. oom_kill_process fires when the kernel selects a victim. If the same cgroup keeps hitting OOM, its memory limit is too low or the application has a memory leak. The process name tells you which process inside the container was killed (often the main application, but sometimes a helper or sidecar).

When to use it: When containers restart unexpectedly. docker inspect shows OOMKilled=true but not the timeline or the specific process that was killed.

"Custom Grafana metric"

The recipes above are for interactive debugging. But the same eBPF data can feed Grafana dashboards for continuous monitoring. The pattern: eBPF program collects data, exports it as a Prometheus metric, Grafana scrapes it. Here are templates for the three most common patterns.

The gap between "I can trace this interactively" and "I can see this in Grafana" is shockingly small. A bpftrace one-liner becomes a prometheus metric with about 20 lines of wrapper script. Once you see how, you'll wonder why all your metrics aren't eBPF-backed.

bpftrace to Prometheus node_exporter textfile

The problem

You want a custom eBPF metric in Grafana but don't want to write a full BCC program. The simplest path: bpftrace writes a textfile, node_exporter picks it up.

#!/bin/bash
# /usr/local/bin/ebpf-disk-latency-exporter.sh
# Exports disk I/O latency percentiles to Prometheus via node_exporter textfile collector

TEXTFILE_DIR="/var/lib/node_exporter/textfile_collector"
mkdir -p "$TEXTFILE_DIR"

bpftrace -e '
interval:s:15 {
  system("cat /proc/diskstats | awk '\''{ if ($4 > 0) print $3, $13/$4 }'\'' | while read dev avg; do
    echo \"ebpf_disk_avg_latency_ms{device=\\\"$dev\\\"} $avg\"
  done > /var/lib/node_exporter/textfile_collector/disk_latency.prom.$$
  mv /var/lib/node_exporter/textfile_collector/disk_latency.prom.$$ /var/lib/node_exporter/textfile_collector/disk_latency.prom");
}'

# Simpler: export a single gauge every 15 seconds
#!/bin/bash
PROM_FILE="/var/lib/node_exporter/textfile_collector/zfs_arc.prom"
while true; do
  ARC_SIZE=$(awk '/^size/ {print $3}' /proc/spl/kstat/zfs/arcstats)
  ARC_HITS=$(awk '/^hits/ {print $3}' /proc/spl/kstat/zfs/arcstats)
  ARC_MISS=$(awk '/^misses/ {print $3}' /proc/spl/kstat/zfs/arcstats)
  cat > "${PROM_FILE}.tmp" <



What the output means: node_exporter's textfile collector reads .prom files from the configured directory and serves them as Prometheus metrics. The mv pattern (write to temp, then atomic rename) prevents Prometheus from scraping a half-written file. Use # TYPE and # HELP lines so Grafana auto-discovers the metric type.

When to use it: For simple metrics where bpftrace or a shell script can produce the value. Zero dependencies beyond node_exporter. Start here before reaching for BCC.


BCC program with prometheus_client


  The problem
  You need a more complex eBPF metric — histograms, per-label breakdowns, or custom aggregation — that a shell script can't handle. BCC + Python's prometheus_client gives you a proper Prometheus exporter.


#!/usr/bin/env python3
# /usr/local/bin/ebpf-exporter.py
# Custom eBPF Prometheus exporter: TCP retransmit rate per remote IP

from bcc import BPF
from prometheus_client import start_http_server, Counter, Histogram
import time

bpf_text = """
#include 
#include 

BPF_HASH(retransmits, u32, u64);  // key: daddr, value: count

int trace_retransmit(struct pt_regs *ctx, struct sock *sk) {
    u32 daddr = sk->__sk_common.skc_daddr;
    u64 *count = retransmits.lookup(&daddr);
    if (count) {
        (*count)++;
    } else {
        u64 one = 1;
        retransmits.update(&daddr, &one);
    }
    return 0;
}
"""

# Prometheus metrics
RETRANSMITS = Counter('tcp_retransmits_total', 'TCP retransmit count', ['remote_ip'])

b = BPF(text=bpf_text)
b.attach_kprobe(event="tcp_retransmit_skb", fn_name="trace_retransmit")

# Start Prometheus HTTP server on port 9101
start_http_server(9101)

print("eBPF exporter running on :9101/metrics")
while True:
    time.sleep(15)
    retransmits = b["retransmits"]
    for k, v in retransmits.items():
        ip = "{}.{}.{}.{}".format(k.value & 0xff, (k.value >> 8) & 0xff,
                                   (k.value >> 16) & 0xff, (k.value >> 24) & 0xff)
        RETRANSMITS.labels(remote_ip=ip).inc(v.value)
    retransmits.clear()

What the output means: This creates a proper Prometheus exporter on port 9101. Add it to your prometheus.yml scrape config. The tcp_retransmits_total counter will appear in Grafana with per-IP labels. You can build dashboards showing retransmit rates per destination, alerting on spikes.

When to use it: When you need proper Prometheus metric types (histograms with quantiles, counters with labels) that a textfile can't express well. This is the production-grade approach.


Histogram metric template


  The problem
  You want to export an eBPF latency distribution as a Prometheus histogram so you can compute percentiles (p50, p95, p99) in Grafana.


#!/usr/bin/env python3
# Template: eBPF histogram -> Prometheus histogram
# Customize the BPF program and bucket boundaries for your use case

from bcc import BPF
from prometheus_client import start_http_server, Histogram
import time

bpf_text = """
BPF_HISTOGRAM(latency, int);

// Customize this tracepoint for your metric
int trace_entry(struct pt_regs *ctx) {
    // ... start timing ...
    return 0;
}
int trace_return(struct pt_regs *ctx) {
    // ... compute latency, store in histogram ...
    // latency.increment(bpf_log2l(delta_us));
    return 0;
}
"""

# Prometheus histogram with custom buckets (microseconds)
LATENCY = Histogram(
    'my_operation_latency_microseconds',
    'Latency of my operation in microseconds',
    ['operation'],
    buckets=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536]
)

# b = BPF(text=bpf_text)
# b.attach_kprobe(...)
start_http_server(9102)

while True:
    time.sleep(15)
    # Read BPF histogram, convert log2 buckets to linear, observe into Prometheus
    # for k, v in b["latency"].items():
    #     bucket_us = 1 << k.value  # log2 -> linear
    #     for _ in range(v.value):
    #         LATENCY.labels(operation="read").observe(bucket_us)
    # b["latency"].clear()
    pass

What the output means: BPF histograms use log2 buckets (power-of-2 ranges). Prometheus histograms use linear bucket boundaries. The template shows how to convert between them. In Grafana, use histogram_quantile(0.99, rate(my_operation_latency_microseconds_bucket[5m])) to get p99 latency.

When to use it: Any time you need percentile latency data in Grafana. Histograms are the right Prometheus type for latency — not gauges, not counters.


Counter metric template


  The problem
  You want to count events (errors, retransmits, drops, faults) and export them as a Prometheus counter with labels.


#!/bin/bash
# Simple counter exporter: count events from bpftrace and write to textfile
# Runs as a systemd service

PROM_DIR="/var/lib/node_exporter/textfile_collector"
PROM_FILE="$PROM_DIR/ebpf_events.prom"

bpftrace -e '
tracepoint:skb:kfree_skb { @packet_drops = count(); }
tracepoint:sched:sched_process_exit { @process_exits = count(); }
tracepoint:signal:signal_deliver /args.sig == 9/ { @sigkills = count(); }
interval:s:15 {
  printf("FLUSH %lld %lld %lld\n", @packet_drops, @process_exits, @sigkills);
  clear(@packet_drops); clear(@process_exits); clear(@sigkills);
}' 2>/dev/null | while IFS=' ' read -r tag drops exits kills; do
  [ "$tag" = "FLUSH" ] || continue
  cat > "${PROM_FILE}.tmp" <


What the output means: Three kernel-level counters exported to Prometheus every 15 seconds. In Grafana, use rate(ebpf_packet_drops_total[5m]) to see drops per second. Alert on rate(ebpf_sigkills_total[5m]) > 0 to catch OOM kills or forced process termination.

When to use it: When you want simple event counting in Grafana with minimal setup. This pattern works for any kernel event you can trace with bpftrace.


Gauge metric template


  The problem
  You want to export a point-in-time value (current ARC size, current run queue depth, current connection count) as a Prometheus gauge.


#!/bin/bash
# Gauge exporter: export current system state as Prometheus gauges
PROM_DIR="/var/lib/node_exporter/textfile_collector"
PROM_FILE="$PROM_DIR/ebpf_gauges.prom"

while true; do
  # ARC size from /proc
  ARC_SIZE=$(awk '/^size/ {print $3}' /proc/spl/kstat/zfs/arcstats 2>/dev/null || echo 0)

  # Current TCP connections
  TCP_ESTABLISHED=$(ss -t state established | tail -n +2 | wc -l)

  # Current run queue length
  RUNQ=$(awk '{print $1}' /proc/loadavg)

  # ZFS pool free space (bytes)
  POOL_FREE=$(zpool list -Hpo free 2>/dev/null | head -1 || echo 0)

  cat > "${PROM_FILE}.tmp" <


What the output means: Gauges represent current values that can go up and down. Use them for "how much right now" metrics (memory usage, connection count, queue depth). Don't use gauges for events — use counters. In Grafana, gauge metrics can be displayed directly without rate().

When to use it: For any metric that represents a current state rather than a cumulative count. ARC size, pool free space, connection count, buffer utilization.



Putting it all together

These 70 recipes cover the vast majority of production debugging scenarios. The pattern is always the same: symptom tells you which section to look at, copy the command, read the output, act on what it tells you. You don't need to understand eBPF internals. You don't need to write BPF C code. You just need to know which recipe to grab.


  The kldload approach: all BCC tools and bpftrace are pre-installed on desktop and server profiles. No package installation, no kernel header compilation, no framework setup. Boot the ISO, run the command. That's it.
  On kldload core profile, install the tools yourself: dnf install bcc-tools bpftrace (RHEL/CentOS/Rocky/Fedora), apt install bpfcc-tools bpftrace (Debian/Ubuntu).



  Quick reference: which recipe for which symptom
  
    Disk slow: Start with biolatency, then biotop, then ZIO latency for ZFS.

    Network broken: Start with packet drops, then retransmits, then DNS latency.

    CPU eaten: Start with flame graph, then runqlat, then syscalls.

    Memory gone: Start with page faults, then slab top, then ARC size for ZFS.

    Process weird: Start with all syscalls, then file opens, then futex for threading.

    Security audit: Start with executions, then /etc writes, then socket opens.

    ZFS weird: Start with ARC hit/miss, then TXG sync, then ZIL commit.

    WireGuard slow: Start with per-peer bytes, then handshake timing, then UDP drops.

    Container noisy: Start with CPU by cgroup, then disk by cgroup, then OOM by cgroup.

    Grafana metric: Start with textfile exporter, upgrade to BCC exporter when you need labels.