eBPF Reference

eBPF — The Definitive Reference

eBPF is the most significant change to Linux observability and networking since the kernel itself. It lets you attach small, verified programs to any kernel event — syscalls, function entries, network packets, scheduler decisions, security hooks — and get answers about your running system without modifying the kernel, loading kernel modules, or rebooting. It is safe enough for production, fast enough for line-rate networking, and expressive enough to replace entire kernel subsystems.

This page is a comprehensive reference. It covers the eBPF virtual machine, every program type, the verifier, map types, the complete toolchain, and practical commands you can run today on any kldload system.

What eBPF actually is

Before eBPF, if you wanted to understand what your kernel was doing — why a process was slow, where packets were being dropped, what was causing latency spikes at 3 AM — you had three choices: add printk statements to the kernel source and recompile, load a kernel module that hooks into internal functions (risking a panic if you get it wrong), or use coarse tools like strace that slow your application to a crawl. None of these are safe for production. None of them let you ask arbitrary questions about a running system without risk.

eBPF changes the equation entirely. It provides a sandboxed virtual machine inside the kernel that runs small programs you write in user space. The kernel’s verifier mathematically proves your program is safe before it runs: it cannot crash the kernel, it cannot access arbitrary memory, it cannot loop forever, it cannot corrupt data structures. Once verified, the program is JIT-compiled to native machine code and runs at near-native speed. You attach it to a kernel event — a function entry, a tracepoint, a network hook, a security check — and it fires every time that event occurs.

The paradigm shift is this: eBPF turns the kernel into a programmable platform. Instead of the kernel being a fixed binary you can only observe from the outside, it becomes something you can instrument from the inside, safely, in production, at any time. You can ask any question about your running system and get an answer in real time. "Which process is causing disk latency spikes?" "Why is this TCP connection being reset?" "What path does this packet take through the network stack?" These used to be multi-day debugging exercises. With eBPF, they are one-liners.

The paradigm shift

Traditional kernel instrumentation is static: you build tracing into the kernel at compile time, or you load a module that hooks specific functions. If you want to trace something new, you need to recompile or write a new module. eBPF is dynamic: you attach programs at runtime to any of the thousands of hooks the kernel exposes. No reboot, no recompile, no risk.

Safe

The verifier proves your program terminates, doesn’t access invalid memory, and doesn’t corrupt kernel state. A verified eBPF program cannot crash the kernel. This is what makes eBPF production-safe — you can instrument a system serving millions of requests per second.

Dynamic

Attach programs at runtime to any of 100,000+ kernel functions, tracepoints, network interfaces, cgroup hooks, or security checkpoints. Detach them when you’re done. No reboot, no downtime, no kernel rebuild.

Fast

eBPF programs are JIT-compiled to native x86_64/ARM64 instructions. Overhead is typically under 100 nanoseconds per event. XDP programs process packets at line rate — millions of packets per second — before they even reach the network stack.

Expressive

eBPF programs can read kernel data structures, aggregate statistics into maps, communicate with user space, modify packets, enforce security policies, and even replace kernel TCP congestion algorithms — all from user space.

A brief history

BPF (Berkeley Packet Filter) was created in 1992 for tcpdump — a tiny virtual machine that filtered network packets in the kernel instead of copying every packet to user space. It had two 32-bit registers, no loops, and could only say "accept" or "reject" a packet. In 2014, Alexei Starovoitov rewrote BPF from scratch for Linux 3.18: 64-bit registers, a proper instruction set, maps for shared state, helper functions for kernel interaction, and a verifier that proved programs safe. The "e" in eBPF stands for "extended," but the original BPF and eBPF are so different that the "e" is essentially meaningless — eBPF is a new system that happens to share a name. By Linux 5.x, eBPF had grown far beyond packet filtering into a general-purpose kernel programming framework used by every major tech company in production.

The eBPF machine

The eBPF virtual machine is a register-based RISC machine with a deliberately constrained design. Understanding the machine model explains why eBPF programs have the constraints they do — the 512-byte stack limit, the bounded loops, the helper function calling convention.

Registers

eBPF has 11 registers, all 64-bit. The calling convention mirrors x86_64:

Register	Purpose
`r0`	Return value. Function return value and eBPF program exit code. For XDP programs: XDP_PASS, XDP_DROP, etc. For kprobes: ignored. For helper calls: the return value of the helper.
`r1`–`r5`	Function arguments. Used to pass arguments to helper functions and at program entry to pass the context pointer (`r1`). `r1`–`r5` are caller-saved — helpers may clobber them.
`r6`–`r9`	Callee-saved. Preserved across helper function calls. Use these to hold values you need after a helper returns.
`r10`	Frame pointer (read-only). Points to the bottom of the 512-byte stack. You cannot write to `r10` — it exists only for stack-relative addressing.

At program entry, r1 contains a pointer to the context — the data structure specific to the program type. For a kprobe, it’s a struct pt_regs * (CPU register state). For XDP, it’s an struct xdp_md * (packet metadata). For tracepoints, it’s the tracepoint arguments struct. Your program reads from the context to inspect the event that triggered it.

Instruction set

eBPF uses fixed-width 64-bit instructions (8 bytes each, with one exception: 128-bit load-immediate for 64-bit constants). The instruction set includes:

ALU operations

add, sub, mul, div, mod, or, and, xor, lsh, rsh, arsh, neg. Both 32-bit and 64-bit variants. All operate on registers.

Memory access

ldx (load from memory), stx (store to memory), st (store immediate). Supports 8, 16, 32, and 64-bit widths. All memory access must be to the stack, map values, or the context — the verifier rejects arbitrary pointer arithmetic.

Branch instructions

jeq, jne, jgt, jge, jlt, jle, jsgt, jsge, jslt, jsle, jset (bitwise AND non-zero). Plus unconditional ja (jump always). All jumps are forward-only in classic eBPF; bounded loops were added in kernel 5.3+.

Function calls

call invokes a kernel helper function by ID. exit terminates the program. BPF-to-BPF function calls (subprograms) are supported since kernel 4.16, enabling code reuse without duplicating instructions.

The 512-byte stack

Every eBPF program gets a 512-byte stack. That’s it. No heap, no malloc, no dynamic allocation. This is intentional — the kernel stack is already limited (typically 8KB–16KB), and eBPF programs run inside kernel code paths. If your program needs more working space, use a map (per-CPU array maps are the common pattern for scratch buffers). The 512-byte limit is per stack frame, so BPF-to-BPF function calls each get their own 512 bytes, up to 8 frames deep (total: 4KB, but the verifier tracks it precisely).

Maps — shared state between kernel and user space

Maps are the mechanism for persistent state in eBPF. A map is a key-value data structure that lives in kernel memory and can be accessed by both eBPF programs (from the kernel side) and user-space applications (via the bpf() syscall or file descriptors). Maps survive across eBPF program invocations — they are created separately and can outlive the programs that use them.

Maps are how eBPF programs aggregate data (count events, build histograms), share state between multiple eBPF programs, and communicate results to user space. Without maps, eBPF programs could only return a single integer. With maps, they can build complex data structures accessible from both sides of the kernel/user boundary.

The verifier

The verifier is what makes eBPF safe. Every eBPF program passes through the verifier before it can execute. The verifier performs static analysis of every possible execution path:

Termination

The program must terminate. Before kernel 5.3, this meant no backward jumps at all (no loops). Since 5.3, bounded loops are allowed — the verifier must be able to prove the loop has a fixed upper bound on iterations.

Memory safety

Every memory access must be provably safe. The verifier tracks the type and bounds of every register. Accessing beyond the stack, reading past the end of a packet, or dereferencing a null pointer all result in rejection.

No uninitialized reads

Every register and stack slot must be written before it is read. The verifier tracks initialization state per-register and per-stack-byte across all branches.

Valid helper calls

Only helpers permitted for the program type can be called. Arguments must have the correct type. Return values must be checked before use (e.g., map lookups can return NULL and must be NULL-checked).

JIT compilation

After the verifier approves a program, the kernel’s JIT compiler translates the eBPF bytecode into native machine instructions for the host architecture (x86_64, ARM64, etc.). JIT-compiled eBPF runs at near-native speed — there is no interpreter overhead. The JIT is enabled by default on all modern kernels. You can verify it:

# Check JIT status (1 = enabled)
cat /proc/sys/net/core/bpf_jit_enable
1

# Enable if disabled
echo 1 > /proc/sys/net/core/bpf_jit_enable

The combination of verifier + JIT is what makes eBPF fundamentally different from every tracing tool that came before it. DTrace has a verifier but no JIT on Linux. SystemTap compiles to kernel modules with no safety guarantee. ftrace is safe but can only trace pre-defined hooks. eBPF is the first system that is simultaneously safe, fast, and can attach to anything.

Program types

An eBPF program’s type determines what it can do: what context it receives, which helpers it can call, where it can attach, and what its return value means. There are over 30 program types in modern kernels. These are the ones that matter for infrastructure work.

kprobe / kretprobe

Attach to any kernel function entry (kprobe) or return (kretprobe). This is the most flexible attach point — any of the 50,000+ functions in the kernel can be probed. The context is struct pt_regs * containing the CPU register state, from which you can extract function arguments. kretprobes let you capture the return value.

When to use: tracing internal kernel behavior that isn’t exposed via tracepoints. Debugging specific kernel functions. Understanding code paths during development. Caveat: kprobes attach to internal kernel functions that can change between kernel versions. Your probe may break on a kernel upgrade. Use tracepoints when available — they are stable ABI.

# Trace every call to tcp_connect with the destination address
bpftrace -e ‘kprobe:tcp_connect {
  $sk = (struct sock *)arg0;
  printf("connect to %s:%d\n",
    ntop($sk->__sk_common.skc_daddr),
    $sk->__sk_common.skc_dport);
}’

# Trace vfs_read return values (bytes read)
bpftrace -e ‘kretprobe:vfs_read { @bytes = hist(retval); }’

Tracepoints

Attach to stable, well-defined kernel instrumentation points. Tracepoints are placed by kernel developers at important locations in the code and have a stable ABI — their arguments don’t change between kernel versions (within a major version). They are the preferred attach point for production tracing.

When to use: production tracing, syscall monitoring, scheduler analysis, network event tracking. Always prefer tracepoints over kprobes when both are available.

# List all available tracepoints
bpftrace -l ‘tracepoint:*’ | head -20
tracepoint:syscalls:sys_enter_read
tracepoint:syscalls:sys_exit_read
tracepoint:syscalls:sys_enter_write
tracepoint:syscalls:sys_exit_write
tracepoint:sched:sched_switch
tracepoint:sched:sched_wakeup
tracepoint:net:net_dev_xmit
tracepoint:block:block_rq_issue
tracepoint:block:block_rq_complete
...

# Trace process scheduling with time on CPU
bpftrace -e ‘tracepoint:sched:sched_switch {
  printf("%-16s (pid %d) -> %-16s (pid %d)\n",
    args.prev_comm, args.prev_pid,
    args.next_comm, args.next_pid);
}’

Raw tracepoints

Like tracepoints but with raw, unprocessed arguments. Regular tracepoints copy arguments into a stable struct, which adds overhead. Raw tracepoints pass the original kernel pointers directly, saving the copy but requiring you to read struct fields yourself with BTF. Lower overhead, less stable across versions.

When to use: high-frequency tracepoints where the copy overhead of regular tracepoints is measurable (e.g., sched_switch fires millions of times per second on a busy system).

fentry / fexit

Modern replacement for kprobe/kretprobe. Introduced in kernel 5.5, fentry/fexit attach to kernel function entry and exit with lower overhead than kprobes (no breakpoint trap — the program is called directly via a trampoline). fexit has a major advantage over kretprobe: it receives both the function arguments and the return value, so you can correlate input with output in a single program.

When to use: any situation where you would use kprobe/kretprobe on kernel 5.5+. fentry/fexit is strictly better — lower overhead, type-safe arguments via BTF, and fexit gives you args + return value together.

# fentry/fexit example in bpftrace (kernel 5.5+)
# Not yet supported in bpftrace — use libbpf or Cilium eBPF library
# The C skeleton looks like:
#
# SEC("fentry/tcp_connect")
# int BPF_PROG(trace_tcp_connect, struct sock *sk) {
#     // sk is typed — no need to cast from pt_regs
#     return 0;
# }
#
# SEC("fexit/tcp_sendmsg")
# int BPF_PROG(trace_tcp_sendmsg_exit, struct sock *sk,
#              struct msghdr *msg, size_t size, int ret) {
#     // Both arguments AND return value available
#     return 0;
# }

XDP (eXpress Data Path)

Process packets at the earliest possible point — before the kernel network stack touches them. XDP programs run in the NIC driver (or in a generic hook for drivers that don’t support native XDP). They receive raw packet data and can XDP_PASS (continue to the stack), XDP_DROP (discard), XDP_TX (bounce back out the same interface), or XDP_REDIRECT (send to a different interface, CPU, or AF_XDP socket).

When to use: DDoS mitigation, load balancing, packet filtering at line rate, high-performance networking. XDP can process millions of packets per second per core because it runs before any socket buffer allocation, before any protocol processing, before the packet even has an sk_buff.

# Simple XDP program to drop all UDP traffic on port 9999
# Save as drop_udp.c, compile with clang -O2 -target bpf
#
# SEC("xdp")
# int drop_udp_9999(struct xdp_md *ctx) {
#     void *data     = (void *)(long)ctx->data;
#     void *data_end = (void *)(long)ctx->data_end;
#     struct ethhdr *eth = data;
#     if ((void *)(eth + 1) > data_end) return XDP_PASS;
#     if (eth->h_proto != htons(ETH_P_IP)) return XDP_PASS;
#     struct iphdr *ip = (void *)(eth + 1);
#     if ((void *)(ip + 1) > data_end) return XDP_PASS;
#     if (ip->protocol != IPPROTO_UDP) return XDP_PASS;
#     struct udphdr *udp = (void *)ip + ip->ihl * 4;
#     if ((void *)(udp + 1) > data_end) return XDP_PASS;
#     if (udp->dest == htons(9999)) return XDP_DROP;
#     return XDP_PASS;
# }

# Attach an XDP program to eth0
ip link set dev eth0 xdpgeneric obj drop_udp.o sec xdp

# View XDP program attached to interfaces
ip link show dev eth0 | grep xdp

# Remove XDP program
ip link set dev eth0 xdpgeneric off

TC (Traffic Control / cls_bpf)

Attach eBPF programs to the Linux traffic control layer. TC programs run on both ingress and egress, after the kernel has created an sk_buff (unlike XDP which runs before). This means TC programs can access full socket and connection information, modify packet headers, redirect between interfaces, and apply traffic shaping.

When to use: Kubernetes networking (Cilium uses TC extensively), container network policies, NAT, packet mangling on egress, anything that needs socket context. TC is the workhorse of eBPF-based networking when you need more context than XDP provides.

# Attach a TC eBPF program to ingress of eth0
tc qdisc add dev eth0 clsact
tc filter add dev eth0 ingress bpf da obj my_filter.o sec tc_ingress

# List TC programs on an interface
tc filter show dev eth0 ingress

# Remove
tc filter del dev eth0 ingress

cgroup programs

Attach to cgroup hooks to control network behavior, device access, and sysctl values per container or process group. cgroup eBPF programs are the mechanism behind Kubernetes network policies (which pods can talk to which), device whitelisting in containers, and per-cgroup sysctl overrides.

cgroup/sock (BPF_CGROUP_INET_SOCK_CREATE)

Fires when a socket is created. Can allow or deny socket creation per cgroup. Used to prevent containers from creating raw sockets or sockets on disallowed address families.

cgroup/connect4, cgroup/connect6

Fires on connect(). Can rewrite the destination address/port — this is how transparent service mesh proxying works. The application connects to IP A, but the eBPF program silently redirects to IP B.

cgroup/bind4, cgroup/bind6

Fires on bind(). Can rewrite the bind address or reject it. Used for port-level access control per cgroup.

cgroup/sysctl

Intercepts sysctl reads/writes for processes in the cgroup. Allows per-container sysctl overrides without giving the container actual sysctl access.

cgroup/device

Controls which device files (/dev/*) processes in the cgroup can access. Replaces the legacy device cgroup controller with programmable logic.

LSM (Linux Security Modules)

Attach eBPF programs to LSM hooks — the same hooks used by SELinux and AppArmor. BPF LSM programs (kernel 5.7+) can implement custom security policies without writing a kernel module. They can allow, deny, or audit any operation that goes through the LSM framework: file access, socket operations, process creation, module loading, mount operations, and hundreds more.

When to use: custom security policies that go beyond what SELinux/AppArmor profiles can express. Runtime security monitoring (detecting suspicious behavior patterns). Container security enforcement. Audit logging of security-sensitive operations.

struct_ops

Replace kernel subsystem implementations with eBPF programs. struct_ops lets you implement an entire kernel operations struct (vtable) in eBPF. The most important use case: TCP congestion control algorithms. You can write a custom congestion control algorithm in eBPF and load it at runtime without recompiling the kernel.

When to use: custom TCP congestion control (e.g., data center-specific algorithms), custom HID drivers, scheduler extensions (sched_ext in kernel 6.12+). struct_ops is the mechanism that makes the kernel truly programmable — not just observable, but replaceable.

uprobe / uretprobe

Attach to any function in any userspace binary. uprobes work the same way as kprobes but for userspace: the kernel inserts a breakpoint at the function entry point in the target process. When the function is called, the eBPF program fires. uretprobes capture the return value. This works on any ELF binary — compiled C, Go, Rust, even Python or Node.js native extensions.

When to use: tracing application-level behavior without modifying the application. Measuring latency of specific library functions. Debugging third-party binaries. Tracing TLS/SSL handshakes by attaching to OpenSSL functions.

# Trace every call to malloc in a specific process
bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc { @bytes = hist(arg0); }'

# Trace OpenSSL handshakes (see what's doing TLS)
bpftrace -e 'uprobe:/usr/lib64/libssl.so:SSL_do_handshake { printf("TLS handshake: pid=%d comm=%s\n", pid, comm); }'

# Trace readline in bash (see every command typed)
bpftrace -e 'uretprobe:/bin/bash:readline { printf("cmd: %s\n", str(retval)); }'

USDT (Userspace Statically Defined Tracing)

Pre-defined probe points embedded in userspace applications by their developers. Think of USDT as tracepoints for userspace. Applications like PostgreSQL, MySQL, Node.js, Python, and the JVM include USDT probes at important points: query start, query complete, GC start, GC end, connection accept, etc. These are more stable than uprobes because the probe location and arguments are declared by the application developer.

# List USDT probes in PostgreSQL
bpftrace -l 'usdt:/usr/bin/postgres:*'
usdt:/usr/bin/postgres:postgresql:query__start
usdt:/usr/bin/postgres:postgresql:query__done
usdt:/usr/bin/postgres:postgresql:transaction__start
usdt:/usr/bin/postgres:postgresql:transaction__commit
usdt:/usr/bin/postgres:postgresql:transaction__abort

# Trace PostgreSQL query execution with latency
bpftrace -e '
usdt:/usr/bin/postgres:postgresql:query__start { @start[tid] = nsecs; @query[tid] = str(arg0); }
usdt:/usr/bin/postgres:postgresql:query__done /@start[tid]/ {
    printf("%-6d %8.2f ms  %s\n", pid, (nsecs - @start[tid]) / 1e6, @query[tid]);
    delete(@start[tid]); delete(@query[tid]);
}'

# List USDT probes in Python
bpftrace -l 'usdt:/usr/bin/python3:*'

# List USDT probes in Node.js
bpftrace -l 'usdt:/usr/bin/node:*'

perf_event

Attach to hardware and software performance counters. CPU cycles, cache misses, branch mispredictions, context switches, page faults. This is the foundation of CPU profiling and flame graph generation. Typically used at a sampling frequency (e.g., 99 Hz) to capture stack traces without measurable overhead.

# Sample kernel stack traces at 99 Hz for flame graph generation
bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'

# Sample both kernel and userspace stacks
bpftrace -e 'profile:hz:99 { @[kstack, ustack, comm] = count(); }'

# Trace hardware cache misses
bpftrace -e 'hardware:cache-misses:1000 { @[kstack] = count(); }'

Socket filter

The original BPF use case: filter packets on a socket. Attach to a raw or packet socket to receive only packets that match your filter. This is what tcpdump uses under the hood. Modern eBPF socket filters can do much more than classic BPF — they can access maps, call helpers, and make complex decisions.

sk_msg / sk_skb

Intercept and redirect messages between sockets. sk_msg programs attach to a sockmap or sockhash and can inspect, modify, or redirect data flowing between sockets. This enables kernel-level proxying: data from socket A is redirected to socket B without ever reaching user space. Cilium uses this for service mesh acceleration — bypassing the TCP/IP stack entirely for pod-to-pod communication on the same node.

If this list feels overwhelming, here’s the practical hierarchy: start with tracepoints for observability, kprobes when tracepoints don’t cover what you need, XDP for packet processing, TC for container networking, and LSM for security. Everything else is specialized.

Attach points

An attach point is the specific location in the kernel where your eBPF program runs. Understanding what attach points exist and how to find them is the key to using eBPF effectively.

Finding attach points

# List ALL available attach points (this outputs a lot)
bpftrace -l | wc -l
247831

# List all tracepoints
bpftrace -l ‘tracepoint:*’ | wc -l
2194

# List syscall tracepoints (entry and exit for every syscall)
bpftrace -l ‘tracepoint:syscalls:*’ | head -20
tracepoint:syscalls:sys_enter_accept
tracepoint:syscalls:sys_enter_accept4
tracepoint:syscalls:sys_enter_access
tracepoint:syscalls:sys_enter_acct
tracepoint:syscalls:sys_enter_add_key
tracepoint:syscalls:sys_enter_adjtimex
tracepoint:syscalls:sys_enter_alarm
tracepoint:syscalls:sys_enter_arch_prctl
tracepoint:syscalls:sys_enter_bind
tracepoint:syscalls:sys_enter_bpf

# List kprobes for a specific subsystem (e.g., TCP)
bpftrace -l ‘kprobe:tcp_*’ | head -20
kprobe:tcp_abort
kprobe:tcp_check_req
kprobe:tcp_close
kprobe:tcp_connect
kprobe:tcp_conn_request
kprobe:tcp_disconnect
kprobe:tcp_done
kprobe:tcp_fin
kprobe:tcp_get_info
kprobe:tcp_getsockopt

# List kprobes for ZFS
bpftrace -l ‘kprobe:zfs_*’
bpftrace -l ‘kprobe:spa_*’
bpftrace -l ‘kprobe:dmu_*’

# Show the arguments for a tracepoint
bpftrace -lv ‘tracepoint:syscalls:sys_enter_openat’
tracepoint:syscalls:sys_enter_openat
    int __syscall_nr
    int dfd
    const char * filename
    int flags
    umode_t mode

# Show struct layouts with BTF
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 20 ‘struct sock_common {‘

How programs attach

kprobe / kretprobe

Uses the kernel’s kprobe infrastructure (software breakpoints). The kernel patches the first instruction of the target function with an INT3 (x86) or BRK (ARM64). When the function is called, the CPU traps, the eBPF program runs, then the original instruction executes. This adds ~100ns overhead per call.

fentry / fexit

Uses BPF trampolines. The kernel patches the function’s nop-padded preamble (added by -fno-omit-frame-pointer -fpatchable-function-entry) to jump to a trampoline that calls the eBPF program. Much lower overhead than kprobes — no trap, no interrupt, just a direct call.

Tracepoints

Uses static instrumentation points compiled into the kernel. Each tracepoint is a trace_* function call that is normally a no-op (static key disabled). When you attach an eBPF program, the static key is enabled and the tracepoint fires. Near-zero overhead when no program is attached.

XDP

Attached directly to the network device driver’s receive path via netlink or ip link set. The driver calls the eBPF program for every received packet before allocating an sk_buff. Three modes: native (driver support required), generic (works everywhere, slower), and offload (runs on NIC hardware, supported by Netronome/nfp).

cgroup

Attached to a cgroup via the bpf() syscall. The kernel checks for attached eBPF programs at each cgroup hook point (socket create, connect, bind, etc.). Programs inherit down the cgroup hierarchy unless overridden.

LSM

Attached to LSM hooks via BPF link. Requires CONFIG_BPF_LSM=y and lsm=...,bpf on the kernel command line. The BPF LSM programs run alongside any other loaded LSM (SELinux, AppArmor).

eBPF vs DTrace vs SystemTap vs ftrace vs perf

If you’ve used other tracing tools, here’s how eBPF compares. The short version: eBPF is the only tool that is simultaneously safe, low-overhead, dynamically attachable, and production-ready on Linux.

	eBPF	DTrace	SystemTap	ftrace	perf
Availability	Linux 4.x+ (practical: 5.x+)	Solaris, macOS, FreeBSD. Linux port exists but incomplete	Linux (RHEL-focused)	Linux 2.6.27+	Linux 2.6.31+
Safety model	Kernel verifier proves program safe before execution. Cannot crash kernel.	Safe interpreter with privilege checks	Compiles to kernel module. Can crash kernel.	Safe — only uses pre-built kernel hooks	Safe — read-only sampling and counters
Language	C (libbpf), bpftrace (awk-like), Python (BCC), Go (Cilium), Rust (Aya)	D language (DTrace-specific)	SystemTap scripting language, or C (guru mode)	Shell (trace-cmd), or direct sysfs writes	Command-line tool, no programmability
Overhead	~50–100ns per probe. JIT-compiled native code.	Low (interpreted, but optimized)	Low when compiled, but kernel module compilation adds startup latency	Very low (function tracer adds ~15ns)	Sampling-based, near-zero unless recording
Production safe	Yes. Used in production at Meta, Google, Netflix, Cloudflare, all major cloud providers.	Yes (on Solaris/FreeBSD)	Risky. Guru mode can crash. Not widely used in production.	Yes, but limited scope	Yes
Dynamic attachment	Any kernel function, tracepoint, network hook, cgroup, LSM hook	Any function with probes defined	Any kernel function (via kprobe)	Pre-defined kernel trace events only	Pre-defined PMU events and tracepoints
Networking	XDP, TC, socket filter, sk_msg. Can replace entire network stacks.	Packet inspection only	Limited network support	Network tracepoints only	Network tracepoints only
Security	LSM hooks, seccomp-BPF, cgroup device control. Can implement security policies.	No	No	No	No
State / aggregation	Maps (hash, array, ring buffer, etc.) shared between kernel and user space	Aggregations built into language	Associative arrays	Histograms via trace events	In-kernel aggregation for some events
Kernel version coupling	CO-RE + BTF: compile once, run on any kernel	Stable probes (less coupling)	Strongly coupled. Scripts break on kernel upgrades.	Stable (uses trace events)	Stable

DTrace was revolutionary in 2005 and eBPF learned from it. But DTrace never got a real Linux implementation, and eBPF has now surpassed it in every dimension — networking, security, programmability, and ecosystem. If you’re on Linux, eBPF is the answer. Period.

BTF and CO-RE

BPF Type Format (BTF)

BTF is a compact, space-efficient metadata format that describes every type in the kernel: structs, unions, enums, typedefs, function prototypes. It is embedded in the kernel binary (/sys/kernel/btf/vmlinux) and in eBPF programs themselves. BTF is what allows bpftrace to understand args.filename in a tracepoint — it knows the struct layout, field offsets, and types.

Without BTF, eBPF programs that access kernel structs must hardcode field offsets, which change between kernel versions. With BTF, the loader can relocate field accesses at load time, reading the offset from the running kernel’s BTF data. This is the foundation of CO-RE.

# Check BTF availability
ls -la /sys/kernel/btf/vmlinux
-r--r--r--. 1 root root 5765272 Apr  4 12:00 /sys/kernel/btf/vmlinux

# List kernel modules with BTF
bpftool btf list
1: name [vmlinux]  size 5765272B
2: name [openzfs]  size 183441B  map_ids 3,7

# Dump all types in the kernel (generates vmlinux.h)
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
wc -l vmlinux.h
189432 vmlinux.h

# Search for a specific struct
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 15 ‘struct task_struct {‘

# Dump BTF for a specific kernel module
bpftool btf dump file /sys/kernel/btf/openzfs format c > openzfs.h

Compile Once, Run Everywhere (CO-RE)

CO-RE solves the biggest historical problem with eBPF development: kernel version coupling. Before CO-RE, if your eBPF program accessed task->pid and the offset of pid within struct task_struct changed between kernel 5.15 and 6.1, your program would read garbage on the new kernel. You had to compile your program on every target kernel, or use BCC which compiles at runtime (slow, requires compiler toolchain on every host).

CO-RE eliminates this. When you compile a CO-RE program with clang, the compiler records relocation records in the ELF binary: "I’m reading field pid from type struct task_struct." When the program is loaded, libbpf reads the running kernel’s BTF, finds the actual offset of pid, and patches the instruction. The same compiled binary runs correctly on any kernel that has BTF, regardless of struct layout changes.

This is why BTF matters so much: it is the prerequisite for CO-RE, and CO-RE is what makes it practical to distribute pre-compiled eBPF programs. Without CO-RE, every eBPF tool would need to ship kernel headers and a compiler. With CO-RE, you ship a single binary.

# The CO-RE workflow:
# 1. Generate vmlinux.h from your dev kernel (or use a pre-made one)
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

# 2. Write your eBPF program using vmlinux.h types
#    #include "vmlinux.h"
#    #include 
#    #include 
#
#    SEC("tp/sched/sched_process_exec")
#    int trace_exec(struct trace_event_raw_sched_process_exec *ctx) {
#        pid_t pid = BPF_CORE_READ(ctx, __data_loc_filename);
#        // This read will be relocated at load time
#        return 0;
#    }

# 3. Compile with clang (once)
clang -O2 -g -target bpf -D__TARGET_ARCH_x86 -c trace_exec.bpf.c -o trace_exec.bpf.o

# 4. The .o file runs on any kernel with BTF — no recompilation needed

CO-RE is the reason eBPF tools like Cilium, Falco, and Tetragon can ship as single binaries. Without it, every customer would need kernel headers and LLVM installed. CO-RE made eBPF commercially viable.

The toolchain

There are five major eBPF development environments. They serve different audiences and use cases. Here’s when to use each one.

bpftrace — the one-liner tool

What: A high-level tracing language inspired by awk and DTrace. One-liners or short scripts. Compiles to eBPF bytecode behind the scenes.
Language: bpftrace scripting language (awk-like syntax with kernel awareness).
When to use: Ad-hoc investigation. Production debugging. Quick answers to "what is happening right now?" questions. Prototyping before writing a full program.
When NOT to use: Long-running daemons, complex logic, networking programs (XDP/TC), anything that needs to be distributed as a binary.

# One-liner: who’s doing DNS lookups?
bpftrace -e ‘tracepoint:syscalls:sys_enter_connect /comm != "sshd"/ { printf("%s pid=%d\n", comm, pid); }’

# One-liner: histogram of syscall latency for a specific process
bpftrace -e ‘tracepoint:raw_syscalls:sys_enter /pid == 1234/ { @start = nsecs; }
             tracepoint:raw_syscalls:sys_exit /pid == 1234 && @start/ { @ns = hist(nsecs - @start); @start = 0; }’

# Script: trace file I/O by process with latency
bpftrace -e ‘
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
  @us[comm] = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}’

BCC — Python + C

What: BPF Compiler Collection. Write the kernel-side eBPF program in C, the user-space frontend in Python (or Lua/C++). BCC compiles the C code at runtime using LLVM.
Language: C (eBPF side) + Python (user space side).
When to use: The 100+ pre-built tools (opensnoop, execsnoop, tcplife, biolatency, etc.) are invaluable. Also good for prototyping tools that need a user-space component.
When NOT to use: New tool development. BCC requires LLVM + kernel headers on every target machine (compiles at runtime). Modern projects use libbpf + CO-RE instead.

# BCC tool locations:
#   Debian/Ubuntu: /usr/sbin/ (e.g., /usr/sbin/opensnoop-bpfcc)
#   CentOS/RHEL:   /usr/share/bcc/tools/ (e.g., /usr/share/bcc/tools/opensnoop)

# The essential BCC tools every SRE should know:
opensnoop          # trace file opens system-wide
execsnoop          # trace process execution
tcpconnect         # trace outbound TCP connections
tcplife            # trace TCP sessions with duration + bytes
biolatency         # block I/O latency histogram
runqlat            # CPU scheduler queue latency
cachestat          # page cache hit/miss ratio
memleak            # trace memory allocations (find leaks)
funccount          # count kernel function calls
profile            # CPU profiling via sampling

libbpf — the modern C library

What: The standard C library for loading, attaching, and managing eBPF programs. Used with CO-RE for portable, pre-compiled eBPF binaries.
Language: C (both sides).
When to use: Building production tools, daemons, agents. Anything that needs to ship as a self-contained binary without requiring LLVM or kernel headers on the target. This is the recommended approach for new projects.
Workflow: Write eBPF C code, compile with clang to BPF object file, use bpftool gen skeleton to generate a C header, link against libbpf in your user-space program.

# The libbpf development workflow:
# 1. Write the eBPF program (runs in kernel)
#    trace_exec.bpf.c

# 2. Compile to BPF object file
clang -O2 -g -target bpf -c trace_exec.bpf.c -o trace_exec.bpf.o

# 3. Generate the skeleton header
bpftool gen skeleton trace_exec.bpf.o > trace_exec.skel.h

# 4. Write user-space loader that includes the skeleton
#    trace_exec.c — calls trace_exec_bpf__open(), __load(), __attach()

# 5. Compile user-space program, link against libbpf
gcc -O2 -o trace_exec trace_exec.c -lbpf -lelf -lz

# Result: single binary, runs on any kernel with BTF

Cilium eBPF library — Go

What: A pure Go library for working with eBPF programs and maps. Used by Cilium, Tetragon, Hubble, and many Go-based infrastructure tools.
Language: Go (user space) + C (eBPF side, compiled with clang).
When to use: Building eBPF tools in Go. If your infrastructure is Go-based (as most modern cloud-native tooling is), this is the natural choice. Excellent documentation and active community.

Aya — Rust

What: A Rust library for eBPF that writes both the kernel-side and user-space code in Rust. No dependency on libbpf or clang — it has its own BPF linker and relocator.
Language: Rust (both sides).
When to use: Rust-based infrastructure. Projects where memory safety in the user-space component matters as much as in the kernel component. Aya is newer but growing rapidly.

Decision tree

Quick investigation? bpftrace.
Pre-built tool exists? BCC.
Building a production C tool? libbpf + CO-RE.
Building a production Go tool? Cilium eBPF library.
Building a production Rust tool? Aya.

What kldload pre-installs

kldload installs a complete eBPF toolchain out of the box. No post-install setup required — you boot the system and start tracing immediately.

Package	Debian/Ubuntu name	CentOS/RHEL name	What it provides
bpftrace	`bpftrace`	`bpftrace`	High-level tracing language. One-liners and scripts for production debugging.
BCC tools	`bpfcc-tools`	`bcc-tools`	100+ ready-made tools: opensnoop, execsnoop, tcplife, biolatency, runqlat, etc.
bpftool	`bpftool`	`bpftool`	Low-level BPF program/map management. BTF introspection. Skeleton generation.
Kernel headers	`linux-headers-$(uname -r)`	`kernel-devel`	Required for BCC (compiles at runtime). Not needed for CO-RE programs.
BTF	Built into kernel	Built into kernel	`/sys/kernel/btf/vmlinux` — type information for CO-RE and bpftrace struct access.
perf	`linux-perf`	`perf`	Performance counters, CPU profiling, flame graphs. Complements eBPF for sampling-based analysis.

Why all of this is pre-installed: when you’re debugging a production issue at 3 AM, the last thing you want is to discover you need to install packages. kldload ensures every system ships with a complete observability toolkit. You boot, you trace, you find the problem.

On CentOS/RHEL, set KLDLOAD_ENABLE_EBPF=1 in the answers file to include eBPF tools in the initial install, or install afterward with dnf install -y bpftool bcc-tools bpftrace perf. On Debian/Ubuntu, they’re included in every desktop and server profile by default.

Map types deep dive

Maps are the data structures of eBPF. They persist across program invocations, are accessible from both kernel and user space, and come in over 30 specialized types. Here are the ones you will actually use.

Hash map (BPF_MAP_TYPE_HASH)

The general-purpose key-value store. Arbitrary keys, arbitrary values, O(1) lookup. Use it for: tracking per-PID state, building lookup tables, counting events by key.

# bpftrace hash map: count syscalls by process name
bpftrace -e ‘tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }’

# Output (Ctrl+C to print):
# @[systemd]: 142
# @[sshd]: 87
# @[bash]: 1203
# @[postgres]: 34891

# Inspect a hash map with bpftool
bpftool map dump id 42

Array (BPF_MAP_TYPE_ARRAY)

Fixed-size array indexed by integer key (0 to max_entries-1). O(1) lookup, pre-allocated. Use it for: global configuration, indexed counters, lookup tables with integer keys. Array elements cannot be deleted — they exist for the lifetime of the map.

Per-CPU variants (BPF_MAP_TYPE_PERCPU_HASH, BPF_MAP_TYPE_PERCPU_ARRAY)

Same as hash and array, but each CPU gets its own copy of every value. No locking needed, no contention between CPUs. User space reads all per-CPU copies and aggregates them. Use it for: high-frequency counters where lock contention would be a bottleneck. Packet counters, syscall counters, latency tracking on busy systems.

# bpftrace uses per-cpu maps automatically for @count() and @hist()
# Under the hood, each CPU increments its own counter independently

# View per-CPU map contents with bpftool
bpftool map dump id 15
key: 00 00 00 00  value (CPU 00): 01 00 00 00 00 00 00 00
                  value (CPU 01): 03 00 00 00 00 00 00 00
                  value (CPU 02): 00 00 00 00 00 00 00 00
                  value (CPU 03): 02 00 00 00 00 00 00 00

LRU hash (BPF_MAP_TYPE_LRU_HASH)

A hash map with a fixed maximum size that automatically evicts the least-recently-used entry when full. Use it for: connection tracking tables, flow caches, any map where you can’t predict the number of entries and need bounded memory. Per-CPU variant available (BPF_MAP_TYPE_LRU_PERCPU_HASH).

Ring buffer (BPF_MAP_TYPE_RINGBUF)

A single shared ring buffer for streaming events from kernel to user space. Introduced in kernel 5.8 as a replacement for perf buffers. Advantages over perf buffer: single buffer shared across all CPUs (no per-CPU allocation waste), supports variable-length records, allows reserving space before writing (no double-copy), and preserves event ordering across CPUs.

When to use: any time you need to stream events to user space. The ring buffer is the recommended choice for new programs — it is more memory-efficient and has better ordering guarantees than perf buffers.

# In libbpf C:
# struct {
#     __uint(type, BPF_MAP_TYPE_RINGBUF);
#     __uint(max_entries, 256 * 1024);  /* 256 KB */
# } events SEC(".maps");
#
# SEC("tp/sched/sched_process_exec")
# int trace_exec(void *ctx) {
#     struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
#     if (!e) return 0;
#     e->pid = bpf_get_current_pid_tgid() >> 32;
#     bpf_get_current_comm(&e->comm, sizeof(e->comm));
#     bpf_ringbuf_submit(e, 0);
#     return 0;
# }

Perf buffer (BPF_MAP_TYPE_PERF_EVENT_ARRAY)

The original mechanism for streaming events to user space. One ring buffer per CPU. Still widely used in BCC tools. For new programs, prefer ring buffers (above) — but you’ll encounter perf buffers constantly in existing tools.

Bloom filter (BPF_MAP_TYPE_BLOOM_FILTER)

A probabilistic data structure that answers "is this key in the set?" with no false negatives and a configurable false positive rate. Use it for: fast pre-filtering before an expensive hash lookup. Example: checking if an IP address is in a blocklist of millions of entries. Kernel 5.16+.

Queue and Stack (BPF_MAP_TYPE_QUEUE, BPF_MAP_TYPE_STACK)

FIFO queue and LIFO stack. No keys — just push and pop values. Use them for: work queues between eBPF programs, ordered event collection, breadth-first or depth-first traversal state.

Sockmap / Sockhash

Maps that hold references to kernel sockets. Used with sk_msg/sk_skb programs to redirect data between sockets at kernel level. This is the mechanism behind Cilium’s socket-level load balancing and service mesh acceleration — data flows from socket A to socket B without ever leaving the kernel.

Program array (BPF_MAP_TYPE_PROG_ARRAY)

A map that holds references to other eBPF programs. Used for tail calls: one eBPF program jumps to another program in the array. This works around the instruction count limit by chaining programs together, and enables runtime-selectable behavior (swap a program in the array to change behavior without reloading).

The map type you’ll use 90% of the time is the hash map. Per-CPU hash for high-frequency counters, ring buffer for streaming events to user space. Everything else is for specialized use cases — but knowing they exist means you won’t reinvent them poorly.

The verifier in detail

The verifier is the gatekeeper. Every eBPF program must pass through it, and when your program gets rejected, the error messages can be cryptic. Understanding what the verifier checks and why helps you write programs that pass on the first try — and debug the ones that don’t.

DAG analysis

The verifier walks every possible execution path through your program as a directed acyclic graph (DAG). At each instruction, it tracks the state of every register: its type (scalar, pointer to map value, pointer to stack, pointer to packet, etc.), its value range (if known), and whether it has been initialized. When paths merge (after a conditional branch), the verifier takes the union of possible states. If any path can reach an unsafe operation, the program is rejected.

Complexity limit

The verifier has a hard limit on the number of instructions it will analyze: 1 million verified instructions (as of kernel 5.2+; it was 128K before). This is not the number of instructions in your program — it’s the number of instructions the verifier visits across all paths. A 200-instruction program with many branches can exceed the limit because the verifier explores every combination. If you hit the limit, your program is rejected with BPF program is too large.

# View verifier output for a loaded program
bpftool prog dump xlated id 42

# Get verbose verifier log when loading a program
# In libbpf: set log_level in bpf_object_open_opts
# In bpftrace: bpftrace -dd shows the verifier log on failure

Bounded loops

Before kernel 5.3, eBPF had no loops at all — every backward jump was rejected. You had to unroll loops manually with #pragma unroll. Since 5.3, the verifier allows loops if it can prove they terminate. The verifier tracks the loop variable and its bounds:

// This works (kernel 5.3+): bounded loop
for (int i = 0; i < 10; i++) {
    // verifier knows: i ranges [0, 9], loop iterates exactly 10 times
    buf[i] = 0;
}

// This fails: verifier can’t prove termination
int n = get_some_value();
for (int i = 0; i < n; i++) {
    // n is unknown — loop bound is unprovable
    buf[i] = 0;
}

// Fix: clamp the variable
int n = get_some_value();
if (n > 10) n = 10;  // now verifier knows n <= 10
for (int i = 0; i < n; i++) {
    buf[i] = 0;
}

Since kernel 5.17, bpf_loop() helper provides another way: you pass a callback function and a maximum iteration count, and the kernel handles the loop. This avoids the verifier needing to analyze the loop body on every iteration.

Stack limits

The verifier tracks every byte of the 512-byte stack. It ensures you don’t read uninitialized stack memory, don’t write past the stack bounds, and don’t pass uninitialized stack buffers to helpers. If your program needs more than 512 bytes of working space, use a per-CPU array map as a scratch buffer:

// Per-CPU array as scratch buffer (avoids 512-byte stack limit)
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __type(key, u32);
    __type(value, struct big_buffer);  // can be up to 32KB
    __uint(max_entries, 1);
} scratch SEC(".maps");

SEC("kprobe/some_func")
int my_prog(struct pt_regs *ctx) {
    u32 key = 0;
    struct big_buffer *buf = bpf_map_lookup_elem(&scratch, &key);
    if (!buf) return 0;  // always succeeds for per-cpu array, but verifier requires the check
    // use buf->data[...] for up to 32KB of scratch space
    return 0;
}

Helper functions

eBPF programs cannot call arbitrary kernel functions. They can only call helper functions provided by the kernel. Helpers are the API surface between eBPF programs and the kernel. Each program type has access to a specific subset of helpers. The verifier enforces this.

Common helpers every eBPF developer uses:

Helper	What it does
`bpf_map_lookup_elem()`	Look up a key in a map. Returns pointer to value or NULL.
`bpf_map_update_elem()`	Insert or update a key-value pair in a map.
`bpf_map_delete_elem()`	Delete a key from a map.
`bpf_get_current_pid_tgid()`	Returns (tgid << 32 \| pid). tgid = process ID, pid = thread ID.
`bpf_get_current_comm()`	Copies the current task’s command name (up to 16 bytes).
`bpf_ktime_get_ns()`	Returns monotonic clock in nanoseconds. For latency measurement.
`bpf_probe_read_kernel()`	Safely read from a kernel address. Returns 0 on success.
`bpf_probe_read_user()`	Safely read from a user-space address.
`bpf_ringbuf_reserve()`	Reserve space in a ring buffer for writing.
`bpf_ringbuf_submit()`	Submit a reserved ring buffer entry to user space.
`bpf_printk()`	Debug print to `/sys/kernel/debug/tracing/trace_pipe`. Use for debugging only — very slow.
`bpf_get_stackid()`	Capture a stack trace into a stack trace map. For profiling.
`bpf_redirect()`	Redirect a packet to another interface (XDP/TC).
`bpf_skb_store_bytes()`	Modify packet contents (TC programs).
`bpf_loop()`	Execute a callback function up to N times (kernel 5.17+). Avoids verifier loop analysis.

Common rejections and how to fix them

"R1 type=scalar expected=map_value"

You passed a raw integer to a function expecting a map value pointer. Fix: call bpf_map_lookup_elem() first and pass the returned pointer. Always NULL-check the return value before dereferencing.

"invalid mem access ‘scalar’"

You’re trying to dereference something the verifier thinks is a number, not a pointer. Fix: make sure you’re casting correctly and that the pointer source is valid (map lookup, context access, or stack address).

"R0 invalid mem access ‘map_value_or_null’"

You used a map lookup result without checking for NULL. Fix: add if (!val) return 0; after every bpf_map_lookup_elem().

"back-edge from insn X to Y"

You have a backward jump (loop) on a kernel older than 5.3, or the verifier can’t prove your loop terminates. Fix: unroll the loop with #pragma unroll, use bpf_loop(), or add an explicit bound the verifier can track.

"BPF program is too large"

The verifier hit the 1M instruction complexity limit. Fix: split into multiple programs connected by tail calls, reduce branching, use bpf_loop() instead of inline loops, or move complex logic to user space.

"invalid access to packet"

You’re reading past the end of a packet without a bounds check. Fix: always check if ((void *)(hdr + 1) > data_end) return XDP_PASS; before accessing each protocol header. The verifier needs these checks at every layer.

"cannot pass map_value to helper"

Some helpers need a pointer to the map, not a pointer to a value in the map. Fix: pass &my_map (the map itself) rather than the result of a lookup.

"helper call is not allowed in probe"

You called a helper that isn’t available for your program type. For example, bpf_redirect() in a kprobe program. Fix: check the helper availability table for your program type. Use bpftool feature to see what’s available on your kernel.

# See which helpers are available for each program type on your kernel
bpftool feature probe | grep -A 5 ‘eBPF helpers’

# See all available program types
bpftool feature probe | grep ‘program_type’

# See map types
bpftool feature probe | grep ‘map_type’

Quick reference — 25 essential commands

These commands work on any kldload system immediately after boot. All require root.

Tracing processes

# 1. Trace every process execution system-wide
execsnoop

# Output:
# PCOMM   PID    PPID   RET ARGS
# bash    18401  18400    0 /bin/bash
# ls      18402  18401    0 /usr/bin/ls --color=auto
# grep    18403  18401    0 /usr/bin/grep -i error /var/log/messages

# 2. Trace every file open system-wide
opensnoop

# Output:
# PID    COMM       FD ERR PATH
# 18401  bash        3   0 /etc/profile
# 18402  ls          3   0 /etc/ld.so.cache
# 1842   sshd        4   0 /etc/ssh/sshd_config

# 3. Trace signals sent to processes
bpftrace -e ‘tracepoint:signal:signal_generate { printf("%s (pid %d) sent signal %d to pid %d\n", comm, pid, args.sig, args.pid); }’

# 4. Count syscalls by process (top talkers)
bpftrace -e ‘tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }’

# 5. Trace process lifecycle (fork, exec, exit)
bpftrace -e ‘tracepoint:sched:sched_process_fork { printf("fork: %s (pid %d) -> child pid %d\n", comm, pid, args.child_pid); }’

Disk and filesystem

# 6. Block I/O latency histogram
biolatency

# Output:
#      usecs       : count    distribution
#        0 -> 1    : 0        |                                    |
#        2 -> 3    : 0        |                                    |
#        4 -> 7    : 15       |****                                |
#        8 -> 15   : 42       |*************                       |
#       16 -> 31   : 128      |****************************************|
#       32 -> 63   : 91       |****************************            |
#       64 -> 127  : 23       |*******                             |
#      128 -> 255  : 4        |*                                   |

# 7. Block I/O by device and process (top-like)
biotop

# 8. Slow filesystem operations (>10ms)
fileslower 10

# Output:
# TIME(s)  COMM           TID    D BYTES    LAT(ms) FILENAME
# 0.250    postgres       1842   R 8192       14.32 base/16384/16385
# 1.102    rsync          2103   W 131072     22.50 backup.tar.gz

# 9. ZFS operations slower than 1ms
zfsslower 1

# 10. Trace disk I/O size distribution
bpftrace -e ‘tracepoint:block:block_rq_complete { @bytes = hist(args.nr_sector * 512); }’

Network

# 11. Trace new TCP connections (outbound)
tcpconnect

# Output:
# PID    COMM         IP SADDR            DADDR            DPORT
# 18501  curl         4  10.0.0.5         93.184.216.34    443
# 18502  python3      4  10.0.0.5         10.0.0.10        5432

# 12. Trace TCP sessions with duration and bytes
tcplife

# Output:
# PID   COMM       LADDR           LPORT RADDR           RPORT TX_KB RX_KB MS
# 18501 curl       10.0.0.5        42916 93.184.216.34   443       1    15 230.45
# 18502 python3    10.0.0.5        54112 10.0.0.10       5432      0     0   2.31

# 13. Trace TCP retransmits (sign of network problems)
tcpretrans

# 14. Trace DNS lookups with latency
gethostlatency

# Output:
# TIME      PID    COMM          LATms HOST
# 14:02:31  18501  curl           2.91 example.com
# 14:02:31  18502  python3        0.12 db.internal

# 15. Count packets per process per second
bpftrace -e ‘tracepoint:net:net_dev_xmit { @[comm] = count(); }’ -d 5

CPU and scheduler

# 16. CPU scheduler queue latency (how long tasks wait to run)
runqlat

# Output:
#      usecs       : count    distribution
#        0 -> 1    : 234      |********                            |
#        2 -> 3    : 1042     |****************************************|
#        4 -> 7    : 823      |*******************************         |
#        8 -> 15   : 412      |***************                     |
#       16 -> 31   : 98       |***                                 |
#       32 -> 63   : 12       |                                    |

# 17. CPU profiling (sample stack traces)
profile -af 30 > /tmp/profile.out

# 18. Show off-CPU time (why processes are blocked)
offcputime 5

# 19. Context switches per process
bpftrace -e ‘tracepoint:sched:sched_switch { @[args.prev_comm] = count(); }’

Memory

# 20. Page cache hit/miss ratio
cachestat

# Output:
#     HITS   MISSES  DIRTIES HITRATIO   BUFFERS_MB  CACHED_MB
#     1523       12       34   99.22%          142       3847

# 21. Trace memory allocations (find leaks)
memleak -p 1842

# 22. Page faults by process
bpftrace -e ‘software:page-faults:1 { @[comm] = count(); }’

System inspection

# 23. List all loaded eBPF programs
bpftool prog list

# Output:
# 6: cgroup_device  tag a]4f...  gpl
#    loaded_at 2026-04-04T10:23:17+0000  uid 0
#    xlated 504B  jited 309B  memlock 4096B  map_ids 2
# 42: tracepoint  name trace_exec  tag b2e9...  gpl
#    loaded_at 2026-04-04T14:01:33+0000  uid 0
#    xlated 1832B  jited 1104B  memlock 4096B  map_ids 15,16

# 24. List all eBPF maps
bpftool map list

# 25. Check kernel eBPF feature support
bpftool feature probe kernel

These 25 commands cover 95% of production debugging. Memorize the first 15 — they become muscle memory after a few incidents. The rest you’ll reach for when the situation calls for it. Every one of them is pre-installed on kldload.

Practical examples on kldload

Trace ZFS internals

# ZFS read/write latency histogram
zfsslower 1

# Output:
# TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
# 14:23:01 postgres       1842   R 8192    16384      1.23  16385
# 14:23:01 cp             2041   W 131072  0          3.45  backup.tar

# Trace ZFS ARC hits and misses
bpftrace -e ‘kprobe:arc_read { @[comm] = count(); }’

# Trace ZFS transaction group syncs
bpftrace -e ‘kprobe:txg_sync_thread { printf("txg sync: %s\n", comm); }’

# Trace zpool import/export
bpftrace -e ‘
kprobe:spa_open { printf("pool open: pid=%d comm=%s\n", pid, comm); }
kretprobe:spa_open { printf("pool open returned: %d\n", retval); }’

# Monitor ZFS scrub I/O
bpftrace -e ‘kprobe:dsl_scan_scrub_cb { @scrub_ios = count(); }’

Trace WireGuard

# Packets going through WireGuard tunnel
bpftrace -e ‘kprobe:wg_xmit { @tx[comm] = count(); }’
bpftrace -e ‘kprobe:wg_receive { @rx = count(); }’

# WireGuard handshake events
bpftrace -e ‘kprobe:wg_noise_handshake_create_initiation { printf("WG handshake init: pid=%d\n", pid); }’

Trace the kldload installer

# During a kldload install, trace every file being created
opensnoop -f O_CREAT

# Watch the installer’s process tree unfold
execsnoop

# Trace all disk I/O during install (see which files are being written)
biotop

# Trace dnf/debootstrap dependency resolution
bpftrace -e ‘tracepoint:syscalls:sys_enter_openat /comm == "dnf" || comm == "debootstrap"/ {
  printf("%s: %s\n", comm, str(args.filename));
}’

Security auditing

# Track every process execution with full command line (security audit trail)
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%llu uid=%-5d pid=%-6d ppid=%-6d %s -> %s\n",
        nsecs / 1000000000, uid, pid,
        curtask->real_parent->tgid,
        comm, str(args.filename));
}'

# Output:
# 1712234521 uid=0     pid=18401  ppid=18400  bash -> /usr/bin/ls
# 1712234521 uid=1000  pid=18402  ppid=18401  crond -> /usr/sbin/logrotate
# 1712234522 uid=0     pid=18403  ppid=1      sshd -> /usr/sbin/sshd

# Detect privilege escalation (setuid calls)
bpftrace -e '
tracepoint:syscalls:sys_enter_setuid {
    printf("SETUID: pid=%d comm=%s uid=%d -> target_uid=%d\n",
        pid, comm, uid, args.uid);
}'

# Monitor access to sensitive files
bpftrace -e '
tracepoint:syscalls:sys_enter_openat
/str(args.filename) == "/etc/shadow" ||
 str(args.filename) == "/etc/sudoers"/ {
    printf("SENSITIVE FILE: %s (pid=%d uid=%d) opened %s\n",
        comm, pid, uid, str(args.filename));
}'

# Detect kernel module loads (potential rootkit insertion)
bpftrace -e '
kprobe:do_init_module {
    printf("MODULE LOADED: pid=%d uid=%d comm=%s\n", pid, uid, comm);
    print(kstack);
}'

# Trace container escape indicators (mount namespace changes)
bpftrace -e '
tracepoint:syscalls:sys_enter_mount {
    printf("mount: pid=%d comm=%s source=%s target=%s\n",
        pid, comm, str(args.dev_name), str(args.dir_name));
}'

The security use cases are where eBPF separates itself from everything that came before it. Traditional audit frameworks (auditd) write events to a log file and you grep them afterward. eBPF lets you react in real time — detect a suspicious pattern and alert within microseconds. Tools like Tetragon and Falco build entire runtime security platforms on this foundation. On a kldload system, you have the building blocks to do the same thing with bpftrace one-liners.

Generate flame graphs

# CPU flame graph (what is using CPU time?)
perf record -g -a sleep 30
perf script > /tmp/out.perf

# If FlameGraph tools are installed:
stackcollapse-perf.pl /tmp/out.perf | flamegraph.pl > /tmp/cpu-flame.svg

# Off-CPU flame graph (what is blocking?)
offcputime -f 30 > /tmp/offcpu.out
flamegraph.pl --color=io /tmp/offcpu.out > /tmp/offcpu-flame.svg

# Alternatively, use bpftrace for targeted profiling:
bpftrace -e ‘profile:hz:99 { @[kstack] = count(); }’ -d 30 > /tmp/stacks.out

Kernel requirements

eBPF features were added incrementally across kernel versions. Here’s what shipped when, and what kldload kernels support.

Feature	Minimum kernel	kldload CentOS (5.14)	kldload Debian (6.x)
Basic eBPF (maps, helpers)	3.18	Yes	Yes
kprobe/kretprobe programs	4.1	Yes	Yes
Tracepoint programs	4.7	Yes	Yes
XDP	4.8	Yes	Yes
BPF-to-BPF calls	4.16	Yes	Yes
BTF	4.18	Yes	Yes
Bounded loops	5.3	Yes	Yes
fentry/fexit	5.5	Yes	Yes
BPF LSM	5.7	Yes	Yes
Ring buffer	5.8	Yes	Yes
Bloom filter map	5.16	Yes	Yes
bpf_loop() helper	5.17	Backported	Yes
User ring buffer	6.1	No	Yes
sched_ext (BPF scheduler)	6.12	No	Depends on version

Required kernel config options (all enabled in kldload kernels):

# Core eBPF
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_HAVE_EBPF_JIT=y

# BTF (required for CO-RE and bpftrace struct access)
CONFIG_DEBUG_INFO_BTF=y
CONFIG_DEBUG_INFO_BTF_MODULES=y

# kprobe support
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y

# Tracepoints
CONFIG_TRACING=y
CONFIG_FTRACE=y

# BPF LSM (for security programs)
CONFIG_BPF_LSM=y

# XDP
CONFIG_XDP_SOCKETS=y

# Verify on your kernel:
zcat /proc/config.gz | grep CONFIG_BPF
# or
grep CONFIG_BPF /boot/config-$(uname -r)

Permissions: eBPF requires root or specific capabilities. On kernel 5.8+, unprivileged eBPF is disabled by default (kernel.unprivileged_bpf_disabled=1). For non-root users, grant CAP_BPF + CAP_PERFMON (for tracing) or CAP_BPF + CAP_NET_ADMIN (for networking programs).

# Check BTF availability
ls /sys/kernel/btf/vmlinux && echo "BTF available" || echo "No BTF"

# Check JIT status
cat /proc/sys/net/core/bpf_jit_enable

# Check if unprivileged BPF is disabled (should be 1 or 2)
cat /proc/sys/kernel/unprivileged_bpf_disabled

Writing custom eBPF programs with libbpf

When bpftrace one-liners are not enough and you need a production-grade tool that ships as a single binary, use libbpf with CO-RE. Here is the complete workflow: write the eBPF kernel program, write the userspace loader, compile, and run.

Step 1: The kernel-side eBPF program

/* trace_open.bpf.c — traces every openat() syscall */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>

/* Event structure shared between kernel and user space */
struct event {
    u32 pid;
    u32 uid;
    char comm[16];
    char filename[256];
};

/* Ring buffer for streaming events to user space */
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);  /* 256 KB */
} events SEC(".maps");

SEC("tracepoint/syscalls/sys_enter_openat")
int trace_openat(struct trace_event_raw_sys_enter *ctx) {
    struct event *e;

    /* Reserve space in the ring buffer */
    e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;  /* ring buffer full, drop event */

    /* Fill in event fields */
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->uid = bpf_get_current_uid_gid() & 0xffffffff;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    bpf_probe_read_user_str(&e->filename, sizeof(e->filename),
                            (const char *)ctx->args[1]);

    /* Submit event to user space */
    bpf_ringbuf_submit(e, 0);
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

Step 2: The userspace loader

/* trace_open.c — loads and manages the eBPF program */
#include <stdio.h>
#include <signal.h>
#include <bpf/libbpf.h>
#include "trace_open.skel.h"  /* auto-generated by bpftool gen skeleton */

struct event {
    __u32 pid;
    __u32 uid;
    char comm[16];
    char filename[256];
};

static volatile bool running = true;
static void sig_handler(int sig) { running = false; }

static int handle_event(void *ctx, void *data, size_t len) {
    struct event *e = data;
    printf("%-8d %-8d %-16s %s\n", e->pid, e->uid, e->comm, e->filename);
    return 0;
}

int main(void) {
    struct trace_open_bpf *skel;
    struct ring_buffer *rb;

    signal(SIGINT, sig_handler);

    /* Open, load, and verify the eBPF program */
    skel = trace_open_bpf__open_and_load();
    if (!skel) { fprintf(stderr, "Failed to load BPF program\n"); return 1; }

    /* Attach to the tracepoint */
    trace_open_bpf__attach(skel);

    /* Set up ring buffer polling */
    rb = ring_buffer__new(bpf_map__fd(skel->maps.events), handle_event, NULL, NULL);

    printf("%-8s %-8s %-16s %s\n", "PID", "UID", "COMM", "FILENAME");
    while (running)
        ring_buffer__poll(rb, 100 /* timeout ms */);

    /* Cleanup */
    ring_buffer__free(rb);
    trace_open_bpf__destroy(skel);
    return 0;
}

Step 3: Compile and run

# 1. Generate vmlinux.h (once per kernel version)
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

# 2. Compile the eBPF program to BPF bytecode
clang -g -O2 -target bpf -D__TARGET_ARCH_x86 \
    -c trace_open.bpf.c -o trace_open.bpf.o

# 3. Generate the skeleton header (auto-creates open/load/attach functions)
bpftool gen skeleton trace_open.bpf.o > trace_open.skel.h

# 4. Compile the userspace loader
clang -g -O2 -Wall trace_open.c -lbpf -lelf -lz -o trace_open

# 5. Run it
sudo ./trace_open
PID      UID      COMM             FILENAME
18401    1000     bash             /etc/profile
18402    0        sshd             /etc/ssh/sshd_config
18403    33       nginx            /var/log/nginx/access.log
18404    0        postgres         /var/lib/pgsql/data/base/16384/16385
^C

The resulting trace_open binary is self-contained. It embeds the compiled eBPF bytecode and can run on any Linux kernel with BTF enabled, regardless of the kernel version it was compiled on. This is CO-RE in action — compile once, ship everywhere.

This is the pattern every production eBPF tool uses: kernel-side C compiled to BPF, skeleton header generated by bpftool, userspace loader linked against libbpf. Cilium, Falco, Tetragon, and all of libbpf-tools follow exactly this workflow. The skeleton header is the magic — it auto-generates type-safe C functions for opening, loading, attaching, and accessing maps. You never touch the raw bpf() syscall.

Writing bpftrace scripts

bpftrace scripts follow a consistent pattern: probe /filter/ { action }. Multiple probes, built-in variables, maps, and printf-style output. Here’s a complete script demonstrating the key features.

#!/usr/bin/env bpftrace
/*
 * trace-io-latency.bt — Track I/O latency by process and device,
 * print a summary every 5 seconds.
 */

BEGIN {
  printf("Tracing block I/O latency... Hit Ctrl+C to stop.\n");
}

/* Record timestamp when I/O is issued */
tracepoint:block:block_rq_issue {
  @start[args.dev, args.sector] = nsecs;
}

/* Calculate latency when I/O completes */
tracepoint:block:block_rq_complete
/@start[args.dev, args.sector]/ {
  $lat_us = (nsecs - @start[args.dev, args.sector]) / 1000;

  /* Per-device latency histogram */
  @lat_hist[args.dev] = hist($lat_us);

  /* Count I/Os per device */
  @io_count[args.dev] = count();

  /* Track max latency per device */
  @max_lat[args.dev] = max($lat_us);

  delete(@start[args.dev, args.sector]);
}

/* Print interval summary */
interval:s:5 {
  printf("\n--- I/O Summary (last 5s) ---\n");
  print(@io_count);
  print(@max_lat);
  clear(@io_count);
  clear(@max_lat);
}

END {
  printf("\n--- Final Latency Histograms ---\n");
  print(@lat_hist);
  clear(@start);
}

Run it:

chmod +x trace-io-latency.bt
bpftrace trace-io-latency.bt

Key bpftrace built-in variables:

Variable	Meaning
`pid`	Process ID (thread group ID)
`tid`	Thread ID
`uid`	User ID
`comm`	Process name (16 chars max)
`nsecs`	Nanosecond timestamp (monotonic)
`kstack`	Kernel stack trace
`ustack`	User-space stack trace
`args`	Tracepoint arguments struct
`retval`	Return value (kretprobe/fexit)
`curtask`	Pointer to current `struct task_struct`
`cpu`	Current CPU number

Key bpftrace aggregation functions:

Function	What it produces
`count()`	Event count
`sum(x)`	Running sum
`avg(x)`	Average value
`min(x)`	Minimum value seen
`max(x)`	Maximum value seen
`hist(x)`	Power-of-2 histogram
`lhist(x, min, max, step)`	Linear histogram
`stats(x)`	Count, average, and total