eBPF — The Definitive Reference
eBPF is the most significant change to Linux observability and networking since the kernel itself. It lets you attach small, verified programs to any kernel event — syscalls, function entries, network packets, scheduler decisions, security hooks — and get answers about your running system without modifying the kernel, loading kernel modules, or rebooting. It is safe enough for production, fast enough for line-rate networking, and expressive enough to replace entire kernel subsystems.
This page is a comprehensive reference. It covers the eBPF virtual machine, every program type, the verifier, map types, the complete toolchain, and practical commands you can run today on any kldload system.
What eBPF actually is
Before eBPF, if you wanted to understand what your kernel was doing — why a process was slow,
where packets were being dropped, what was causing latency spikes at 3 AM — you had three choices:
add printk statements to the kernel source and recompile, load a kernel module that hooks
into internal functions (risking a panic if you get it wrong), or use coarse tools like strace
that slow your application to a crawl. None of these are safe for production. None of them let you
ask arbitrary questions about a running system without risk.
eBPF changes the equation entirely. It provides a sandboxed virtual machine inside the kernel that runs small programs you write in user space. The kernel’s verifier mathematically proves your program is safe before it runs: it cannot crash the kernel, it cannot access arbitrary memory, it cannot loop forever, it cannot corrupt data structures. Once verified, the program is JIT-compiled to native machine code and runs at near-native speed. You attach it to a kernel event — a function entry, a tracepoint, a network hook, a security check — and it fires every time that event occurs.
The paradigm shift
Traditional kernel instrumentation is static: you build tracing into the kernel at compile time, or you load a module that hooks specific functions. If you want to trace something new, you need to recompile or write a new module. eBPF is dynamic: you attach programs at runtime to any of the thousands of hooks the kernel exposes. No reboot, no recompile, no risk.
Safe
The verifier proves your program terminates, doesn’t access invalid memory, and doesn’t corrupt kernel state. A verified eBPF program cannot crash the kernel. This is what makes eBPF production-safe — you can instrument a system serving millions of requests per second.
Dynamic
Attach programs at runtime to any of 100,000+ kernel functions, tracepoints, network interfaces, cgroup hooks, or security checkpoints. Detach them when you’re done. No reboot, no downtime, no kernel rebuild.
Fast
eBPF programs are JIT-compiled to native x86_64/ARM64 instructions. Overhead is typically under 100 nanoseconds per event. XDP programs process packets at line rate — millions of packets per second — before they even reach the network stack.
Expressive
eBPF programs can read kernel data structures, aggregate statistics into maps, communicate with user space, modify packets, enforce security policies, and even replace kernel TCP congestion algorithms — all from user space.
A brief history
BPF (Berkeley Packet Filter) was created in 1992 for tcpdump — a tiny virtual machine
that filtered network packets in the kernel instead of copying every packet to user space. It had two
32-bit registers, no loops, and could only say "accept" or "reject" a packet. In 2014, Alexei Starovoitov
rewrote BPF from scratch for Linux 3.18: 64-bit registers, a proper instruction set, maps for shared state,
helper functions for kernel interaction, and a verifier that proved programs safe. The "e" in eBPF stands
for "extended," but the original BPF and eBPF are so different that the "e" is essentially meaningless —
eBPF is a new system that happens to share a name. By Linux 5.x, eBPF had grown far beyond packet filtering
into a general-purpose kernel programming framework used by every major tech company in production.
The eBPF machine
The eBPF virtual machine is a register-based RISC machine with a deliberately constrained design. Understanding the machine model explains why eBPF programs have the constraints they do — the 512-byte stack limit, the bounded loops, the helper function calling convention.
Registers
eBPF has 11 registers, all 64-bit. The calling convention mirrors x86_64:
| Register | Purpose |
|---|---|
r0 | Return value. Function return value and eBPF program exit code. For XDP programs: XDP_PASS, XDP_DROP, etc. For kprobes: ignored. For helper calls: the return value of the helper. |
r1–r5 | Function arguments. Used to pass arguments to helper functions and at program entry to pass the context pointer (r1). r1–r5 are caller-saved — helpers may clobber them. |
r6–r9 | Callee-saved. Preserved across helper function calls. Use these to hold values you need after a helper returns. |
r10 | Frame pointer (read-only). Points to the bottom of the 512-byte stack. You cannot write to r10 — it exists only for stack-relative addressing. |
At program entry, r1 contains a pointer to the context — the data structure
specific to the program type. For a kprobe, it’s a struct pt_regs * (CPU register state).
For XDP, it’s an struct xdp_md * (packet metadata). For tracepoints, it’s the tracepoint
arguments struct. Your program reads from the context to inspect the event that triggered it.
Instruction set
eBPF uses fixed-width 64-bit instructions (8 bytes each, with one exception: 128-bit load-immediate for 64-bit constants). The instruction set includes:
ALU operations
add, sub, mul, div, mod,
or, and, xor, lsh, rsh,
arsh, neg. Both 32-bit and 64-bit variants. All operate on registers.
Memory access
ldx (load from memory), stx (store to memory), st (store immediate).
Supports 8, 16, 32, and 64-bit widths. All memory access must be to the stack, map values, or
the context — the verifier rejects arbitrary pointer arithmetic.
Branch instructions
jeq, jne, jgt, jge, jlt,
jle, jsgt, jsge, jslt, jsle,
jset (bitwise AND non-zero). Plus unconditional ja (jump always).
All jumps are forward-only in classic eBPF; bounded loops were added in kernel 5.3+.
Function calls
call invokes a kernel helper function by ID. exit terminates the program.
BPF-to-BPF function calls (subprograms) are supported since kernel 4.16, enabling code reuse
without duplicating instructions.
The 512-byte stack
Every eBPF program gets a 512-byte stack. That’s it. No heap, no malloc, no dynamic allocation. This is intentional — the kernel stack is already limited (typically 8KB–16KB), and eBPF programs run inside kernel code paths. If your program needs more working space, use a map (per-CPU array maps are the common pattern for scratch buffers). The 512-byte limit is per stack frame, so BPF-to-BPF function calls each get their own 512 bytes, up to 8 frames deep (total: 4KB, but the verifier tracks it precisely).
Maps — shared state between kernel and user space
Maps are the mechanism for persistent state in eBPF. A map is a key-value data structure
that lives in kernel memory and can be accessed by both eBPF programs (from the kernel side) and
user-space applications (via the bpf() syscall or file descriptors). Maps survive
across eBPF program invocations — they are created separately and can outlive the programs that use them.
Maps are how eBPF programs aggregate data (count events, build histograms), share state between multiple eBPF programs, and communicate results to user space. Without maps, eBPF programs could only return a single integer. With maps, they can build complex data structures accessible from both sides of the kernel/user boundary.
The verifier
The verifier is what makes eBPF safe. Every eBPF program passes through the verifier before it can execute. The verifier performs static analysis of every possible execution path:
Termination
The program must terminate. Before kernel 5.3, this meant no backward jumps at all (no loops). Since 5.3, bounded loops are allowed — the verifier must be able to prove the loop has a fixed upper bound on iterations.
Memory safety
Every memory access must be provably safe. The verifier tracks the type and bounds of every register. Accessing beyond the stack, reading past the end of a packet, or dereferencing a null pointer all result in rejection.
No uninitialized reads
Every register and stack slot must be written before it is read. The verifier tracks initialization state per-register and per-stack-byte across all branches.
Valid helper calls
Only helpers permitted for the program type can be called. Arguments must have the correct type. Return values must be checked before use (e.g., map lookups can return NULL and must be NULL-checked).
JIT compilation
After the verifier approves a program, the kernel’s JIT compiler translates the eBPF bytecode into native machine instructions for the host architecture (x86_64, ARM64, etc.). JIT-compiled eBPF runs at near-native speed — there is no interpreter overhead. The JIT is enabled by default on all modern kernels. You can verify it:
# Check JIT status (1 = enabled)
cat /proc/sys/net/core/bpf_jit_enable
1
# Enable if disabled
echo 1 > /proc/sys/net/core/bpf_jit_enable
Program types
An eBPF program’s type determines what it can do: what context it receives, which helpers it can call, where it can attach, and what its return value means. There are over 30 program types in modern kernels. These are the ones that matter for infrastructure work.
kprobe / kretprobe
Attach to any kernel function entry (kprobe) or return (kretprobe). This is the most
flexible attach point — any of the 50,000+ functions in the kernel can be probed. The context
is struct pt_regs * containing the CPU register state, from which you can extract
function arguments. kretprobes let you capture the return value.
When to use: tracing internal kernel behavior that isn’t exposed via tracepoints. Debugging specific kernel functions. Understanding code paths during development. Caveat: kprobes attach to internal kernel functions that can change between kernel versions. Your probe may break on a kernel upgrade. Use tracepoints when available — they are stable ABI.
# Trace every call to tcp_connect with the destination address
bpftrace -e ‘kprobe:tcp_connect {
$sk = (struct sock *)arg0;
printf("connect to %s:%d\n",
ntop($sk->__sk_common.skc_daddr),
$sk->__sk_common.skc_dport);
}’
# Trace vfs_read return values (bytes read)
bpftrace -e ‘kretprobe:vfs_read { @bytes = hist(retval); }’
Tracepoints
Attach to stable, well-defined kernel instrumentation points. Tracepoints are placed by kernel developers at important locations in the code and have a stable ABI — their arguments don’t change between kernel versions (within a major version). They are the preferred attach point for production tracing.
When to use: production tracing, syscall monitoring, scheduler analysis, network event tracking. Always prefer tracepoints over kprobes when both are available.
# List all available tracepoints
bpftrace -l ‘tracepoint:*’ | head -20
tracepoint:syscalls:sys_enter_read
tracepoint:syscalls:sys_exit_read
tracepoint:syscalls:sys_enter_write
tracepoint:syscalls:sys_exit_write
tracepoint:sched:sched_switch
tracepoint:sched:sched_wakeup
tracepoint:net:net_dev_xmit
tracepoint:block:block_rq_issue
tracepoint:block:block_rq_complete
...
# Trace process scheduling with time on CPU
bpftrace -e ‘tracepoint:sched:sched_switch {
printf("%-16s (pid %d) -> %-16s (pid %d)\n",
args.prev_comm, args.prev_pid,
args.next_comm, args.next_pid);
}’
Raw tracepoints
Like tracepoints but with raw, unprocessed arguments. Regular tracepoints copy arguments into a stable struct, which adds overhead. Raw tracepoints pass the original kernel pointers directly, saving the copy but requiring you to read struct fields yourself with BTF. Lower overhead, less stable across versions.
When to use: high-frequency tracepoints where the copy overhead of regular
tracepoints is measurable (e.g., sched_switch fires millions of times per second
on a busy system).
fentry / fexit
Modern replacement for kprobe/kretprobe. Introduced in kernel 5.5, fentry/fexit attach to kernel function entry and exit with lower overhead than kprobes (no breakpoint trap — the program is called directly via a trampoline). fexit has a major advantage over kretprobe: it receives both the function arguments and the return value, so you can correlate input with output in a single program.
When to use: any situation where you would use kprobe/kretprobe on kernel 5.5+. fentry/fexit is strictly better — lower overhead, type-safe arguments via BTF, and fexit gives you args + return value together.
# fentry/fexit example in bpftrace (kernel 5.5+)
# Not yet supported in bpftrace — use libbpf or Cilium eBPF library
# The C skeleton looks like:
#
# SEC("fentry/tcp_connect")
# int BPF_PROG(trace_tcp_connect, struct sock *sk) {
# // sk is typed — no need to cast from pt_regs
# return 0;
# }
#
# SEC("fexit/tcp_sendmsg")
# int BPF_PROG(trace_tcp_sendmsg_exit, struct sock *sk,
# struct msghdr *msg, size_t size, int ret) {
# // Both arguments AND return value available
# return 0;
# }
XDP (eXpress Data Path)
Process packets at the earliest possible point — before the kernel network stack touches them.
XDP programs run in the NIC driver (or in a generic hook for drivers that don’t support native XDP).
They receive raw packet data and can XDP_PASS (continue to the stack), XDP_DROP
(discard), XDP_TX (bounce back out the same interface), or XDP_REDIRECT
(send to a different interface, CPU, or AF_XDP socket).
When to use: DDoS mitigation, load balancing, packet filtering at line rate,
high-performance networking. XDP can process millions of packets per second per core
because it runs before any socket buffer allocation, before any protocol processing, before the
packet even has an sk_buff.
# Simple XDP program to drop all UDP traffic on port 9999
# Save as drop_udp.c, compile with clang -O2 -target bpf
#
# SEC("xdp")
# int drop_udp_9999(struct xdp_md *ctx) {
# void *data = (void *)(long)ctx->data;
# void *data_end = (void *)(long)ctx->data_end;
# struct ethhdr *eth = data;
# if ((void *)(eth + 1) > data_end) return XDP_PASS;
# if (eth->h_proto != htons(ETH_P_IP)) return XDP_PASS;
# struct iphdr *ip = (void *)(eth + 1);
# if ((void *)(ip + 1) > data_end) return XDP_PASS;
# if (ip->protocol != IPPROTO_UDP) return XDP_PASS;
# struct udphdr *udp = (void *)ip + ip->ihl * 4;
# if ((void *)(udp + 1) > data_end) return XDP_PASS;
# if (udp->dest == htons(9999)) return XDP_DROP;
# return XDP_PASS;
# }
# Attach an XDP program to eth0
ip link set dev eth0 xdpgeneric obj drop_udp.o sec xdp
# View XDP program attached to interfaces
ip link show dev eth0 | grep xdp
# Remove XDP program
ip link set dev eth0 xdpgeneric off
TC (Traffic Control / cls_bpf)
Attach eBPF programs to the Linux traffic control layer. TC programs run on both
ingress and egress, after the kernel has created an sk_buff (unlike XDP which
runs before). This means TC programs can access full socket and connection information, modify
packet headers, redirect between interfaces, and apply traffic shaping.
When to use: Kubernetes networking (Cilium uses TC extensively), container network policies, NAT, packet mangling on egress, anything that needs socket context. TC is the workhorse of eBPF-based networking when you need more context than XDP provides.
# Attach a TC eBPF program to ingress of eth0
tc qdisc add dev eth0 clsact
tc filter add dev eth0 ingress bpf da obj my_filter.o sec tc_ingress
# List TC programs on an interface
tc filter show dev eth0 ingress
# Remove
tc filter del dev eth0 ingress
cgroup programs
Attach to cgroup hooks to control network behavior, device access, and sysctl values per container or process group. cgroup eBPF programs are the mechanism behind Kubernetes network policies (which pods can talk to which), device whitelisting in containers, and per-cgroup sysctl overrides.
cgroup/sock (BPF_CGROUP_INET_SOCK_CREATE)
Fires when a socket is created. Can allow or deny socket creation per cgroup. Used to prevent containers from creating raw sockets or sockets on disallowed address families.
cgroup/connect4, cgroup/connect6
Fires on connect(). Can rewrite the destination address/port — this is how
transparent service mesh proxying works. The application connects to IP A, but the eBPF program
silently redirects to IP B.
cgroup/bind4, cgroup/bind6
Fires on bind(). Can rewrite the bind address or reject it. Used for port-level
access control per cgroup.
cgroup/sysctl
Intercepts sysctl reads/writes for processes in the cgroup. Allows per-container sysctl overrides without giving the container actual sysctl access.
cgroup/device
Controls which device files (/dev/*) processes in the cgroup can access.
Replaces the legacy device cgroup controller with programmable logic.
LSM (Linux Security Modules)
Attach eBPF programs to LSM hooks — the same hooks used by SELinux and AppArmor. BPF LSM programs (kernel 5.7+) can implement custom security policies without writing a kernel module. They can allow, deny, or audit any operation that goes through the LSM framework: file access, socket operations, process creation, module loading, mount operations, and hundreds more.
When to use: custom security policies that go beyond what SELinux/AppArmor profiles can express. Runtime security monitoring (detecting suspicious behavior patterns). Container security enforcement. Audit logging of security-sensitive operations.
struct_ops
Replace kernel subsystem implementations with eBPF programs. struct_ops lets you implement an entire kernel operations struct (vtable) in eBPF. The most important use case: TCP congestion control algorithms. You can write a custom congestion control algorithm in eBPF and load it at runtime without recompiling the kernel.
When to use: custom TCP congestion control (e.g., data center-specific algorithms), custom HID drivers, scheduler extensions (sched_ext in kernel 6.12+). struct_ops is the mechanism that makes the kernel truly programmable — not just observable, but replaceable.
uprobe / uretprobe
Attach to any function in any userspace binary. uprobes work the same way as kprobes but for userspace: the kernel inserts a breakpoint at the function entry point in the target process. When the function is called, the eBPF program fires. uretprobes capture the return value. This works on any ELF binary — compiled C, Go, Rust, even Python or Node.js native extensions.
When to use: tracing application-level behavior without modifying the application. Measuring latency of specific library functions. Debugging third-party binaries. Tracing TLS/SSL handshakes by attaching to OpenSSL functions.
# Trace every call to malloc in a specific process
bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc { @bytes = hist(arg0); }'
# Trace OpenSSL handshakes (see what's doing TLS)
bpftrace -e 'uprobe:/usr/lib64/libssl.so:SSL_do_handshake { printf("TLS handshake: pid=%d comm=%s\n", pid, comm); }'
# Trace readline in bash (see every command typed)
bpftrace -e 'uretprobe:/bin/bash:readline { printf("cmd: %s\n", str(retval)); }'
USDT (Userspace Statically Defined Tracing)
Pre-defined probe points embedded in userspace applications by their developers. Think of USDT as tracepoints for userspace. Applications like PostgreSQL, MySQL, Node.js, Python, and the JVM include USDT probes at important points: query start, query complete, GC start, GC end, connection accept, etc. These are more stable than uprobes because the probe location and arguments are declared by the application developer.
# List USDT probes in PostgreSQL
bpftrace -l 'usdt:/usr/bin/postgres:*'
usdt:/usr/bin/postgres:postgresql:query__start
usdt:/usr/bin/postgres:postgresql:query__done
usdt:/usr/bin/postgres:postgresql:transaction__start
usdt:/usr/bin/postgres:postgresql:transaction__commit
usdt:/usr/bin/postgres:postgresql:transaction__abort
# Trace PostgreSQL query execution with latency
bpftrace -e '
usdt:/usr/bin/postgres:postgresql:query__start { @start[tid] = nsecs; @query[tid] = str(arg0); }
usdt:/usr/bin/postgres:postgresql:query__done /@start[tid]/ {
printf("%-6d %8.2f ms %s\n", pid, (nsecs - @start[tid]) / 1e6, @query[tid]);
delete(@start[tid]); delete(@query[tid]);
}'
# List USDT probes in Python
bpftrace -l 'usdt:/usr/bin/python3:*'
# List USDT probes in Node.js
bpftrace -l 'usdt:/usr/bin/node:*'
perf_event
Attach to hardware and software performance counters. CPU cycles, cache misses, branch mispredictions, context switches, page faults. This is the foundation of CPU profiling and flame graph generation. Typically used at a sampling frequency (e.g., 99 Hz) to capture stack traces without measurable overhead.
# Sample kernel stack traces at 99 Hz for flame graph generation
bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'
# Sample both kernel and userspace stacks
bpftrace -e 'profile:hz:99 { @[kstack, ustack, comm] = count(); }'
# Trace hardware cache misses
bpftrace -e 'hardware:cache-misses:1000 { @[kstack] = count(); }'
Socket filter
The original BPF use case: filter packets on a socket. Attach to a raw or packet socket
to receive only packets that match your filter. This is what tcpdump uses under the hood.
Modern eBPF socket filters can do much more than classic BPF — they can access maps, call helpers,
and make complex decisions.
sk_msg / sk_skb
Intercept and redirect messages between sockets. sk_msg programs attach to a sockmap or sockhash and can inspect, modify, or redirect data flowing between sockets. This enables kernel-level proxying: data from socket A is redirected to socket B without ever reaching user space. Cilium uses this for service mesh acceleration — bypassing the TCP/IP stack entirely for pod-to-pod communication on the same node.
Attach points
An attach point is the specific location in the kernel where your eBPF program runs. Understanding what attach points exist and how to find them is the key to using eBPF effectively.
Finding attach points
# List ALL available attach points (this outputs a lot)
bpftrace -l | wc -l
247831
# List all tracepoints
bpftrace -l ‘tracepoint:*’ | wc -l
2194
# List syscall tracepoints (entry and exit for every syscall)
bpftrace -l ‘tracepoint:syscalls:*’ | head -20
tracepoint:syscalls:sys_enter_accept
tracepoint:syscalls:sys_enter_accept4
tracepoint:syscalls:sys_enter_access
tracepoint:syscalls:sys_enter_acct
tracepoint:syscalls:sys_enter_add_key
tracepoint:syscalls:sys_enter_adjtimex
tracepoint:syscalls:sys_enter_alarm
tracepoint:syscalls:sys_enter_arch_prctl
tracepoint:syscalls:sys_enter_bind
tracepoint:syscalls:sys_enter_bpf
# List kprobes for a specific subsystem (e.g., TCP)
bpftrace -l ‘kprobe:tcp_*’ | head -20
kprobe:tcp_abort
kprobe:tcp_check_req
kprobe:tcp_close
kprobe:tcp_connect
kprobe:tcp_conn_request
kprobe:tcp_disconnect
kprobe:tcp_done
kprobe:tcp_fin
kprobe:tcp_get_info
kprobe:tcp_getsockopt
# List kprobes for ZFS
bpftrace -l ‘kprobe:zfs_*’
bpftrace -l ‘kprobe:spa_*’
bpftrace -l ‘kprobe:dmu_*’
# Show the arguments for a tracepoint
bpftrace -lv ‘tracepoint:syscalls:sys_enter_openat’
tracepoint:syscalls:sys_enter_openat
int __syscall_nr
int dfd
const char * filename
int flags
umode_t mode
# Show struct layouts with BTF
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 20 ‘struct sock_common {‘
How programs attach
kprobe / kretprobe
Uses the kernel’s kprobe infrastructure (software breakpoints). The kernel patches the first instruction of the target function with an INT3 (x86) or BRK (ARM64). When the function is called, the CPU traps, the eBPF program runs, then the original instruction executes. This adds ~100ns overhead per call.
fentry / fexit
Uses BPF trampolines. The kernel patches the function’s nop-padded preamble (added by
-fno-omit-frame-pointer -fpatchable-function-entry) to jump to a trampoline that calls
the eBPF program. Much lower overhead than kprobes — no trap, no interrupt, just a direct call.
Tracepoints
Uses static instrumentation points compiled into the kernel. Each tracepoint is a trace_*
function call that is normally a no-op (static key disabled). When you attach an eBPF program, the
static key is enabled and the tracepoint fires. Near-zero overhead when no program is attached.
XDP
Attached directly to the network device driver’s receive path via netlink or
ip link set. The driver calls the eBPF program for every received packet before
allocating an sk_buff. Three modes: native (driver support required), generic
(works everywhere, slower), and offload (runs on NIC hardware, supported by Netronome/nfp).
cgroup
Attached to a cgroup via the bpf() syscall. The kernel checks for attached eBPF programs
at each cgroup hook point (socket create, connect, bind, etc.). Programs inherit down the cgroup hierarchy
unless overridden.
LSM
Attached to LSM hooks via BPF link. Requires CONFIG_BPF_LSM=y and
lsm=...,bpf on the kernel command line. The BPF LSM programs run alongside any
other loaded LSM (SELinux, AppArmor).
eBPF vs DTrace vs SystemTap vs ftrace vs perf
If you’ve used other tracing tools, here’s how eBPF compares. The short version: eBPF is the only tool that is simultaneously safe, low-overhead, dynamically attachable, and production-ready on Linux.
| eBPF | DTrace | SystemTap | ftrace | perf | |
|---|---|---|---|---|---|
| Availability | Linux 4.x+ (practical: 5.x+) | Solaris, macOS, FreeBSD. Linux port exists but incomplete | Linux (RHEL-focused) | Linux 2.6.27+ | Linux 2.6.31+ |
| Safety model | Kernel verifier proves program safe before execution. Cannot crash kernel. | Safe interpreter with privilege checks | Compiles to kernel module. Can crash kernel. | Safe — only uses pre-built kernel hooks | Safe — read-only sampling and counters |
| Language | C (libbpf), bpftrace (awk-like), Python (BCC), Go (Cilium), Rust (Aya) | D language (DTrace-specific) | SystemTap scripting language, or C (guru mode) | Shell (trace-cmd), or direct sysfs writes | Command-line tool, no programmability |
| Overhead | ~50–100ns per probe. JIT-compiled native code. | Low (interpreted, but optimized) | Low when compiled, but kernel module compilation adds startup latency | Very low (function tracer adds ~15ns) | Sampling-based, near-zero unless recording |
| Production safe | Yes. Used in production at Meta, Google, Netflix, Cloudflare, all major cloud providers. | Yes (on Solaris/FreeBSD) | Risky. Guru mode can crash. Not widely used in production. | Yes, but limited scope | Yes |
| Dynamic attachment | Any kernel function, tracepoint, network hook, cgroup, LSM hook | Any function with probes defined | Any kernel function (via kprobe) | Pre-defined kernel trace events only | Pre-defined PMU events and tracepoints |
| Networking | XDP, TC, socket filter, sk_msg. Can replace entire network stacks. | Packet inspection only | Limited network support | Network tracepoints only | Network tracepoints only |
| Security | LSM hooks, seccomp-BPF, cgroup device control. Can implement security policies. | No | No | No | No |
| State / aggregation | Maps (hash, array, ring buffer, etc.) shared between kernel and user space | Aggregations built into language | Associative arrays | Histograms via trace events | In-kernel aggregation for some events |
| Kernel version coupling | CO-RE + BTF: compile once, run on any kernel | Stable probes (less coupling) | Strongly coupled. Scripts break on kernel upgrades. | Stable (uses trace events) | Stable |
BTF and CO-RE
BPF Type Format (BTF)
BTF is a compact, space-efficient metadata format that describes every type in the kernel:
structs, unions, enums, typedefs, function prototypes. It is embedded in the kernel binary
(/sys/kernel/btf/vmlinux) and in eBPF programs themselves. BTF is what allows
bpftrace to understand args.filename in a tracepoint — it knows the struct
layout, field offsets, and types.
Without BTF, eBPF programs that access kernel structs must hardcode field offsets, which change between kernel versions. With BTF, the loader can relocate field accesses at load time, reading the offset from the running kernel’s BTF data. This is the foundation of CO-RE.
# Check BTF availability
ls -la /sys/kernel/btf/vmlinux
-r--r--r--. 1 root root 5765272 Apr 4 12:00 /sys/kernel/btf/vmlinux
# List kernel modules with BTF
bpftool btf list
1: name [vmlinux] size 5765272B
2: name [openzfs] size 183441B map_ids 3,7
# Dump all types in the kernel (generates vmlinux.h)
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
wc -l vmlinux.h
189432 vmlinux.h
# Search for a specific struct
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 15 ‘struct task_struct {‘
# Dump BTF for a specific kernel module
bpftool btf dump file /sys/kernel/btf/openzfs format c > openzfs.h
Compile Once, Run Everywhere (CO-RE)
CO-RE solves the biggest historical problem with eBPF development: kernel version coupling.
Before CO-RE, if your eBPF program accessed task->pid and the offset of pid
within struct task_struct changed between kernel 5.15 and 6.1, your program would read
garbage on the new kernel. You had to compile your program on every target kernel, or use BCC which
compiles at runtime (slow, requires compiler toolchain on every host).
CO-RE eliminates this. When you compile a CO-RE program with clang, the compiler records
relocation records in the ELF binary: "I’m reading field pid from type
struct task_struct." When the program is loaded, libbpf reads the running kernel’s BTF,
finds the actual offset of pid, and patches the instruction. The same compiled binary
runs correctly on any kernel that has BTF, regardless of struct layout changes.
This is why BTF matters so much: it is the prerequisite for CO-RE, and CO-RE is what makes it practical to distribute pre-compiled eBPF programs. Without CO-RE, every eBPF tool would need to ship kernel headers and a compiler. With CO-RE, you ship a single binary.
# The CO-RE workflow:
# 1. Generate vmlinux.h from your dev kernel (or use a pre-made one)
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
# 2. Write your eBPF program using vmlinux.h types
# #include "vmlinux.h"
# #include
# #include
#
# SEC("tp/sched/sched_process_exec")
# int trace_exec(struct trace_event_raw_sched_process_exec *ctx) {
# pid_t pid = BPF_CORE_READ(ctx, __data_loc_filename);
# // This read will be relocated at load time
# return 0;
# }
# 3. Compile with clang (once)
clang -O2 -g -target bpf -D__TARGET_ARCH_x86 -c trace_exec.bpf.c -o trace_exec.bpf.o
# 4. The .o file runs on any kernel with BTF — no recompilation needed
The toolchain
There are five major eBPF development environments. They serve different audiences and use cases. Here’s when to use each one.
bpftrace — the one-liner tool
What: A high-level tracing language inspired by awk and DTrace. One-liners or short scripts.
Compiles to eBPF bytecode behind the scenes.
Language: bpftrace scripting language (awk-like syntax with kernel awareness).
When to use: Ad-hoc investigation. Production debugging. Quick answers to "what is happening
right now?" questions. Prototyping before writing a full program.
When NOT to use: Long-running daemons, complex logic, networking programs (XDP/TC),
anything that needs to be distributed as a binary.
# One-liner: who’s doing DNS lookups?
bpftrace -e ‘tracepoint:syscalls:sys_enter_connect /comm != "sshd"/ { printf("%s pid=%d\n", comm, pid); }’
# One-liner: histogram of syscall latency for a specific process
bpftrace -e ‘tracepoint:raw_syscalls:sys_enter /pid == 1234/ { @start = nsecs; }
tracepoint:raw_syscalls:sys_exit /pid == 1234 && @start/ { @ns = hist(nsecs - @start); @start = 0; }’
# Script: trace file I/O by process with latency
bpftrace -e ‘
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
@us[comm] = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}’
BCC — Python + C
What: BPF Compiler Collection. Write the kernel-side eBPF program in C, the user-space
frontend in Python (or Lua/C++). BCC compiles the C code at runtime using LLVM.
Language: C (eBPF side) + Python (user space side).
When to use: The 100+ pre-built tools (opensnoop, execsnoop, tcplife, biolatency, etc.)
are invaluable. Also good for prototyping tools that need a user-space component.
When NOT to use: New tool development. BCC requires LLVM + kernel headers on every target
machine (compiles at runtime). Modern projects use libbpf + CO-RE instead.
# BCC tool locations:
# Debian/Ubuntu: /usr/sbin/ (e.g., /usr/sbin/opensnoop-bpfcc)
# CentOS/RHEL: /usr/share/bcc/tools/ (e.g., /usr/share/bcc/tools/opensnoop)
# The essential BCC tools every SRE should know:
opensnoop # trace file opens system-wide
execsnoop # trace process execution
tcpconnect # trace outbound TCP connections
tcplife # trace TCP sessions with duration + bytes
biolatency # block I/O latency histogram
runqlat # CPU scheduler queue latency
cachestat # page cache hit/miss ratio
memleak # trace memory allocations (find leaks)
funccount # count kernel function calls
profile # CPU profiling via sampling
libbpf — the modern C library
What: The standard C library for loading, attaching, and managing eBPF programs.
Used with CO-RE for portable, pre-compiled eBPF binaries.
Language: C (both sides).
When to use: Building production tools, daemons, agents. Anything that needs to ship as a
self-contained binary without requiring LLVM or kernel headers on the target. This is the recommended
approach for new projects.
Workflow: Write eBPF C code, compile with clang to BPF object file, use
bpftool gen skeleton to generate a C header, link against libbpf in your user-space program.
# The libbpf development workflow:
# 1. Write the eBPF program (runs in kernel)
# trace_exec.bpf.c
# 2. Compile to BPF object file
clang -O2 -g -target bpf -c trace_exec.bpf.c -o trace_exec.bpf.o
# 3. Generate the skeleton header
bpftool gen skeleton trace_exec.bpf.o > trace_exec.skel.h
# 4. Write user-space loader that includes the skeleton
# trace_exec.c — calls trace_exec_bpf__open(), __load(), __attach()
# 5. Compile user-space program, link against libbpf
gcc -O2 -o trace_exec trace_exec.c -lbpf -lelf -lz
# Result: single binary, runs on any kernel with BTF
Cilium eBPF library — Go
What: A pure Go library for working with eBPF programs and maps.
Used by Cilium, Tetragon, Hubble, and many Go-based infrastructure tools.
Language: Go (user space) + C (eBPF side, compiled with clang).
When to use: Building eBPF tools in Go. If your infrastructure is Go-based
(as most modern cloud-native tooling is), this is the natural choice. Excellent documentation
and active community.
Aya — Rust
What: A Rust library for eBPF that writes both the kernel-side and user-space code in Rust.
No dependency on libbpf or clang — it has its own BPF linker and relocator.
Language: Rust (both sides).
When to use: Rust-based infrastructure. Projects where memory safety in the user-space
component matters as much as in the kernel component. Aya is newer but growing rapidly.
Decision tree
Quick investigation? bpftrace.
Pre-built tool exists? BCC.
Building a production C tool? libbpf + CO-RE.
Building a production Go tool? Cilium eBPF library.
Building a production Rust tool? Aya.
What kldload pre-installs
kldload installs a complete eBPF toolchain out of the box. No post-install setup required — you boot the system and start tracing immediately.
| Package | Debian/Ubuntu name | CentOS/RHEL name | What it provides |
|---|---|---|---|
| bpftrace | bpftrace | bpftrace | High-level tracing language. One-liners and scripts for production debugging. |
| BCC tools | bpfcc-tools | bcc-tools | 100+ ready-made tools: opensnoop, execsnoop, tcplife, biolatency, runqlat, etc. |
| bpftool | bpftool | bpftool | Low-level BPF program/map management. BTF introspection. Skeleton generation. |
| Kernel headers | linux-headers-$(uname -r) | kernel-devel | Required for BCC (compiles at runtime). Not needed for CO-RE programs. |
| BTF | Built into kernel | Built into kernel | /sys/kernel/btf/vmlinux — type information for CO-RE and bpftrace struct access. |
| perf | linux-perf | perf | Performance counters, CPU profiling, flame graphs. Complements eBPF for sampling-based analysis. |
Why all of this is pre-installed: when you’re debugging a production issue at 3 AM, the last thing you want is to discover you need to install packages. kldload ensures every system ships with a complete observability toolkit. You boot, you trace, you find the problem.
KLDLOAD_ENABLE_EBPF=1 in the answers file to include eBPF tools
in the initial install, or install afterward with dnf install -y bpftool bcc-tools bpftrace perf.
On Debian/Ubuntu, they’re included in every desktop and server profile by default.
Map types deep dive
Maps are the data structures of eBPF. They persist across program invocations, are accessible from both kernel and user space, and come in over 30 specialized types. Here are the ones you will actually use.
Hash map (BPF_MAP_TYPE_HASH)
The general-purpose key-value store. Arbitrary keys, arbitrary values, O(1) lookup. Use it for: tracking per-PID state, building lookup tables, counting events by key.
# bpftrace hash map: count syscalls by process name
bpftrace -e ‘tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }’
# Output (Ctrl+C to print):
# @[systemd]: 142
# @[sshd]: 87
# @[bash]: 1203
# @[postgres]: 34891
# Inspect a hash map with bpftool
bpftool map dump id 42
Array (BPF_MAP_TYPE_ARRAY)
Fixed-size array indexed by integer key (0 to max_entries-1). O(1) lookup, pre-allocated. Use it for: global configuration, indexed counters, lookup tables with integer keys. Array elements cannot be deleted — they exist for the lifetime of the map.
Per-CPU variants (BPF_MAP_TYPE_PERCPU_HASH, BPF_MAP_TYPE_PERCPU_ARRAY)
Same as hash and array, but each CPU gets its own copy of every value. No locking needed, no contention between CPUs. User space reads all per-CPU copies and aggregates them. Use it for: high-frequency counters where lock contention would be a bottleneck. Packet counters, syscall counters, latency tracking on busy systems.
# bpftrace uses per-cpu maps automatically for @count() and @hist()
# Under the hood, each CPU increments its own counter independently
# View per-CPU map contents with bpftool
bpftool map dump id 15
key: 00 00 00 00 value (CPU 00): 01 00 00 00 00 00 00 00
value (CPU 01): 03 00 00 00 00 00 00 00
value (CPU 02): 00 00 00 00 00 00 00 00
value (CPU 03): 02 00 00 00 00 00 00 00
LRU hash (BPF_MAP_TYPE_LRU_HASH)
A hash map with a fixed maximum size that automatically evicts the least-recently-used
entry when full. Use it for: connection tracking tables, flow caches, any map where you can’t predict
the number of entries and need bounded memory. Per-CPU variant available
(BPF_MAP_TYPE_LRU_PERCPU_HASH).
Ring buffer (BPF_MAP_TYPE_RINGBUF)
A single shared ring buffer for streaming events from kernel to user space. Introduced in kernel 5.8 as a replacement for perf buffers. Advantages over perf buffer: single buffer shared across all CPUs (no per-CPU allocation waste), supports variable-length records, allows reserving space before writing (no double-copy), and preserves event ordering across CPUs.
When to use: any time you need to stream events to user space. The ring buffer is the recommended choice for new programs — it is more memory-efficient and has better ordering guarantees than perf buffers.
# In libbpf C:
# struct {
# __uint(type, BPF_MAP_TYPE_RINGBUF);
# __uint(max_entries, 256 * 1024); /* 256 KB */
# } events SEC(".maps");
#
# SEC("tp/sched/sched_process_exec")
# int trace_exec(void *ctx) {
# struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
# if (!e) return 0;
# e->pid = bpf_get_current_pid_tgid() >> 32;
# bpf_get_current_comm(&e->comm, sizeof(e->comm));
# bpf_ringbuf_submit(e, 0);
# return 0;
# }
Perf buffer (BPF_MAP_TYPE_PERF_EVENT_ARRAY)
The original mechanism for streaming events to user space. One ring buffer per CPU. Still widely used in BCC tools. For new programs, prefer ring buffers (above) — but you’ll encounter perf buffers constantly in existing tools.
Bloom filter (BPF_MAP_TYPE_BLOOM_FILTER)
A probabilistic data structure that answers "is this key in the set?" with no false negatives and a configurable false positive rate. Use it for: fast pre-filtering before an expensive hash lookup. Example: checking if an IP address is in a blocklist of millions of entries. Kernel 5.16+.
Queue and Stack (BPF_MAP_TYPE_QUEUE, BPF_MAP_TYPE_STACK)
FIFO queue and LIFO stack. No keys — just push and pop values. Use them for: work queues between eBPF programs, ordered event collection, breadth-first or depth-first traversal state.
Sockmap / Sockhash
Maps that hold references to kernel sockets. Used with sk_msg/sk_skb programs to redirect data between sockets at kernel level. This is the mechanism behind Cilium’s socket-level load balancing and service mesh acceleration — data flows from socket A to socket B without ever leaving the kernel.
Program array (BPF_MAP_TYPE_PROG_ARRAY)
A map that holds references to other eBPF programs. Used for tail calls: one eBPF program jumps to another program in the array. This works around the instruction count limit by chaining programs together, and enables runtime-selectable behavior (swap a program in the array to change behavior without reloading).
The verifier in detail
The verifier is the gatekeeper. Every eBPF program must pass through it, and when your program gets rejected, the error messages can be cryptic. Understanding what the verifier checks and why helps you write programs that pass on the first try — and debug the ones that don’t.
DAG analysis
The verifier walks every possible execution path through your program as a directed acyclic graph (DAG). At each instruction, it tracks the state of every register: its type (scalar, pointer to map value, pointer to stack, pointer to packet, etc.), its value range (if known), and whether it has been initialized. When paths merge (after a conditional branch), the verifier takes the union of possible states. If any path can reach an unsafe operation, the program is rejected.
Complexity limit
The verifier has a hard limit on the number of instructions it will analyze: 1 million verified
instructions (as of kernel 5.2+; it was 128K before). This is not the number of instructions in
your program — it’s the number of instructions the verifier visits across all paths. A 200-instruction
program with many branches can exceed the limit because the verifier explores every combination. If you
hit the limit, your program is rejected with BPF program is too large.
# View verifier output for a loaded program
bpftool prog dump xlated id 42
# Get verbose verifier log when loading a program
# In libbpf: set log_level in bpf_object_open_opts
# In bpftrace: bpftrace -dd shows the verifier log on failure
Bounded loops
Before kernel 5.3, eBPF had no loops at all — every backward jump was rejected.
You had to unroll loops manually with #pragma unroll. Since 5.3, the verifier allows loops
if it can prove they terminate. The verifier tracks the loop variable and its bounds:
// This works (kernel 5.3+): bounded loop
for (int i = 0; i < 10; i++) {
// verifier knows: i ranges [0, 9], loop iterates exactly 10 times
buf[i] = 0;
}
// This fails: verifier can’t prove termination
int n = get_some_value();
for (int i = 0; i < n; i++) {
// n is unknown — loop bound is unprovable
buf[i] = 0;
}
// Fix: clamp the variable
int n = get_some_value();
if (n > 10) n = 10; // now verifier knows n <= 10
for (int i = 0; i < n; i++) {
buf[i] = 0;
}
Since kernel 5.17, bpf_loop() helper provides another way: you pass a callback function
and a maximum iteration count, and the kernel handles the loop. This avoids the verifier needing to
analyze the loop body on every iteration.
Stack limits
The verifier tracks every byte of the 512-byte stack. It ensures you don’t read uninitialized stack memory, don’t write past the stack bounds, and don’t pass uninitialized stack buffers to helpers. If your program needs more than 512 bytes of working space, use a per-CPU array map as a scratch buffer:
// Per-CPU array as scratch buffer (avoids 512-byte stack limit)
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__type(key, u32);
__type(value, struct big_buffer); // can be up to 32KB
__uint(max_entries, 1);
} scratch SEC(".maps");
SEC("kprobe/some_func")
int my_prog(struct pt_regs *ctx) {
u32 key = 0;
struct big_buffer *buf = bpf_map_lookup_elem(&scratch, &key);
if (!buf) return 0; // always succeeds for per-cpu array, but verifier requires the check
// use buf->data[...] for up to 32KB of scratch space
return 0;
}
Helper functions
eBPF programs cannot call arbitrary kernel functions. They can only call helper functions provided by the kernel. Helpers are the API surface between eBPF programs and the kernel. Each program type has access to a specific subset of helpers. The verifier enforces this.
Common helpers every eBPF developer uses:
| Helper | What it does |
|---|---|
bpf_map_lookup_elem() | Look up a key in a map. Returns pointer to value or NULL. |
bpf_map_update_elem() | Insert or update a key-value pair in a map. |
bpf_map_delete_elem() | Delete a key from a map. |
bpf_get_current_pid_tgid() | Returns (tgid << 32 | pid). tgid = process ID, pid = thread ID. |
bpf_get_current_comm() | Copies the current task’s command name (up to 16 bytes). |
bpf_ktime_get_ns() | Returns monotonic clock in nanoseconds. For latency measurement. |
bpf_probe_read_kernel() | Safely read from a kernel address. Returns 0 on success. |
bpf_probe_read_user() | Safely read from a user-space address. |
bpf_ringbuf_reserve() | Reserve space in a ring buffer for writing. |
bpf_ringbuf_submit() | Submit a reserved ring buffer entry to user space. |
bpf_printk() | Debug print to /sys/kernel/debug/tracing/trace_pipe. Use for debugging only — very slow. |
bpf_get_stackid() | Capture a stack trace into a stack trace map. For profiling. |
bpf_redirect() | Redirect a packet to another interface (XDP/TC). |
bpf_skb_store_bytes() | Modify packet contents (TC programs). |
bpf_loop() | Execute a callback function up to N times (kernel 5.17+). Avoids verifier loop analysis. |
Common rejections and how to fix them
"R1 type=scalar expected=map_value"
You passed a raw integer to a function expecting a map value pointer.
Fix: call bpf_map_lookup_elem() first and pass the returned pointer.
Always NULL-check the return value before dereferencing.
"invalid mem access ‘scalar’"
You’re trying to dereference something the verifier thinks is a number, not a pointer. Fix: make sure you’re casting correctly and that the pointer source is valid (map lookup, context access, or stack address).
"R0 invalid mem access ‘map_value_or_null’"
You used a map lookup result without checking for NULL.
Fix: add if (!val) return 0; after every bpf_map_lookup_elem().
"back-edge from insn X to Y"
You have a backward jump (loop) on a kernel older than 5.3, or the verifier can’t prove your loop terminates.
Fix: unroll the loop with #pragma unroll, use bpf_loop(),
or add an explicit bound the verifier can track.
"BPF program is too large"
The verifier hit the 1M instruction complexity limit.
Fix: split into multiple programs connected by tail calls, reduce branching,
use bpf_loop() instead of inline loops, or move complex logic to user space.
"invalid access to packet"
You’re reading past the end of a packet without a bounds check.
Fix: always check if ((void *)(hdr + 1) > data_end) return XDP_PASS;
before accessing each protocol header. The verifier needs these checks at every layer.
"cannot pass map_value to helper"
Some helpers need a pointer to the map, not a pointer to a value in the map.
Fix: pass &my_map (the map itself) rather than the result of a lookup.
"helper call is not allowed in probe"
You called a helper that isn’t available for your program type. For example, bpf_redirect()
in a kprobe program.
Fix: check the helper availability table for your program type. Use bpftool feature
to see what’s available on your kernel.
# See which helpers are available for each program type on your kernel
bpftool feature probe | grep -A 5 ‘eBPF helpers’
# See all available program types
bpftool feature probe | grep ‘program_type’
# See map types
bpftool feature probe | grep ‘map_type’
Quick reference — 25 essential commands
These commands work on any kldload system immediately after boot. All require root.
Tracing processes
# 1. Trace every process execution system-wide
execsnoop
# Output:
# PCOMM PID PPID RET ARGS
# bash 18401 18400 0 /bin/bash
# ls 18402 18401 0 /usr/bin/ls --color=auto
# grep 18403 18401 0 /usr/bin/grep -i error /var/log/messages
# 2. Trace every file open system-wide
opensnoop
# Output:
# PID COMM FD ERR PATH
# 18401 bash 3 0 /etc/profile
# 18402 ls 3 0 /etc/ld.so.cache
# 1842 sshd 4 0 /etc/ssh/sshd_config
# 3. Trace signals sent to processes
bpftrace -e ‘tracepoint:signal:signal_generate { printf("%s (pid %d) sent signal %d to pid %d\n", comm, pid, args.sig, args.pid); }’
# 4. Count syscalls by process (top talkers)
bpftrace -e ‘tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }’
# 5. Trace process lifecycle (fork, exec, exit)
bpftrace -e ‘tracepoint:sched:sched_process_fork { printf("fork: %s (pid %d) -> child pid %d\n", comm, pid, args.child_pid); }’
Disk and filesystem
# 6. Block I/O latency histogram
biolatency
# Output:
# usecs : count distribution
# 0 -> 1 : 0 | |
# 2 -> 3 : 0 | |
# 4 -> 7 : 15 |**** |
# 8 -> 15 : 42 |************* |
# 16 -> 31 : 128 |****************************************|
# 32 -> 63 : 91 |**************************** |
# 64 -> 127 : 23 |******* |
# 128 -> 255 : 4 |* |
# 7. Block I/O by device and process (top-like)
biotop
# 8. Slow filesystem operations (>10ms)
fileslower 10
# Output:
# TIME(s) COMM TID D BYTES LAT(ms) FILENAME
# 0.250 postgres 1842 R 8192 14.32 base/16384/16385
# 1.102 rsync 2103 W 131072 22.50 backup.tar.gz
# 9. ZFS operations slower than 1ms
zfsslower 1
# 10. Trace disk I/O size distribution
bpftrace -e ‘tracepoint:block:block_rq_complete { @bytes = hist(args.nr_sector * 512); }’
Network
# 11. Trace new TCP connections (outbound)
tcpconnect
# Output:
# PID COMM IP SADDR DADDR DPORT
# 18501 curl 4 10.0.0.5 93.184.216.34 443
# 18502 python3 4 10.0.0.5 10.0.0.10 5432
# 12. Trace TCP sessions with duration and bytes
tcplife
# Output:
# PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
# 18501 curl 10.0.0.5 42916 93.184.216.34 443 1 15 230.45
# 18502 python3 10.0.0.5 54112 10.0.0.10 5432 0 0 2.31
# 13. Trace TCP retransmits (sign of network problems)
tcpretrans
# 14. Trace DNS lookups with latency
gethostlatency
# Output:
# TIME PID COMM LATms HOST
# 14:02:31 18501 curl 2.91 example.com
# 14:02:31 18502 python3 0.12 db.internal
# 15. Count packets per process per second
bpftrace -e ‘tracepoint:net:net_dev_xmit { @[comm] = count(); }’ -d 5
CPU and scheduler
# 16. CPU scheduler queue latency (how long tasks wait to run)
runqlat
# Output:
# usecs : count distribution
# 0 -> 1 : 234 |******** |
# 2 -> 3 : 1042 |****************************************|
# 4 -> 7 : 823 |******************************* |
# 8 -> 15 : 412 |*************** |
# 16 -> 31 : 98 |*** |
# 32 -> 63 : 12 | |
# 17. CPU profiling (sample stack traces)
profile -af 30 > /tmp/profile.out
# 18. Show off-CPU time (why processes are blocked)
offcputime 5
# 19. Context switches per process
bpftrace -e ‘tracepoint:sched:sched_switch { @[args.prev_comm] = count(); }’
Memory
# 20. Page cache hit/miss ratio
cachestat
# Output:
# HITS MISSES DIRTIES HITRATIO BUFFERS_MB CACHED_MB
# 1523 12 34 99.22% 142 3847
# 21. Trace memory allocations (find leaks)
memleak -p 1842
# 22. Page faults by process
bpftrace -e ‘software:page-faults:1 { @[comm] = count(); }’
System inspection
# 23. List all loaded eBPF programs
bpftool prog list
# Output:
# 6: cgroup_device tag a]4f... gpl
# loaded_at 2026-04-04T10:23:17+0000 uid 0
# xlated 504B jited 309B memlock 4096B map_ids 2
# 42: tracepoint name trace_exec tag b2e9... gpl
# loaded_at 2026-04-04T14:01:33+0000 uid 0
# xlated 1832B jited 1104B memlock 4096B map_ids 15,16
# 24. List all eBPF maps
bpftool map list
# 25. Check kernel eBPF feature support
bpftool feature probe kernel
Practical examples on kldload
Trace ZFS internals
# ZFS read/write latency histogram
zfsslower 1
# Output:
# TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
# 14:23:01 postgres 1842 R 8192 16384 1.23 16385
# 14:23:01 cp 2041 W 131072 0 3.45 backup.tar
# Trace ZFS ARC hits and misses
bpftrace -e ‘kprobe:arc_read { @[comm] = count(); }’
# Trace ZFS transaction group syncs
bpftrace -e ‘kprobe:txg_sync_thread { printf("txg sync: %s\n", comm); }’
# Trace zpool import/export
bpftrace -e ‘
kprobe:spa_open { printf("pool open: pid=%d comm=%s\n", pid, comm); }
kretprobe:spa_open { printf("pool open returned: %d\n", retval); }’
# Monitor ZFS scrub I/O
bpftrace -e ‘kprobe:dsl_scan_scrub_cb { @scrub_ios = count(); }’
Trace WireGuard
# Packets going through WireGuard tunnel
bpftrace -e ‘kprobe:wg_xmit { @tx[comm] = count(); }’
bpftrace -e ‘kprobe:wg_receive { @rx = count(); }’
# WireGuard handshake events
bpftrace -e ‘kprobe:wg_noise_handshake_create_initiation { printf("WG handshake init: pid=%d\n", pid); }’
Trace the kldload installer
# During a kldload install, trace every file being created
opensnoop -f O_CREAT
# Watch the installer’s process tree unfold
execsnoop
# Trace all disk I/O during install (see which files are being written)
biotop
# Trace dnf/debootstrap dependency resolution
bpftrace -e ‘tracepoint:syscalls:sys_enter_openat /comm == "dnf" || comm == "debootstrap"/ {
printf("%s: %s\n", comm, str(args.filename));
}’
Security auditing
# Track every process execution with full command line (security audit trail)
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
printf("%llu uid=%-5d pid=%-6d ppid=%-6d %s -> %s\n",
nsecs / 1000000000, uid, pid,
curtask->real_parent->tgid,
comm, str(args.filename));
}'
# Output:
# 1712234521 uid=0 pid=18401 ppid=18400 bash -> /usr/bin/ls
# 1712234521 uid=1000 pid=18402 ppid=18401 crond -> /usr/sbin/logrotate
# 1712234522 uid=0 pid=18403 ppid=1 sshd -> /usr/sbin/sshd
# Detect privilege escalation (setuid calls)
bpftrace -e '
tracepoint:syscalls:sys_enter_setuid {
printf("SETUID: pid=%d comm=%s uid=%d -> target_uid=%d\n",
pid, comm, uid, args.uid);
}'
# Monitor access to sensitive files
bpftrace -e '
tracepoint:syscalls:sys_enter_openat
/str(args.filename) == "/etc/shadow" ||
str(args.filename) == "/etc/sudoers"/ {
printf("SENSITIVE FILE: %s (pid=%d uid=%d) opened %s\n",
comm, pid, uid, str(args.filename));
}'
# Detect kernel module loads (potential rootkit insertion)
bpftrace -e '
kprobe:do_init_module {
printf("MODULE LOADED: pid=%d uid=%d comm=%s\n", pid, uid, comm);
print(kstack);
}'
# Trace container escape indicators (mount namespace changes)
bpftrace -e '
tracepoint:syscalls:sys_enter_mount {
printf("mount: pid=%d comm=%s source=%s target=%s\n",
pid, comm, str(args.dev_name), str(args.dir_name));
}'
Generate flame graphs
# CPU flame graph (what is using CPU time?)
perf record -g -a sleep 30
perf script > /tmp/out.perf
# If FlameGraph tools are installed:
stackcollapse-perf.pl /tmp/out.perf | flamegraph.pl > /tmp/cpu-flame.svg
# Off-CPU flame graph (what is blocking?)
offcputime -f 30 > /tmp/offcpu.out
flamegraph.pl --color=io /tmp/offcpu.out > /tmp/offcpu-flame.svg
# Alternatively, use bpftrace for targeted profiling:
bpftrace -e ‘profile:hz:99 { @[kstack] = count(); }’ -d 30 > /tmp/stacks.out
Kernel requirements
eBPF features were added incrementally across kernel versions. Here’s what shipped when, and what kldload kernels support.
| Feature | Minimum kernel | kldload CentOS (5.14) | kldload Debian (6.x) |
|---|---|---|---|
| Basic eBPF (maps, helpers) | 3.18 | Yes | Yes |
| kprobe/kretprobe programs | 4.1 | Yes | Yes |
| Tracepoint programs | 4.7 | Yes | Yes |
| XDP | 4.8 | Yes | Yes |
| BPF-to-BPF calls | 4.16 | Yes | Yes |
| BTF | 4.18 | Yes | Yes |
| Bounded loops | 5.3 | Yes | Yes |
| fentry/fexit | 5.5 | Yes | Yes |
| BPF LSM | 5.7 | Yes | Yes |
| Ring buffer | 5.8 | Yes | Yes |
| Bloom filter map | 5.16 | Yes | Yes |
| bpf_loop() helper | 5.17 | Backported | Yes |
| User ring buffer | 6.1 | No | Yes |
| sched_ext (BPF scheduler) | 6.12 | No | Depends on version |
Required kernel config options (all enabled in kldload kernels):
# Core eBPF
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_HAVE_EBPF_JIT=y
# BTF (required for CO-RE and bpftrace struct access)
CONFIG_DEBUG_INFO_BTF=y
CONFIG_DEBUG_INFO_BTF_MODULES=y
# kprobe support
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y
# Tracepoints
CONFIG_TRACING=y
CONFIG_FTRACE=y
# BPF LSM (for security programs)
CONFIG_BPF_LSM=y
# XDP
CONFIG_XDP_SOCKETS=y
# Verify on your kernel:
zcat /proc/config.gz | grep CONFIG_BPF
# or
grep CONFIG_BPF /boot/config-$(uname -r)
Permissions: eBPF requires root or specific capabilities.
On kernel 5.8+, unprivileged eBPF is disabled by default (kernel.unprivileged_bpf_disabled=1).
For non-root users, grant CAP_BPF + CAP_PERFMON (for tracing) or
CAP_BPF + CAP_NET_ADMIN (for networking programs).
# Check BTF availability
ls /sys/kernel/btf/vmlinux && echo "BTF available" || echo "No BTF"
# Check JIT status
cat /proc/sys/net/core/bpf_jit_enable
# Check if unprivileged BPF is disabled (should be 1 or 2)
cat /proc/sys/kernel/unprivileged_bpf_disabled
Writing custom eBPF programs with libbpf
When bpftrace one-liners are not enough and you need a production-grade tool that ships as a single binary, use libbpf with CO-RE. Here is the complete workflow: write the eBPF kernel program, write the userspace loader, compile, and run.
Step 1: The kernel-side eBPF program
/* trace_open.bpf.c — traces every openat() syscall */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
/* Event structure shared between kernel and user space */
struct event {
u32 pid;
u32 uid;
char comm[16];
char filename[256];
};
/* Ring buffer for streaming events to user space */
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); /* 256 KB */
} events SEC(".maps");
SEC("tracepoint/syscalls/sys_enter_openat")
int trace_openat(struct trace_event_raw_sys_enter *ctx) {
struct event *e;
/* Reserve space in the ring buffer */
e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0; /* ring buffer full, drop event */
/* Fill in event fields */
e->pid = bpf_get_current_pid_tgid() >> 32;
e->uid = bpf_get_current_uid_gid() & 0xffffffff;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
bpf_probe_read_user_str(&e->filename, sizeof(e->filename),
(const char *)ctx->args[1]);
/* Submit event to user space */
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Step 2: The userspace loader
/* trace_open.c — loads and manages the eBPF program */
#include <stdio.h>
#include <signal.h>
#include <bpf/libbpf.h>
#include "trace_open.skel.h" /* auto-generated by bpftool gen skeleton */
struct event {
__u32 pid;
__u32 uid;
char comm[16];
char filename[256];
};
static volatile bool running = true;
static void sig_handler(int sig) { running = false; }
static int handle_event(void *ctx, void *data, size_t len) {
struct event *e = data;
printf("%-8d %-8d %-16s %s\n", e->pid, e->uid, e->comm, e->filename);
return 0;
}
int main(void) {
struct trace_open_bpf *skel;
struct ring_buffer *rb;
signal(SIGINT, sig_handler);
/* Open, load, and verify the eBPF program */
skel = trace_open_bpf__open_and_load();
if (!skel) { fprintf(stderr, "Failed to load BPF program\n"); return 1; }
/* Attach to the tracepoint */
trace_open_bpf__attach(skel);
/* Set up ring buffer polling */
rb = ring_buffer__new(bpf_map__fd(skel->maps.events), handle_event, NULL, NULL);
printf("%-8s %-8s %-16s %s\n", "PID", "UID", "COMM", "FILENAME");
while (running)
ring_buffer__poll(rb, 100 /* timeout ms */);
/* Cleanup */
ring_buffer__free(rb);
trace_open_bpf__destroy(skel);
return 0;
}
Step 3: Compile and run
# 1. Generate vmlinux.h (once per kernel version)
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
# 2. Compile the eBPF program to BPF bytecode
clang -g -O2 -target bpf -D__TARGET_ARCH_x86 \
-c trace_open.bpf.c -o trace_open.bpf.o
# 3. Generate the skeleton header (auto-creates open/load/attach functions)
bpftool gen skeleton trace_open.bpf.o > trace_open.skel.h
# 4. Compile the userspace loader
clang -g -O2 -Wall trace_open.c -lbpf -lelf -lz -o trace_open
# 5. Run it
sudo ./trace_open
PID UID COMM FILENAME
18401 1000 bash /etc/profile
18402 0 sshd /etc/ssh/sshd_config
18403 33 nginx /var/log/nginx/access.log
18404 0 postgres /var/lib/pgsql/data/base/16384/16385
^C
The resulting trace_open binary is self-contained. It embeds the compiled eBPF
bytecode and can run on any Linux kernel with BTF enabled, regardless of the kernel version
it was compiled on. This is CO-RE in action — compile once, ship everywhere.
bpf() syscall.
Writing bpftrace scripts
bpftrace scripts follow a consistent pattern: probe /filter/ { action }.
Multiple probes, built-in variables, maps, and printf-style output.
Here’s a complete script demonstrating the key features.
#!/usr/bin/env bpftrace
/*
* trace-io-latency.bt — Track I/O latency by process and device,
* print a summary every 5 seconds.
*/
BEGIN {
printf("Tracing block I/O latency... Hit Ctrl+C to stop.\n");
}
/* Record timestamp when I/O is issued */
tracepoint:block:block_rq_issue {
@start[args.dev, args.sector] = nsecs;
}
/* Calculate latency when I/O completes */
tracepoint:block:block_rq_complete
/@start[args.dev, args.sector]/ {
$lat_us = (nsecs - @start[args.dev, args.sector]) / 1000;
/* Per-device latency histogram */
@lat_hist[args.dev] = hist($lat_us);
/* Count I/Os per device */
@io_count[args.dev] = count();
/* Track max latency per device */
@max_lat[args.dev] = max($lat_us);
delete(@start[args.dev, args.sector]);
}
/* Print interval summary */
interval:s:5 {
printf("\n--- I/O Summary (last 5s) ---\n");
print(@io_count);
print(@max_lat);
clear(@io_count);
clear(@max_lat);
}
END {
printf("\n--- Final Latency Histograms ---\n");
print(@lat_hist);
clear(@start);
}
Run it:
chmod +x trace-io-latency.bt
bpftrace trace-io-latency.bt
Key bpftrace built-in variables:
| Variable | Meaning |
|---|---|
pid | Process ID (thread group ID) |
tid | Thread ID |
uid | User ID |
comm | Process name (16 chars max) |
nsecs | Nanosecond timestamp (monotonic) |
kstack | Kernel stack trace |
ustack | User-space stack trace |
args | Tracepoint arguments struct |
retval | Return value (kretprobe/fexit) |
curtask | Pointer to current struct task_struct |
cpu | Current CPU number |
Key bpftrace aggregation functions:
| Function | What it produces |
|---|---|
count() | Event count |
sum(x) | Running sum |
avg(x) | Average value |
min(x) | Minimum value seen |
max(x) | Maximum value seen |
hist(x) | Power-of-2 histogram |
lhist(x, min, max, step) | Linear histogram |
stats(x) | Count, average, and total |
Further reading
On kldload.com
eBPF Security — using eBPF for runtime security monitoring and enforcement.
eBPF Performance — deep-dive into performance analysis with eBPF.
Custom eBPF Programs — writing and deploying your own eBPF programs.
eBPF Masterclass — advanced eBPF for infrastructure engineers.
Advanced Observability — integrating eBPF into your monitoring stack.
External resources
Books: "BPF Performance Tools" by Brendan Gregg (the definitive reference),
"Learning eBPF" by Liz Rice (excellent introduction).
Sites: ebpf.io (official eBPF project site),
brendangregg.com/ebpf.html (tools and methodology),
docs.kernel.org/bpf/ (kernel documentation).
Code: github.com/iovisor/bcc (BCC tools),
github.com/bpftrace/bpftrace (bpftrace),
github.com/libbpf/libbpf (libbpf),
github.com/cilium/ebpf (Go library),
github.com/aya-rs/aya (Rust library).