eBPF Masterclass
This guide goes deep on eBPF — the kernel extension mechanism that powers Cilium, Falco, every modern performance profiler, and most of what makes Linux observability fast and precise. If you have read the Networking tutorial and seen eBPF in the context of packet filtering, this is the next step: understanding what eBPF actually is, how it works, and how to use the full toolkit that ships with kldload to answer any question about what your kernel is doing.
What this page covers: the eBPF execution model, program types, maps, the verifier, bcc-tools, bpftrace one-liners, XDP packet processing, TC programmable policy, socket-level interception, map internals, writing production programs with CO-RE and BTF, security considerations, and a complete reference to the eBPF toolkit on kldload.
Prerequisites: familiarity with the Linux kernel/userland split (see Kernel vs Userland). You do not need to have written C before, but you will see C code snippets. The goal is understanding, not compilation.
1. What eBPF Actually Is
eBPF is a programming environment inside the Linux kernel. You write small programs in a restricted subset of C, compile them to eBPF bytecode, and load them into the kernel where they attach to events — a packet arriving at a NIC, a syscall being called, a kernel function being entered or returned from. When that event fires, your program runs. In the kernel. At kernel speed. Without a context switch, without copying data to userland, without a daemon in the way.
The manifesto: eBPF is not a networking technology. It is not an observability technology. It is a general-purpose kernel extension mechanism that happens to be incredible at both. Cilium is an eBPF application. bpftrace is an eBPF application. Falco is an eBPF application. The Linux perf subsystem increasingly uses eBPF internally. Understanding that eBPF is the mechanism — not the use case — matters because it means you can use eBPF for anything that touches the kernel, not just the use cases someone already built a tool for. If you can express your question as "when kernel event X happens, record Y," eBPF can answer it.
How it works, in one paragraph
You write a C function annotated with the event it should attach to. The clang compiler (with the BPF target) compiles it to eBPF bytecode — a RISC-like instruction set with 11 registers and 512 bytes of stack. You call bpf(BPF_PROG_LOAD) to hand the bytecode to the kernel. The kernel verifier checks it (see below). If it passes, the bytecode is JIT-compiled to native machine code and attached to the hook. The next time that event fires, your native machine code runs in the kernel.
The verifier — why eBPF can't crash the kernel
Before any eBPF program runs, the kernel's verifier performs static analysis on the bytecode. It proves the program terminates (no infinite loops — bounded loops only), never accesses memory out of bounds, never follows uninitialized pointers, never calls arbitrary kernel functions (only a whitelist of BPF helper functions), and always returns an appropriate value. A program that fails verification is rejected with an error message — it never loads. A program that passes is mathematically guaranteed not to crash the kernel.
Kernel modules vs eBPF
Kernel modules run arbitrary C code in the kernel — they can do anything, including corrupting memory and causing panics. They require a matching kernel version and recompilation when the kernel changes. eBPF programs run a verified, sandboxed bytecode. They cannot crash the kernel. With CO-RE (see section 10), they compile once and run on any kernel that has BTF enabled — which includes every kldload kernel.
What "eXtended" means
The original BPF (Berkeley Packet Filter, 1992) was a 32-bit, 2-register, packet-only filter language — tcpdump still uses it. eBPF (extended BPF, Linux 3.18, 2014) is a completely different machine: 64-bit, 11 registers, maps for persistent state, helper functions for kernel services, and attachment points across the entire kernel — not just packets. The "e" in eBPF is significant: this is a general-purpose kernel programming environment, not a packet filter with extra steps.
2. The eBPF Execution Model
To use eBPF effectively you need a mental model of how programs run, how they communicate with userland, and what constraints they operate under. None of this is complicated once you see the pieces.
Program Types
Every eBPF program has a type that determines where it attaches and what context it receives. The type is declared at load time and constrains which helper functions are available.
XDP — eXpress Data Path
Attaches to the NIC driver, before the kernel allocates an sk_buff. Receives raw packet bytes. Can return PASS (continue to kernel stack), DROP (discard immediately), TX (retransmit out the same interface), REDIRECT (send to another interface or CPU), or ABORTED (drop with error). Fastest possible hook — runs before memory allocation.
TC — Traffic Control
Attaches to the TC ingress or egress hook. Has access to the full sk_buff — socket buffer with all headers parsed, routing decision made, conntrack state available. Can modify packets, redirect, or drop. Used by Cilium for all pod-level network policy. Slower than XDP (sk_buff allocation has happened) but far more capable.
kprobe / kretprobe
Attaches to any kernel function entry (kprobe) or return (kretprobe). Receives function arguments or return values. Powerful but fragile — kernel function signatures change between versions. bpftrace uses kprobes under the hood. Used for debugging and dynamic tracing of arbitrary kernel behavior.
Tracepoint
Attaches to stable, versioned kernel tracepoints — instrumentation points that kernel developers mark as part of the stable ABI. More portable than kprobes (tracepoints don't change between kernel versions). Covers syscalls, scheduler events, block I/O, network events. bcc tools prefer tracepoints over kprobes for stability.
Socket programs — sk_msg, sockops, sk_lookup
sockops fires on socket operations (connect, accept, etc.) and can set socket options. sk_msg fires on every message sent through a socket. sk_lookup intercepts socket lookup and can redirect to a different socket. Together these allow Cilium to short-circuit the network stack for same-node pod communication.
cgroup programs
Attaches to a cgroup and applies to all processes in that cgroup. Types include cgroup_skb (filter outbound socket traffic), cgroup_sock (intercept socket creation), cgroup_device (control device access). Kubernetes uses cgroup v2; Cilium uses cgroup programs to enforce network policy at the cgroup level rather than the network namespace level.
Maps — Shared State Between Kernel and Userland
eBPF programs are stateless by themselves — each invocation starts fresh. Maps are
the persistent store. A map is a key-value data structure that lives in kernel
memory but is accessible from both eBPF programs and userland. An eBPF program
updates a counter in a map; your Go monitoring daemon reads that counter via
bpf(BPF_MAP_LOOKUP_ELEM). This is the bridge between kernel speed and human
readability. Map types are covered in detail in section 9.
How maps work
Maps are created by calling bpf(BPF_MAP_CREATE) with a type, key size, value size, and max entries. The kernel returns a file descriptor. eBPF programs access maps through a file descriptor reference embedded in the bytecode. Userland programs access maps through the same fd or by pinning the map to the BPF filesystem at /sys/fs/bpf/.
Pinning maps to /sys/fs/bpf/
Maps are reference-counted — they persist as long as something holds a reference (a loaded program or an open fd). Pinning a map to the BPF filesystem creates a persistent file that survives the loading process exiting. Multiple programs and tools can then open the pinned map by path. This is how bpftool inspects maps created by Cilium or other long-running daemons.
3. eBPF on kldload
kldload ships the complete eBPF toolkit out of the box. No dependency hunting, no kernel header mismatches, no pip install chains. Everything is present on the live ISO and installed to every target system that uses the desktop or server profile.
| Tool | What it gives you | How to use it |
|---|---|---|
bcc-tools |
80+ ready-made eBPF programs for networking, disk, CPU, memory | /usr/share/bcc/tools/ — each is a standalone command |
bpftrace |
One-liner kernel tracing language built on eBPF | bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }' |
bpftool |
Low-level inspection: list programs, dump maps, show BTF | bpftool prog list, bpftool map dump id N |
| kernel headers | Required to compile eBPF programs against kernel types | /usr/include/linux/, installed via kernel-headers package |
libbpf |
C library for loading and managing eBPF programs from userland | Link against it in your C/Go/Rust programs |
| BTF support | BPF Type Format — enables CO-RE portability across kernels | /sys/kernel/btf/vmlinux exists on all kldload kernels |
The kernel version matters. XDP was stable in 4.8. Maps with per-CPU types in 4.6.
Bounded loops (required for complex programs) in 5.3. BTF in 5.2. bpf_ringbuf
(low-overhead event streaming) in 5.8. kldload ships kernel 5.14+ on CentOS Stream 9
and 6.1+ on Debian 13 — all modern eBPF features are available and all bcc tools
work without caveats.
4. bcc-tools — 80+ Ready-Made eBPF Programs
The BCC (BPF Compiler Collection) ships a library of production-grade eBPF tools
that cover the most common operational questions across networking, disk, CPU, and
memory. Each tool is a Python or C program that compiles an eBPF program on demand,
loads it, and presents formatted output. They are in /usr/share/bcc/tools/ or
accessible directly by name if the package installs symlinks.
Category overview
Networking tools
tcpconnect — every outbound TCP connection with PID, comm, source/dest IP/port, and latency. tcpaccept — every inbound accepted connection. tcpretrans — TCP retransmits with full 4-tuple and retransmit state. tcpdrop — TCP drops at the kernel level with reason codes. tcplife — TCP session lifetimes with bytes transferred.
Disk I/O tools
biolatency — block I/O latency histogram (power-of-2 buckets, microsecond resolution). biosnoop — per-I/O tracing with PID, comm, disk, offset, size, and latency. ext4slower — ext4 operations slower than a threshold (also: xfsslower, btrfsslower, zfsslower). fileslower — any filesystem read/write above a latency threshold.
CPU and scheduler tools
profile — CPU profiler using sampling, produces flamegraph-compatible stacks. offcputime — time spent off-CPU (sleeping, waiting, blocked on I/O or locks). runqlat — CPU run queue latency histogram: how long did tasks wait before getting a CPU? cpudist — on-CPU time distribution per wakeup. softirqs / hardirqs — interrupt handler time histograms.
Memory tools
memleak — allocations without corresponding frees, grouped by call stack. oomkill — OOM kill events with the victim process, its memory stats, and what triggered the kill. shmsnoop — shared memory operations. drsnoop — direct reclaim events (kernel trying to free memory synchronously, causing latency spikes).
Ten tools you should know immediately
tcpconnect
Shows every outbound TCP connection as it happens. You see the process name, PID, source IP, destination IP, and port — in real time, before the connection completes. Useful for auditing what your services are connecting to, catching unexpected outbound connections from compromised processes, and debugging connection failures.
biolatency
Block I/O latency histogram. Runs for N seconds and prints a power-of-2 histogram of I/O completion times. Instantly shows whether disk latency is predictable (tight histogram) or has outliers (long tail). The -D flag breaks it down per disk device.
runqlat
CPU scheduler run queue latency. Shows how long tasks wait in the run queue before getting scheduled onto a CPU. A healthy system has nearly all tasks scheduled within 10 microseconds. Latency spikes above 1ms indicate CPU contention — too many runnable threads, too few CPUs, or scheduler interference from realtime tasks.
profile
CPU profiler. Samples stack traces at a specified frequency (default 49 Hz) across all CPUs. After the sampling window, prints a summary of hot stack frames sorted by count. Combine with flamegraph.pl to produce a flamegraph SVG. Works for both kernel and userland stacks, with or without frame pointers (uses DWARF unwinding).
tcpretrans
TCP retransmit events with full 4-tuple and the retransmit state (ESTABLISHED, CLOSE_WAIT, etc.). This catches packet loss that is invisible to application logs — the kernel retransmits silently unless you are watching. Useful for diagnosing intermittent slowness that correlates with network congestion or faulty hardware.
offcputime
Time spent off-CPU — blocked, sleeping, waiting for I/O, waiting for a lock. While profile shows what your CPUs are doing, offcputime shows what your threads are waiting on. Essential for diagnosing slow services that are not CPU-bound: database queries, lock contention, slow syscalls, NFS stalls.
ext4slower / xfsslower / zfsslower
Filesystem operations slower than a threshold (default 10ms). Shows the process, operation type (read/write/open/fsync), filename, size, and latency. The filesystem-specific variants parse the filesystem's internal function calls, not just VFS, so they capture cases where latency comes from filesystem-internal work (checksumming, journaling, copy-on-write).
memleak
Memory leak detector. Traces malloc/free (and kernel allocations) and reports allocations with no corresponding free, grouped by call stack. Does not require recompilation or instrumentation of the target process — it attaches to the running process via uprobes. Run it for a few minutes, then look at what is growing.
opensnoop
Traces open() syscalls system-wide or for a specific PID. Shows which files every process is opening, when, and whether the open succeeded or failed. Useful for auditing file access, debugging "file not found" errors in containerized applications, and understanding the access patterns of unfamiliar processes.
execsnoop
Traces every execve() call system-wide — every process launched, with its arguments. Indispensable for security auditing (what processes are running?), debugging cascading failure (what is this service spawning?), and understanding build systems or CI pipelines at a granular level.
5. bpftrace — One-Liner Kernel Tracing
bpftrace is a high-level tracing language that compiles to eBPF on the fly. The syntax is inspired by awk and DTrace. You write a probe specification, an optional filter, and an action — bpftrace compiles the action to an eBPF program, attaches it to the probe, and streams output. It is the fastest path from a question to an answer in the kernel.
Syntax overview
probe_type:probe_name / filter / {
action
}
// probe_type: tracepoint, kprobe, kretprobe, uprobe, usdt, profile, interval, BEGIN, END
// filter: optional boolean expression (comm == "nginx", pid == 1234, etc.)
// action: bpftrace statements — printf, @maps, hist, count, etc.
// Built-in variables:
// pid, tid, uid, gid, comm (process name), cpu, nsecs, args (tracepoint args)
// retval (kretprobe return value), func (function name), curtask
// Map types:
// @x = count() — counter
// @x = hist(val) — power-of-2 histogram
// @x[key] = val — associative map
// @x = lhist(val, min, max, step) — linear histogram
15 essential one-liners
Syscall counts per process
Count every syscall entry, grouped by process name. Print on Ctrl-C. Shows the system call profile of every running process in real time — which processes are making the most syscalls, and which syscalls.
Files being opened
Print every filename passed to openat(), with the process name. System-wide file access tracer in one line. Add a filter like / comm == "nginx" / to scope it to a single process.
DNS queries
Trace outbound DNS queries at the socket level — every process, every query, no pcap. Uses the sendmsg tracepoint scoped to UDP on port 53. Prints process name and the raw bytes of the DNS packet (parse with a DNS library if needed).
TCP retransmits
Count TCP retransmit events per destination IP. The kernel fires tcp_retransmit_skb on every retransmit. This one-liner gives you a per-IP retransmit count that updates in real time — run it during a suspected network issue to find the problematic destination.
New process exec
Print every process execution with its arguments. Built using the execve tracepoint — fires the moment execve is called, before the new process runs. Useful for security monitoring, debugging unexpected process spawning, or understanding what a build system does.
Block I/O latency histogram
Measure time between block I/O issue and completion. Store issue timestamps in a map keyed by request pointer, then measure elapsed time on completion. Produces a power-of-2 histogram of latencies in microseconds — same as biolatency but in one bpftrace line.
Page faults per process
Count page faults (both minor and major) per process name. A burst of major page faults means the kernel is reading data from disk into memory — working set is larger than physical RAM. A burst of minor faults is normal (anonymous memory allocation), but unusually high counts can indicate memory pressure.
Context switches per process
Count voluntary and involuntary context switches per process. High involuntary context switch rates indicate the process is being preempted — it wants to run but is being forced off the CPU. High voluntary context switch rates indicate the process is frequently sleeping (I/O bound, lock waiting).
Cache miss rate
Sample CPU cache misses using hardware performance counters. The hardware:cache-misses probe fires on LLC (last-level cache) miss events. High cache miss rates in hot code paths indicate poor data locality — a target for optimization or a sign of working set overflow.
Read/write syscall latency histogram
Measure the latency of read() and write() syscalls per process. This captures I/O latency as seen by the application — including time waiting for the kernel, filesystem, and disk. Useful for correlating application-level slowness with kernel I/O overhead.
Network interface packet rate
Count packets transmitted per network interface per second. Uses the net:net_dev_xmit tracepoint. Run it during a suspected traffic spike to identify which interface and which process is generating the traffic — before you even open tcpdump.
Kernel function call counts
Count calls to any kernel function by name using kprobe. Useful for understanding how often a specific kernel code path is taken — e.g., how often the kernel drops into slow path for memory allocation, or how frequently a specific driver function is called during I/O.
malloc size distribution
Trace userland malloc() calls and build a histogram of allocation sizes for a specific process. Uses uprobes on libc. Shows whether a process is making many small allocations (fragmentation risk) or a few large ones (OOM risk), and what call stacks are doing the allocating.
Signal delivery
Trace signal delivery events system-wide. Shows which process sent which signal to which target. Useful for debugging unexpected process termination (who sent SIGKILL?), OOM events (the OOM killer sends SIGKILL), and misbehaving signal handlers.
Scheduler wakeup latency
Measure time between a task being woken up (made runnable) and actually running on a CPU. The gap is time spent in the run queue. Equivalent to runqlat but as a raw bpftrace one-liner that you can scope to a specific process or cgroup.
6. XDP — Packet Processing at the NIC Driver
XDP (eXpress Data Path) is the earliest kernel hook in the packet receive path.
It runs before the kernel allocates an sk_buff — the large data structure that
represents a packet in the normal Linux network stack. Because there is no sk_buff
allocation, no routing lookup, no conntrack state check, XDP programs run at
near-line-rate on modern NICs and can process millions of packets per second on a
single CPU core.
XDP attachment modes
Native XDP — the driver itself calls the eBPF program before sk_buff allocation. Requires driver support (most modern NICs: Intel ixgbe/i40e, Mellanox mlx5, Broadcom bnxt, virtio-net). Fastest mode — processing happens in the NIC interrupt handler. Generic XDP — fallback for drivers without native support; runs after sk_buff allocation, so it loses the performance advantage but maintains the same API.
XDP return codes
XDP_PASS — let the packet continue to the normal kernel network stack. XDP_DROP — discard the packet immediately, no sk_buff allocated, no conntrack entry, no routing lookup. Zero allocation, zero overhead for dropped packets. XDP_TX — retransmit the packet out the same interface (useful for bounceback/reflection). XDP_REDIRECT — send to a different interface, CPU queue, or AF_XDP socket. XDP_ABORTED — drop with a tracepoint fired (debugging).
Concrete program: DDoS mitigation blocklist
// xdp_blocklist.c — drop packets from a blocklist map at NIC speed
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>
// Map: source IP (u32) → drop flag (u8). Updated from userland.
struct {
__uint(type, BPF_MAP_TYPE_LRU_HASH);
__uint(max_entries, 1000000);
__type(key, __u32);
__type(value, __u8);
} blocklist SEC(".maps");
SEC("xdp")
int xdp_filter(struct xdp_md *ctx) {
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end) return XDP_PASS;
if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end) return XDP_PASS;
__u8 *blocked = bpf_map_lookup_elem(&blocklist, &ip->saddr);
if (blocked && *blocked) return XDP_DROP;
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
# Load the program (requires clang + llvm + kernel headers)
clang -O2 -target bpf -c xdp_blocklist.c -o xdp_blocklist.o
ip link set dev eth0 xdp obj xdp_blocklist.o sec xdp
# Add an IP to the blocklist from userland:
bpftool map update pinned /sys/fs/bpf/blocklist \
key hex c0 a8 01 05 \ # 192.168.1.5 in little-endian hex
value hex 01 # 1 = blocked
# Remove a program:
ip link set dev eth0 xdp off
Concrete program: XDP load balancer (REDIRECT to backend)
// Minimal XDP load balancer: hash src IP to one of N backends
// Full implementation needs MAC rewrite + ARP handling — this shows the structure
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 8); // up to 8 backends
__type(key, __u32);
__type(value, __u32); // backend interface ifindex
} backends SEC(".maps");
SEC("xdp")
int xdp_lb(struct xdp_md *ctx) {
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct iphdr *ip = data + sizeof(struct ethhdr);
if ((void *)(ip + 1) > data_end) return XDP_PASS;
__u32 backend_idx = bpf_ntohl(ip->saddr) % 8;
__u32 *ifindex = bpf_map_lookup_elem(&backends, &backend_idx);
if (!ifindex) return XDP_PASS;
return bpf_redirect(*ifindex, 0);
}
Performance numbers: On a single core, XDP_DROP can achieve 10–14 million packets per second on a 10G NIC with native XDP driver support. An XDP load balancer with REDIRECT runs at 6–8 Mpps. Compare to iptables DROP at roughly 1–1.5 Mpps. Under a volumetric DDoS, this difference determines whether your server falls over or keeps serving legitimate traffic.
7. TC — Programmable Per-Packet Policy
TC (Traffic Control) eBPF programs attach to the Linux traffic control subsystem
— the same qdisc layer that tc uses for bandwidth shaping and QoS. Unlike XDP,
TC programs run after sk_buff allocation, which means they have access to the full
packet metadata: socket information, routing decisions, conntrack state, cgroup
membership, security identities. This makes TC the right choice for policy
enforcement that needs context XDP cannot provide.
TC vs XDP: when to use which
Use XDP when you need maximum performance and the decision can be made from raw packet bytes alone — DDoS mitigation, simple stateless filtering, hardware offload. Use TC when you need socket context, routing info, cgroup membership, or conntrack state — policy enforcement, traffic accounting, marking packets with metadata for downstream processing.
Attaching TC programs
TC eBPF programs attach to a qdisc's ingress or egress hook using tc filter add. The clsact qdisc must be created first — it provides a direct-action hook that lets the eBPF program return a verdict without a separate classifier. Load with tc qdisc add dev eth0 clsact, then tc filter add dev eth0 ingress bpf da obj prog.o sec classifier.
Concrete program: per-source-IP packet counter
// tc_counter.c — count packets per source IP, readable from userland
#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>
struct pkt_count {
__u64 packets;
__u64 bytes;
};
struct {
__uint(type, BPF_MAP_TYPE_LRU_HASH);
__uint(max_entries, 65536);
__type(key, __u32); // source IP
__type(value, struct pkt_count);
} ip_counts SEC(".maps");
SEC("classifier")
int tc_ingress(struct __sk_buff *skb) {
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
struct iphdr *ip = data + 14; // skip Ethernet header
if ((void *)(ip + 1) > data_end) return TC_ACT_OK;
struct pkt_count *cnt = bpf_map_lookup_elem(&ip_counts, &ip->saddr);
if (cnt) {
cnt->packets++;
cnt->bytes += skb->len;
} else {
struct pkt_count new = { .packets = 1, .bytes = skb->len };
bpf_map_update_elem(&ip_counts, &ip->saddr, &new, BPF_ANY);
}
return TC_ACT_OK; // pass the packet through
}
# Read the map from userland after some traffic:
bpftool map dump pinned /sys/fs/bpf/ip_counts
# key: 05 01 a8 c0 (192.168.1.5 in network byte order)
# value: packets=18423 bytes=27634500
Traffic marking and classification
TC programs can set skb->mark or skb->priority for downstream processing by
the normal qdisc tree. This is how you build eBPF-aware QoS: classify traffic in
eBPF (full context, arbitrary logic), set a DSCP mark, and let the standard HTB
or FQ_CoDel qdisc do the actual shaping. The eBPF program handles the classification
logic; the existing TC infrastructure handles the queue management.
8. Socket-Level eBPF — Application-Layer Interception
Socket-level eBPF programs operate at the socket layer — above the network stack
but below the application. They can intercept socket operations, redirect data
between sockets, and modify socket behavior without any changes to the application.
The three relevant program types are sockops, sk_msg, and sk_lookup.
sockops — socket lifecycle events
Fires on socket state transitions: BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB (connection established), BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB (accepted connection established), BPF_SOCK_OPS_STATE_CB (any state change). Can call bpf_sock_ops_cb_flags_set() to request additional callbacks, and bpf_setsockopt() to configure socket options like TCP congestion control or buffer sizes.
sk_msg — message-level redirection
Fires on every message sent through a socket that is enrolled in a BPF_MAP_TYPE_SOCKMAP or SOCKHASH. Can redirect the message to a different socket in the same map using bpf_msg_redirect_map(). The data goes directly from the sender's socket buffer to the receiver's socket buffer — zero copies, zero stack traversal.
sk_lookup — socket selection
Fires when the kernel looks up which socket should receive an incoming packet. The default behavior is to match destination IP and port. sk_lookup can override this — redirect to a different socket regardless of the destination port, implement wildcard listeners, or distribute traffic across multiple sockets. Used for transparent proxy implementations and advanced load balancing.
The performance implication is significant. Normal pod-to-pod communication on the same node traverses: socket send buffer, TCP stack, veth pair (virtual Ethernet), bridge (with MAC lookup), routing table, conntrack, TCP stack again, socket receive buffer. With socket-level eBPF redirection, the path is: socket send buffer, kernel copy, socket receive buffer. The entire network stack is bypassed. Cilium's per-node measurements show 40–60% reduction in latency for same-node pod communication, and 2–3x throughput improvement for high-bandwidth workloads.
9. eBPF Maps Deep Dive
Maps are the data layer of eBPF. Every interesting eBPF application uses maps extensively — for state, for communication with userland, for configuration, for metrics. Choosing the right map type for your use case matters for performance and correctness.
BPF_MAP_TYPE_HASH
General-purpose hash map. O(1) average lookup. Key and value are fixed-size byte arrays. Supports concurrent access from multiple CPUs (with a spinlock per bucket). bpf_map_lookup_elem, bpf_map_update_elem, bpf_map_delete_elem. Good for: connection tracking, per-IP counters, blocklists with arbitrary keys.
BPF_MAP_TYPE_ARRAY
Fixed-size array indexed by u32. O(1) lookup by array index — slightly faster than hash for integer keys. Values are zero-initialized. Cannot delete entries. Good for: configuration (indexed by config enum), per-CPU statistics (use PERCPU_ARRAY for lock-free updates), global state with a small known key space.
BPF_MAP_TYPE_LRU_HASH
LRU hash map — when the map is full, the least-recently-used entry is evicted. Bounded memory, self-managing. Essential for tracking ephemeral state (TCP connections, DNS queries) where you do not want to manage eviction yourself. Cilium uses LRU maps for its connection tracking tables.
BPF_MAP_TYPE_PERCPU_HASH / PERCPU_ARRAY
Per-CPU variants of hash and array. Each CPU core has its own copy of the value. eBPF programs access their CPU's copy without locks — zero contention. Userland reads all CPU copies and sums/aggregates them. Best for high-frequency counters (packet counts, byte counts) where lock contention on a shared counter would be the bottleneck.
BPF_MAP_TYPE_RINGBUF
Ring buffer for streaming events from eBPF programs to userland. Replaces the older BPF_MAP_TYPE_PERF_EVENT_ARRAY (which required per-CPU buffers). Single ring buffer, shared across CPUs, with ordering guarantees. bpf_ringbuf_output() copies event data to the ring. Userland polls with epoll and reads events via ring_buffer__poll() (libbpf API). Low-overhead, high-throughput event streaming.
BPF_MAP_TYPE_LPM_TRIE
Longest-prefix-match trie — the data structure used in IP routing tables. Key is a prefix (network address + prefix length). Lookup finds the most specific matching prefix. Perfect for: IP blocklists with CIDR ranges, routing policy tables, geo-blocking by prefix block.
Reading maps from userland with bpftool
bpftool is the Swiss Army knife for eBPF inspection. It can list all loaded programs, dump map contents, show BTF type information, and pin/unpin maps to the BPF filesystem. You do not need source code to inspect a running Cilium or Falco installation — bpftool reads their maps directly.
Concrete: rate limiter map
A per-IP rate limiter uses a hash map with a per-IP token bucket. The eBPF program checks the bucket on each packet, refills based on elapsed time (using bpf_ktime_get_ns()), and returns DROP or PASS. Userland can set per-IP rate limits by updating the map. No kernel module, no iptables hashlimit, no userland daemon in the packet path.
10. Writing Production eBPF Programs
Early eBPF programs were tied to a specific kernel version. You compiled against kernel headers from one specific kernel build, and the program broke if the kernel was updated because internal struct layouts changed. CO-RE (Compile Once — Run Everywhere) changed this. Combined with BTF, CO-RE programs compile once and adapt themselves to whatever kernel they load on.
BTF — BPF Type Format
BTF is a compact binary format that encodes the type information (struct layouts,
function signatures, enum values) of every kernel type. It is stored in
/sys/kernel/btf/vmlinux on BTF-enabled kernels. Every kldload kernel has BTF
enabled. When an eBPF program uses CO-RE relocations, the loader (libbpf) reads the
kernel's BTF at load time and patches the program's field offsets to match the
running kernel's actual struct layout. If a struct field moved between kernel
versions, libbpf fixes it up before loading.
CO-RE in practice
Instead of task->mm->pgd (a brittle direct field access), a CO-RE program uses BPF_CORE_READ(task, mm, pgd). This macro generates a relocation record in the eBPF object. When libbpf loads the program, it looks up the mm and pgd field offsets in the running kernel's BTF and patches the field access to use the correct offset. The same binary runs on kernel 5.14, 6.0, 6.8 without recompilation.
vmlinux.h — the kernel header replacement
Normally you include <linux/sched.h> and 50 other kernel headers. With BTF, you can generate a single vmlinux.h that contains all kernel type definitions, extracted directly from the running kernel's BTF. This eliminates the kernel headers dependency entirely. Generate it with: bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h. Include it in your eBPF C program. Done.
The skeleton pattern
libbpf provides a skeleton generator that makes it easy to load and manage eBPF
programs from any language. The workflow: write the eBPF C program, compile it to
an object file, generate a skeleton header with bpftool gen skeleton, include
the skeleton in your Go/Rust/Python loader. The skeleton handles: loading the
program, verifying it, attaching it to the right hook, exposing maps as typed
objects, and cleaning up on exit.
# 1. Write your eBPF C program: counter.bpf.c
# 2. Compile to eBPF object:
clang -O2 -g -target bpf -D__TARGET_ARCH_x86 \
-I/usr/include/$(uname -m)-linux-gnu \
-c counter.bpf.c -o counter.bpf.o
# 3. Generate skeleton header:
bpftool gen skeleton counter.bpf.o > counter.skel.h
# 4. In your Go loader (using cilium/ebpf library):
# Or in your C loader:
# struct counter_bpf *skel = counter_bpf__open_and_load();
# counter_bpf__attach(skel);
# // access maps: skel->maps.ip_counts
# counter_bpf__destroy(skel);
For Go programs, the cilium/ebpf library provides the same skeleton pattern with
a Go code generator. For Rust, aya is the idiomatic choice with native Rust eBPF
programs and a Rust loader. For Python, libbcc provides a higher-level interface
(used by all bcc tools).
11. eBPF Security Considerations
eBPF is simultaneously the best security monitoring tool available for Linux and a meaningful attack surface if access is not controlled. Understanding both sides is essential.
What eBPF can see
An eBPF program with the right program type can observe everything that passes through the kernel: every syscall with its arguments and return values, every network packet with full payload access (at TC level), every file open, read, and write, every process exec with its argument list, every memory allocation, every inter-process signal. A malicious eBPF program could log every keystroke (intercept read() on terminal file descriptors), exfiltrate every network payload (TC program), or silently drop connections (XDP program). This is not hypothetical — rootkits built on eBPF exist and are used in the wild.
CAP_BPF and CAP_PERFMON
Linux 5.8 split the previous requirement for CAP_SYS_ADMIN to load eBPF programs into two finer-grained capabilities. CAP_BPF allows loading eBPF programs and creating maps. CAP_PERFMON allows attaching to performance events and tracepoints. Together, they allow running observability tools without full root — but either capability on a compromised process is still very powerful. On kldload, only root can load eBPF programs by default.
eBPF for security monitoring: Falco
Falco uses an eBPF probe (or kernel module) to capture syscall events and evaluate rules against them. The rules are things like: "a process in a container spawned a shell," "a process opened /etc/shadow," "a binary wrote to /usr/bin." The eBPF probe captures events at kernel speed; Falco evaluates rules in userland. The combination gives you real-time security alerting with minimal overhead.
Tetragon — kernel-enforced security policy
Cilium project
Tetragon goes further than Falco: it can enforce policy in the kernel, not just observe it. A Tetragon TracingPolicy can attach a SIGKILL action to a specific syscall pattern — the eBPF program sends the signal before the syscall completes. No userland round-trip, no race condition. If a process tries to exec a binary outside an allowed set, the kernel kills it before it runs.
kldload default security posture
kldload sets kernel.unprivileged_bpf_disabled=1 — unprivileged users cannot load eBPF programs or create maps. bcc tools and bpftrace require root. eBPF programs loaded by Cilium run as root under the Cilium agent's service account. The BPF filesystem (/sys/fs/bpf/) is accessible to root only. These defaults mean an attacker who does not have root cannot deploy a stealthy eBPF rootkit.
ps, netstat, and lsof. This is why CAP_BPF exists and why only root should load eBPF programs. On kldload, the default security posture is: only root loads eBPF, bcc and bpftrace require root, and unprivileged BPF is disabled. If you are running a multi-tenant system where untrusted code runs, verify these settings are enforced. The good news: if you are already using eBPF for security monitoring (Falco, Tetragon), you have visibility into any attempt to load additional eBPF programs — Tetragon can alert on bpf(BPF_PROG_LOAD) syscalls from unexpected processes. Fighting eBPF with eBPF, from a position of first-mover advantage.12. The kldload eBPF Toolkit
kldload is the only Linux distribution that ships eBPF development tools, runtime libraries, and a BTF-enabled kernel as a unified pre-configured stack on a bootable ISO. The live environment and every installed target system (desktop and server profiles) include:
| Component | What it is | Covered in |
|---|---|---|
bcc-tools |
80+ production-ready eBPF programs for networking, disk, CPU, memory | Section 4 |
bpftrace |
One-liner eBPF tracing language — awk for the kernel | Section 5 |
bpftool |
Low-level eBPF inspection — programs, maps, BTF, skeleton generation | Sections 9, 10 |
| kernel headers | Required for compiling eBPF programs against kernel types | Section 3 |
libbpf |
C library for loading eBPF programs, managing maps, CO-RE support | Section 10 |
BTF support (/sys/kernel/btf/vmlinux) |
Enables CO-RE — compile-once eBPF programs that work across kernel versions | Section 10 |
| Kernel 5.14+ / 6.1+ | All modern eBPF features: bounded loops, ring buffer, BTF, CO-RE, sockmap | Section 3 |
The complete picture: a kldload server is an eBPF-native host. The kernel ships with BTF enabled, so CO-RE programs from Cilium, Falco, and Tetragon work out of the box. bcc-tools and bpftrace are installed, so any operational question can be answered from the command line in seconds. libbpf and kernel headers are installed, so you can write and compile new eBPF programs on the host without a separate development environment. bpftool is installed, so you can inspect any running eBPF program or map — whether it was loaded by Cilium, a bcc tool, or your own code — without source access.
eBPF changes the economics of kernel observability. Questions that previously required days of strace analysis, packet captures, and log correlation now take seconds — one bpftrace one-liner, one bcc tool invocation. Security monitoring that previously required expensive APM agents or kernel modules now runs as lightweight eBPF programs with sub-1% CPU overhead at production traffic rates. Network policy that previously required complex iptables rule management now lives in kernel-compiled eBPF programs that update atomically and perform at line rate. On kldload, all of this is available the moment the OS boots.
Related pages
- eBPF Reference — bcc and bpftrace quick reference card
- eBPF Security — Falco, Tetragon, and kernel-level security programs
- eBPF Performance — profiling and performance analysis with eBPF tools
- Custom eBPF Programs — writing, compiling, and deploying production eBPF programs Tracepoints & Probes Metrics & Exporters Core Dumps & Stacks XDP & TC Datapath eBPF Cookbook
- Networking tutorial — eBPF in the context of XDP and TC networking
- Cilium Masterclass — eBPF-native Kubernetes networking built on everything in this guide
- AI for eBPF — using AI to write and debug eBPF programs