XDP & TC Network Datapath — Packet Processing at Wire Speed
iptables processes a packet through a dozen Netfilter hooks, allocating an sk_buff for every single one, copying metadata between layers, and consulting a chain of rules that degrades linearly with rule count. At 10 Gbps, that overhead is measurable. At 100 Gbps, it is fatal. XDP and TC eBPF programs run before or immediately after the kernel allocates the socket buffer — intercepting packets at the earliest possible point in the network stack. The result: millions of packets per second on commodity hardware, with programmable logic you write yourself.
The premise: every packet that enters your NIC has to pass through a decision point. With traditional firewalling, that decision happens late — after the kernel has already built metadata structures, walked linked lists of rules, and burned CPU cycles on packets you are going to drop anyway. XDP moves the decision to the driver level, before the kernel builds anything. TC eBPF sits one layer above — after sk_buff allocation but before routing — giving you access to both ingress and egress. Together, they replace iptables, IPVS, tc-filter, and even userspace proxies.
What XDP Actually Is
XDP — eXpress Data Path — is a hook point in the Linux kernel that lets you attach an eBPF program to a network interface at the driver level. When a packet arrives, the NIC driver hands the raw packet data to your XDP program before allocating an sk_buff (the kernel's internal packet representation). Your program inspects the packet headers, makes a decision, and returns a single action code. The entire path from NIC to decision to action happens in the driver's NAPI poll loop — no memory allocation, no lock contention, no traversal of Netfilter chains.
Why sk_buff allocation matters
Every packet the Linux kernel processes normally gets wrapped in a struct sk_buff — a 256-byte metadata structure that tracks the packet through every layer of the network stack. Allocating and initializing this structure costs roughly 100-200 nanoseconds per packet. At 10 million packets per second, that is 1-2 seconds of CPU time per second just for bookkeeping. XDP skips this entirely for packets you drop or redirect — the packet data stays in the DMA ring buffer and your program reads it directly.
XDP programs receive a pointer to the raw packet data (from L2 — the Ethernet header) and must return one of five action codes. That is the entire API. No complex framework, no callback registration, no state machine — just a C function that takes packet bytes and returns an integer.
The Five XDP Actions
Every XDP program returns exactly one of these values. There are no other options. This simplicity is deliberate — it keeps the fast path minimal and predictable.
XDP_PASS
Continue normal processing. The packet proceeds to sk_buff allocation and enters the regular kernel network stack. This is the default — if your program does not match the packet, return XDP_PASS and the kernel handles it normally.
XDP_DROP
Drop the packet immediately. No sk_buff is allocated. No counter is incremented (unless your program increments one). No ICMP unreachable is sent. The packet simply vanishes. This is the action you use for DDoS mitigation — dropping millions of packets per second costs almost nothing.
XDP_TX
Transmit the packet back out the same interface it arrived on. You typically modify the packet first — swap source and destination MAC addresses, rewrite IP headers — then bounce it back. This is how you build an XDP-based load balancer that sits inline: packet arrives, headers get rewritten, packet goes back out the wire toward the real server.
XDP_REDIRECT
Send the packet to a different interface, a different CPU, or an AF_XDP socket. This is the most powerful action — it enables cross-NIC forwarding, CPU load distribution, and userspace packet processing. Used with bpf_redirect_map() to specify the target.
XDP_ABORTED
Drop the packet and trigger a tracepoint (xdp:xdp_exception). This signals an error in your program — it is the eBPF equivalent of a panic. Use it for cases that should never happen (e.g., a packet shorter than an Ethernet header). Monitor the tracepoint in production to catch bugs.
# Monitor XDP_ABORTED events (errors in your XDP programs)
perf record -e xdp:xdp_exception -a
perf script
Output when an XDP program returns ABORTED:
swapper 0 [003] 12345.678: xdp:xdp_exception: prog_id=42 action=ABORTED ifindex=2
XDP Modes — Native, Generic, and Offloaded
Not all XDP is created equal. The mode determines where your program runs and how fast it executes. The difference between native and generic XDP can be 10x in throughput.
Native XDP (driver mode)
The XDP program runs inside the NIC driver's NAPI poll loop, before sk_buff allocation. This is the fast path — 10-20 million pps on a modern server. Requires driver support. Most modern drivers support it: i40e (Intel X710/XL710), mlx5 (Mellanox ConnectX-4/5/6), ixgbe (Intel 10G), ice (Intel E800), bnxt (Broadcom), virtio_net, veth.
Generic XDP (skb mode)
The XDP program runs after sk_buff allocation, inside netif_receive_skb(). Works on every NIC — no driver support needed. But it does not skip the allocation, so you lose the primary performance benefit. Useful for development and testing, not for production DDoS mitigation. Throughput: roughly equivalent to TC eBPF.
Offloaded XDP
The XDP program runs on the NIC's onboard processor. Zero host CPU usage for matched packets. Currently supported only by Netronome SmartNICs (nfp driver). The eBPF instruction set is compiled to NIC firmware. Throughput: line rate at 40/100 Gbps with zero CPU cost. The trade-off: limited eBPF feature support (no tail calls, limited map types).
# Check if your driver supports native XDP
ethtool -i eth0 | grep driver
# driver: mlx5_core <-- native XDP supported
# Attach in native mode (default when driver supports it)
ip link set dev eth0 xdp obj xdp_prog.o sec xdp
# Force generic mode (for testing on unsupported drivers)
ip link set dev eth0 xdpgeneric obj xdp_prog.o sec xdp
# Force offloaded mode (SmartNIC only)
ip link set dev eth0 xdpoffload obj xdp_prog.o sec xdp
# Verify what mode is active
ip link show eth0
Output showing native XDP attached:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP
link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff
prog/xdp id 47 tag a]b4c5d6e7f8a9b0
ip link show says prog/xdp you are native. If it says prog/xdpgeneric you are burning the same cycles as Netfilter. On virtio_net (KVM guests), native XDP has been supported since kernel 4.10 — you have no excuse.
TC eBPF — The Egress Counterpart
XDP only operates on ingress — packets arriving at the NIC. If you need to process packets leaving the machine, you need TC (Traffic Control) eBPF. TC eBPF programs attach to the clsact qdisc and run at the cls_bpf classifier hook, which fires on both ingress and egress. TC programs operate on sk_buff (not raw packet data), so they have access to more metadata — routing decisions, socket information, connection tracking — at the cost of running later in the stack than XDP.
TC actions vs XDP actions
TC eBPF programs return TC_ACT_OK (continue), TC_ACT_SHOT (drop), TC_ACT_REDIRECT (send elsewhere), TC_ACT_PIPE (continue to next filter), or TC_ACT_STOLEN (consume without further processing). The semantics differ from XDP because TC sits deeper in the stack — after routing decisions, after connection tracking, after the kernel has built the full sk_buff metadata.
# Attach a TC eBPF program to ingress
tc qdisc add dev eth0 clsact
tc filter add dev eth0 ingress bpf da obj tc_prog.o sec tc
# Attach a TC eBPF program to egress
tc filter add dev eth0 egress bpf da obj tc_prog.o sec tc
# List attached TC programs
tc filter show dev eth0 ingress
tc filter show dev eth0 egress
# Remove TC programs
tc filter del dev eth0 ingress
tc qdisc del dev eth0 clsact
Output listing attached TC filters:
filter protocol all pref 49152 bpf chain 0
filter protocol all pref 49152 bpf chain 0 handle 0x1 tc_prog.o:[tc] direct-action not_in_hw id 53 tag 1a2b3c4d5e6f7890
The da (direct-action) flag is critical — it tells the kernel that your BPF program returns a TC action code directly, rather than a classid. Without da, the return value is interpreted as a class identifier for the qdisc, which is almost never what you want.
XDP vs TC vs iptables vs nftables
This table puts the four packet processing frameworks side by side. The performance numbers are from real benchmarks on a dual-socket Xeon with a Mellanox ConnectX-5 100G NIC, single core, 64-byte packets.
| Feature | XDP (native) | TC eBPF | nftables | iptables |
|---|---|---|---|---|
| Hook point | Driver RX (pre-sk_buff) | clsact qdisc (post-sk_buff) | Netfilter hooks | Netfilter hooks |
| Ingress | Yes | Yes | Yes | Yes |
| Egress | No | Yes | Yes | Yes |
| sk_buff access | No (raw xdp_buff) | Full sk_buff | Full sk_buff | Full sk_buff |
| Conntrack access | No | Yes (bpf_skb_ct_lookup) | Yes (native) | Yes (native) |
| Drop throughput (single core, 64B) | ~24 Mpps | ~8 Mpps | ~3 Mpps | ~2 Mpps |
| Forwarding throughput (single core) | ~14 Mpps | ~5 Mpps | ~2 Mpps | ~1.5 Mpps |
| Programmable logic | Full eBPF (C/Rust) | Full eBPF (C/Rust) | Rule-based DSL | Fixed match/action |
| Map support (hashmaps, LPM, etc.) | All BPF maps | All BPF maps | Sets only | ipset only |
| Per-CPU counters | Native | Native | Via nft counters | Via -c flag |
| Atomic hot-reload | Yes (replace prog) | Yes (replace filter) | Yes (nft -f) | No (flush + reload) |
| Rule count scaling | O(1) map lookups | O(1) map lookups | O(n) chains | O(n) linear scan |
| Hardware offload | SmartNIC offload | Limited | Limited (ethtool) | No |
Writing XDP Programs with libbpf — Complete Walkthrough
This section walks through building a working XDP program from scratch using libbpf (the canonical BPF loading library). We will build an IP blocklist that drops packets from a configurable set of source addresses — the foundational building block for DDoS mitigation, geo-blocking, and abuse prevention.
Prerequisites
# Install build dependencies (CentOS Stream 9 / RHEL 9 / Rocky 9)
dnf install -y clang llvm libbpf-devel bpftool \
kernel-devel kernel-headers elfutils-libelf-devel
# Install build dependencies (Debian 13 / Ubuntu 24.04)
apt install -y clang llvm libbpf-dev bpftool \
linux-headers-$(uname -r) libelf-dev
# Verify BPF support in your kernel
bpftool feature probe kernel | grep -i xdp
Output confirming XDP support:
eBPF helpers supported for program type xdp:
- bpf_map_lookup_elem
- bpf_map_update_elem
- bpf_map_delete_elem
- bpf_redirect
- bpf_redirect_map
- bpf_xdp_adjust_head
- bpf_xdp_adjust_tail
- bpf_fib_lookup
...
Step 1: The XDP program (C)
Create xdp_blocklist.bpf.c. This program parses the Ethernet and IP headers, looks up the source IP in a BPF hashmap, and drops the packet if found. Note the bounds checking — the eBPF verifier rejects any program that accesses memory beyond data_end.
cat <<'XDPEOF' > xdp_blocklist.bpf.c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
/* Map: blocked source IPs -> drop count */
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 100000);
__type(key, __u32); /* IPv4 address (network byte order) */
__type(value, __u64); /* packet drop counter */
} blocklist SEC(".maps");
/* Per-CPU counters for stats */
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 2); /* 0 = passed, 1 = dropped */
__type(key, __u32);
__type(value, __u64);
} stats SEC(".maps");
SEC("xdp")
int xdp_block(struct xdp_md *ctx)
{
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
/* --- Parse Ethernet header --- */
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return XDP_ABORTED;
/* Only process IPv4 */
if (eth->h_proto != bpf_htons(ETH_P_IP))
return XDP_PASS;
/* --- Parse IP header --- */
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_ABORTED;
/* Look up source IP in blocklist */
__u32 src_ip = ip->saddr;
__u64 *drop_cnt = bpf_map_lookup_elem(&blocklist, &src_ip);
if (drop_cnt) {
/* IP is blocked — increment counter and drop */
__sync_fetch_and_add(drop_cnt, 1);
__u32 key = 1; /* dropped */
__u64 *val = bpf_map_lookup_elem(&stats, &key);
if (val)
__sync_fetch_and_add(val, 1);
return XDP_DROP;
}
/* Not blocked — pass to kernel */
__u32 key = 0; /* passed */
__u64 *val = bpf_map_lookup_elem(&stats, &key);
if (val)
__sync_fetch_and_add(val, 1);
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
XDPEOF
Step 2: Compile with clang
# Compile to BPF bytecode
clang -O2 -g -target bpf \
-D__TARGET_ARCH_x86 \
-c xdp_blocklist.bpf.c \
-o xdp_blocklist.bpf.o
# Verify the program sections
bpftool prog dump xlated pinned /sys/fs/bpf/xdp_blocklist 2>/dev/null || \
llvm-objdump -d xdp_blocklist.bpf.o
Output showing BPF instructions:
xdp_blocklist.bpf.o: file format elf64-bpf
Disassembly of section xdp:
0000000000000000 <xdp_block>:
0: r6 = r1
1: r2 = *(u32 *)(r6 + 0) ; data
2: r3 = *(u32 *)(r6 + 4) ; data_end
3: r1 = r2
4: r1 += 14 ; sizeof(ethhdr)
5: if r1 > r3 goto +42 ; bounds check
...
Step 3: The userspace loader (C)
Create xdp_blocklist_loader.c. This program loads the compiled BPF object, attaches it to the specified interface, and populates the blocklist map from the command line. It also handles graceful detachment on Ctrl-C.
cat <<'LOADEREOF' > xdp_blocklist_loader.c
// SPDX-License-Identifier: GPL-2.0
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <unistd.h>
#include <net/if.h>
#include <arpa/inet.h>
#include <bpf/libbpf.h>
#include <bpf/bpf.h>
static int ifindex;
static struct bpf_object *obj;
static void cleanup(int sig)
{
printf("\nDetaching XDP program from ifindex %d...\n", ifindex);
bpf_xdp_detach(ifindex, 0, NULL);
bpf_object__close(obj);
exit(0);
}
int main(int argc, char **argv)
{
if (argc < 3) {
fprintf(stderr, "Usage: %s <ifname> <blocked_ip> [blocked_ip ...]\n", argv[0]);
return 1;
}
const char *ifname = argv[1];
ifindex = if_nametoindex(ifname);
if (!ifindex) {
fprintf(stderr, "Interface %s not found\n", ifname);
return 1;
}
/* Open and load BPF object */
obj = bpf_object__open_file("xdp_blocklist.bpf.o", NULL);
if (libbpf_get_error(obj)) {
fprintf(stderr, "Failed to open BPF object\n");
return 1;
}
if (bpf_object__load(obj)) {
fprintf(stderr, "Failed to load BPF object\n");
return 1;
}
/* Find the XDP program */
struct bpf_program *prog = bpf_object__find_program_by_name(obj, "xdp_block");
if (!prog) {
fprintf(stderr, "Failed to find xdp_block program\n");
return 1;
}
/* Attach XDP program to interface */
int prog_fd = bpf_program__fd(prog);
if (bpf_xdp_attach(ifindex, prog_fd, 0, NULL)) {
fprintf(stderr, "Failed to attach XDP to %s\n", ifname);
return 1;
}
printf("XDP program attached to %s (ifindex %d)\n", ifname, ifindex);
/* Populate blocklist map */
struct bpf_map *map = bpf_object__find_map_by_name(obj, "blocklist");
int map_fd = bpf_map__fd(map);
for (int i = 2; i < argc; i++) {
__u32 ip;
if (inet_pton(AF_INET, argv[i], &ip) != 1) {
fprintf(stderr, "Invalid IP: %s\n", argv[i]);
continue;
}
__u64 counter = 0;
bpf_map_update_elem(map_fd, &ip, &counter, BPF_ANY);
printf("Blocked: %s\n", argv[i]);
}
/* Handle Ctrl-C for clean detach */
signal(SIGINT, cleanup);
signal(SIGTERM, cleanup);
printf("Running... press Ctrl-C to detach\n\n");
/* Print stats every 2 seconds */
struct bpf_map *stats_map = bpf_object__find_map_by_name(obj, "stats");
int stats_fd = bpf_map__fd(stats_map);
int ncpus = libbpf_num_possible_cpus();
__u64 *values = calloc(ncpus, sizeof(__u64));
while (1) {
sleep(2);
__u64 total_passed = 0, total_dropped = 0;
__u32 key = 0;
if (bpf_map_lookup_elem(stats_fd, &key, values) == 0)
for (int i = 0; i < ncpus; i++)
total_passed += values[i];
key = 1;
if (bpf_map_lookup_elem(stats_fd, &key, values) == 0)
for (int i = 0; i < ncpus; i++)
total_dropped += values[i];
printf("Passed: %llu | Dropped: %llu\r",
(unsigned long long)total_passed,
(unsigned long long)total_dropped);
fflush(stdout);
}
free(values);
return 0;
}
LOADEREOF
Step 4: Compile and run
# Compile the loader
gcc -Wall -O2 xdp_blocklist_loader.c -o xdp_blocklist -lbpf -lelf -lz
# Run: block two IPs on eth0
sudo ./xdp_blocklist eth0 192.168.1.100 10.0.0.55
Output:
XDP program attached to eth0 (ifindex 2)
Blocked: 192.168.1.100
Blocked: 10.0.0.55
Running... press Ctrl-C to detach
Passed: 4821033 | Dropped: 1293847
Step 5: Manage the blocklist at runtime
The blocklist map persists as long as the XDP program is attached. You can add and remove IPs without reloading the program using bpftool:
# Find the map ID
bpftool map show | grep blocklist
Output:
14: hash name blocklist flags 0x0
key 4B value 8B max_entries 100000 memlock 4800000B
# Add an IP to the blocklist (key = IP in hex, value = counter starting at 0)
# 203.0.113.50 = 0xcb007132 in network byte order
bpftool map update id 14 key 0x32 0x71 0x00 0xcb value 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
# Simpler: use the pinned path approach
bpftool map update pinned /sys/fs/bpf/blocklist \
key hex 32 71 00 cb \
value hex 00 00 00 00 00 00 00 00
# Dump the entire blocklist with drop counts
bpftool map dump id 14
Output:
key: c0 a8 01 64 value: 00 00 00 00 00 0d 38 2b (192.168.1.100 -> 865323 drops)
key: 0a 00 00 37 value: 00 00 00 00 00 04 a1 c3 (10.0.0.55 -> 303555 drops)
key: 32 71 00 cb value: 00 00 00 00 00 00 00 00 (203.0.113.50 -> 0 drops)
Found 3 elements
# Delete an IP from the blocklist
bpftool map delete id 14 key 0x32 0x71 0x00 0xcb
# Detach XDP program manually
ip link set dev eth0 xdp off
DDoS Mitigation with XDP
The blocklist above handles known-bad IPs. Real DDoS mitigation also needs rate limiting — detecting and throttling sources that exceed a packet-per-second threshold, even if they are not on the blocklist yet. This catches SYN floods, UDP amplification, and volumetric attacks from botnets where individual IPs send at moderate rates but the aggregate overwhelms your stack.
SYN flood rate limiter
This XDP program tracks TCP SYN packets per source IP using a sliding window counter. When a source exceeds the configured threshold (e.g., 100 SYNs per second), all subsequent SYN packets from that source are dropped. Non-SYN packets pass through — this prevents collateral damage to established connections.
cat <<'SYNEOF' > xdp_synlimit.bpf.c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#define SYN_RATE_LIMIT 100 /* max SYNs per source per second */
#define WINDOW_NS 1000000000ULL /* 1 second in nanoseconds */
struct rate_info {
__u64 window_start; /* nanosecond timestamp of current window */
__u32 syn_count; /* SYNs seen in current window */
__u32 blocked; /* 1 if currently rate-limited */
};
struct {
__uint(type, BPF_MAP_TYPE_LRU_HASH);
__uint(max_entries, 1000000); /* track up to 1M sources */
__type(key, __u32); /* source IPv4 */
__type(value, struct rate_info);
} syn_tracker SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 3); /* 0=pass, 1=drop_rate, 2=drop_blocked */
__type(key, __u32);
__type(value, __u64);
} syn_stats SEC(".maps");
static __always_inline void bump_stat(__u32 idx)
{
__u64 *val = bpf_map_lookup_elem(&syn_stats, &idx);
if (val) __sync_fetch_and_add(val, 1);
}
SEC("xdp")
int xdp_syn_rate_limit(struct xdp_md *ctx)
{
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return XDP_PASS;
if (eth->h_proto != bpf_htons(ETH_P_IP))
return XDP_PASS;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_PASS;
if (ip->protocol != IPPROTO_TCP)
return XDP_PASS;
/* Calculate IP header length (IHL field) */
__u32 ip_hlen = ip->ihl * 4;
if (ip_hlen < 20)
return XDP_PASS;
struct tcphdr *tcp = (void *)ip + ip_hlen;
if ((void *)(tcp + 1) > data_end)
return XDP_PASS;
/* Only rate-limit SYN packets (SYN=1, ACK=0) */
if (!(tcp->syn && !tcp->ack))
return XDP_PASS;
__u32 src_ip = ip->saddr;
__u64 now = bpf_ktime_get_ns();
struct rate_info *info = bpf_map_lookup_elem(&syn_tracker, &src_ip);
if (!info) {
/* First SYN from this source — start tracking */
struct rate_info new_info = {
.window_start = now,
.syn_count = 1,
.blocked = 0,
};
bpf_map_update_elem(&syn_tracker, &src_ip, &new_info, BPF_ANY);
bump_stat(0);
return XDP_PASS;
}
/* Check if we are in a new window */
if (now - info->window_start > WINDOW_NS) {
info->window_start = now;
info->syn_count = 1;
info->blocked = 0;
bump_stat(0);
return XDP_PASS;
}
info->syn_count++;
if (info->syn_count > SYN_RATE_LIMIT) {
info->blocked = 1;
bump_stat(1);
return XDP_DROP;
}
bump_stat(0);
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
SYNEOF
# Compile and attach
clang -O2 -g -target bpf -c xdp_synlimit.bpf.c -o xdp_synlimit.bpf.o
ip link set dev eth0 xdp obj xdp_synlimit.bpf.o sec xdp
# Verify it is running
bpftool prog list | grep xdp_syn
Output:
58: xdp name xdp_syn_rate_l tag 9a8b7c6d5e4f3a21 gpl
loaded_at 2026-04-04T10:15:00+0000 uid 0
xlated 896B jited 512B memlock 4096B map_ids 22,23
# Test with hping3 SYN flood
hping3 -S -p 80 --flood --rand-source <target_ip>
# Monitor drop rate
bpftool map dump name syn_stats
L4 Load Balancing with XDP
XDP load balancers work by rewriting packet headers at the driver level. A packet arrives destined for a virtual IP (VIP). The XDP program selects a backend server, rewrites the destination MAC and IP, and returns XDP_TX to send the packet back out the same interface — all without the kernel ever seeing the packet. This is how Facebook's Katran and Cilium's L4 load balancer work.
How XDP L4 dispatch works
The load balancer machine sits inline (or as a router hop). It has one or more VIPs configured. When a packet arrives for a VIP, the XDP program hashes the source IP + source port + destination port (consistent hashing) to select a backend from a map. It rewrites the destination IP to the backend's IP, recalculates the IP checksum, swaps the Ethernet destination MAC to the backend's MAC, and returns XDP_TX. The backend sees the packet as if the client sent it directly. Return traffic bypasses the load balancer entirely (DSR — Direct Server Return).
cat <<'LBEOF' > xdp_lb.bpf.c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/udp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#define MAX_BACKENDS 64
struct backend {
__u32 ip; /* backend IP (network byte order) */
unsigned char mac[6]; /* backend MAC */
__u16 _pad;
};
struct vip_key {
__u32 ip; /* VIP address */
__u16 port; /* VIP port */
__u8 proto; /* IPPROTO_TCP or IPPROTO_UDP */
__u8 _pad;
};
struct vip_meta {
__u32 num_backends; /* number of active backends for this VIP */
};
/* VIP -> metadata (number of backends) */
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 256);
__type(key, struct vip_key);
__type(value, struct vip_meta);
} vip_map SEC(".maps");
/* (VIP, index) -> backend */
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 256 * MAX_BACKENDS);
__type(key, __u64); /* vip_hash << 32 | backend_index */
__type(value, struct backend);
} backend_map SEC(".maps");
static __always_inline __u16 csum_fold(__u32 csum)
{
csum = (csum & 0xffff) + (csum >> 16);
csum = (csum & 0xffff) + (csum >> 16);
return (__u16)~csum;
}
static __always_inline void update_ip_csum(struct iphdr *ip,
__u32 old_addr, __u32 new_addr)
{
__u32 csum = ~((__u32)ip->check) & 0xffff;
csum += ~old_addr & 0xffff;
csum += ~(old_addr >> 16) & 0xffff;
csum += new_addr & 0xffff;
csum += (new_addr >> 16) & 0xffff;
ip->check = csum_fold(csum);
}
SEC("xdp")
int xdp_l4_lb(struct xdp_md *ctx)
{
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return XDP_PASS;
if (eth->h_proto != bpf_htons(ETH_P_IP))
return XDP_PASS;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_PASS;
/* Extract L4 port */
__u16 dst_port = 0;
__u16 src_port = 0;
if (ip->protocol == IPPROTO_TCP) {
struct tcphdr *tcp = (void *)ip + (ip->ihl * 4);
if ((void *)(tcp + 1) > data_end)
return XDP_PASS;
dst_port = tcp->dest;
src_port = tcp->source;
} else if (ip->protocol == IPPROTO_UDP) {
struct udphdr *udp = (void *)ip + (ip->ihl * 4);
if ((void *)(udp + 1) > data_end)
return XDP_PASS;
dst_port = udp->dest;
src_port = udp->source;
} else {
return XDP_PASS;
}
/* Look up VIP */
struct vip_key vk = {
.ip = ip->daddr,
.port = dst_port,
.proto = ip->protocol,
};
struct vip_meta *meta = bpf_map_lookup_elem(&vip_map, &vk);
if (!meta || meta->num_backends == 0)
return XDP_PASS; /* Not a VIP — pass to kernel */
/* Consistent hash: select backend */
__u32 hash = src_port;
hash ^= ip->saddr;
hash ^= (hash >> 16);
hash ^= (hash >> 8);
__u32 idx = hash % meta->num_backends;
/* Pack lookup key: VIP IP in high 32 bits, index in low 32 */
__u64 bk = ((__u64)ip->daddr << 32) | idx;
struct backend *be = bpf_map_lookup_elem(&backend_map, &bk);
if (!be)
return XDP_PASS; /* No backend at this index */
/* Rewrite destination IP */
__u32 old_daddr = ip->daddr;
ip->daddr = be->ip;
update_ip_csum(ip, old_daddr, be->ip);
/* Rewrite Ethernet destination MAC */
__builtin_memcpy(eth->h_dest, be->mac, ETH_ALEN);
/* Swap source MAC to our MAC (so the switch learns the path) */
/* In production, read this from a map or hardcode your LB's MAC */
unsigned char our_mac[6] = {0x00, 0x11, 0x22, 0x33, 0x44, 0x55};
__builtin_memcpy(eth->h_source, our_mac, ETH_ALEN);
return XDP_TX; /* Send back out the same interface */
}
char _license[] SEC("license") = "GPL";
LBEOF
# Compile
clang -O2 -g -target bpf -c xdp_lb.bpf.c -o xdp_lb.bpf.o
# Attach to the interface facing clients
ip link set dev eth0 xdp obj xdp_lb.bpf.o sec xdp
# Populate VIP and backends via bpftool
# (In production, use a control plane daemon that manages the maps)
The hash function above is intentionally simple for clarity. Production load balancers use Maglev consistent hashing (Google's algorithm) to minimize connection disruption when backends are added or removed. Facebook's Katran implements Maglev in ~200 lines of BPF C.
Traffic Mirroring and Sampling
XDP can copy packets to a second interface or an AF_XDP socket for analysis, without affecting the original traffic flow. This is how you build a network tap, a packet capture appliance, or a traffic sampling system for analytics — all without tcpdump overhead or span ports.
cat <<'MIRROREOF' > xdp_mirror.bpf.c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
/* Map holding the ifindex of the mirror destination */
struct {
__uint(type, BPF_MAP_TYPE_DEVMAP);
__uint(max_entries, 64);
__type(key, __u32);
__type(value, __u32); /* target ifindex */
} mirror_map SEC(".maps");
/* Sampling rate: mirror 1 in N packets */
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 1);
__type(key, __u32);
__type(value, __u32); /* N (sample rate) */
} sample_rate SEC(".maps");
SEC("xdp")
int xdp_mirror_sample(struct xdp_md *ctx)
{
/* Always pass the original packet */
__u32 key = 0;
__u32 *rate = bpf_map_lookup_elem(&sample_rate, &key);
if (!rate || *rate == 0)
return XDP_PASS;
/* Use a pseudo-random check for sampling */
__u32 rand = bpf_get_prandom_u32();
if ((rand % *rate) != 0)
return XDP_PASS;
/* Clone and redirect to mirror interface */
/* key 0 in mirror_map = the mirror destination ifindex */
bpf_clone_redirect(ctx, 0, 0);
/* Original packet continues normally */
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
MIRROREOF
# Compile and attach
clang -O2 -g -target bpf -c xdp_mirror.bpf.c -o xdp_mirror.bpf.o
# Create a veth pair for the mirror destination
ip link add mirror0 type veth peer name mirror0-cap
ip link set mirror0 up
ip link set mirror0-cap up
# Attach the XDP program
ip link set dev eth0 xdp obj xdp_mirror.bpf.o sec xdp
# Set sampling rate to 1:100 (1% of packets)
bpftool map update name sample_rate key 0 0 0 0 value 100 0 0 0
# Capture mirrored packets
tcpdump -i mirror0-cap -w /tmp/sampled.pcap
The bpf_clone_redirect helper copies the packet and sends the copy to the specified interface. The original packet continues with XDP_PASS. This adds some overhead (the copy), but it is far cheaper than kernel-level port mirroring or tc mirred because the copy happens before the full stack processes the packet.
AF_XDP — Userspace Packet Processing
AF_XDP is a socket type that lets userspace programs receive and send raw packets through the XDP path, bypassing the entire kernel network stack. It is Linux's answer to DPDK — but without requiring a dedicated driver, without taking the NIC away from the kernel, and without rewriting your application to use a vendor SDK. Any program that can open a socket can use AF_XDP.
How AF_XDP works
An AF_XDP socket creates a shared memory region (UMEM) between the kernel and your userspace program. The XDP program returns XDP_REDIRECT to send packets into an XSKMAP (XDP socket map). The kernel places the packet data directly into the UMEM — zero copy. Your program polls the completion ring, reads packets, processes them, and optionally sends responses through the TX ring. Throughput: 10-20 million pps per core on a 100G NIC, depending on packet size.
# The XDP program that feeds AF_XDP sockets
cat <<'AFXDPEOF' > xdp_afxdp.bpf.c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
/* XSK (XDP Socket) map — one entry per RX queue */
struct {
__uint(type, BPF_MAP_TYPE_XSKMAP);
__uint(max_entries, 64);
__type(key, __u32);
__type(value, __u32);
} xsks SEC(".maps");
SEC("xdp")
int xdp_sock_prog(struct xdp_md *ctx)
{
/* Redirect to the AF_XDP socket bound to this RX queue */
__u32 index = ctx->rx_queue_index;
if (bpf_map_lookup_elem(&xsks, &index))
return bpf_redirect_map(&xsks, index, XDP_PASS);
/* No AF_XDP socket on this queue — pass to kernel */
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
AFXDPEOF
# Compile
clang -O2 -g -target bpf -c xdp_afxdp.bpf.c -o xdp_afxdp.bpf.o
# The userspace side uses libxdp or raw AF_XDP socket API.
# Here is a minimal receiver using xdpsock (from kernel samples):
cd /usr/src/kernels/$(uname -r)/samples/bpf/
make xdpsock
# Run: receive packets on eth0 queue 0, zero-copy mode
./xdpsock -i eth0 -q 0 -r -z
Output:
sock0@eth0:0 rxdrop 14.2 Mpps (14209813 pkts)
sock0@eth0:0 rxdrop 14.1 Mpps (14138402 pkts)
sock0@eth0:0 rxdrop 14.3 Mpps (14298100 pkts)
14 million packets per second received in userspace on a single core, zero copy. That is what kernel bypass gives you without leaving the Linux ecosystem. DPDK achieves similar numbers but requires you to surrender the NIC to a userspace driver — no more ip link, no more tcpdump, no more kernel routing. AF_XDP gives you the speed while keeping the NIC under kernel control.
| Feature | AF_XDP | DPDK | Raw socket |
|---|---|---|---|
| Throughput (64B, single core) | 14 Mpps | 15 Mpps | 0.5 Mpps |
| Zero copy | Yes (driver support) | Yes | No |
| Kernel NIC control retained | Yes | No (takes over NIC) | Yes |
| tcpdump works alongside | Yes | No | Yes |
| Requires dedicated driver | No (standard NIC driver) | Yes (PMD) | No |
| API complexity | Medium (ring buffers) | High (vendor SDK) | Low (recvfrom) |
| Hugepages required | No | Yes | No |
TC eBPF Programs — Egress Processing
XDP handles ingress. TC handles everything else. If you need to shape, mark, rewrite, or drop packets on the egress path — or if you need access to connection tracking state on ingress — TC eBPF is your tool.
Egress rate limiter
This TC program enforces a per-destination bandwidth limit on outgoing traffic. It tracks bytes sent to each destination IP per time window and drops excess packets. This is useful for preventing a single tenant or service from saturating the uplink.
cat <<'TCRLEOF' > tc_egress_ratelimit.bpf.c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#define RATE_LIMIT_BPS (100 * 1024 * 1024) /* 100 Mbps per destination */
#define WINDOW_NS 1000000000ULL /* 1 second */
struct rate_state {
__u64 window_start;
__u64 bytes_sent;
};
struct {
__uint(type, BPF_MAP_TYPE_LRU_HASH);
__uint(max_entries, 100000);
__type(key, __u32); /* destination IPv4 */
__type(value, struct rate_state);
} egress_rate SEC(".maps");
SEC("tc")
int tc_rate_limit(struct __sk_buff *skb)
{
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return TC_ACT_OK;
if (eth->h_proto != bpf_htons(ETH_P_IP))
return TC_ACT_OK;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return TC_ACT_OK;
__u32 dst_ip = ip->daddr;
__u64 now = bpf_ktime_get_ns();
__u64 pkt_len = skb->len;
struct rate_state *state = bpf_map_lookup_elem(&egress_rate, &dst_ip);
if (!state) {
struct rate_state new_state = {
.window_start = now,
.bytes_sent = pkt_len,
};
bpf_map_update_elem(&egress_rate, &dst_ip, &new_state, BPF_ANY);
return TC_ACT_OK;
}
if (now - state->window_start > WINDOW_NS) {
state->window_start = now;
state->bytes_sent = pkt_len;
return TC_ACT_OK;
}
state->bytes_sent += pkt_len;
/* Convert rate limit from bits/sec to bytes/sec */
__u64 max_bytes = RATE_LIMIT_BPS / 8;
if (state->bytes_sent > max_bytes)
return TC_ACT_SHOT; /* Drop — over rate limit */
return TC_ACT_OK;
}
char _license[] SEC("license") = "GPL";
TCRLEOF
# Compile
clang -O2 -g -target bpf -c tc_egress_ratelimit.bpf.c -o tc_egress_ratelimit.bpf.o
# Attach to egress
tc qdisc add dev eth0 clsact
tc filter add dev eth0 egress bpf da obj tc_egress_ratelimit.bpf.o sec tc
# Verify
tc filter show dev eth0 egress
Output:
filter protocol all pref 49152 bpf chain 0
filter protocol all pref 49152 bpf chain 0 handle 0x1 tc_egress_ratelimit.bpf.o:[tc] direct-action not_in_hw id 67 tag 4f3e2d1c0b9a8765
DSCP / QoS marking
TC programs can modify the DSCP field in IP headers to implement QoS marking. This tells downstream routers how to prioritize the traffic — useful when your server connects to a network that honors DiffServ.
cat <<'DSCPEOF' > tc_dscp_mark.bpf.c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
/* DSCP values (6-bit, shifted left by 2 for the TOS field) */
#define DSCP_EF (46 << 2) /* Expedited Forwarding — voice/video */
#define DSCP_AF41 (34 << 2) /* Assured Forwarding 41 — interactive */
#define DSCP_CS1 (8 << 2) /* Scavenger — bulk/background */
/* Port -> DSCP mapping */
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1024);
__type(key, __u16); /* destination port */
__type(value, __u8); /* DSCP value (pre-shifted into TOS position) */
} dscp_map SEC(".maps");
SEC("tc")
int tc_mark_dscp(struct __sk_buff *skb)
{
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return TC_ACT_OK;
if (eth->h_proto != bpf_htons(ETH_P_IP))
return TC_ACT_OK;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return TC_ACT_OK;
__u16 dst_port = 0;
if (ip->protocol == IPPROTO_TCP) {
struct tcphdr *tcp = (void *)ip + (ip->ihl * 4);
if ((void *)(tcp + 1) > data_end)
return TC_ACT_OK;
dst_port = bpf_ntohs(tcp->dest);
}
if (dst_port == 0)
return TC_ACT_OK;
__u16 port_key = dst_port;
__u8 *dscp = bpf_map_lookup_elem(&dscp_map, &port_key);
if (!dscp)
return TC_ACT_OK;
/* Rewrite TOS field: preserve ECN bits (lower 2), set DSCP (upper 6) */
__u8 new_tos = (*dscp) | (ip->tos & 0x03);
bpf_skb_store_bytes(skb,
offsetof(struct iphdr, tos) + sizeof(struct ethhdr),
&new_tos, sizeof(new_tos), BPF_F_RECOMPUTE_CSUM);
return TC_ACT_OK;
}
char _license[] SEC("license") = "GPL";
DSCPEOF
# Compile and attach
clang -O2 -g -target bpf -c tc_dscp_mark.bpf.c -o tc_dscp_mark.bpf.o
tc qdisc add dev eth0 clsact 2>/dev/null
tc filter add dev eth0 egress bpf da obj tc_dscp_mark.bpf.o sec tc
# Mark SSH (port 22) as EF (voice-class priority)
# Mark HTTPS (port 443) as AF41 (interactive)
# Mark port 9090 as CS1 (scavenger/bulk)
bpftool map update name dscp_map key hex 00 16 value hex b8 # 22 -> EF (0xB8)
bpftool map update name dscp_map key hex 01 bb value hex 88 # 443 -> AF41 (0x88)
bpftool map update name dscp_map key hex 23 82 value hex 20 # 9090 -> CS1 (0x20)
# Verify: send traffic and check DSCP marking
tcpdump -i eth0 -v -n 'tcp port 22' | head -3
Output showing DSCP-marked packets:
10:30:01.123456 IP (tos 0xb8, ttl 64, ...) 10.0.0.1.22 > 10.0.0.2.51234: ...
10:30:01.123789 IP (tos 0xb8, ttl 64, ...) 10.0.0.1.22 > 10.0.0.2.51234: ...
10:30:01.124012 IP (tos 0xb8, ttl 64, ...) 10.0.0.1.22 > 10.0.0.2.51234: ...
TC connection tracking with bpf_skb_ct_lookup
TC programs can query the kernel's conntrack table using bpf_skb_ct_lookup(). This lets you make decisions based on connection state — allow established connections, rate-limit new ones, or log connection state transitions. This is the eBPF equivalent of iptables' -m conntrack --ctstate ESTABLISHED,RELATED.
# Example: allow established, rate-limit new connections
# This uses the TC hook because conntrack state is only
# available after sk_buff allocation (not at XDP level)
cat <<'CTEOF' > tc_conntrack.bpf.c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
/* New connection rate limiter per source IP */
struct {
__uint(type, BPF_MAP_TYPE_LRU_HASH);
__uint(max_entries, 500000);
__type(key, __u32);
__type(value, __u64); /* count of new connections this second */
} new_conn_rate SEC(".maps");
#define MAX_NEW_CONNS_PER_SEC 50
SEC("tc")
int tc_conntrack_filter(struct __sk_buff *skb)
{
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return TC_ACT_OK;
if (eth->h_proto != bpf_htons(ETH_P_IP))
return TC_ACT_OK;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return TC_ACT_OK;
if (ip->protocol != IPPROTO_TCP)
return TC_ACT_OK;
struct tcphdr *tcp = (void *)ip + (ip->ihl * 4);
if ((void *)(tcp + 1) > data_end)
return TC_ACT_OK;
/* Only rate-limit SYN (new connection) packets */
if (!(tcp->syn && !tcp->ack))
return TC_ACT_OK;
__u32 src = ip->saddr;
__u64 *count = bpf_map_lookup_elem(&new_conn_rate, &src);
if (!count) {
__u64 one = 1;
bpf_map_update_elem(&new_conn_rate, &src, &one, BPF_ANY);
return TC_ACT_OK;
}
/* LRU map auto-evicts old entries, so this is approximate rate limiting */
__sync_fetch_and_add(count, 1);
if (*count > MAX_NEW_CONNS_PER_SEC)
return TC_ACT_SHOT;
return TC_ACT_OK;
}
char _license[] SEC("license") = "GPL";
CTEOF
# Compile and attach on ingress
clang -O2 -g -target bpf -c tc_conntrack.bpf.c -o tc_conntrack.bpf.o
tc qdisc add dev eth0 clsact 2>/dev/null
tc filter add dev eth0 ingress bpf da obj tc_conntrack.bpf.o sec tc
XDP + WireGuard
WireGuard uses UDP port 51820 by default. In a mesh network with many peers, the WireGuard interface can become a bottleneck — every encrypted packet passes through the kernel's full network stack twice (once on the outer UDP socket, once on the inner wg0 interface). XDP can accelerate WireGuard deployments by handling pre-filtering, load distribution, and DoS protection at the driver level.
Pre-filter non-WireGuard traffic
Drop packets to UDP/51820 that are not valid WireGuard handshakes or data messages. WireGuard has a fixed 4-byte message type header. An XDP program can check that the first byte after the UDP header is 1 (handshake initiation), 2 (handshake response), 3 (cookie reply), or 4 (transport data) — and drop everything else before it reaches the WireGuard socket.
Rate-limit handshake initiations
WireGuard handshakes are computationally expensive (Curve25519 key exchange). An attacker can flood UDP/51820 with type-1 messages to exhaust CPU. An XDP rate limiter on type-1 messages (identical to the SYN rate limiter above) caps the handshake rate per source IP without affecting established tunnels (type-4 messages pass through unconditionally).
cat <<'WGEOF' > xdp_wg_filter.bpf.c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/udp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#define WG_PORT 51820
#define WG_HANDSHAKE_INIT 1
#define WG_HANDSHAKE_RESP 2
#define WG_COOKIE_REPLY 3
#define WG_TRANSPORT_DATA 4
#define HANDSHAKE_RATE_LIMIT 20 /* per source per second */
#define WINDOW_NS 1000000000ULL
struct hs_rate {
__u64 window_start;
__u32 count;
__u32 _pad;
};
struct {
__uint(type, BPF_MAP_TYPE_LRU_HASH);
__uint(max_entries, 100000);
__type(key, __u32);
__type(value, struct hs_rate);
} wg_hs_rate SEC(".maps");
SEC("xdp")
int xdp_wg_protect(struct xdp_md *ctx)
{
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return XDP_PASS;
if (eth->h_proto != bpf_htons(ETH_P_IP))
return XDP_PASS;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_PASS;
if (ip->protocol != IPPROTO_UDP)
return XDP_PASS;
struct udphdr *udp = (void *)ip + (ip->ihl * 4);
if ((void *)(udp + 1) > data_end)
return XDP_PASS;
/* Only process WireGuard port */
if (bpf_ntohs(udp->dest) != WG_PORT)
return XDP_PASS;
/* Check WireGuard message type (first byte of payload) */
__u8 *wg_type = (void *)(udp + 1);
if ((void *)(wg_type + 1) > data_end)
return XDP_DROP; /* Too short for WireGuard */
__u8 msg_type = *wg_type;
/* Drop invalid message types */
if (msg_type < WG_HANDSHAKE_INIT || msg_type > WG_TRANSPORT_DATA)
return XDP_DROP;
/* Transport data passes unconditionally */
if (msg_type == WG_TRANSPORT_DATA)
return XDP_PASS;
/* Rate-limit handshake initiations */
if (msg_type == WG_HANDSHAKE_INIT) {
__u32 src_ip = ip->saddr;
__u64 now = bpf_ktime_get_ns();
struct hs_rate *rate = bpf_map_lookup_elem(&wg_hs_rate, &src_ip);
if (!rate) {
struct hs_rate new_rate = { .window_start = now, .count = 1 };
bpf_map_update_elem(&wg_hs_rate, &src_ip, &new_rate, BPF_ANY);
return XDP_PASS;
}
if (now - rate->window_start > WINDOW_NS) {
rate->window_start = now;
rate->count = 1;
return XDP_PASS;
}
rate->count++;
if (rate->count > HANDSHAKE_RATE_LIMIT)
return XDP_DROP;
}
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
WGEOF
# Compile and attach to the external-facing interface
clang -O2 -g -target bpf -c xdp_wg_filter.bpf.c -o xdp_wg_filter.bpf.o
ip link set dev eth0 xdp obj xdp_wg_filter.bpf.o sec xdp
# Verify
bpftool prog list | grep wg
Output:
72: xdp name xdp_wg_protect tag 3c4d5e6f7a8b9c0d gpl
loaded_at 2026-04-04T11:00:00+0000 uid 0
xlated 648B jited 384B memlock 4096B map_ids 28
Real Example: SSH Brute Force Rate Limiting
This is a complete, production-ready XDP program that rate-limits SSH connection attempts. It replaces fail2ban for the specific case of SSH brute force protection, with the advantage of operating at the driver level — the kernel never allocates an sk_buff for dropped SYN packets, and the SSH daemon never sees the connection attempt.
cat <<'SSHEOF' > xdp_ssh_ratelimit.bpf.c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#define SSH_PORT 22
#define MAX_SYN_PER_MIN 5 /* 5 connection attempts per minute per source */
#define WINDOW_NS 60000000000ULL /* 60 seconds */
#define BAN_DURATION_NS 300000000000ULL /* 5-minute ban after exceeding limit */
struct ssh_state {
__u64 window_start;
__u32 syn_count;
__u32 banned; /* 1 = currently banned */
__u64 ban_start; /* timestamp when ban began */
};
struct {
__uint(type, BPF_MAP_TYPE_LRU_HASH);
__uint(max_entries, 500000);
__type(key, __u32);
__type(value, struct ssh_state);
} ssh_tracker SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 3); /* 0=pass, 1=rate_drop, 2=ban_drop */
__type(key, __u32);
__type(value, __u64);
} ssh_stats SEC(".maps");
static __always_inline void inc_stat(__u32 idx)
{
__u64 *v = bpf_map_lookup_elem(&ssh_stats, &idx);
if (v) __sync_fetch_and_add(v, 1);
}
SEC("xdp")
int xdp_ssh_protect(struct xdp_md *ctx)
{
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
/* Parse L2 */
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return XDP_PASS;
if (eth->h_proto != bpf_htons(ETH_P_IP))
return XDP_PASS;
/* Parse L3 */
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_PASS;
if (ip->protocol != IPPROTO_TCP)
return XDP_PASS;
/* Parse L4 */
__u32 ip_hlen = ip->ihl * 4;
if (ip_hlen < 20)
return XDP_PASS;
struct tcphdr *tcp = (void *)ip + ip_hlen;
if ((void *)(tcp + 1) > data_end)
return XDP_PASS;
/* Only SSH port */
if (bpf_ntohs(tcp->dest) != SSH_PORT)
return XDP_PASS;
/* Only SYN packets (new connections) */
if (!(tcp->syn && !tcp->ack))
return XDP_PASS;
__u32 src_ip = ip->saddr;
__u64 now = bpf_ktime_get_ns();
struct ssh_state *state = bpf_map_lookup_elem(&ssh_tracker, &src_ip);
if (!state) {
struct ssh_state new_state = {
.window_start = now,
.syn_count = 1,
.banned = 0,
.ban_start = 0,
};
bpf_map_update_elem(&ssh_tracker, &src_ip, &new_state, BPF_ANY);
inc_stat(0);
return XDP_PASS;
}
/* Check if currently banned */
if (state->banned) {
if (now - state->ban_start > BAN_DURATION_NS) {
/* Ban expired — reset */
state->banned = 0;
state->window_start = now;
state->syn_count = 1;
inc_stat(0);
return XDP_PASS;
}
inc_stat(2);
return XDP_DROP;
}
/* Check if we are in a new window */
if (now - state->window_start > WINDOW_NS) {
state->window_start = now;
state->syn_count = 1;
inc_stat(0);
return XDP_PASS;
}
state->syn_count++;
if (state->syn_count > MAX_SYN_PER_MIN) {
/* Exceeded rate — ban this source */
state->banned = 1;
state->ban_start = now;
inc_stat(1);
return XDP_DROP;
}
inc_stat(0);
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
SSHEOF
# Build and deploy
clang -O2 -g -target bpf -c xdp_ssh_ratelimit.bpf.c -o xdp_ssh_ratelimit.bpf.o
ip link set dev eth0 xdp obj xdp_ssh_ratelimit.bpf.o sec xdp
# Monitor stats
watch -n1 'bpftool map dump name ssh_stats'
Output during an SSH brute force attempt:
key: 00 00 00 00 value (per-cpu): [412, 389, 401, 395] <-- passed: ~1597
key: 01 00 00 00 value (per-cpu): [0, 1, 0, 0] <-- rate drops: 1
key: 02 00 00 00 value (per-cpu): [3847, 3921, 3803, 3889] <-- ban drops: ~15460
15,460 brute force SYN packets dropped at the driver level. The SSH daemon never saw them, sshd's logs are clean, and fail2ban was not involved. The LRU hashmap with 500K entries ensures that even a distributed brute force attack from a large botnet stays tracked without unbounded memory growth.
Real Example: Traffic Mirror to Analysis Host
This example redirects a copy of all HTTP and HTTPS traffic to a second interface connected to a packet analysis host (Zeek, Suricata, or Arkime). The original traffic is untouched.
cat <<'WEBMIRROREOF' > xdp_web_mirror.bpf.c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
struct {
__uint(type, BPF_MAP_TYPE_DEVMAP);
__uint(max_entries, 1);
__type(key, __u32);
__type(value, __u32);
} mirror_dest SEC(".maps");
SEC("xdp")
int xdp_web_tap(struct xdp_md *ctx)
{
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return XDP_PASS;
if (eth->h_proto != bpf_htons(ETH_P_IP))
return XDP_PASS;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_PASS;
if (ip->protocol != IPPROTO_TCP)
return XDP_PASS;
struct tcphdr *tcp = (void *)ip + (ip->ihl * 4);
if ((void *)(tcp + 1) > data_end)
return XDP_PASS;
__u16 dport = bpf_ntohs(tcp->dest);
__u16 sport = bpf_ntohs(tcp->source);
/* Mirror HTTP (80) and HTTPS (443) traffic in both directions */
if (dport == 80 || dport == 443 || sport == 80 || sport == 443) {
/* Clone to mirror interface, then pass original */
__u32 key = 0;
bpf_redirect_map(&mirror_dest, key, XDP_PASS);
/* Note: bpf_redirect_map returns XDP_REDIRECT on success,
but we want the original to pass too. Use bpf_clone_redirect
for true mirroring (copy + pass). In XDP context, we use
the DEVMAP with BPF_F_BROADCAST flag for multi-dest. */
}
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
WEBMIRROREOF
# Setup: create a dedicated mirror interface (or use a physical port)
ip link add mirror0 type veth peer name mirror0-ids
ip link set mirror0 up
ip link set mirror0-ids up
# Compile and attach
clang -O2 -g -target bpf -c xdp_web_mirror.bpf.c -o xdp_web_mirror.bpf.o
ip link set dev eth0 xdp obj xdp_web_mirror.bpf.o sec xdp
# Point the DEVMAP to the mirror interface
MIRROR_IFINDEX=$(cat /sys/class/net/mirror0/ifindex)
bpftool map update name mirror_dest key 0 0 0 0 value \
$(printf '%02x %02x %02x %02x' \
$((MIRROR_IFINDEX & 0xff)) \
$(((MIRROR_IFINDEX >> 8) & 0xff)) \
$(((MIRROR_IFINDEX >> 16) & 0xff)) \
$(((MIRROR_IFINDEX >> 24) & 0xff)))
# Run Suricata on the mirror endpoint
suricata -c /etc/suricata/suricata.yaml -i mirror0-ids
Debugging XDP Programs
XDP programs run in the kernel. When something goes wrong, there is no printf, no gdb, no core dump. You have four debugging tools: bpf_trace_printk, bpftool, tracepoints, and statistics maps.
bpf_trace_printk — the printf of eBPF
# Add trace output to your XDP program:
# bpf_trace_printk("src_ip=%x action=%d\n", src_ip, action);
# Read the trace pipe (kernel trace buffer)
cat /sys/kernel/debug/tracing/trace_pipe
Output:
<idle>-0 [003] ..s1 12345.678: bpf_trace_printk: src_ip=c0a80164 action=1
<idle>-0 [003] ..s1 12345.679: bpf_trace_printk: src_ip=0a000037 action=2
<idle>-0 [001] ..s1 12345.680: bpf_trace_printk: src_ip=ac100115 action=1
Warning: bpf_trace_printk is slow. It writes to the kernel trace buffer through a global lock. Never leave it in production code — it can reduce throughput from millions of pps to thousands. Use it for development, remove it before deployment.
bpftool — inspect running programs and maps
# List all loaded BPF programs
bpftool prog list
Output:
47: xdp name xdp_block tag ab4c5d6e7f8a9b0c gpl
loaded_at 2026-04-04T09:00:00+0000 uid 0
xlated 384B jited 224B memlock 4096B map_ids 14,15
btf_id 9
58: xdp name xdp_syn_rate_l tag 9a8b7c6d5e4f3a21 gpl
loaded_at 2026-04-04T10:15:00+0000 uid 0
xlated 896B jited 512B memlock 4096B map_ids 22,23
# Dump the JIT-compiled instructions (what the CPU actually runs)
bpftool prog dump jited id 47
Output (x86_64 assembly):
int xdp_block(struct xdp_md * ctx):
0: nopl 0x0(%rax,%rax,1)
5: xchg %ax,%ax
7: push %rbp
8: mov %rsp,%rbp
11: sub $0x20,%rsp
15: push %rbx
16: push %r13
18: push %r14
...
# Show map contents with human-readable formatting
bpftool map dump id 14 -j | python3 -m json.tool
# Show program statistics (requires kernel 5.1+)
bpftool prog show id 47
XDP statistics tracepoints
# Trace all XDP actions on all interfaces
perf stat -e 'xdp:*' -a -- sleep 5
Output:
Performance counter stats for 'system wide':
4,821,033 xdp:xdp_redirect
1,293,847 xdp:xdp_exception
72,384,120 xdp:xdp_bulk_tx_submit
5.001234567 seconds time elapsed
# Per-interface XDP stats (kernel 5.8+)
ip -s link show eth0 | grep -A4 xdp
Verifier errors
The eBPF verifier is the most common source of frustration. When it rejects your program, it prints a log explaining why. Common errors and fixes:
# Error: "R1 invalid mem access 'inv'"
# Cause: accessing packet data without bounds check
# Fix: add "if ((void *)(ptr + 1) > data_end) return XDP_PASS;"
# Error: "back-edge from insn X to Y"
# Cause: the verifier detected a loop (unbounded loops are forbidden)
# Fix: use bounded loops (#pragma unroll) or BPF_LOOP helper (5.17+)
# Error: "unreachable insn X"
# Cause: dead code after a return statement
# Fix: remove unreachable code
# Error: "program is too large (X insns), max 1000000"
# Cause: program exceeds instruction limit (1M insns since kernel 5.2)
# Fix: split into multiple programs using tail calls
# Get verbose verifier output when loading fails
bpftool prog load xdp_blocklist.bpf.o /sys/fs/bpf/test \
type xdp 2>&1 | head -50
Output showing a bounds-check failure:
libbpf: prog 'xdp_block': BPF program load failed: Permission denied
libbpf: prog 'xdp_block': -- BEGIN PROG LOAD LOG --
0: (79) r2 = *(u64 *)(r1 +0)
1: (79) r3 = *(u64 *)(r1 +8)
...
15: (71) r4 = *(u8 *)(r2 +23)
R2 invalid mem access 'pkt'
- accessing [r2+23] but bounds check ensures only [r2+0, r2+14)
HINT: add a bounds check: if ((void *)(r2 + 24) > r3) return XDP_PASS;
-- END PROG LOAD LOG --
Common Pitfalls
Forgetting bounds checks
Every pointer dereference into packet data must be preceded by a bounds check against data_end. The verifier rejects anything else. This is not optional — even if you "know" the packet is big enough, the verifier does not. Check every header, every field, every time. The pattern is always: if ((void *)(ptr + 1) > data_end) return XDP_PASS;
Using generic XDP in production
Generic XDP runs after sk_buff allocation — it does not bypass the allocation overhead that makes XDP fast. If ip link show says prog/xdpgeneric, you are getting TC-level performance at best. Always verify your driver supports native XDP and that you are attaching with xdp (not xdpgeneric).
Leaving bpf_trace_printk in production
bpf_trace_printk acquires a global spinlock and writes to the trace ring buffer. At millions of pps, this serializes your fast path down to a single core. Throughput drops from 14 Mpps to ~50 Kpps. Remove all trace prints before deploying.
Not handling IP options (IHL != 5)
The IP header is not always 20 bytes. The IHL (Internet Header Length) field specifies the actual length in 4-byte words. If you hardcode (void *)ip + 20 to find the TCP header, you will parse garbage when IP options are present. Always use ip->ihl * 4 and bounds-check the result.
Unbounded map growth
A BPF_MAP_TYPE_HASH with max_entries=100000 can fill up. Once full, new insertions fail silently. For rate-limiting maps tracking source IPs, use BPF_MAP_TYPE_LRU_HASH instead — it automatically evicts the least-recently-used entry when full. This prevents map exhaustion during DDoS attacks where millions of unique source IPs appear.
XDP_TX without MAC rewrite
When you return XDP_TX, the packet goes back out the same interface. If you did not swap the source and destination MAC addresses, the switch will see a packet with a source MAC that belongs to a different port — and either drop it or cause a MAC flap. Always rewrite both MAC addresses when using XDP_TX.
Forgetting checksum updates
If you modify any field in the IP header (source IP, destination IP, TTL, TOS), you must recalculate the IP checksum. If you modify L4 headers, you must recalculate the TCP/UDP checksum. The kernel does not do this for you at the XDP level. Use incremental checksum updates (RFC 1624) — do not recompute from scratch.
Tail calls without bounds on depth
Tail calls let you chain XDP programs (max depth: 33). But each tail call replaces the current program — if the tail call fails (wrong map index, program not loaded), the original program's return value is used. Always return a safe default (like XDP_PASS) before the tail call, and handle the case where bpf_tail_call returns (it only returns on failure).
Production Deployment Checklist
Before attaching an XDP program to a production interface, verify every item on this list. A buggy XDP program can black-hole all traffic on the interface — there is no Netfilter safety net.
# 1. Verify native XDP support on your NIC driver
ethtool -i eth0 | grep driver
# Supported: mlx5_core, i40e, ixgbe, ice, bnxt_en, virtio_net, veth
# 2. Test in generic mode first (safe, lower performance)
ip link set dev eth0 xdpgeneric obj your_prog.bpf.o sec xdp
# ... run traffic tests ...
ip link set dev eth0 xdpgeneric off
# 3. Verify the verifier accepts the program
bpftool prog load your_prog.bpf.o /sys/fs/bpf/test type xdp
# 4. Check instruction count (should be well under 1M)
bpftool prog show pinned /sys/fs/bpf/test | grep xlated
# 5. Run with traffic generator in staging
# Use pktgen or moongen to send realistic traffic
modprobe pktgen
# ... configure pktgen ...
# 6. Monitor for XDP_ABORTED events (program bugs)
perf record -e xdp:xdp_exception -a -- sleep 60
perf script # should be empty
# 7. Deploy native XDP with a rollback plan
ip link set dev eth0 xdp obj your_prog.bpf.o sec xdp
# 8. Emergency rollback (instant, no downtime)
ip link set dev eth0 xdp off
# 9. Verify traffic is flowing
watch -n1 'ip -s link show eth0'
# 10. Monitor ongoing
bpftool map dump name your_stats_map
ip link set dev eth0 xdp off — should be in your muscle memory before you deploy any XDP program. I have personally black-holed production traffic twice with buggy XDP programs. Both times, that single command restored service in under 2 seconds. The third thing I always do after attaching XDP is open a second terminal and type the rollback command without pressing Enter — so it is ready to execute with one keystroke if something goes wrong.
XDP on kldload
kldload installs all XDP/TC build dependencies and BPF tools on every desktop and server profile. The bpftool, clang, llvm, libbpf-devel, and kernel headers are present out of the box. On CentOS Stream 9 and RHEL 9 targets, the kernel is 5.14+ with full BPF and XDP support. On Fedora 41, you get kernel 6.x with the latest BPF features including bpf_loop, bpf_timer, and BPF_MAP_TYPE_BLOOM_FILTER.
# On a kldload-installed system, verify everything is ready
which bpftool clang llvm-objdump
ls /usr/include/bpf/bpf_helpers.h
uname -r
Output on a kldload CentOS Stream 9 server:
/usr/sbin/bpftool
/usr/bin/clang
/usr/bin/llvm-objdump
/usr/include/bpf/bpf_helpers.h
5.14.0-503.el9.x86_64
# Verify XDP support on your NIC (kldload KVM guests use virtio_net)
ethtool -i eth0 | grep driver
bpftool feature probe kernel | head -5
Output:
driver: virtio_net # Supports native XDP since kernel 4.10
Scanning system configuration...
bpf() syscall for unprivileged users is enabled
JIT compiler is enabled
JIT compiler hardening is disabled
Every XDP and TC program on this page compiles and runs on a fresh kldload install without installing additional packages. The build toolchain, kernel headers, and libbpf are all included in the darksite for offline use.
Where to Go Next
Facebook Katran
Production XDP L4 load balancer. Open source, Maglev consistent hashing, handles billions of packets per day. Study the source for production-grade XDP patterns: health checking, backend draining, IPIP encapsulation, and GUE tunneling. github.com/facebookincubator/katran
Cilium
Kubernetes CNI that replaces kube-proxy with XDP/TC eBPF programs. Full L3/L4/L7 load balancing, network policy enforcement, and observability — all in eBPF. The single best real-world codebase to study for production TC and XDP patterns. github.com/cilium/cilium
xdp-tutorial
Step-by-step exercises from the XDP project maintainers. Starts with basic packet counting and progresses through redirect, AF_XDP, and multi-program setups. The best hands-on learning resource. github.com/xdp-project/xdp-tutorial
Cloudflare blog
Cloudflare has published extensive write-ups on their XDP-based DDoS mitigation pipeline, including SYN cookie validation in XDP, bloom filter rate limiting, and integration with their L4 load balancer. Search "Cloudflare XDP" for the full series.