eBPF Core Dumps & Stack Traces
Performance problems don't announce themselves. A process burns CPU in a tight loop, or it sleeps on a lock nobody knows about, or it leaks memory so slowly that OOM doesn't hit for three weeks. The kernel sees all of it. eBPF lets you ask the kernel what's happening — right now — without stopping the process, without attaching a debugger, without restarting anything. Stack traces, flame graphs, off-CPU analysis, wakeup chains, memory leak detection, core dump capture. This page covers all of it.
The mental model: every thread on your system is either running on a CPU or sleeping somewhere. On-CPU profiling tells you where it's running. Off-CPU analysis tells you why it's sleeping. Stack traces connect both to actual code paths. Flame graphs make the data visual. Core dumps capture the moment of failure. Together, they give you complete visibility into any process on any Linux machine — live, in production, with near-zero overhead.
Stack Traces in eBPF
A stack trace is the chain of function calls that led to the current point of execution. When your PostgreSQL process is burning 100% CPU, the stack trace tells you which function, called by which function, called by which function. Without a stack trace, you know the process is hot. With a stack trace, you know why.
kstack() and ustack() — the two built-ins
bpftrace provides two functions for capturing stack traces. kstack() captures the kernel stack — the chain of kernel functions that are executing on behalf of a thread. ustack() captures the userspace stack — the chain of application functions. They are different stacks. A thread doing a read() syscall has a userspace stack (your application code calling read()) and a kernel stack (the kernel's VFS layer, filesystem driver, block layer, and device driver). You often want both.
Print kernel and userspace stack traces whenever a process calls read():
# bpftrace -e 'tracepoint:syscalls:sys_enter_read /comm == "postgres"/ {
printf("--- kernel stack ---\n");
print(kstack());
printf("--- user stack ---\n");
print(ustack());
}'
Attaching 1 probe...
--- kernel stack ---
ksys_read+95
do_syscall_64+91
entry_SYSCALL_64_after_hwframe+118
--- user stack ---
__GI___libc_read+18
pq_recvbuf+115
pq_getbyte+30
SocketBackend+99
ReadCommand+90
PostgresMain+1520
ServerLoop+982
PostmasterMain+5893
main+753
That userspace stack trace tells a complete story: main called PostmasterMain, which entered ServerLoop, which called PostgresMain (the per-backend main loop), which called ReadCommand to wait for the next SQL query, which went down through SocketBackend into pq_getbyte into pq_recvbuf into libc read(). You are looking at a PostgreSQL backend process waiting for its client to send the next query. Zero guesswork.
How Stack Unwinding Works
Capturing a stack trace means unwinding the call stack — walking backwards from the current frame to find each caller. There are three methods, and which one works depends on how the code was compiled:
Frame Pointer Walking
The oldest and simplest method. Each function prologue saves the previous frame pointer (RBP on x86_64) on the stack. The unwinder follows the chain: current RBP → previous RBP → previous previous RBP. Fast (just pointer chasing), but requires code compiled with -fno-omit-frame-pointer. Most distros omit frame pointers for performance (saves one register), which breaks this method. Fedora 38+ re-enabled frame pointers system-wide. CentOS/RHEL 9 did not.
DWARF Unwinding
Uses the .eh_frame section in ELF binaries, which contains unwinding rules for every instruction address. Works even without frame pointers. More expensive (has to parse DWARF tables), and the eBPF verifier limits how deep you can go. bpftrace supports DWARF unwinding with ustack() when debug symbols are available. This is the primary method for userspace stacks on distros that strip frame pointers.
ORC Unwinder (Kernel Only)
The Linux kernel uses its own unwinder called ORC (Oops Rewind Capability). ORC is a simplified version of DWARF designed specifically for kernel stacks. It's faster than DWARF and works without frame pointers. The kernel builds ORC tables at compile time from the DWARF data, then throws away the DWARF. kstack() uses ORC automatically on kernels 4.14+. You don't need to do anything — it just works.
Check whether your kernel uses ORC or frame pointers:
# grep -c CONFIG_UNWINDER_ORC=y /boot/config-$(uname -r)
1
# If that prints 1, you have ORC. If 0, check for frame pointers:
# grep -c CONFIG_UNWINDER_FRAME_POINTER=y /boot/config-$(uname -r)
0
Stack Depth Limits and Stack ID Maps
eBPF programs have a maximum stack depth they can capture. The default in bpftrace is 127 frames, which is plenty for most workloads. If you're profiling deeply recursive code (some XML parsers, tree-walking interpreters), you might hit the limit. You can increase it:
# bpftrace -e 'profile:hz:99 { @stacks[kstack(perf, 256)] = count(); }'
Internally, eBPF uses stack ID maps (BPF_MAP_TYPE_STACK_TRACE) to store captured stacks efficiently. Each unique stack trace gets a numeric ID. The map stores the actual instruction pointers. This means if 10,000 samples hit the same call chain, bpftrace stores the stack once and increments a counter — not 10,000 copies of the same stack. This is why eBPF profiling is memory-efficient even at high sample rates.
On-CPU Profiling
On-CPU profiling answers the question: where is this process spending CPU time? The technique is sampling — at regular intervals, interrupt the CPU and record what function is currently executing. Do this thousands of times, aggregate the results, and you get a statistical picture of where CPU time goes.
Why 99 Hz, not 100 Hz?
If you sample at exactly 100 Hz and some workload has a 10ms timer (also 100 Hz), every sample lands at the same point in the timer cycle. You'd see a wildly distorted profile. Sampling at 99 Hz avoids this lockstep artifact — the sampling frequency drifts relative to any power-of-10 timer, giving you uniform coverage across the entire execution. This is standard practice from Brendan Gregg's perf methodology. 49 Hz or 199 Hz also work. Never use round numbers.
Sample kernel stacks at 99 Hz across all CPUs for 30 seconds:
# bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' -d 30
Attaching 1 probe...
@[
native_queued_spin_lock_slowpath+161
_raw_spin_lock+30
zfs_znode_alloc+181
zfs_zget+335
zfs_dirent_lock+451
zfs_lookup+306
zpl_lookup+94
__lookup_slow+130
walk_component+451
path_lookupat+120
filename_lookup+179
vfs_statx+142
do_statx+57
__x64_sys_statx+45
do_syscall_64+91
entry_SYSCALL_64_after_hwframe+118
]: 847
@[
finish_task_switch.isra.0+303
__schedule+785
schedule+46
schedule_hrtimeout_range_clock+256
do_select+867
core_sys_select+452
__x64_sys_select+181
do_syscall_64+91
entry_SYSCALL_64_after_hwframe+118
]: 2391
That output just told you two things about your system. First: the most frequent kernel stack trace (2,391 samples) is threads sleeping in select() — that's idle threads, harmless. Second: 847 samples hit zfs_znode_alloc spinning on a lock during directory lookups. If your system feels sluggish, that spin lock contention in ZFS is a real lead. You found it in 30 seconds with a one-liner.
Profile a specific process by PID, capturing both kernel and userspace stacks:
# bpftrace -e 'profile:hz:99 /pid == 4521/ {
@[kstack, ustack] = count();
}'
The workflow is always the same: sample → aggregate → fold → visualize. Sampling gives you raw stack traces with counts. Aggregation collapses identical stacks. Folding converts the data into a format that flame graph tools understand. Visualization turns it into an interactive SVG. The next section covers the full pipeline.
Flame Graphs
A flame graph is the single most useful visualization in performance analysis. It takes thousands of stack trace samples and turns them into one interactive SVG where you can instantly see which functions consume the most CPU time. Invented by Brendan Gregg in 2011, flame graphs have become the standard way to communicate profiling results.
How to Read a Flame Graph
The x-axis is alphabetically sorted, not time — this is the most common misunderstanding. Left-to-right order means nothing. The width of each box is proportional to the number of samples where that function appeared in the stack. Wider = more CPU time. The y-axis is stack depth — bottom is the entry point (e.g., main), top is the leaf function actually running on the CPU. A wide plateau at the top means one function is consuming a lot of CPU. A wide bar at the bottom with many narrow towers above means many different code paths go through that function.
The Full Pipeline: bpftrace to SVG
Step 1 — install Brendan Gregg's FlameGraph tools:
# git clone https://github.com/brendangregg/FlameGraph /opt/FlameGraph
Step 2 — capture stack traces with bpftrace. Save the raw output:
# bpftrace -e 'profile:hz:99 /pid == 4521/ {
@[ustack] = count();
}' > /tmp/raw-stacks.txt
# Let it run for 30-60 seconds, then Ctrl+C
Step 3 — convert bpftrace output to folded stack format. The folded format has one stack per line, with frames separated by semicolons, followed by a space and the count:
# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/raw-stacks.txt > /tmp/folded.txt
# What folded format looks like:
# cat /tmp/folded.txt
main;PostgresMain;exec_simple_query;PortalRun;ExecutorRun;ExecScan;heapgettup_pagemode 312
main;PostgresMain;exec_simple_query;PortalRun;ExecutorRun;ExecScan;heapgettup 198
main;PostgresMain;ReadCommand;SocketBackend;pq_getbyte;pq_recvbuf;__GI___libc_read 1547
main;PostgresMain;exec_simple_query;PortalRun;ExecutorRun;ExecSort;tuplesort_performsort 89
Step 4 — generate the flame graph SVG:
# /opt/FlameGraph/flamegraph.pl /tmp/folded.txt > /tmp/postgres-cpu.svg
Step 5 — open it in a browser. The SVG is interactive — hover over any box to see the function name and sample count, click to zoom into a subtree.
# Open locally:
# firefox /tmp/postgres-cpu.svg
# Or SCP to your workstation:
# scp root@server:/tmp/postgres-cpu.svg .
Kernel Flame Graph
Same pipeline, but capture kernel stacks with kstack:
# bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > /tmp/kstacks.txt
# Run for 30 seconds, Ctrl+C
# Fold and generate:
# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/kstacks.txt > /tmp/kfolded.txt
# /opt/FlameGraph/flamegraph.pl --title "Kernel CPU Flame Graph" \
--colors java /tmp/kfolded.txt > /tmp/kernel-cpu.svg
Combined Kernel + Userspace Flame Graph
The most powerful variant — shows the full call chain from userspace through the syscall boundary into the kernel:
# bpftrace -e 'profile:hz:99 /pid == 4521/ {
@[ustack, kstack] = count();
}' > /tmp/combined.txt
# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/combined.txt > /tmp/combined-folded.txt
# /opt/FlameGraph/flamegraph.pl --title "PostgreSQL: User + Kernel CPU" \
/tmp/combined-folded.txt > /tmp/postgres-combined.svg
A combined flame graph shows you the complete picture. You see your application calling write(), then the kernel's VFS layer, then the ZFS I/O pipeline, then the block device driver. If the hot path is in your application code, you know it's a code problem. If the hot path is in the kernel (say, zfs_zio_compress), you know it's an I/O or configuration problem. Without the combined view, you're guessing which side of the syscall boundary to blame.
Off-CPU Analysis
On-CPU profiling only tells half the story. If your application is slow but CPU usage is only 5%, the problem isn't where it's running — it's where it's sleeping. Off-CPU analysis traces the scheduler to record stack traces when threads go to sleep, and for how long. This catches everything on-CPU profiling misses: lock contention, disk I/O waits, network latency, sleep() calls, futex waits, pipe reads, and any other blocking operation.
The Two Halves of Performance
Every thread's wall-clock time is split into on-CPU time (running on a processor) and off-CPU time (sleeping, waiting, blocked). On-CPU profiling shows where CPU cycles go. Off-CPU analysis shows where time goes. A request that takes 500ms might spend 2ms on-CPU and 498ms off-CPU waiting for a DNS lookup, a disk read, or a lock. On-CPU profiling would say "nothing interesting, the process barely uses CPU." Off-CPU analysis would say "498ms blocked in getaddrinfo — your DNS server is slow."
The mechanism: trace the sched_switch tracepoint, which fires every time a thread is switched off the CPU. Record the timestamp and stack trace at switch-off, then compute the duration when the thread comes back on-CPU:
# bpftrace -e '
tracepoint:sched:sched_switch {
if (args->prev_state == 1 || args->prev_state == 2) {
@start[args->prev_pid] = nsecs;
@stack[args->prev_pid] = kstack;
}
}
tracepoint:sched:sched_switch /
@start[args->next_pid] != 0/ {
$dur = nsecs - @start[args->next_pid];
@offcpu[@stack[args->next_pid]] = sum($dur);
delete(@start[args->next_pid]);
delete(@stack[args->next_pid]);
}
END {
clear(@start);
clear(@stack);
}'
The output shows the total off-CPU time per unique stack trace, in nanoseconds:
@offcpu[
schedule+46
schedule_hrtimeout_range_clock+256
do_poll.constprop.0+567
do_sys_poll+567
__x64_sys_poll+174
do_syscall_64+91
entry_SYSCALL_64_after_hwframe+118
]: 18923847261
@offcpu[
schedule+46
io_schedule+46
zio_wait+214
dmu_buf_hold_array_by_dnode+215
dmu_read_impl+76
dmu_read+58
zfs_read+358
zpl_read_common_iovec+162
zpl_iter_read+116
vfs_read+451
ksys_read+95
do_syscall_64+91
entry_SYSCALL_64_after_hwframe+118
]: 4281563890
Read those two stacks. The first one — 18.9 seconds total in poll() — is threads sleeping on I/O multiplexing, probably idle event loops. Uninteresting. The second one — 4.28 seconds in io_schedule inside zio_wait inside zfs_read — is threads blocked on ZFS reads waiting for disk I/O to complete. If your application feels slow, that 4.28 seconds of ZFS read latency is likely the bottleneck. On-CPU profiling would never have shown this, because the threads aren't running during that time. They're just... waiting.
Filter Off-CPU Analysis by Process
In production, you usually want to focus on a single process rather than the entire system:
# bpftrace -e '
tracepoint:sched:sched_switch /args->prev_pid == 4521/ {
if (args->prev_state != 0) {
@start[args->prev_pid] = nsecs;
@offstack[args->prev_pid] = ustack;
}
}
tracepoint:sched:sched_switch /
args->next_pid == 4521 &&
@start[args->next_pid] != 0/ {
$dur = nsecs - @start[args->next_pid];
@blocked[@offstack[args->next_pid]] = sum($dur);
delete(@start[args->next_pid]);
delete(@offstack[args->next_pid]);
}
END { clear(@start); clear(@offstack); }
' -d 60
Off-CPU Flame Graphs
Off-CPU flame graphs use the same visualization as CPU flame graphs, but the width of each bar represents blocked time instead of CPU time. A wide bar at the bottom means threads spent a lot of wall-clock time blocked in that code path. This is where you find your I/O bottlenecks, lock contention, and synchronization stalls.
Complete bpftrace script that outputs in a format compatible with the FlameGraph tools:
#!/usr/bin/env bpftrace
// offcpu-flamegraph.bt — off-CPU profiling for flame graph generation
// Usage: bpftrace offcpu-flamegraph.bt -p PID > /tmp/offcpu-raw.txt
tracepoint:sched:sched_switch /args->prev_pid == $1/ {
if (args->prev_state != 0) {
@start[tid] = nsecs;
}
}
tracepoint:sched:sched_switch /
args->next_pid == $1 &&
@start[tid] != 0/ {
$dur = (nsecs - @start[tid]) / 1000; // microseconds
@blocked[ustack] = sum($dur);
delete(@start[tid]);
}
END {
clear(@start);
}
Run it, then generate the flame graph:
# bpftrace offcpu-flamegraph.bt -p 4521 > /tmp/offcpu-raw.txt
# Let it run for 60 seconds, Ctrl+C
# Fold and generate:
# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/offcpu-raw.txt > /tmp/offcpu-folded.txt
# /opt/FlameGraph/flamegraph.pl --title "PostgreSQL Off-CPU Time" \
--countname "microseconds" \
--colors io \
/tmp/offcpu-folded.txt > /tmp/postgres-offcpu.svg
The --colors io flag makes the flame graph use blue-green tones instead of the default warm colors. This is a convention: warm colors (red/yellow/orange) for CPU flame graphs, cool colors (blue/green/aqua) for off-CPU. When you're comparing the two side-by-side, you can instantly tell which is which. The --countname "microseconds" changes the tooltip to show time units instead of sample counts.
Interpreting Off-CPU Flame Graphs
Common patterns you'll see in off-CPU flame graphs and what they mean:
Wide bar at futex_wait
Threads are blocked on a mutex or condition variable. Follow the stack below futex_wait to find which application lock is contended. The width tells you the total time all threads spent waiting on that lock. Common in thread pool implementations, database connection pools, and any code using pthread_mutex_lock.
Wide bar at io_schedule / zio_wait
Threads are blocked waiting for disk I/O. On ZFS systems, you'll see this in zio_wait inside the ZFS I/O pipeline. The fix is faster disks, better ZFS tuning (ARC size, prefetch settings), or reducing I/O demand. This is the most common finding on storage-heavy workloads.
Wide bar at sk_wait_data / inet_csk_wait_for_connect
Threads are blocked waiting for network data. sk_wait_data means waiting for data on an already-connected socket. inet_csk_wait_for_connect means a listening socket waiting for new connections. If your application is slow and the off-CPU flame graph shows most time here, the network is the bottleneck.
Wide bar at ep_poll / do_select
Threads are sleeping in epoll_wait() or select(). This is usually harmless — it means an event loop is idle, waiting for work. If you see this dominating an off-CPU flame graph, it doesn't mean your system is slow. It means your system is mostly idle. Filter it out mentally or by excluding specific comm names.
Wakeup Analysis
Off-CPU analysis tells you where a thread is sleeping. Wakeup analysis tells you who wakes it up — and that's where you find cascading latency chains. Process A is slow because it's sleeping in futex_wait. Who wakes it up? Process B, via futex_wake. Why was process B slow? Because it was blocked on a disk read. Why was the disk read slow? Because ZFS txg_sync was flushing a transaction group and saturating the I/O queue.
The Wakeup Chain
A wakeup chain traces the cause of a thread's sleep backward through the system. Thread X is blocked → Thread Y wakes it → Thread Y was blocked on something else → Thread Z woke Thread Y → Thread Z was doing disk I/O. By tracing sched_wakeup, you capture the waker's stack trace at the moment it wakes the target. This connects the sleeping thread to its root cause, even if the root cause is in a completely different process.
Trace who wakes a specific process (PID 4521):
# bpftrace -e '
tracepoint:sched:sched_wakeup /args->pid == 4521/ {
printf("Woken by: %s (PID %d)\n", comm, pid);
printf("Waker kernel stack:\n");
print(kstack());
printf("---\n");
}'
Attaching 1 probe...
Woken by: postgres (PID 4535)
Waker kernel stack:
try_to_wake_up+508
futex_wake+534
do_futex+318
__x64_sys_futex+161
do_syscall_64+91
entry_SYSCALL_64_after_hwframe+118
---
Woken by: z_wr_iss (PID 289)
Waker kernel stack:
try_to_wake_up+508
__wake_up_common+119
__wake_up_common_lock+122
zio_notify_parent+162
zio_done+1284
zio_execute+114
taskq_thread+567
---
Two different wakeup patterns. The first one is another PostgreSQL backend (PID 4535) waking our target via futex_wake — this is inter-process synchronization, probably shared buffer coordination. The second is a ZFS write issuer thread (z_wr_iss) waking our target because an I/O operation completed. If you trace PID 4535 next and find it was blocked on disk I/O, you've traced the latency chain from your slow query all the way down to the disk.
Combine wakeup analysis with off-CPU duration to find the slowest wakeup chains:
# bpftrace -e '
tracepoint:sched:sched_switch /args->prev_pid == 4521 && args->prev_state != 0/ {
@sleep_start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_wakeup /args->pid == 4521 && @sleep_start[args->pid] != 0/ {
$dur_ms = (nsecs - @sleep_start[args->pid]) / 1000000;
if ($dur_ms > 10) {
printf("[%dms] Woken by %s (PID %d)\n", $dur_ms, comm, pid);
print(kstack());
}
delete(@sleep_start[args->pid]);
}
END { clear(@sleep_start); }
'
Attaching 2 probes...
[47ms] Woken by z_wr_iss (PID 289)
try_to_wake_up+508
__wake_up_common+119
__wake_up_common_lock+122
zio_notify_parent+162
zio_done+1284
zio_execute+114
taskq_thread+567
[238ms] Woken by jbd2/sda3-8 (PID 412)
try_to_wake_up+508
wake_up_bit+51
journal_end_buffer_io_sync+50
end_buffer_write_sync+33
blkdev_bio_end_io+181
bio_endio+280
blk_update_request+332
That 238ms wakeup from jbd2 (the ext4 journal thread) is suspicious. The PostgreSQL WAL (write-ahead log) might be on an ext4 partition, and the journal commit is slow. If the WAL were on a ZFS dataset with synchronous writes going to a fast SLOG device, that 238ms might become 2ms. Wakeup analysis doesn't just find the bottleneck — it shows you exactly which subsystem to fix.
Differential Flame Graphs
You pushed a change. Performance got worse. Or maybe better. Differential flame graphs show you exactly which code paths changed and by how much. Generate a flame graph before the change, generate one after, diff them. Red means a code path got hotter (regression). Blue means it got cooler (improvement).
Step 1 — capture the "before" profile:
# bpftrace -e 'profile:hz:99 /pid == 4521/ { @[ustack] = count(); }' \
> /tmp/before-raw.txt
# Run for 60 seconds under representative load, Ctrl+C
# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/before-raw.txt > /tmp/before.folded
Step 2 — deploy your change, then capture the "after" profile under the same load:
# bpftrace -e 'profile:hz:99 /pid == 5102/ { @[ustack] = count(); }' \
> /tmp/after-raw.txt
# Same duration, same load, Ctrl+C
# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/after-raw.txt > /tmp/after.folded
Step 3 — generate the differential flame graph:
# /opt/FlameGraph/difffolded.pl /tmp/before.folded /tmp/after.folded \
| /opt/FlameGraph/flamegraph.pl --title "Differential: before vs after" \
--negate > /tmp/diff.svg
The --negate flag makes it so that red = regression (more time than before) and blue = improvement (less time than before). Without --negate, the colors are inverted, which is confusing.
Reading a Differential Flame Graph
Red bars are code paths that now take more CPU time than before. The more saturated the red, the bigger the regression. Blue bars are code paths that take less CPU time. White/gray bars are unchanged. The width still represents total time (from the "after" profile). Look for the biggest red bars — those are the functions your change made slower. Look for blue bars to confirm expected improvements. If a function went from 5% to 15% of CPU time, it'll be bright red and wider than before.
Differential flame graphs are a cheat code for performance regressions. Instead of spending hours comparing profiles manually, you get a single image where the problem is highlighted in red. You deployed a new query planner? The diff flame graph shows ExecHashJoin is bright red (more CPU) while ExecNestLoop is blue (less CPU) — the planner is choosing hash joins over nested loops, and hash joins are more expensive for your data distribution. One image, complete diagnosis.
Debug Symbols
Without debug symbols, your stack traces show hex addresses instead of function names. Instead of PostgresMain+1520, you get 0x55a3c7f21a30. This makes everything useless. Debug symbols are the mapping from memory addresses to human-readable function names, source file names, and line numbers. You need them for meaningful userspace stack traces.
Installing Debug Symbols
CentOS / RHEL / Rocky / Fedora — use debuginfo-install:
# dnf install dnf-utils
# Install debug symbols for a specific package:
# debuginfo-install postgresql15-server
# Or install by running binary:
# debuginfo-install -y $(rpm -qf /usr/pgsql-15/bin/postgres)
# Install kernel debuginfo:
# debuginfo-install kernel-$(uname -r)
# Verify installation:
# ls /usr/lib/debug/usr/pgsql-15/bin/postgres.debug
/usr/lib/debug/usr/pgsql-15/bin/postgres.debug
Debian / Ubuntu — enable the dbgsym repository and install -dbgsym packages:
# For Debian:
# echo "deb http://deb.debian.org/debian-debug/ trixie-debug main" \
>> /etc/apt/sources.list.d/debug.list
# apt update
# For Ubuntu:
# echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted \
universe multiverse" >> /etc/apt/sources.list.d/ddebs.list
# apt install ubuntu-dbgsym-keyring
# apt update
# Install debug symbols:
# apt install postgresql-15-dbgsym
# apt install linux-image-$(uname -r)-dbgsym
# Verify:
# ls /usr/lib/debug/.build-id/
00/ 01/ 02/ 03/ 04/ ... fd/ fe/ ff/
The .build-id Directory Structure
Modern Linux systems store debug symbols indexed by build ID — a unique hash of the binary. The debugger looks up /usr/lib/debug/.build-id/ab/cdef1234.debug to find symbols for a binary with build ID abcdef1234. This means debug symbols always match the exact binary, even across updates. You can check a binary's build ID with readelf -n /usr/bin/postgres | grep "Build ID".
DWARF vs BTF
DWARF
The standard debug format for userspace binaries and the kernel. Contains everything: function names, variable types, source file mappings, line numbers, unwinding rules. A kernel with full DWARF debug info is 800MB+. Stored in .debug_info, .debug_line, .debug_frame, and related ELF sections. This is what gdb, perf, and bpftrace ustack() use for userspace.
BTF (BPF Type Format)
A compact type format designed specifically for eBPF. Contains only type information (struct layouts, function signatures) — no line numbers, no variable names, no unwind rules. A kernel with BTF is about 5MB instead of 800MB. BTF is what makes bpftrace able to access kernel struct fields without installing kernel-debuginfo. It's embedded in the kernel image itself (/sys/kernel/btf/vmlinux). Not a replacement for DWARF — a complement.
Check if your kernel has BTF enabled:
# ls -la /sys/kernel/btf/vmlinux
-r--r--r-- 1 root root 5765432 Apr 4 12:00 /sys/kernel/btf/vmlinux
# If that file exists, BTF is enabled. bpftrace will use it automatically.
# If not, you need kernel-debuginfo for struct access.
Application-Specific Debug Symbols
Common applications and their debug symbol packages:
# PostgreSQL
# CentOS/RHEL: debuginfo-install postgresql15-server
# Debian/Ubuntu: apt install postgresql-15-dbgsym
# nginx
# CentOS/RHEL: debuginfo-install nginx
# Debian/Ubuntu: apt install nginx-dbg # note: -dbg not -dbgsym for nginx
# Python (for profiling Python C extensions)
# CentOS/RHEL: debuginfo-install python3
# Debian/Ubuntu: apt install python3-dbg
# glibc (needed for libc function names in stacks)
# CentOS/RHEL: debuginfo-install glibc
# Debian/Ubuntu: apt install libc6-dbg
# Node.js — no debug symbols needed, but you need perf maps:
# node --perf-basic-prof your-app.js
# This writes /tmp/perf-PID.map which bpftrace and perf can read
# Java — needs perf-map-agent:
# git clone https://github.com/jvm-profiling-tools/perf-map-agent
# cmake . && make
# java -agentpath:./libperfmap.so your-app.jar
The single most common reason for broken stack traces is missing glibc debug symbols. Your application's own symbols show function names, but every stack trace goes through libc at some point (read, write, malloc, pthread_create). Without glibc debuginfo, those frames show as hex addresses, and the stack looks incomplete. Always install glibc debuginfo first. Everything else is secondary.
Core Dump Capture Triggered by eBPF
eBPF can detect anomalies in real time — segfaults, high latency, memory spikes, specific error codes — and trigger core dump capture automatically. Instead of waiting for users to report a crash and hoping the core dump was preserved, you set up eBPF to watch for the condition and grab the evidence the moment it happens.
systemd-coredump Configuration
First, make sure systemd-coredump is configured to actually capture core dumps:
# cat /etc/systemd/coredump.conf
[Coredump]
Storage=external
Compress=yes
ProcessSizeMax=2G
ExternalSizeMax=2G
JournalSizeMax=100M
MaxUse=10G
# Reload after changes:
# systemctl daemon-reload
Verify it's working:
# coredumpctl list
TIME PID UID GID SIG COREFILE EXE SIZE
Fri 2026-04-04 09:14:23 UTC 8821 1000 1000 SIGSEGV present /usr/bin/myapp 24.3M
Fri 2026-04-04 09:15:47 UTC 8834 1000 1000 SIGABRT present /usr/bin/myapp 31.1M
eBPF: Detect Segfaults and Log Context
Use eBPF to monitor for SIGSEGV (segfault) signals and capture surrounding context that coredumpctl alone doesn't provide — like what the process was doing in the seconds before the crash:
#!/usr/bin/env bpftrace
// crash-monitor.bt — detect segfaults and capture pre-crash context
tracepoint:signal:signal_generate /args->sig == 11/ {
printf("\n=== SIGSEGV DETECTED ===\n");
printf("Time: %s\n", strftime("%Y-%m-%d %H:%M:%S", nsecs));
printf("Process: %s (PID %d, TID %d)\n", comm, pid, tid);
printf("Signal sent by PID: %d\n", args->pid);
printf("\nKernel stack at crash:\n");
print(kstack());
printf("\nUserspace stack at crash:\n");
print(ustack());
printf("\nRecent syscalls from this process (last 5 seconds):\n");
print(@recent_syscalls[pid]);
printf("=========================\n");
}
// Track recent syscalls for crash context
tracepoint:raw_syscalls:sys_enter /comm == "myapp"/ {
@recent_syscalls[pid] = count();
}
// Track open files for crash context
tracepoint:syscalls:sys_enter_openat /comm == "myapp"/ {
printf("[pre-crash file access] %s: open(%s)\n", comm, str(args->filename));
}
// Track memory allocations for crash context
uprobe:/lib64/libc.so.6:malloc /comm == "myapp"/ {
@alloc_sizes[pid] = hist(arg0);
}
eBPF: Trigger Core Dump on High Latency
Sometimes you don't want to wait for a crash — you want to capture a core dump when a specific operation exceeds a latency threshold. This script monitors PostgreSQL query execution and sends SIGABRT to trigger a core dump when a query takes longer than 5 seconds:
#!/usr/bin/env bpftrace
// slow-query-coredump.bt — capture core dump on slow queries
uprobe:/usr/pgsql-15/bin/postgres:exec_simple_query {
@query_start[tid] = nsecs;
}
uretprobe:/usr/pgsql-15/bin/postgres:exec_simple_query
/@query_start[tid] != 0/ {
$dur_ms = (nsecs - @query_start[tid]) / 1000000;
if ($dur_ms > 5000) {
printf("SLOW QUERY: %dms in PID %d — triggering core dump\n",
$dur_ms, pid);
print(ustack());
signal("SIGABRT");
}
delete(@query_start[tid]);
}
Sending SIGABRT to a production PostgreSQL backend will kill that backend process. PostgreSQL will then crash-recover and respawn a new backend. This is not something you do casually — you use it when you have a recurring slow-query problem that you can't reproduce with EXPLAIN ANALYZE, and you need the full memory state at the moment of the slow query. The core dump will contain the query plan, buffer state, lock state, and everything else in the backend's address space. One core dump from production is worth a hundred hours of trying to reproduce in dev.
Post-Mortem Analysis
The workflow: eBPF detects the anomaly live → captures context (stack traces, syscall history, memory allocation patterns) → triggers core dump → you analyze the core dump offline with gdb and coredumpctl. eBPF gives you the when and why. The core dump gives you the what — exact variable values, heap state, thread state.
coredumpctl Analysis Workflow
# List all core dumps:
# coredumpctl list
TIME PID UID GID SIG COREFILE EXE SIZE
Fri 2026-04-04 09:14:23 UTC 8821 1000 1000 SIGSEGV present /usr/bin/myapp 24.3M
Fri 2026-04-04 09:15:47 UTC 8834 1000 1000 SIGABRT present /usr/bin/myapp 31.1M
# Get detailed info about a specific dump:
# coredumpctl info 8821
PID: 8821 (myapp)
UID: 1000 (appuser)
GID: 1000 (appuser)
Signal: 11 (SEGV)
Timestamp: Fri 2026-04-04 09:14:23 UTC
Command Line: /usr/bin/myapp --config /etc/myapp.conf
Executable: /usr/bin/myapp
Control Group: /system.slice/myapp.service
Unit: myapp.service
Slice: system.slice
Boot ID: a1b2c3d4e5f6...
Machine ID: f6e5d4c3b2a1...
Hostname: prod-web-03
Storage: /var/lib/systemd/coredump/core.myapp.1000.a1b2c3.8821.1712222063000000.zst
Size on Disk: 24.3M
# Open in gdb:
# coredumpctl gdb 8821
GNU gdb (GDB) 13.2
Reading symbols from /usr/bin/myapp...
Reading symbols from /usr/lib/debug/.build-id/ab/cdef1234.debug...
Core was generated by `/usr/bin/myapp --config /etc/myapp.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00005583a7f21a30 in process_request (req=0x0) at src/handler.c:247
247 size_t len = req->content_length;
(gdb) bt
#0 0x00005583a7f21a30 in process_request (req=0x0) at src/handler.c:247
#1 0x00005583a7f22b50 in handle_connection (conn=0x7f3a2c001230) at src/server.c:182
#2 0x00005583a7f23c70 in worker_thread (arg=0x7f3a2c000010) at src/threadpool.c:95
#3 0x00007f3a3c0a3ea7 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f3a3bdd2a2f in clone () from /lib64/libc.so.6
(gdb) frame 0
#0 0x00005583a7f21a30 in process_request (req=0x0) at src/handler.c:247
247 size_t len = req->content_length;
(gdb) print req
$1 = (struct request *) 0x0
(gdb) frame 1
#1 0x00005583a7f22b50 in handle_connection (conn=0x7f3a2c001230) at src/server.c:182
182 process_request(conn->current_request);
(gdb) print conn->current_request
$2 = (struct request *) 0x0
(gdb) print conn->state
$3 = CONNECTION_CLOSING
And there it is. The connection was in CONNECTION_CLOSING state, which means current_request was already freed and set to NULL. But handle_connection didn't check the state before calling process_request. A race condition: the connection close handler ran on one thread while the request handler ran on another. The eBPF crash monitor told you when and how often. The core dump told you exactly which variable was null and why. That's the power of combining live tracing with post-mortem analysis.
Correlating eBPF Context with Core Dumps
The real power is correlating eBPF's live context with the core dump's frozen state. Set up eBPF to log everything leading up to the crash, then use the core dump to inspect the exact state:
# 1. Run the crash monitor (writes to /var/log/ebpf-crashes.log):
# bpftrace crash-monitor.bt > /var/log/ebpf-crashes.log 2>&1 &
# 2. When a crash happens, check the eBPF log:
# tail -50 /var/log/ebpf-crashes.log
=== SIGSEGV DETECTED ===
Time: 2026-04-04 09:14:23
Process: myapp (PID 8821, TID 8825)
Recent file accesses:
open(/var/lib/myapp/session-4a2b.dat)
open(/var/lib/myapp/session-4a2b.dat.lock)
Allocation histogram (last 10s):
[64, 128) : 12847
[128, 256) : 8921
[256, 512) : 3401
[4K, 8K) : 2
[1M, 2M) : 1 ← suspicious large allocation right before crash
# 3. Open the core dump and check that 1MB allocation:
# coredumpctl gdb 8821
(gdb) info threads
Id Target Id Frame
1 Thread 0x7f3a3c0a5740 0x00007f3a3bdd1337 in epoll_wait ()
* 2 Thread 0x7f3a3b89e700 0x00005583a7f21a30 in process_request ()
3 Thread 0x7f3a3b09d700 0x00007f3a3bdd2a2f in clone ()
4 Thread 0x7f3a3a89c700 0x00007f3a3bdc94ed in nanosleep ()
(gdb) thread 2
(gdb) info locals
req = 0x0
buf = 0x7f3a2c100000
buf_size = 1048576
Memory Leak Detection
Memory leaks are silent killers. A process allocates memory, forgets to free it, and slowly grows until the OOM killer shows up three weeks later. Traditional tools like Valgrind work but require restarting the process under instrumentation, which is impossible in production. eBPF traces malloc/free (userspace) or kmalloc/kfree (kernel) on a live process with near-zero overhead.
The memleak BCC Tool
BCC includes a purpose-built memleak tool that does exactly this. It attaches uprobes to the allocator functions, tracks outstanding allocations, and reports the ones that haven't been freed:
# Trace userspace allocations in PID 4521, show top 10 every 5 seconds:
# /usr/share/bcc/tools/memleak -p 4521 5
Attaching to pid 4521, Ctrl+C to quit.
[09:30:15] Top 10 stacks with outstanding allocations:
948736 bytes in 3714 allocations from stack
operator new(unsigned long)+0x1c [libstdc++.so.6]
std::string::_Rep::_S_create(unsigned long, ...)+0x59
ConnectionPool::createConnection()+0x4a [myapp]
ConnectionPool::getConnection()+0x123 [myapp]
RequestHandler::handleRequest()+0x67 [myapp]
main+0x2a1 [myapp]
524288 bytes in 1 allocations from stack
malloc+0x3e [libc.so.6]
json_parse_buffer()+0x28 [libjansson.so.4]
parse_config()+0x81 [myapp]
main+0x55 [myapp]
[09:30:20] Top 10 stacks with outstanding allocations:
1523712 bytes in 5967 allocations from stack
operator new(unsigned long)+0x1c [libstdc++.so.6]
std::string::_Rep::_S_create(unsigned long, ...)+0x59
ConnectionPool::createConnection()+0x4a [myapp]
ConnectionPool::getConnection()+0x123 [myapp]
RequestHandler::handleRequest()+0x67 [myapp]
main+0x2a1 [myapp]
Between the two samples (5 seconds apart), ConnectionPool::createConnection grew from 948KB (3,714 allocations) to 1.52MB (5,967 allocations). That's 2,253 new connections created in 5 seconds and none freed. The connection pool is leaking — it creates new connections but never returns them to the pool or closes them. The stack trace tells you exactly which function to fix. In production. Without restarting anything. Without Valgrind's 20x slowdown.
Kernel Memory Leak Detection
Trace kernel memory allocations with kmalloc/kfree. This catches kernel module leaks, driver bugs, and subsystem leaks:
# /usr/share/bcc/tools/memleak 5
Attaching to kernel allocators, Ctrl+C to quit.
[09:35:10] Top 10 stacks with outstanding allocations:
4194304 bytes in 1024 allocations from stack
kmalloc_trace+0x2b
zfs_znode_alloc+0x9a
zfs_zget+0x14f
zfs_dirent_lock+0x1c3
zfs_lookup+0x132
zpl_lookup+0x5e
Manual malloc/free Tracing with bpftrace
For more control, write your own malloc/free tracer:
#!/usr/bin/env bpftrace
// malloc-tracer.bt — track allocations and frees for a specific process
uprobe:/lib64/libc.so.6:malloc /pid == $1/ {
@alloc_size[tid] = arg0;
}
uretprobe:/lib64/libc.so.6:malloc /pid == $1 && @alloc_size[tid] != 0/ {
@outstanding[retval] = @alloc_size[tid];
@alloc_stacks[ustack, @alloc_size[tid]] = count();
@total_alloc = sum(@alloc_size[tid]);
delete(@alloc_size[tid]);
}
uprobe:/lib64/libc.so.6:free /pid == $1 && arg0 != 0/ {
if (@outstanding[arg0] != 0) {
@total_free = sum(@outstanding[arg0]);
delete(@outstanding[arg0]);
}
}
interval:s:10 {
printf("\n--- Outstanding allocations: %d bytes ---\n",
@total_alloc - @total_free);
}
END {
printf("\n=== Allocation stacks (not freed) ===\n");
print(@alloc_stacks);
clear(@outstanding);
clear(@alloc_size);
}
# bpftrace malloc-tracer.bt 4521
Attaching 5 probes...
--- Outstanding allocations: 2481152 bytes ---
--- Outstanding allocations: 4915200 bytes ---
--- Outstanding allocations: 7340032 bytes ---
^C
=== Allocation stacks (not freed) ===
@alloc_stacks[
malloc+62
ConnectionPool::createConnection()+74
ConnectionPool::getConnection()+291
RequestHandler::handleRequest()+103
main+673
, 256]: 28672
Lock Contention Analysis
Lock contention is the number one killer of multi-threaded application performance. Thread A holds a lock. Threads B, C, D, E, and F are all blocked waiting for it. On-CPU profiling shows Thread A consuming CPU. Off-CPU analysis shows Threads B-F sleeping. But neither tells you which lock is the problem or how long threads wait for it. Lock contention analysis does.
Tracing pthread_mutex_lock
Trace all mutex acquisitions and measure the time threads spend waiting:
#!/usr/bin/env bpftrace
// lock-contention.bt — measure mutex wait times
uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /pid == $1/ {
@lock_start[tid] = nsecs;
@lock_addr[tid] = arg0;
}
uretprobe:/lib64/libpthread.so.0:pthread_mutex_lock
/pid == $1 && @lock_start[tid] != 0/ {
$dur = nsecs - @lock_start[tid];
$dur_us = $dur / 1000;
@lock_wait_us = hist($dur_us);
@lock_wait_by_addr[@lock_addr[tid]] = sum($dur);
@lock_contention[@lock_addr[tid], ustack] = count();
if ($dur_us > 1000) {
printf("SLOW LOCK: %dus waiting for mutex 0x%lx\n",
$dur_us, @lock_addr[tid]);
print(ustack());
}
delete(@lock_start[tid]);
delete(@lock_addr[tid]);
}
END {
printf("\n=== Lock wait time distribution (microseconds) ===\n");
print(@lock_wait_us);
printf("\n=== Total wait time by lock address ===\n");
print(@lock_wait_by_addr);
printf("\n=== Contention by lock + stack ===\n");
print(@lock_contention);
}
# bpftrace lock-contention.bt 4521
Attaching 3 probes...
SLOW LOCK: 4721us waiting for mutex 0x55a3c8012340
pthread_mutex_lock+37
ConnectionPool::getConnection()+45
RequestHandler::handleRequest()+103
WorkerThread::run()+201
start_thread+741
SLOW LOCK: 12847us waiting for mutex 0x55a3c8012340
pthread_mutex_lock+37
ConnectionPool::getConnection()+45
RequestHandler::handleRequest()+103
WorkerThread::run()+201
start_thread+741
^C
=== Lock wait time distribution (microseconds) ===
@lock_wait_us:
[0] 8924 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 3421 |@@@@@@@@@@@@@@@ |
[2, 4) 1847 |@@@@@@@@ |
[4, 8) 921 |@@@@ |
[8, 16) 412 |@ |
[16, 32) 198 | |
[32, 64) 87 | |
[64, 128) 34 | |
[128, 256) 12 | |
[256, 512) 8 | |
[512, 1K) 5 | |
[1K, 2K) 3 | |
[2K, 4K) 2 | |
[4K, 8K) 1 | |
[8K, 16K) 1 | |
=== Total wait time by lock address ===
@lock_wait_by_addr[0x55a3c8012340]: 892341872
@lock_wait_by_addr[0x55a3c8012380]: 12847291
@lock_wait_by_addr[0x55a3c80123c0]: 847123
Lock address 0x55a3c8012340 accounts for 892ms of total wait time — orders of magnitude more than any other lock. Every contention event on that lock comes from ConnectionPool::getConnection(). The connection pool has a single global mutex that every worker thread contends on. The fix is usually a striped lock (one mutex per N connections), a lock-free ring buffer, or a per-thread connection pool. You found the exact lock, the exact function, and the exact contention pattern. In production. In 30 seconds.
Kernel Futex Tracing
For deeper analysis, trace the kernel-side futex operations directly. This catches all synchronization primitives that use futexes (mutexes, condition variables, semaphores, rwlocks):
# bpftrace -e '
tracepoint:syscalls:sys_enter_futex /pid == 4521/ {
@futex_ops[args->op & 0xf] = count();
if ((args->op & 0xf) == 0) { // FUTEX_WAIT
@wait_start[tid] = nsecs;
@wait_stack[tid] = ustack;
}
}
tracepoint:syscalls:sys_exit_futex /pid == 4521 && @wait_start[tid] != 0/ {
$dur_us = (nsecs - @wait_start[tid]) / 1000;
@futex_wait_us = hist($dur_us);
if ($dur_us > 5000) {
printf("LONG FUTEX WAIT: %dus\n", $dur_us);
print(@wait_stack[tid]);
}
delete(@wait_start[tid]);
delete(@wait_stack[tid]);
}
END {
printf("\nFutex operation counts (0=WAIT, 1=WAKE, ...):\n");
print(@futex_ops);
}'
Lock Contention Flame Graph
Generate a flame graph where the width represents lock wait time instead of CPU time:
# bpftrace -e '
uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /pid == 4521/ {
@lock_start[tid] = nsecs;
}
uretprobe:/lib64/libpthread.so.0:pthread_mutex_lock
/pid == 4521 && @lock_start[tid] != 0/ {
$dur = (nsecs - @lock_start[tid]) / 1000;
@lock_wait[ustack] = sum($dur);
delete(@lock_start[tid]);
}
END { clear(@lock_start); }
' > /tmp/lock-raw.txt
# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/lock-raw.txt > /tmp/lock-folded.txt
# /opt/FlameGraph/flamegraph.pl --title "Lock Contention" \
--countname "microseconds" \
--colors aqua \
/tmp/lock-folded.txt > /tmp/lock-contention.svg
Real-World Examples
Finding Latency Spikes in PostgreSQL
A production PostgreSQL server has intermittent query latency spikes. P99 goes from 5ms to 800ms for no apparent reason. pg_stat_statements shows the slow queries are simple index lookups that should be fast. Here's how to find the root cause:
# Step 1: Confirm it's off-CPU time, not on-CPU
# bpftrace -e 'profile:hz:99 /comm == "postgres"/ { @[ustack] = count(); }' -d 30
# Result: most CPU time is in ReadBuffer and index scan functions.
# Nothing unusual. CPU profile looks normal.
# Step 2: Check off-CPU time
# bpftrace -e '
tracepoint:sched:sched_switch /comm == "postgres" && args->prev_state != 0/ {
@start[args->prev_pid] = nsecs;
@stack[args->prev_pid] = kstack;
}
tracepoint:sched:sched_switch /comm == "postgres" && @start[args->next_pid]/ {
$dur_ms = (nsecs - @start[args->next_pid]) / 1000000;
if ($dur_ms > 50) {
printf("OFF-CPU %dms:\n", $dur_ms);
print(@stack[args->next_pid]);
}
delete(@start[args->next_pid]);
delete(@stack[args->next_pid]);
}
END { clear(@start); clear(@stack); }
' -d 120
OFF-CPU 412ms:
schedule+46
io_schedule+46
zio_wait+214
dmu_buf_hold_array_by_dnode+215
dmu_read_impl+76
dmu_read+58
zfs_read+358
zpl_read_common_iovec+162
zpl_iter_read+116
vfs_read+451
ksys_read+95
do_syscall_64+91
entry_SYSCALL_64_after_hwframe+118
OFF-CPU 237ms:
schedule+46
io_schedule+46
zio_wait+214
dmu_tx_assign+327
zfs_write+643
zpl_write_common_iovec+162
zpl_iter_write+116
vfs_write+451
ksys_write+95
do_syscall_64+91
entry_SYSCALL_64_after_hwframe+118
Found it. PostgreSQL backends are blocking for 200-400ms in zio_wait inside both zfs_read and zfs_write. The ZFS I/O pipeline is stalling. This typically means one of three things: (1) the ARC (adaptive replacement cache) is too small, causing cache misses that hit disk, (2) a ZFS scrub or resilver is running and consuming I/O bandwidth, or (3) the underlying disk is failing and retrying I/O operations. Check with zpool status for scrub activity and zpool iostat -v 1 for per-disk latency. The eBPF trace took 2 minutes to find what might have taken days of guess-and-check.
Off-CPU Flame Graph Reveals ZFS txg_sync Blocking Writes
An application writes heavily to ZFS and experiences periodic 1-2 second stalls every 5-10 seconds. The stalls correlate with ZFS transaction group (TXG) syncs:
# Step 1: Generate off-CPU flame graph for the application
# bpftrace -e '
tracepoint:sched:sched_switch /args->prev_pid == 9182 && args->prev_state != 0/ {
@start[tid] = nsecs;
}
tracepoint:sched:sched_switch /args->next_pid == 9182 && @start[tid]/ {
$dur = (nsecs - @start[tid]) / 1000;
@blocked[kstack] = sum($dur);
delete(@start[tid]);
}
END { clear(@start); }
' > /tmp/txg-offcpu.txt -d 60
# Step 2: Generate flame graph
# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/txg-offcpu.txt > /tmp/txg-folded.txt
# /opt/FlameGraph/flamegraph.pl --title "App Off-CPU (ZFS txg_sync visible)" \
--countname "microseconds" --colors io \
/tmp/txg-folded.txt > /tmp/txg-offcpu.svg
# The flame graph shows a massive wide bar through this path:
# io_schedule → zio_wait → dmu_tx_assign → zfs_write
# Width: ~60% of total off-CPU time
# This means the application is blocked on dmu_tx_assign waiting for
# the current TXG to have space for new writes.
# Step 3: Confirm TXG sync timing
# bpftrace -e '
kprobe:txg_sync_thread {
printf("TXG sync started: %s\n", strftime("%H:%M:%S", nsecs));
}
kretprobe:txg_sync_thread {
printf("TXG sync ended: %s\n", strftime("%H:%M:%S", nsecs));
}
' -d 30
TXG sync started: 09:45:03
TXG sync ended: 09:45:04
TXG sync started: 09:45:08
TXG sync ended: 09:45:09
TXG sync started: 09:45:13
TXG sync ended: 09:45:15
The fix: tune ZFS to sync TXGs more frequently with shorter durations, or add a SLOG (ZFS Intent Log) device to absorb synchronous writes without waiting for TXG commit:
# Reduce TXG sync interval from default 5s to 1s:
# zfs set zfs:zfs_txg_timeout=1
# Or add a fast SLOG device:
# zpool add tank log /dev/nvme1n1
Wakeup Chain: Slow DNS Causing HTTP Timeouts
An HTTP service has intermittent 2-3 second response times. CPU usage is low. Disk I/O is fine. Network bandwidth is fine. The problem is invisible to traditional monitoring:
# Step 1: Off-CPU analysis on the HTTP worker threads
# bpftrace -e '
tracepoint:sched:sched_switch /comm == "http-worker" && args->prev_state != 0/ {
@start[args->prev_pid] = nsecs;
@offstack[args->prev_pid] = kstack;
}
tracepoint:sched:sched_switch /comm == "http-worker" && @start[args->next_pid]/ {
$dur_ms = (nsecs - @start[args->next_pid]) / 1000000;
if ($dur_ms > 500) {
printf("OFF-CPU %dms:\n", $dur_ms);
print(@offstack[args->next_pid]);
}
delete(@start[args->next_pid]);
delete(@offstack[args->next_pid]);
}
END { clear(@start); clear(@offstack); }
'
OFF-CPU 2341ms:
schedule+46
schedule_hrtimeout_range_clock+256
do_poll.constprop.0+567
do_sys_poll+567
__x64_sys_poll+174
do_syscall_64+91
entry_SYSCALL_64_after_hwframe+118
# The kernel stack just shows poll() — we need the userspace stack.
# Step 2: Capture userspace stacks for the same event
# bpftrace -e '
tracepoint:sched:sched_switch /comm == "http-worker" && args->prev_state != 0/ {
@start[args->prev_pid] = nsecs;
@offustack[args->prev_pid] = ustack;
}
tracepoint:sched:sched_switch /comm == "http-worker" && @start[args->next_pid]/ {
$dur_ms = (nsecs - @start[args->next_pid]) / 1000000;
if ($dur_ms > 500) {
printf("OFF-CPU %dms:\n", $dur_ms);
print(@offustack[args->next_pid]);
}
delete(@start[args->next_pid]);
delete(@offustack[args->next_pid]);
}
END { clear(@start); clear(@offustack); }
'
OFF-CPU 2107ms:
__GI___poll+45
__res_context_send+1842
__res_context_query+412
__res_context_search+253
gaih_inet.constprop.0+2641
getaddrinfo+341
resolve_upstream+87
proxy_handler+245
handle_request+152
worker_main+401
There it is. getaddrinfo → __res_context_search → __res_context_query → __res_context_send → poll. The HTTP worker is doing a DNS lookup via getaddrinfo() for every request to resolve the upstream backend hostname, and the DNS server is taking 2+ seconds to respond. The fix is one of: (1) use /etc/hosts or a local DNS cache (systemd-resolved, dnsmasq), (2) cache the DNS result in the application, or (3) use IP addresses instead of hostnames for upstream backends. On-CPU profiling would have shown nothing. Standard network monitoring would have shown nothing (the DNS packets are tiny). Only off-CPU tracing with userspace stacks found this.
Memory Leak in a Long-Running Daemon
A daemon's RSS grows by ~50MB per day. After 2 weeks, the OOM killer terminates it. No crashes, no errors in the logs. Just steady growth:
# Step 1: Confirm the growth rate
# while true; do
ps -o pid,rss,comm -p 6712
sleep 60
done
PID RSS COMMAND
6712 842316 mydaemon
PID RSS COMMAND
6712 843128 mydaemon
PID RSS COMMAND
6712 843940 mydaemon
# ~800KB/minute growth. That's 48MB/hour, confirming the report.
# Step 2: Use memleak to find the leaking allocation
# /usr/share/bcc/tools/memleak -p 6712 -o 10000 10
# -o 10000 = only show allocations outstanding for 10+ seconds
# 10 = print every 10 seconds
[09:50:00] Top 10 stacks with outstanding allocations:
83886080 bytes in 20480 allocations from stack
malloc+0x3e [libc.so.6]
json_loads+0x48 [libjansson.so.4]
parse_event+0x67 [mydaemon]
event_loop+0x1a3 [mydaemon]
main+0x2b5 [mydaemon]
4194304 bytes in 1 allocations from stack
malloc+0x3e [libc.so.6]
init_buffer_pool+0x28 [mydaemon]
main+0x55 [mydaemon]
[09:50:10] Top 10 stacks with outstanding allocations:
88080384 bytes in 21504 allocations from stack
malloc+0x3e [libc.so.6]
json_loads+0x48 [libjansson.so.4]
parse_event+0x67 [mydaemon]
event_loop+0x1a3 [mydaemon]
main+0x2b5 [mydaemon]
parse_event calls json_loads to parse incoming JSON events. json_loads allocates memory for the parsed JSON object. In 10 seconds, 1,024 new allocations appeared and none were freed. The parse_event function is parsing the JSON but never calling json_decref() to free the parsed object when it's done with it. The fix is a one-line call to json_decref(root) at the end of parse_event(). Twenty minutes of eBPF tracing found a bug that had been leaking memory for months.
Common Pitfalls
eBPF profiling and stack trace analysis can go wrong in subtle ways. These are the most common problems and how to fix them.
Missing Frame Pointers
The number one cause of broken userspace stack traces. Most distros compile with -fomit-frame-pointer to free up the RBP register. Without frame pointers, the stack unwinder can't walk the call chain, and you get truncated stacks with [unknown] frames. Fix: recompile with -fno-omit-frame-pointer, or use a distro that enables frame pointers by default (Fedora 38+). For third-party software, install DWARF debug symbols and use bpftrace's DWARF unwinder.
Missing Debug Symbols
Without debug symbols, stack traces show hex addresses instead of function names: 0x55a3c7f21a30 instead of PostgresMain+1520. The stack trace is technically correct but completely useless. Fix: install the -debuginfo (RHEL) or -dbgsym (Debian) package for every binary in the stack. Don't forget glibc — it's in every stack.
JIT-Compiled Code (Java, Node.js, Python)
JIT compilers generate machine code at runtime. The eBPF stack unwinder doesn't know what functions that code belongs to — it just sees anonymous memory regions marked [unknown]. Fix: Java needs perf-map-agent to write /tmp/perf-PID.map. Node.js needs --perf-basic-prof. Python needs py-spy or the python3-dbg interpreter. Each runtime has its own mechanism for exporting symbol maps.
Stripped Binaries
Containers often ship minimal images with stripped binaries (no symbol table at all). Even readelf -s shows nothing. Fix: install the unstripped binary alongside it, or use eu-unstrip to merge a debug symbol file with the stripped binary. For container debugging, mount the debug symbols from the host into the container's /usr/lib/debug/ path.
Kernel ORC vs Frame Pointer Confusion
Kernel stacks almost always work because modern kernels use the ORC unwinder. But if you're running an older kernel (pre-4.14) or a custom kernel with ORC disabled, kernel stacks will be broken just like userspace stacks without frame pointers. Check: grep CONFIG_UNWINDER_ORC /boot/config-$(uname -r). If it's not set, you need CONFIG_UNWINDER_FRAME_POINTER=y and a kernel recompile.
Stack Depth Truncation
The default maximum stack depth in bpftrace is 127 frames. If your application has deeper call chains (some enterprise Java apps routinely exceed 200 frames), the stack gets truncated at the bottom, and you lose the entry point. Fix: increase the depth with kstack(perf, 256) or ustack(256). Be aware that deeper stacks use more map memory.
Inlined Functions
Compiler optimizations inline small functions into their callers. The inlined function disappears from the stack trace. If you're looking for parse_header() but the compiler inlined it into handle_request(), you'll only see handle_request() in the stack. Fix: compile with -fno-inline for debugging (not production), or use DWARF info which tracks inlining decisions and can show inline frames.
High Frequency Sampling Overhead
Sampling at 999 Hz instead of 99 Hz gives you 10x more samples but also 10x more overhead. On a busy system with 128 CPUs, 999 Hz means 127,872 interrupts per second. That's measurable overhead. Rule of thumb: 49 Hz for initial exploration, 99 Hz for detailed profiling, never above 999 Hz. For off-CPU tracing, the overhead comes from probe frequency, not a timer — heavily threaded workloads can generate millions of sched_switch events per second.
Quick Diagnostic Checklist
When your stack traces don't look right, run through this checklist:
# 1. Do you have debug symbols?
# readelf -S /usr/bin/postgres | grep debug
[28] .debug_info PROGBITS ...
[29] .debug_abbrev PROGBITS ...
[30] .debug_line PROGBITS ...
# If no .debug sections, install debuginfo/dbgsym packages.
# 2. Does the binary have frame pointers?
# readelf -wf /usr/bin/postgres 2>/dev/null | head -5
# If this produces output, DWARF unwind info is present (good).
# Also check:
# objdump -d /usr/bin/postgres | grep -c "push.*rbp"
# High count = frame pointers present. Zero = omitted.
# 3. Does the kernel have BTF?
# ls /sys/kernel/btf/vmlinux
# If missing, bpftrace can't access kernel struct fields.
# 4. Does the kernel have ORC?
# grep CONFIG_UNWINDER_ORC=y /boot/config-$(uname -r)
# If missing, kernel stacks may be broken.
# 5. Is the process a JIT runtime?
# If Java: check for /tmp/perf-PID.map
# If Node.js: was it started with --perf-basic-prof?
# If neither exists, JIT frames will show as [unknown].
# 6. Is bpftrace running as root?
# bpftrace needs CAP_BPF + CAP_PERFMON (or root) for profiling.
# Non-root bpftrace silently produces empty or partial results.
The most frustrating debugging experience is debugging your debugger. You spend an hour setting up eBPF profiling, generate a flame graph, and it's full of [unknown] frames and truncated stacks. Run the checklist first. Install debug symbols first. Verify frame pointers first. Then profile. Five minutes of setup saves an hour of staring at garbage data. The tools are only as good as the metadata you give them.
Putting It All Together
The complete profiling workflow for any performance investigation:
Step 1: On-CPU Profile
Start with profile:hz:99 to see where CPU time goes. If the application is CPU-bound and the hot path is in application code, you've found the bottleneck. Generate a flame graph. Done.
Step 2: Off-CPU Analysis
If CPU usage is low but the application is slow, the problem is off-CPU. Trace sched_switch to find where threads sleep. Generate an off-CPU flame graph. Look for wide bars in I/O, lock, or network paths.
Step 3: Wakeup Chains
If off-CPU analysis shows threads sleeping but you can't tell why, trace sched_wakeup to find who wakes them and what the waker was doing. Follow the chain until you find the root cause.
Step 4: Lock Contention
If multiple threads are sleeping on futexes, trace pthread_mutex_lock to identify the hot lock. Generate a lock contention flame graph to see which code paths contend most.
Step 5: Memory Analysis
If the process is growing in RSS, use memleak to trace allocations. Compare alloc vs free counts. The allocation stack with the highest outstanding byte count is your leak.
Step 6: Differential Flame Graphs
If performance changed after a deployment, generate before/after flame graphs and diff them. Red bars are regressions. Blue bars are improvements. No guessing.
Step 7: Core Dump Capture
If none of the above explains an intermittent crash or corruption, set up eBPF to detect the anomaly and trigger a core dump. Analyze the frozen process state with gdb and coredumpctl.
Every one of these steps works in production. No restarts. No Valgrind. No strace (which adds 50-100x overhead for some syscalls). No gdb attach (which stops the process). eBPF samples the kernel at near-zero cost, captures the data you need, and lets the process keep running. This is how you debug systems that matter — the ones you can't afford to restart.