| pick your distro, get ZFS on root
kldload — your platform, your way, free
Source

eBPF Core Dumps & Stack Traces

Performance problems don't announce themselves. A process burns CPU in a tight loop, or it sleeps on a lock nobody knows about, or it leaks memory so slowly that OOM doesn't hit for three weeks. The kernel sees all of it. eBPF lets you ask the kernel what's happening — right now — without stopping the process, without attaching a debugger, without restarting anything. Stack traces, flame graphs, off-CPU analysis, wakeup chains, memory leak detection, core dump capture. This page covers all of it.

The mental model: every thread on your system is either running on a CPU or sleeping somewhere. On-CPU profiling tells you where it's running. Off-CPU analysis tells you why it's sleeping. Stack traces connect both to actual code paths. Flame graphs make the data visual. Core dumps capture the moment of failure. Together, they give you complete visibility into any process on any Linux machine — live, in production, with near-zero overhead.

Stack Traces in eBPF

A stack trace is the chain of function calls that led to the current point of execution. When your PostgreSQL process is burning 100% CPU, the stack trace tells you which function, called by which function, called by which function. Without a stack trace, you know the process is hot. With a stack trace, you know why.

kstack() and ustack() — the two built-ins

bpftrace provides two functions for capturing stack traces. kstack() captures the kernel stack — the chain of kernel functions that are executing on behalf of a thread. ustack() captures the userspace stack — the chain of application functions. They are different stacks. A thread doing a read() syscall has a userspace stack (your application code calling read()) and a kernel stack (the kernel's VFS layer, filesystem driver, block layer, and device driver). You often want both.

kstack = what the kernel is doing for you. ustack = what your code is doing to itself.

Print kernel and userspace stack traces whenever a process calls read():

# bpftrace -e 'tracepoint:syscalls:sys_enter_read /comm == "postgres"/ {
    printf("--- kernel stack ---\n");
    print(kstack());
    printf("--- user stack ---\n");
    print(ustack());
}'

Attaching 1 probe...
--- kernel stack ---

        ksys_read+95
        do_syscall_64+91
        entry_SYSCALL_64_after_hwframe+118

--- user stack ---

        __GI___libc_read+18
        pq_recvbuf+115
        pq_getbyte+30
        SocketBackend+99
        ReadCommand+90
        PostgresMain+1520
        ServerLoop+982
        PostmasterMain+5893
        main+753

That userspace stack trace tells a complete story: main called PostmasterMain, which entered ServerLoop, which called PostgresMain (the per-backend main loop), which called ReadCommand to wait for the next SQL query, which went down through SocketBackend into pq_getbyte into pq_recvbuf into libc read(). You are looking at a PostgreSQL backend process waiting for its client to send the next query. Zero guesswork.

How Stack Unwinding Works

Capturing a stack trace means unwinding the call stack — walking backwards from the current frame to find each caller. There are three methods, and which one works depends on how the code was compiled:

Frame Pointer Walking

The oldest and simplest method. Each function prologue saves the previous frame pointer (RBP on x86_64) on the stack. The unwinder follows the chain: current RBP → previous RBP → previous previous RBP. Fast (just pointer chasing), but requires code compiled with -fno-omit-frame-pointer. Most distros omit frame pointers for performance (saves one register), which breaks this method. Fedora 38+ re-enabled frame pointers system-wide. CentOS/RHEL 9 did not.

DWARF Unwinding

Uses the .eh_frame section in ELF binaries, which contains unwinding rules for every instruction address. Works even without frame pointers. More expensive (has to parse DWARF tables), and the eBPF verifier limits how deep you can go. bpftrace supports DWARF unwinding with ustack() when debug symbols are available. This is the primary method for userspace stacks on distros that strip frame pointers.

ORC Unwinder (Kernel Only)

The Linux kernel uses its own unwinder called ORC (Oops Rewind Capability). ORC is a simplified version of DWARF designed specifically for kernel stacks. It's faster than DWARF and works without frame pointers. The kernel builds ORC tables at compile time from the DWARF data, then throws away the DWARF. kstack() uses ORC automatically on kernels 4.14+. You don't need to do anything — it just works.

Check whether your kernel uses ORC or frame pointers:

# grep -c CONFIG_UNWINDER_ORC=y /boot/config-$(uname -r)
1

# If that prints 1, you have ORC. If 0, check for frame pointers:
# grep -c CONFIG_UNWINDER_FRAME_POINTER=y /boot/config-$(uname -r)
0

Stack Depth Limits and Stack ID Maps

eBPF programs have a maximum stack depth they can capture. The default in bpftrace is 127 frames, which is plenty for most workloads. If you're profiling deeply recursive code (some XML parsers, tree-walking interpreters), you might hit the limit. You can increase it:

# bpftrace -e 'profile:hz:99 { @stacks[kstack(perf, 256)] = count(); }'

Internally, eBPF uses stack ID maps (BPF_MAP_TYPE_STACK_TRACE) to store captured stacks efficiently. Each unique stack trace gets a numeric ID. The map stores the actual instruction pointers. This means if 10,000 samples hit the same call chain, bpftrace stores the stack once and increments a counter — not 10,000 copies of the same stack. This is why eBPF profiling is memory-efficient even at high sample rates.

On-CPU Profiling

On-CPU profiling answers the question: where is this process spending CPU time? The technique is sampling — at regular intervals, interrupt the CPU and record what function is currently executing. Do this thousands of times, aggregate the results, and you get a statistical picture of where CPU time goes.

Why 99 Hz, not 100 Hz?

If you sample at exactly 100 Hz and some workload has a 10ms timer (also 100 Hz), every sample lands at the same point in the timer cycle. You'd see a wildly distorted profile. Sampling at 99 Hz avoids this lockstep artifact — the sampling frequency drifts relative to any power-of-10 timer, giving you uniform coverage across the entire execution. This is standard practice from Brendan Gregg's perf methodology. 49 Hz or 199 Hz also work. Never use round numbers.

Same reason you don't take photos at exactly the same frame rate as a helicopter blade: aliasing.

Sample kernel stacks at 99 Hz across all CPUs for 30 seconds:

# bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' -d 30

Attaching 1 probe...

@[
    native_queued_spin_lock_slowpath+161
    _raw_spin_lock+30
    zfs_znode_alloc+181
    zfs_zget+335
    zfs_dirent_lock+451
    zfs_lookup+306
    zpl_lookup+94
    __lookup_slow+130
    walk_component+451
    path_lookupat+120
    filename_lookup+179
    vfs_statx+142
    do_statx+57
    __x64_sys_statx+45
    do_syscall_64+91
    entry_SYSCALL_64_after_hwframe+118
]: 847

@[
    finish_task_switch.isra.0+303
    __schedule+785
    schedule+46
    schedule_hrtimeout_range_clock+256
    do_select+867
    core_sys_select+452
    __x64_sys_select+181
    do_syscall_64+91
    entry_SYSCALL_64_after_hwframe+118
]: 2391

That output just told you two things about your system. First: the most frequent kernel stack trace (2,391 samples) is threads sleeping in select() — that's idle threads, harmless. Second: 847 samples hit zfs_znode_alloc spinning on a lock during directory lookups. If your system feels sluggish, that spin lock contention in ZFS is a real lead. You found it in 30 seconds with a one-liner.

Profile a specific process by PID, capturing both kernel and userspace stacks:

# bpftrace -e 'profile:hz:99 /pid == 4521/ {
    @[kstack, ustack] = count();
}'

The workflow is always the same: sample → aggregate → fold → visualize. Sampling gives you raw stack traces with counts. Aggregation collapses identical stacks. Folding converts the data into a format that flame graph tools understand. Visualization turns it into an interactive SVG. The next section covers the full pipeline.

Flame Graphs

A flame graph is the single most useful visualization in performance analysis. It takes thousands of stack trace samples and turns them into one interactive SVG where you can instantly see which functions consume the most CPU time. Invented by Brendan Gregg in 2011, flame graphs have become the standard way to communicate profiling results.

How to Read a Flame Graph

The x-axis is alphabetically sorted, not time — this is the most common misunderstanding. Left-to-right order means nothing. The width of each box is proportional to the number of samples where that function appeared in the stack. Wider = more CPU time. The y-axis is stack depth — bottom is the entry point (e.g., main), top is the leaf function actually running on the CPU. A wide plateau at the top means one function is consuming a lot of CPU. A wide bar at the bottom with many narrow towers above means many different code paths go through that function.

Think of it as a city skyline. The tallest buildings are the deepest call stacks. The widest buildings are where the most people work.

The Full Pipeline: bpftrace to SVG

Step 1 — install Brendan Gregg's FlameGraph tools:

# git clone https://github.com/brendangregg/FlameGraph /opt/FlameGraph

Step 2 — capture stack traces with bpftrace. Save the raw output:

# bpftrace -e 'profile:hz:99 /pid == 4521/ {
    @[ustack] = count();
}' > /tmp/raw-stacks.txt

# Let it run for 30-60 seconds, then Ctrl+C

Step 3 — convert bpftrace output to folded stack format. The folded format has one stack per line, with frames separated by semicolons, followed by a space and the count:

# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/raw-stacks.txt > /tmp/folded.txt

# What folded format looks like:
# cat /tmp/folded.txt
main;PostgresMain;exec_simple_query;PortalRun;ExecutorRun;ExecScan;heapgettup_pagemode 312
main;PostgresMain;exec_simple_query;PortalRun;ExecutorRun;ExecScan;heapgettup 198
main;PostgresMain;ReadCommand;SocketBackend;pq_getbyte;pq_recvbuf;__GI___libc_read 1547
main;PostgresMain;exec_simple_query;PortalRun;ExecutorRun;ExecSort;tuplesort_performsort 89

Step 4 — generate the flame graph SVG:

# /opt/FlameGraph/flamegraph.pl /tmp/folded.txt > /tmp/postgres-cpu.svg

Step 5 — open it in a browser. The SVG is interactive — hover over any box to see the function name and sample count, click to zoom into a subtree.

# Open locally:
# firefox /tmp/postgres-cpu.svg

# Or SCP to your workstation:
# scp root@server:/tmp/postgres-cpu.svg .

Kernel Flame Graph

Same pipeline, but capture kernel stacks with kstack:

# bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > /tmp/kstacks.txt
# Run for 30 seconds, Ctrl+C

# Fold and generate:
# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/kstacks.txt > /tmp/kfolded.txt
# /opt/FlameGraph/flamegraph.pl --title "Kernel CPU Flame Graph" \
    --colors java /tmp/kfolded.txt > /tmp/kernel-cpu.svg

Combined Kernel + Userspace Flame Graph

The most powerful variant — shows the full call chain from userspace through the syscall boundary into the kernel:

# bpftrace -e 'profile:hz:99 /pid == 4521/ {
    @[ustack, kstack] = count();
}' > /tmp/combined.txt

# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/combined.txt > /tmp/combined-folded.txt
# /opt/FlameGraph/flamegraph.pl --title "PostgreSQL: User + Kernel CPU" \
    /tmp/combined-folded.txt > /tmp/postgres-combined.svg

A combined flame graph shows you the complete picture. You see your application calling write(), then the kernel's VFS layer, then the ZFS I/O pipeline, then the block device driver. If the hot path is in your application code, you know it's a code problem. If the hot path is in the kernel (say, zfs_zio_compress), you know it's an I/O or configuration problem. Without the combined view, you're guessing which side of the syscall boundary to blame.

Off-CPU Analysis

On-CPU profiling only tells half the story. If your application is slow but CPU usage is only 5%, the problem isn't where it's running — it's where it's sleeping. Off-CPU analysis traces the scheduler to record stack traces when threads go to sleep, and for how long. This catches everything on-CPU profiling misses: lock contention, disk I/O waits, network latency, sleep() calls, futex waits, pipe reads, and any other blocking operation.

The Two Halves of Performance

Every thread's wall-clock time is split into on-CPU time (running on a processor) and off-CPU time (sleeping, waiting, blocked). On-CPU profiling shows where CPU cycles go. Off-CPU analysis shows where time goes. A request that takes 500ms might spend 2ms on-CPU and 498ms off-CPU waiting for a DNS lookup, a disk read, or a lock. On-CPU profiling would say "nothing interesting, the process barely uses CPU." Off-CPU analysis would say "498ms blocked in getaddrinfo — your DNS server is slow."

On-CPU profiling watches people working. Off-CPU analysis watches people standing in line.

The mechanism: trace the sched_switch tracepoint, which fires every time a thread is switched off the CPU. Record the timestamp and stack trace at switch-off, then compute the duration when the thread comes back on-CPU:

# bpftrace -e '
tracepoint:sched:sched_switch {
    if (args->prev_state == 1 || args->prev_state == 2) {
        @start[args->prev_pid] = nsecs;
        @stack[args->prev_pid] = kstack;
    }
}

tracepoint:sched:sched_switch /
    @start[args->next_pid] != 0/ {
    $dur = nsecs - @start[args->next_pid];
    @offcpu[@stack[args->next_pid]] = sum($dur);
    delete(@start[args->next_pid]);
    delete(@stack[args->next_pid]);
}

END {
    clear(@start);
    clear(@stack);
}'

The output shows the total off-CPU time per unique stack trace, in nanoseconds:

@offcpu[
    schedule+46
    schedule_hrtimeout_range_clock+256
    do_poll.constprop.0+567
    do_sys_poll+567
    __x64_sys_poll+174
    do_syscall_64+91
    entry_SYSCALL_64_after_hwframe+118
]: 18923847261

@offcpu[
    schedule+46
    io_schedule+46
    zio_wait+214
    dmu_buf_hold_array_by_dnode+215
    dmu_read_impl+76
    dmu_read+58
    zfs_read+358
    zpl_read_common_iovec+162
    zpl_iter_read+116
    vfs_read+451
    ksys_read+95
    do_syscall_64+91
    entry_SYSCALL_64_after_hwframe+118
]: 4281563890

Read those two stacks. The first one — 18.9 seconds total in poll() — is threads sleeping on I/O multiplexing, probably idle event loops. Uninteresting. The second one — 4.28 seconds in io_schedule inside zio_wait inside zfs_read — is threads blocked on ZFS reads waiting for disk I/O to complete. If your application feels slow, that 4.28 seconds of ZFS read latency is likely the bottleneck. On-CPU profiling would never have shown this, because the threads aren't running during that time. They're just... waiting.

Filter Off-CPU Analysis by Process

In production, you usually want to focus on a single process rather than the entire system:

# bpftrace -e '
tracepoint:sched:sched_switch /args->prev_pid == 4521/ {
    if (args->prev_state != 0) {
        @start[args->prev_pid] = nsecs;
        @offstack[args->prev_pid] = ustack;
    }
}

tracepoint:sched:sched_switch /
    args->next_pid == 4521 &&
    @start[args->next_pid] != 0/ {
    $dur = nsecs - @start[args->next_pid];
    @blocked[@offstack[args->next_pid]] = sum($dur);
    delete(@start[args->next_pid]);
    delete(@offstack[args->next_pid]);
}

END { clear(@start); clear(@offstack); }
' -d 60

Off-CPU Flame Graphs

Off-CPU flame graphs use the same visualization as CPU flame graphs, but the width of each bar represents blocked time instead of CPU time. A wide bar at the bottom means threads spent a lot of wall-clock time blocked in that code path. This is where you find your I/O bottlenecks, lock contention, and synchronization stalls.

Complete bpftrace script that outputs in a format compatible with the FlameGraph tools:

#!/usr/bin/env bpftrace
// offcpu-flamegraph.bt — off-CPU profiling for flame graph generation
// Usage: bpftrace offcpu-flamegraph.bt -p PID > /tmp/offcpu-raw.txt

tracepoint:sched:sched_switch /args->prev_pid == $1/ {
    if (args->prev_state != 0) {
        @start[tid] = nsecs;
    }
}

tracepoint:sched:sched_switch /
    args->next_pid == $1 &&
    @start[tid] != 0/ {
    $dur = (nsecs - @start[tid]) / 1000;  // microseconds
    @blocked[ustack] = sum($dur);
    delete(@start[tid]);
}

END {
    clear(@start);
}

Run it, then generate the flame graph:

# bpftrace offcpu-flamegraph.bt -p 4521 > /tmp/offcpu-raw.txt
# Let it run for 60 seconds, Ctrl+C

# Fold and generate:
# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/offcpu-raw.txt > /tmp/offcpu-folded.txt
# /opt/FlameGraph/flamegraph.pl --title "PostgreSQL Off-CPU Time" \
    --countname "microseconds" \
    --colors io \
    /tmp/offcpu-folded.txt > /tmp/postgres-offcpu.svg

The --colors io flag makes the flame graph use blue-green tones instead of the default warm colors. This is a convention: warm colors (red/yellow/orange) for CPU flame graphs, cool colors (blue/green/aqua) for off-CPU. When you're comparing the two side-by-side, you can instantly tell which is which. The --countname "microseconds" changes the tooltip to show time units instead of sample counts.

Interpreting Off-CPU Flame Graphs

Common patterns you'll see in off-CPU flame graphs and what they mean:

Wide bar at futex_wait

Threads are blocked on a mutex or condition variable. Follow the stack below futex_wait to find which application lock is contended. The width tells you the total time all threads spent waiting on that lock. Common in thread pool implementations, database connection pools, and any code using pthread_mutex_lock.

Wide bar at io_schedule / zio_wait

Threads are blocked waiting for disk I/O. On ZFS systems, you'll see this in zio_wait inside the ZFS I/O pipeline. The fix is faster disks, better ZFS tuning (ARC size, prefetch settings), or reducing I/O demand. This is the most common finding on storage-heavy workloads.

Wide bar at sk_wait_data / inet_csk_wait_for_connect

Threads are blocked waiting for network data. sk_wait_data means waiting for data on an already-connected socket. inet_csk_wait_for_connect means a listening socket waiting for new connections. If your application is slow and the off-CPU flame graph shows most time here, the network is the bottleneck.

Wide bar at ep_poll / do_select

Threads are sleeping in epoll_wait() or select(). This is usually harmless — it means an event loop is idle, waiting for work. If you see this dominating an off-CPU flame graph, it doesn't mean your system is slow. It means your system is mostly idle. Filter it out mentally or by excluding specific comm names.

Wakeup Analysis

Off-CPU analysis tells you where a thread is sleeping. Wakeup analysis tells you who wakes it up — and that's where you find cascading latency chains. Process A is slow because it's sleeping in futex_wait. Who wakes it up? Process B, via futex_wake. Why was process B slow? Because it was blocked on a disk read. Why was the disk read slow? Because ZFS txg_sync was flushing a transaction group and saturating the I/O queue.

The Wakeup Chain

A wakeup chain traces the cause of a thread's sleep backward through the system. Thread X is blocked → Thread Y wakes it → Thread Y was blocked on something else → Thread Z woke Thread Y → Thread Z was doing disk I/O. By tracing sched_wakeup, you capture the waker's stack trace at the moment it wakes the target. This connects the sleeping thread to its root cause, even if the root cause is in a completely different process.

If off-CPU is finding the person stuck in line, wakeup analysis is finding the person in front of them who's holding up the line — and the person in front of them, all the way to the front.

Trace who wakes a specific process (PID 4521):

# bpftrace -e '
tracepoint:sched:sched_wakeup /args->pid == 4521/ {
    printf("Woken by: %s (PID %d)\n", comm, pid);
    printf("Waker kernel stack:\n");
    print(kstack());
    printf("---\n");
}'

Attaching 1 probe...
Woken by: postgres (PID 4535)
Waker kernel stack:

        try_to_wake_up+508
        futex_wake+534
        do_futex+318
        __x64_sys_futex+161
        do_syscall_64+91
        entry_SYSCALL_64_after_hwframe+118

---
Woken by: z_wr_iss (PID 289)
Waker kernel stack:

        try_to_wake_up+508
        __wake_up_common+119
        __wake_up_common_lock+122
        zio_notify_parent+162
        zio_done+1284
        zio_execute+114
        taskq_thread+567

---

Two different wakeup patterns. The first one is another PostgreSQL backend (PID 4535) waking our target via futex_wake — this is inter-process synchronization, probably shared buffer coordination. The second is a ZFS write issuer thread (z_wr_iss) waking our target because an I/O operation completed. If you trace PID 4535 next and find it was blocked on disk I/O, you've traced the latency chain from your slow query all the way down to the disk.

Combine wakeup analysis with off-CPU duration to find the slowest wakeup chains:

# bpftrace -e '
tracepoint:sched:sched_switch /args->prev_pid == 4521 && args->prev_state != 0/ {
    @sleep_start[args->prev_pid] = nsecs;
}

tracepoint:sched:sched_wakeup /args->pid == 4521 && @sleep_start[args->pid] != 0/ {
    $dur_ms = (nsecs - @sleep_start[args->pid]) / 1000000;
    if ($dur_ms > 10) {
        printf("[%dms] Woken by %s (PID %d)\n", $dur_ms, comm, pid);
        print(kstack());
    }
    delete(@sleep_start[args->pid]);
}

END { clear(@sleep_start); }
'

Attaching 2 probes...
[47ms] Woken by z_wr_iss (PID 289)

        try_to_wake_up+508
        __wake_up_common+119
        __wake_up_common_lock+122
        zio_notify_parent+162
        zio_done+1284
        zio_execute+114
        taskq_thread+567

[238ms] Woken by jbd2/sda3-8 (PID 412)

        try_to_wake_up+508
        wake_up_bit+51
        journal_end_buffer_io_sync+50
        end_buffer_write_sync+33
        blkdev_bio_end_io+181
        bio_endio+280
        blk_update_request+332

That 238ms wakeup from jbd2 (the ext4 journal thread) is suspicious. The PostgreSQL WAL (write-ahead log) might be on an ext4 partition, and the journal commit is slow. If the WAL were on a ZFS dataset with synchronous writes going to a fast SLOG device, that 238ms might become 2ms. Wakeup analysis doesn't just find the bottleneck — it shows you exactly which subsystem to fix.

Differential Flame Graphs

You pushed a change. Performance got worse. Or maybe better. Differential flame graphs show you exactly which code paths changed and by how much. Generate a flame graph before the change, generate one after, diff them. Red means a code path got hotter (regression). Blue means it got cooler (improvement).

Step 1 — capture the "before" profile:

# bpftrace -e 'profile:hz:99 /pid == 4521/ { @[ustack] = count(); }' \
    > /tmp/before-raw.txt
# Run for 60 seconds under representative load, Ctrl+C

# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/before-raw.txt > /tmp/before.folded

Step 2 — deploy your change, then capture the "after" profile under the same load:

# bpftrace -e 'profile:hz:99 /pid == 5102/ { @[ustack] = count(); }' \
    > /tmp/after-raw.txt
# Same duration, same load, Ctrl+C

# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/after-raw.txt > /tmp/after.folded

Step 3 — generate the differential flame graph:

# /opt/FlameGraph/difffolded.pl /tmp/before.folded /tmp/after.folded \
    | /opt/FlameGraph/flamegraph.pl --title "Differential: before vs after" \
    --negate > /tmp/diff.svg

The --negate flag makes it so that red = regression (more time than before) and blue = improvement (less time than before). Without --negate, the colors are inverted, which is confusing.

Reading a Differential Flame Graph

Red bars are code paths that now take more CPU time than before. The more saturated the red, the bigger the regression. Blue bars are code paths that take less CPU time. White/gray bars are unchanged. The width still represents total time (from the "after" profile). Look for the biggest red bars — those are the functions your change made slower. Look for blue bars to confirm expected improvements. If a function went from 5% to 15% of CPU time, it'll be bright red and wider than before.

Differential flame graphs are a cheat code for performance regressions. Instead of spending hours comparing profiles manually, you get a single image where the problem is highlighted in red. You deployed a new query planner? The diff flame graph shows ExecHashJoin is bright red (more CPU) while ExecNestLoop is blue (less CPU) — the planner is choosing hash joins over nested loops, and hash joins are more expensive for your data distribution. One image, complete diagnosis.

Debug Symbols

Without debug symbols, your stack traces show hex addresses instead of function names. Instead of PostgresMain+1520, you get 0x55a3c7f21a30. This makes everything useless. Debug symbols are the mapping from memory addresses to human-readable function names, source file names, and line numbers. You need them for meaningful userspace stack traces.

Installing Debug Symbols

CentOS / RHEL / Rocky / Fedora — use debuginfo-install:

# dnf install dnf-utils

# Install debug symbols for a specific package:
# debuginfo-install postgresql15-server

# Or install by running binary:
# debuginfo-install -y $(rpm -qf /usr/pgsql-15/bin/postgres)

# Install kernel debuginfo:
# debuginfo-install kernel-$(uname -r)

# Verify installation:
# ls /usr/lib/debug/usr/pgsql-15/bin/postgres.debug
/usr/lib/debug/usr/pgsql-15/bin/postgres.debug

Debian / Ubuntu — enable the dbgsym repository and install -dbgsym packages:

# For Debian:
# echo "deb http://deb.debian.org/debian-debug/ trixie-debug main" \
    >> /etc/apt/sources.list.d/debug.list
# apt update

# For Ubuntu:
# echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted \
    universe multiverse" >> /etc/apt/sources.list.d/ddebs.list
# apt install ubuntu-dbgsym-keyring
# apt update

# Install debug symbols:
# apt install postgresql-15-dbgsym
# apt install linux-image-$(uname -r)-dbgsym

# Verify:
# ls /usr/lib/debug/.build-id/
00/ 01/ 02/ 03/ 04/ ... fd/ fe/ ff/

The .build-id Directory Structure

Modern Linux systems store debug symbols indexed by build ID — a unique hash of the binary. The debugger looks up /usr/lib/debug/.build-id/ab/cdef1234.debug to find symbols for a binary with build ID abcdef1234. This means debug symbols always match the exact binary, even across updates. You can check a binary's build ID with readelf -n /usr/bin/postgres | grep "Build ID".

DWARF vs BTF

DWARF

The standard debug format for userspace binaries and the kernel. Contains everything: function names, variable types, source file mappings, line numbers, unwinding rules. A kernel with full DWARF debug info is 800MB+. Stored in .debug_info, .debug_line, .debug_frame, and related ELF sections. This is what gdb, perf, and bpftrace ustack() use for userspace.

BTF (BPF Type Format)

A compact type format designed specifically for eBPF. Contains only type information (struct layouts, function signatures) — no line numbers, no variable names, no unwind rules. A kernel with BTF is about 5MB instead of 800MB. BTF is what makes bpftrace able to access kernel struct fields without installing kernel-debuginfo. It's embedded in the kernel image itself (/sys/kernel/btf/vmlinux). Not a replacement for DWARF — a complement.

Check if your kernel has BTF enabled:

# ls -la /sys/kernel/btf/vmlinux
-r--r--r-- 1 root root 5765432 Apr  4 12:00 /sys/kernel/btf/vmlinux

# If that file exists, BTF is enabled. bpftrace will use it automatically.
# If not, you need kernel-debuginfo for struct access.

Application-Specific Debug Symbols

Common applications and their debug symbol packages:

# PostgreSQL
# CentOS/RHEL: debuginfo-install postgresql15-server
# Debian/Ubuntu: apt install postgresql-15-dbgsym

# nginx
# CentOS/RHEL: debuginfo-install nginx
# Debian/Ubuntu: apt install nginx-dbg   # note: -dbg not -dbgsym for nginx

# Python (for profiling Python C extensions)
# CentOS/RHEL: debuginfo-install python3
# Debian/Ubuntu: apt install python3-dbg

# glibc (needed for libc function names in stacks)
# CentOS/RHEL: debuginfo-install glibc
# Debian/Ubuntu: apt install libc6-dbg

# Node.js — no debug symbols needed, but you need perf maps:
# node --perf-basic-prof your-app.js
# This writes /tmp/perf-PID.map which bpftrace and perf can read

# Java — needs perf-map-agent:
# git clone https://github.com/jvm-profiling-tools/perf-map-agent
# cmake . && make
# java -agentpath:./libperfmap.so your-app.jar

The single most common reason for broken stack traces is missing glibc debug symbols. Your application's own symbols show function names, but every stack trace goes through libc at some point (read, write, malloc, pthread_create). Without glibc debuginfo, those frames show as hex addresses, and the stack looks incomplete. Always install glibc debuginfo first. Everything else is secondary.

Core Dump Capture Triggered by eBPF

eBPF can detect anomalies in real time — segfaults, high latency, memory spikes, specific error codes — and trigger core dump capture automatically. Instead of waiting for users to report a crash and hoping the core dump was preserved, you set up eBPF to watch for the condition and grab the evidence the moment it happens.

systemd-coredump Configuration

First, make sure systemd-coredump is configured to actually capture core dumps:

# cat /etc/systemd/coredump.conf
[Coredump]
Storage=external
Compress=yes
ProcessSizeMax=2G
ExternalSizeMax=2G
JournalSizeMax=100M
MaxUse=10G

# Reload after changes:
# systemctl daemon-reload

Verify it's working:

# coredumpctl list
TIME                            PID  UID  GID SIG     COREFILE EXE                SIZE
Fri 2026-04-04 09:14:23 UTC    8821 1000 1000 SIGSEGV present  /usr/bin/myapp    24.3M
Fri 2026-04-04 09:15:47 UTC    8834 1000 1000 SIGABRT present  /usr/bin/myapp    31.1M

eBPF: Detect Segfaults and Log Context

Use eBPF to monitor for SIGSEGV (segfault) signals and capture surrounding context that coredumpctl alone doesn't provide — like what the process was doing in the seconds before the crash:

#!/usr/bin/env bpftrace
// crash-monitor.bt — detect segfaults and capture pre-crash context

tracepoint:signal:signal_generate /args->sig == 11/ {
    printf("\n=== SIGSEGV DETECTED ===\n");
    printf("Time: %s\n", strftime("%Y-%m-%d %H:%M:%S", nsecs));
    printf("Process: %s (PID %d, TID %d)\n", comm, pid, tid);
    printf("Signal sent by PID: %d\n", args->pid);
    printf("\nKernel stack at crash:\n");
    print(kstack());
    printf("\nUserspace stack at crash:\n");
    print(ustack());
    printf("\nRecent syscalls from this process (last 5 seconds):\n");
    print(@recent_syscalls[pid]);
    printf("=========================\n");
}

// Track recent syscalls for crash context
tracepoint:raw_syscalls:sys_enter /comm == "myapp"/ {
    @recent_syscalls[pid] = count();
}

// Track open files for crash context
tracepoint:syscalls:sys_enter_openat /comm == "myapp"/ {
    printf("[pre-crash file access] %s: open(%s)\n", comm, str(args->filename));
}

// Track memory allocations for crash context
uprobe:/lib64/libc.so.6:malloc /comm == "myapp"/ {
    @alloc_sizes[pid] = hist(arg0);
}

eBPF: Trigger Core Dump on High Latency

Sometimes you don't want to wait for a crash — you want to capture a core dump when a specific operation exceeds a latency threshold. This script monitors PostgreSQL query execution and sends SIGABRT to trigger a core dump when a query takes longer than 5 seconds:

#!/usr/bin/env bpftrace
// slow-query-coredump.bt — capture core dump on slow queries

uprobe:/usr/pgsql-15/bin/postgres:exec_simple_query {
    @query_start[tid] = nsecs;
}

uretprobe:/usr/pgsql-15/bin/postgres:exec_simple_query
    /@query_start[tid] != 0/ {
    $dur_ms = (nsecs - @query_start[tid]) / 1000000;
    if ($dur_ms > 5000) {
        printf("SLOW QUERY: %dms in PID %d — triggering core dump\n",
               $dur_ms, pid);
        print(ustack());
        signal("SIGABRT");
    }
    delete(@query_start[tid]);
}

Sending SIGABRT to a production PostgreSQL backend will kill that backend process. PostgreSQL will then crash-recover and respawn a new backend. This is not something you do casually — you use it when you have a recurring slow-query problem that you can't reproduce with EXPLAIN ANALYZE, and you need the full memory state at the moment of the slow query. The core dump will contain the query plan, buffer state, lock state, and everything else in the backend's address space. One core dump from production is worth a hundred hours of trying to reproduce in dev.

Post-Mortem Analysis

The workflow: eBPF detects the anomaly live → captures context (stack traces, syscall history, memory allocation patterns) → triggers core dump → you analyze the core dump offline with gdb and coredumpctl. eBPF gives you the when and why. The core dump gives you the what — exact variable values, heap state, thread state.

coredumpctl Analysis Workflow

# List all core dumps:
# coredumpctl list
TIME                            PID  UID  GID SIG     COREFILE EXE                SIZE
Fri 2026-04-04 09:14:23 UTC    8821 1000 1000 SIGSEGV present  /usr/bin/myapp    24.3M
Fri 2026-04-04 09:15:47 UTC    8834 1000 1000 SIGABRT present  /usr/bin/myapp    31.1M

# Get detailed info about a specific dump:
# coredumpctl info 8821
           PID: 8821 (myapp)
           UID: 1000 (appuser)
           GID: 1000 (appuser)
        Signal: 11 (SEGV)
     Timestamp: Fri 2026-04-04 09:14:23 UTC
  Command Line: /usr/bin/myapp --config /etc/myapp.conf
    Executable: /usr/bin/myapp
   Control Group: /system.slice/myapp.service
          Unit: myapp.service
         Slice: system.slice
       Boot ID: a1b2c3d4e5f6...
    Machine ID: f6e5d4c3b2a1...
      Hostname: prod-web-03
       Storage: /var/lib/systemd/coredump/core.myapp.1000.a1b2c3.8821.1712222063000000.zst
  Size on Disk: 24.3M

# Open in gdb:
# coredumpctl gdb 8821
GNU gdb (GDB) 13.2
Reading symbols from /usr/bin/myapp...
Reading symbols from /usr/lib/debug/.build-id/ab/cdef1234.debug...
Core was generated by `/usr/bin/myapp --config /etc/myapp.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005583a7f21a30 in process_request (req=0x0) at src/handler.c:247
247         size_t len = req->content_length;

(gdb) bt
#0  0x00005583a7f21a30 in process_request (req=0x0) at src/handler.c:247
#1  0x00005583a7f22b50 in handle_connection (conn=0x7f3a2c001230) at src/server.c:182
#2  0x00005583a7f23c70 in worker_thread (arg=0x7f3a2c000010) at src/threadpool.c:95
#3  0x00007f3a3c0a3ea7 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f3a3bdd2a2f in clone () from /lib64/libc.so.6

(gdb) frame 0
#0  0x00005583a7f21a30 in process_request (req=0x0) at src/handler.c:247
247         size_t len = req->content_length;

(gdb) print req
$1 = (struct request *) 0x0

(gdb) frame 1
#1  0x00005583a7f22b50 in handle_connection (conn=0x7f3a2c001230) at src/server.c:182
182         process_request(conn->current_request);

(gdb) print conn->current_request
$2 = (struct request *) 0x0

(gdb) print conn->state
$3 = CONNECTION_CLOSING

And there it is. The connection was in CONNECTION_CLOSING state, which means current_request was already freed and set to NULL. But handle_connection didn't check the state before calling process_request. A race condition: the connection close handler ran on one thread while the request handler ran on another. The eBPF crash monitor told you when and how often. The core dump told you exactly which variable was null and why. That's the power of combining live tracing with post-mortem analysis.

Correlating eBPF Context with Core Dumps

The real power is correlating eBPF's live context with the core dump's frozen state. Set up eBPF to log everything leading up to the crash, then use the core dump to inspect the exact state:

# 1. Run the crash monitor (writes to /var/log/ebpf-crashes.log):
# bpftrace crash-monitor.bt > /var/log/ebpf-crashes.log 2>&1 &

# 2. When a crash happens, check the eBPF log:
# tail -50 /var/log/ebpf-crashes.log
=== SIGSEGV DETECTED ===
Time: 2026-04-04 09:14:23
Process: myapp (PID 8821, TID 8825)
Recent file accesses:
  open(/var/lib/myapp/session-4a2b.dat)
  open(/var/lib/myapp/session-4a2b.dat.lock)
Allocation histogram (last 10s):
  [64, 128)     : 12847
  [128, 256)    : 8921
  [256, 512)    : 3401
  [4K, 8K)      : 2
  [1M, 2M)      : 1    ← suspicious large allocation right before crash

# 3. Open the core dump and check that 1MB allocation:
# coredumpctl gdb 8821
(gdb) info threads
  Id   Target Id                  Frame
  1    Thread 0x7f3a3c0a5740      0x00007f3a3bdd1337 in epoll_wait ()
* 2    Thread 0x7f3a3b89e700      0x00005583a7f21a30 in process_request ()
  3    Thread 0x7f3a3b09d700      0x00007f3a3bdd2a2f in clone ()
  4    Thread 0x7f3a3a89c700      0x00007f3a3bdc94ed in nanosleep ()

(gdb) thread 2
(gdb) info locals
req = 0x0
buf = 0x7f3a2c100000
buf_size = 1048576

Memory Leak Detection

Memory leaks are silent killers. A process allocates memory, forgets to free it, and slowly grows until the OOM killer shows up three weeks later. Traditional tools like Valgrind work but require restarting the process under instrumentation, which is impossible in production. eBPF traces malloc/free (userspace) or kmalloc/kfree (kernel) on a live process with near-zero overhead.

The memleak BCC Tool

BCC includes a purpose-built memleak tool that does exactly this. It attaches uprobes to the allocator functions, tracks outstanding allocations, and reports the ones that haven't been freed:

# Trace userspace allocations in PID 4521, show top 10 every 5 seconds:
# /usr/share/bcc/tools/memleak -p 4521 5

Attaching to pid 4521, Ctrl+C to quit.
[09:30:15] Top 10 stacks with outstanding allocations:
        948736 bytes in 3714 allocations from stack
                operator new(unsigned long)+0x1c [libstdc++.so.6]
                std::string::_Rep::_S_create(unsigned long, ...)+0x59
                ConnectionPool::createConnection()+0x4a [myapp]
                ConnectionPool::getConnection()+0x123 [myapp]
                RequestHandler::handleRequest()+0x67 [myapp]
                main+0x2a1 [myapp]

        524288 bytes in 1 allocations from stack
                malloc+0x3e [libc.so.6]
                json_parse_buffer()+0x28 [libjansson.so.4]
                parse_config()+0x81 [myapp]
                main+0x55 [myapp]

[09:30:20] Top 10 stacks with outstanding allocations:
        1523712 bytes in 5967 allocations from stack
                operator new(unsigned long)+0x1c [libstdc++.so.6]
                std::string::_Rep::_S_create(unsigned long, ...)+0x59
                ConnectionPool::createConnection()+0x4a [myapp]
                ConnectionPool::getConnection()+0x123 [myapp]
                RequestHandler::handleRequest()+0x67 [myapp]
                main+0x2a1 [myapp]

Between the two samples (5 seconds apart), ConnectionPool::createConnection grew from 948KB (3,714 allocations) to 1.52MB (5,967 allocations). That's 2,253 new connections created in 5 seconds and none freed. The connection pool is leaking — it creates new connections but never returns them to the pool or closes them. The stack trace tells you exactly which function to fix. In production. Without restarting anything. Without Valgrind's 20x slowdown.

Kernel Memory Leak Detection

Trace kernel memory allocations with kmalloc/kfree. This catches kernel module leaks, driver bugs, and subsystem leaks:

# /usr/share/bcc/tools/memleak 5

Attaching to kernel allocators, Ctrl+C to quit.
[09:35:10] Top 10 stacks with outstanding allocations:
        4194304 bytes in 1024 allocations from stack
                kmalloc_trace+0x2b
                zfs_znode_alloc+0x9a
                zfs_zget+0x14f
                zfs_dirent_lock+0x1c3
                zfs_lookup+0x132
                zpl_lookup+0x5e

Manual malloc/free Tracing with bpftrace

For more control, write your own malloc/free tracer:

#!/usr/bin/env bpftrace
// malloc-tracer.bt — track allocations and frees for a specific process

uprobe:/lib64/libc.so.6:malloc /pid == $1/ {
    @alloc_size[tid] = arg0;
}

uretprobe:/lib64/libc.so.6:malloc /pid == $1 && @alloc_size[tid] != 0/ {
    @outstanding[retval] = @alloc_size[tid];
    @alloc_stacks[ustack, @alloc_size[tid]] = count();
    @total_alloc = sum(@alloc_size[tid]);
    delete(@alloc_size[tid]);
}

uprobe:/lib64/libc.so.6:free /pid == $1 && arg0 != 0/ {
    if (@outstanding[arg0] != 0) {
        @total_free = sum(@outstanding[arg0]);
        delete(@outstanding[arg0]);
    }
}

interval:s:10 {
    printf("\n--- Outstanding allocations: %d bytes ---\n",
           @total_alloc - @total_free);
}

END {
    printf("\n=== Allocation stacks (not freed) ===\n");
    print(@alloc_stacks);
    clear(@outstanding);
    clear(@alloc_size);
}
# bpftrace malloc-tracer.bt 4521
Attaching 5 probes...

--- Outstanding allocations: 2481152 bytes ---
--- Outstanding allocations: 4915200 bytes ---
--- Outstanding allocations: 7340032 bytes ---
^C

=== Allocation stacks (not freed) ===
@alloc_stacks[
    malloc+62
    ConnectionPool::createConnection()+74
    ConnectionPool::getConnection()+291
    RequestHandler::handleRequest()+103
    main+673
, 256]: 28672

Lock Contention Analysis

Lock contention is the number one killer of multi-threaded application performance. Thread A holds a lock. Threads B, C, D, E, and F are all blocked waiting for it. On-CPU profiling shows Thread A consuming CPU. Off-CPU analysis shows Threads B-F sleeping. But neither tells you which lock is the problem or how long threads wait for it. Lock contention analysis does.

Tracing pthread_mutex_lock

Trace all mutex acquisitions and measure the time threads spend waiting:

#!/usr/bin/env bpftrace
// lock-contention.bt — measure mutex wait times

uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /pid == $1/ {
    @lock_start[tid] = nsecs;
    @lock_addr[tid] = arg0;
}

uretprobe:/lib64/libpthread.so.0:pthread_mutex_lock
    /pid == $1 && @lock_start[tid] != 0/ {
    $dur = nsecs - @lock_start[tid];
    $dur_us = $dur / 1000;

    @lock_wait_us = hist($dur_us);
    @lock_wait_by_addr[@lock_addr[tid]] = sum($dur);
    @lock_contention[@lock_addr[tid], ustack] = count();

    if ($dur_us > 1000) {
        printf("SLOW LOCK: %dus waiting for mutex 0x%lx\n",
               $dur_us, @lock_addr[tid]);
        print(ustack());
    }

    delete(@lock_start[tid]);
    delete(@lock_addr[tid]);
}

END {
    printf("\n=== Lock wait time distribution (microseconds) ===\n");
    print(@lock_wait_us);
    printf("\n=== Total wait time by lock address ===\n");
    print(@lock_wait_by_addr);
    printf("\n=== Contention by lock + stack ===\n");
    print(@lock_contention);
}
# bpftrace lock-contention.bt 4521
Attaching 3 probes...
SLOW LOCK: 4721us waiting for mutex 0x55a3c8012340

        pthread_mutex_lock+37
        ConnectionPool::getConnection()+45
        RequestHandler::handleRequest()+103
        WorkerThread::run()+201
        start_thread+741

SLOW LOCK: 12847us waiting for mutex 0x55a3c8012340

        pthread_mutex_lock+37
        ConnectionPool::getConnection()+45
        RequestHandler::handleRequest()+103
        WorkerThread::run()+201
        start_thread+741

^C

=== Lock wait time distribution (microseconds) ===
@lock_wait_us:
[0]                    8924 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                    3421 |@@@@@@@@@@@@@@@                         |
[2, 4)                 1847 |@@@@@@@@                                |
[4, 8)                  921 |@@@@                                    |
[8, 16)                 412 |@                                       |
[16, 32)                198 |                                        |
[32, 64)                 87 |                                        |
[64, 128)                34 |                                        |
[128, 256)               12 |                                        |
[256, 512)                8 |                                        |
[512, 1K)                 5 |                                        |
[1K, 2K)                  3 |                                        |
[2K, 4K)                  2 |                                        |
[4K, 8K)                  1 |                                        |
[8K, 16K)                 1 |                                        |

=== Total wait time by lock address ===
@lock_wait_by_addr[0x55a3c8012340]: 892341872
@lock_wait_by_addr[0x55a3c8012380]: 12847291
@lock_wait_by_addr[0x55a3c80123c0]: 847123

Lock address 0x55a3c8012340 accounts for 892ms of total wait time — orders of magnitude more than any other lock. Every contention event on that lock comes from ConnectionPool::getConnection(). The connection pool has a single global mutex that every worker thread contends on. The fix is usually a striped lock (one mutex per N connections), a lock-free ring buffer, or a per-thread connection pool. You found the exact lock, the exact function, and the exact contention pattern. In production. In 30 seconds.

Kernel Futex Tracing

For deeper analysis, trace the kernel-side futex operations directly. This catches all synchronization primitives that use futexes (mutexes, condition variables, semaphores, rwlocks):

# bpftrace -e '
tracepoint:syscalls:sys_enter_futex /pid == 4521/ {
    @futex_ops[args->op & 0xf] = count();
    if ((args->op & 0xf) == 0) {  // FUTEX_WAIT
        @wait_start[tid] = nsecs;
        @wait_stack[tid] = ustack;
    }
}

tracepoint:syscalls:sys_exit_futex /pid == 4521 && @wait_start[tid] != 0/ {
    $dur_us = (nsecs - @wait_start[tid]) / 1000;
    @futex_wait_us = hist($dur_us);
    if ($dur_us > 5000) {
        printf("LONG FUTEX WAIT: %dus\n", $dur_us);
        print(@wait_stack[tid]);
    }
    delete(@wait_start[tid]);
    delete(@wait_stack[tid]);
}

END {
    printf("\nFutex operation counts (0=WAIT, 1=WAKE, ...):\n");
    print(@futex_ops);
}'

Lock Contention Flame Graph

Generate a flame graph where the width represents lock wait time instead of CPU time:

# bpftrace -e '
uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /pid == 4521/ {
    @lock_start[tid] = nsecs;
}

uretprobe:/lib64/libpthread.so.0:pthread_mutex_lock
    /pid == 4521 && @lock_start[tid] != 0/ {
    $dur = (nsecs - @lock_start[tid]) / 1000;
    @lock_wait[ustack] = sum($dur);
    delete(@lock_start[tid]);
}

END { clear(@lock_start); }
' > /tmp/lock-raw.txt

# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/lock-raw.txt > /tmp/lock-folded.txt
# /opt/FlameGraph/flamegraph.pl --title "Lock Contention" \
    --countname "microseconds" \
    --colors aqua \
    /tmp/lock-folded.txt > /tmp/lock-contention.svg

Real-World Examples

Finding Latency Spikes in PostgreSQL

A production PostgreSQL server has intermittent query latency spikes. P99 goes from 5ms to 800ms for no apparent reason. pg_stat_statements shows the slow queries are simple index lookups that should be fast. Here's how to find the root cause:

# Step 1: Confirm it's off-CPU time, not on-CPU
# bpftrace -e 'profile:hz:99 /comm == "postgres"/ { @[ustack] = count(); }' -d 30

# Result: most CPU time is in ReadBuffer and index scan functions.
# Nothing unusual. CPU profile looks normal.

# Step 2: Check off-CPU time
# bpftrace -e '
tracepoint:sched:sched_switch /comm == "postgres" && args->prev_state != 0/ {
    @start[args->prev_pid] = nsecs;
    @stack[args->prev_pid] = kstack;
}
tracepoint:sched:sched_switch /comm == "postgres" && @start[args->next_pid]/ {
    $dur_ms = (nsecs - @start[args->next_pid]) / 1000000;
    if ($dur_ms > 50) {
        printf("OFF-CPU %dms:\n", $dur_ms);
        print(@stack[args->next_pid]);
    }
    delete(@start[args->next_pid]);
    delete(@stack[args->next_pid]);
}
END { clear(@start); clear(@stack); }
' -d 120

OFF-CPU 412ms:

        schedule+46
        io_schedule+46
        zio_wait+214
        dmu_buf_hold_array_by_dnode+215
        dmu_read_impl+76
        dmu_read+58
        zfs_read+358
        zpl_read_common_iovec+162
        zpl_iter_read+116
        vfs_read+451
        ksys_read+95
        do_syscall_64+91
        entry_SYSCALL_64_after_hwframe+118

OFF-CPU 237ms:

        schedule+46
        io_schedule+46
        zio_wait+214
        dmu_tx_assign+327
        zfs_write+643
        zpl_write_common_iovec+162
        zpl_iter_write+116
        vfs_write+451
        ksys_write+95
        do_syscall_64+91
        entry_SYSCALL_64_after_hwframe+118

Found it. PostgreSQL backends are blocking for 200-400ms in zio_wait inside both zfs_read and zfs_write. The ZFS I/O pipeline is stalling. This typically means one of three things: (1) the ARC (adaptive replacement cache) is too small, causing cache misses that hit disk, (2) a ZFS scrub or resilver is running and consuming I/O bandwidth, or (3) the underlying disk is failing and retrying I/O operations. Check with zpool status for scrub activity and zpool iostat -v 1 for per-disk latency. The eBPF trace took 2 minutes to find what might have taken days of guess-and-check.

Off-CPU Flame Graph Reveals ZFS txg_sync Blocking Writes

An application writes heavily to ZFS and experiences periodic 1-2 second stalls every 5-10 seconds. The stalls correlate with ZFS transaction group (TXG) syncs:

# Step 1: Generate off-CPU flame graph for the application
# bpftrace -e '
tracepoint:sched:sched_switch /args->prev_pid == 9182 && args->prev_state != 0/ {
    @start[tid] = nsecs;
}
tracepoint:sched:sched_switch /args->next_pid == 9182 && @start[tid]/ {
    $dur = (nsecs - @start[tid]) / 1000;
    @blocked[kstack] = sum($dur);
    delete(@start[tid]);
}
END { clear(@start); }
' > /tmp/txg-offcpu.txt -d 60

# Step 2: Generate flame graph
# /opt/FlameGraph/stackcollapse-bpftrace.pl /tmp/txg-offcpu.txt > /tmp/txg-folded.txt
# /opt/FlameGraph/flamegraph.pl --title "App Off-CPU (ZFS txg_sync visible)" \
    --countname "microseconds" --colors io \
    /tmp/txg-folded.txt > /tmp/txg-offcpu.svg

# The flame graph shows a massive wide bar through this path:
#   io_schedule → zio_wait → dmu_tx_assign → zfs_write
# Width: ~60% of total off-CPU time
# This means the application is blocked on dmu_tx_assign waiting for
# the current TXG to have space for new writes.

# Step 3: Confirm TXG sync timing
# bpftrace -e '
kprobe:txg_sync_thread {
    printf("TXG sync started: %s\n", strftime("%H:%M:%S", nsecs));
}
kretprobe:txg_sync_thread {
    printf("TXG sync ended:   %s\n", strftime("%H:%M:%S", nsecs));
}
' -d 30

TXG sync started: 09:45:03
TXG sync ended:   09:45:04
TXG sync started: 09:45:08
TXG sync ended:   09:45:09
TXG sync started: 09:45:13
TXG sync ended:   09:45:15

The fix: tune ZFS to sync TXGs more frequently with shorter durations, or add a SLOG (ZFS Intent Log) device to absorb synchronous writes without waiting for TXG commit:

# Reduce TXG sync interval from default 5s to 1s:
# zfs set zfs:zfs_txg_timeout=1

# Or add a fast SLOG device:
# zpool add tank log /dev/nvme1n1

Wakeup Chain: Slow DNS Causing HTTP Timeouts

An HTTP service has intermittent 2-3 second response times. CPU usage is low. Disk I/O is fine. Network bandwidth is fine. The problem is invisible to traditional monitoring:

# Step 1: Off-CPU analysis on the HTTP worker threads
# bpftrace -e '
tracepoint:sched:sched_switch /comm == "http-worker" && args->prev_state != 0/ {
    @start[args->prev_pid] = nsecs;
    @offstack[args->prev_pid] = kstack;
}
tracepoint:sched:sched_switch /comm == "http-worker" && @start[args->next_pid]/ {
    $dur_ms = (nsecs - @start[args->next_pid]) / 1000000;
    if ($dur_ms > 500) {
        printf("OFF-CPU %dms:\n", $dur_ms);
        print(@offstack[args->next_pid]);
    }
    delete(@start[args->next_pid]);
    delete(@offstack[args->next_pid]);
}
END { clear(@start); clear(@offstack); }
'

OFF-CPU 2341ms:

        schedule+46
        schedule_hrtimeout_range_clock+256
        do_poll.constprop.0+567
        do_sys_poll+567
        __x64_sys_poll+174
        do_syscall_64+91
        entry_SYSCALL_64_after_hwframe+118

# The kernel stack just shows poll() — we need the userspace stack.

# Step 2: Capture userspace stacks for the same event
# bpftrace -e '
tracepoint:sched:sched_switch /comm == "http-worker" && args->prev_state != 0/ {
    @start[args->prev_pid] = nsecs;
    @offustack[args->prev_pid] = ustack;
}
tracepoint:sched:sched_switch /comm == "http-worker" && @start[args->next_pid]/ {
    $dur_ms = (nsecs - @start[args->next_pid]) / 1000000;
    if ($dur_ms > 500) {
        printf("OFF-CPU %dms:\n", $dur_ms);
        print(@offustack[args->next_pid]);
    }
    delete(@start[args->next_pid]);
    delete(@offustack[args->next_pid]);
}
END { clear(@start); clear(@offustack); }
'

OFF-CPU 2107ms:

        __GI___poll+45
        __res_context_send+1842
        __res_context_query+412
        __res_context_search+253
        gaih_inet.constprop.0+2641
        getaddrinfo+341
        resolve_upstream+87
        proxy_handler+245
        handle_request+152
        worker_main+401

There it is. getaddrinfo__res_context_search__res_context_query__res_context_sendpoll. The HTTP worker is doing a DNS lookup via getaddrinfo() for every request to resolve the upstream backend hostname, and the DNS server is taking 2+ seconds to respond. The fix is one of: (1) use /etc/hosts or a local DNS cache (systemd-resolved, dnsmasq), (2) cache the DNS result in the application, or (3) use IP addresses instead of hostnames for upstream backends. On-CPU profiling would have shown nothing. Standard network monitoring would have shown nothing (the DNS packets are tiny). Only off-CPU tracing with userspace stacks found this.

Memory Leak in a Long-Running Daemon

A daemon's RSS grows by ~50MB per day. After 2 weeks, the OOM killer terminates it. No crashes, no errors in the logs. Just steady growth:

# Step 1: Confirm the growth rate
# while true; do
    ps -o pid,rss,comm -p 6712
    sleep 60
  done

    PID    RSS COMMAND
   6712 842316 mydaemon
    PID    RSS COMMAND
   6712 843128 mydaemon
    PID    RSS COMMAND
   6712 843940 mydaemon

# ~800KB/minute growth. That's 48MB/hour, confirming the report.

# Step 2: Use memleak to find the leaking allocation
# /usr/share/bcc/tools/memleak -p 6712 -o 10000 10

# -o 10000 = only show allocations outstanding for 10+ seconds
# 10 = print every 10 seconds

[09:50:00] Top 10 stacks with outstanding allocations:
        83886080 bytes in 20480 allocations from stack
                malloc+0x3e [libc.so.6]
                json_loads+0x48 [libjansson.so.4]
                parse_event+0x67 [mydaemon]
                event_loop+0x1a3 [mydaemon]
                main+0x2b5 [mydaemon]

        4194304 bytes in 1 allocations from stack
                malloc+0x3e [libc.so.6]
                init_buffer_pool+0x28 [mydaemon]
                main+0x55 [mydaemon]

[09:50:10] Top 10 stacks with outstanding allocations:
        88080384 bytes in 21504 allocations from stack
                malloc+0x3e [libc.so.6]
                json_loads+0x48 [libjansson.so.4]
                parse_event+0x67 [mydaemon]
                event_loop+0x1a3 [mydaemon]
                main+0x2b5 [mydaemon]

parse_event calls json_loads to parse incoming JSON events. json_loads allocates memory for the parsed JSON object. In 10 seconds, 1,024 new allocations appeared and none were freed. The parse_event function is parsing the JSON but never calling json_decref() to free the parsed object when it's done with it. The fix is a one-line call to json_decref(root) at the end of parse_event(). Twenty minutes of eBPF tracing found a bug that had been leaking memory for months.

Common Pitfalls

eBPF profiling and stack trace analysis can go wrong in subtle ways. These are the most common problems and how to fix them.

Missing Frame Pointers

The number one cause of broken userspace stack traces. Most distros compile with -fomit-frame-pointer to free up the RBP register. Without frame pointers, the stack unwinder can't walk the call chain, and you get truncated stacks with [unknown] frames. Fix: recompile with -fno-omit-frame-pointer, or use a distro that enables frame pointers by default (Fedora 38+). For third-party software, install DWARF debug symbols and use bpftrace's DWARF unwinder.

Missing Debug Symbols

Without debug symbols, stack traces show hex addresses instead of function names: 0x55a3c7f21a30 instead of PostgresMain+1520. The stack trace is technically correct but completely useless. Fix: install the -debuginfo (RHEL) or -dbgsym (Debian) package for every binary in the stack. Don't forget glibc — it's in every stack.

JIT-Compiled Code (Java, Node.js, Python)

JIT compilers generate machine code at runtime. The eBPF stack unwinder doesn't know what functions that code belongs to — it just sees anonymous memory regions marked [unknown]. Fix: Java needs perf-map-agent to write /tmp/perf-PID.map. Node.js needs --perf-basic-prof. Python needs py-spy or the python3-dbg interpreter. Each runtime has its own mechanism for exporting symbol maps.

Stripped Binaries

Containers often ship minimal images with stripped binaries (no symbol table at all). Even readelf -s shows nothing. Fix: install the unstripped binary alongside it, or use eu-unstrip to merge a debug symbol file with the stripped binary. For container debugging, mount the debug symbols from the host into the container's /usr/lib/debug/ path.

Kernel ORC vs Frame Pointer Confusion

Kernel stacks almost always work because modern kernels use the ORC unwinder. But if you're running an older kernel (pre-4.14) or a custom kernel with ORC disabled, kernel stacks will be broken just like userspace stacks without frame pointers. Check: grep CONFIG_UNWINDER_ORC /boot/config-$(uname -r). If it's not set, you need CONFIG_UNWINDER_FRAME_POINTER=y and a kernel recompile.

Stack Depth Truncation

The default maximum stack depth in bpftrace is 127 frames. If your application has deeper call chains (some enterprise Java apps routinely exceed 200 frames), the stack gets truncated at the bottom, and you lose the entry point. Fix: increase the depth with kstack(perf, 256) or ustack(256). Be aware that deeper stacks use more map memory.

Inlined Functions

Compiler optimizations inline small functions into their callers. The inlined function disappears from the stack trace. If you're looking for parse_header() but the compiler inlined it into handle_request(), you'll only see handle_request() in the stack. Fix: compile with -fno-inline for debugging (not production), or use DWARF info which tracks inlining decisions and can show inline frames.

High Frequency Sampling Overhead

Sampling at 999 Hz instead of 99 Hz gives you 10x more samples but also 10x more overhead. On a busy system with 128 CPUs, 999 Hz means 127,872 interrupts per second. That's measurable overhead. Rule of thumb: 49 Hz for initial exploration, 99 Hz for detailed profiling, never above 999 Hz. For off-CPU tracing, the overhead comes from probe frequency, not a timer — heavily threaded workloads can generate millions of sched_switch events per second.

Quick Diagnostic Checklist

When your stack traces don't look right, run through this checklist:

# 1. Do you have debug symbols?
# readelf -S /usr/bin/postgres | grep debug
  [28] .debug_info       PROGBITS ...
  [29] .debug_abbrev     PROGBITS ...
  [30] .debug_line       PROGBITS ...
# If no .debug sections, install debuginfo/dbgsym packages.

# 2. Does the binary have frame pointers?
# readelf -wf /usr/bin/postgres 2>/dev/null | head -5
# If this produces output, DWARF unwind info is present (good).
# Also check:
# objdump -d /usr/bin/postgres | grep -c "push.*rbp"
# High count = frame pointers present. Zero = omitted.

# 3. Does the kernel have BTF?
# ls /sys/kernel/btf/vmlinux
# If missing, bpftrace can't access kernel struct fields.

# 4. Does the kernel have ORC?
# grep CONFIG_UNWINDER_ORC=y /boot/config-$(uname -r)
# If missing, kernel stacks may be broken.

# 5. Is the process a JIT runtime?
# If Java: check for /tmp/perf-PID.map
# If Node.js: was it started with --perf-basic-prof?
# If neither exists, JIT frames will show as [unknown].

# 6. Is bpftrace running as root?
# bpftrace needs CAP_BPF + CAP_PERFMON (or root) for profiling.
# Non-root bpftrace silently produces empty or partial results.

The most frustrating debugging experience is debugging your debugger. You spend an hour setting up eBPF profiling, generate a flame graph, and it's full of [unknown] frames and truncated stacks. Run the checklist first. Install debug symbols first. Verify frame pointers first. Then profile. Five minutes of setup saves an hour of staring at garbage data. The tools are only as good as the metadata you give them.

Putting It All Together

The complete profiling workflow for any performance investigation:

Step 1: On-CPU Profile

Start with profile:hz:99 to see where CPU time goes. If the application is CPU-bound and the hot path is in application code, you've found the bottleneck. Generate a flame graph. Done.

Step 2: Off-CPU Analysis

If CPU usage is low but the application is slow, the problem is off-CPU. Trace sched_switch to find where threads sleep. Generate an off-CPU flame graph. Look for wide bars in I/O, lock, or network paths.

Step 3: Wakeup Chains

If off-CPU analysis shows threads sleeping but you can't tell why, trace sched_wakeup to find who wakes them and what the waker was doing. Follow the chain until you find the root cause.

Step 4: Lock Contention

If multiple threads are sleeping on futexes, trace pthread_mutex_lock to identify the hot lock. Generate a lock contention flame graph to see which code paths contend most.

Step 5: Memory Analysis

If the process is growing in RSS, use memleak to trace allocations. Compare alloc vs free counts. The allocation stack with the highest outstanding byte count is your leak.

Step 6: Differential Flame Graphs

If performance changed after a deployment, generate before/after flame graphs and diff them. Red bars are regressions. Blue bars are improvements. No guessing.

Step 7: Core Dump Capture

If none of the above explains an intermittent crash or corruption, set up eBPF to detect the anomaly and trigger a core dump. Analyze the frozen process state with gdb and coredumpctl.

Every one of these steps works in production. No restarts. No Valgrind. No strace (which adds 50-100x overhead for some syscalls). No gdb attach (which stops the process). eBPF samples the kernel at near-zero cost, captures the data you need, and lets the process keep running. This is how you debug systems that matter — the ones you can't afford to restart.

← eBPF Performance Custom eBPF Programs →