kldload — Seeing Into the Kernel: what eBPF gives you that logs never can

Modern Linux · 1 of 3

Seeing into the kernel.

Every service runs through the kernel. Your web server, your database, your container runtime, your VPN — none of them touch the network card directly, none of them write to disk directly, none of them schedule themselves. The kernel does all of it on their behalf. The kernel is the platform.

For most of Linux's history, the kernel was a black box. You watched what your applications wrote to logs — and you hoped someone had remembered to print the thing that mattered. When they hadn't, you guessed.

You can't operate a platform you can't see.

And you can't run a fleet of services on a platform whose state you have to guess at. eBPF is how you stop guessing.

The kernel, area by area

Brendan Gregg's tracing-tools diagram (below) is famous because it shows, at a glance, every place a modern Linux kernel can be observed from userspace — and the tool that observes it. Each colored region is a subsystem. Each label hanging off it is a probe you can attach without recompiling anything, without rebooting, without asking the application to cooperate.

The Linux kernel observability surface · bcc/bpftrace tools annotated at each subsystem · original inspiration: iovisor/bcc tracing tools 2019

Below, one short walkthrough per layer: what it is, the real-world question it answers, and the tool that gives you the answer. None of these tools require changes to your application — they attach to the kernel itself.

User applications · libraries

Question: where is my application actually spending time?

A Java service is slow. Logs say "request took 4s." Where did those four seconds go — GC, a database call, a regex backtrack, a DNS lookup? Traditional tools require a profiler agent built into the JVM. uprobes let you attach to any function in any binary on the running system, without a restart.

Tools: uprobes · gethostlatency · javastat · bashreadline (audit every interactive shell command)

System calls

Question: what is this process actually doing?

Container start time has crept from 200ms to 2s. Logs in the container tell you nothing — the slow part is before any of your code runs. execsnoop shows every process launched, system-wide, with arguments and parent. opensnoop shows every file touched. You can see the entire startup, from the operator's seat, in plain English.

Tools: execsnoop · opensnoop · syscount · statsnoop · killsnoop

VFS · filesystem cache

Question: is my workload actually hitting disk, or is it hitting cache?

A benchmark shows 5 GB/s "read throughput" but the disks are almost idle. cachestat answers this in seconds: your reads are 99% page-cache hits, 1% actual disk reads. The disks aren't slow — you're measuring RAM.

Tools: vfsstat · vfscount · cachestat · mountsnoop

Filesystems · block I/O

Question: which workload is making my disks slow?

Postgres is fine, then suddenly p99 latency triples. biolatency shows a histogram of disk I/O completion times. biotop shows which PIDs are issuing the I/O. ext4slower (or xfsslower) shows you which files are taking >10ms to read. In two minutes you go from "Postgres is slow" to "a backup job on another tenant just started a 20-hour rsync."

Tools: biolatency · biotop · biosnoop · ext4slower · xfsslower

Network stack

Question: who is my host talking to, and is the conversation healthy?

A microservice is timing out intermittently. tcpdump catches packets but doesn't tell you which PID. tcplife records every completed TCP connection — source PID, peer, bytes in and out, duration — into a stream Loki can index. tcpretrans shows you when packets are being retransmitted, by peer. sslsniff decrypts TLS without breaking it. The network black box becomes a glass box.

Tools: tcpconnect · tcplife · tcptop · tcpretrans · sslsniff

Scheduler · memory

Question: my CPU usage looks fine. Why is my service slow?

A pod uses 30% CPU and feels sluggish. CPU isn't the problem — runqlat shows the process is sitting in the scheduler's run queue waiting for a CPU it could have used. The neighbor pod isn't using more cores; it's using them in tight bursts that starve everyone else. offcputime shows the same problem from the application's side: it's blocked, not running.

Tools: runqlat · runqlen · offcputime · profile · memleak · slabratetop

Drivers · hardware

Question: is the slowness above this line, or below it?

NVMe latency spikes can come from the firmware, the driver, the queue depth, the controller, or a degrading SSD. hardirqs / softirqs show CPU time by interrupt source. llcstat shows last-level-cache hits and misses per process — cache-line contention is invisible to every other tool, and it's the cause of half the "we threw more CPU at it and nothing got better" stories.

Tools: hardirqs · softirqs · llcstat · tlbstat · cpudist

Logs, metrics, traces — and what eBPF adds

"We already have observability." Most operators do. They have a Splunk or Loki for logs, a Prometheus for metrics, a Tempo or Jaeger for traces. Those tools are good at what they do. eBPF doesn't replace them. It answers a different question.

Logs

What your application chose to write down. You see what someone remembered to print. Quality depends on the engineer who wrote the code six months ago.

Metrics

Summarized numbers. You see what someone configured a counter for. Great for "is the rate of X going up?", useless for "why did the rate go up at 02:14?"

Traces

A request's path through services. You see what the SDK was told to instrument. Stops at the application boundary — you can't trace into the kernel.

eBPF events

What the kernel actually did. No code changes, no SDK, no cooperation from the application. Every syscall, every block I/O, every TCP connection — visible.

The three traditional pillars all depend on someone, somewhere, having decided in advance to record this thing. eBPF doesn't. It sees what happened, even when nobody thought to log it — especially when nobody thought to log it. That's where the surprises hide.

Common misconceptions

We have Splunk / Loki / Datadog. We're covered.

Those are log aggregators. They store and index what your applications wrote. They do not see kernel events. No matter how many logs you ingest, there is no log line for "the scheduler held this process off-CPU for 80ms" — because nobody wrote the code that would print it. eBPF is the missing layer underneath.

More logging = more visibility.

More logging is more cost and more noise from the same vantage point. Every byte you ingest costs storage, indexing, retention, and signal-to-noise dilution. You still see only what your apps wrote. eBPF adds a different vantage point — the kernel's — for a fraction of the data volume.

We instrumented everything.

You instrumented your code. You did not instrument the syscall layer, the page-cache, the run queue, the block-I/O queue depth, the TCP retransmission counter, the LLC cache line, or the NVMe controller's interrupt path. eBPF instruments those without you doing anything.

eBPF is too low-level to be useful to operators.

The tools above are one-line invocations. biolatency, execsnoop, tcplife are as easy to run as top. The kernel work was done years ago by Brendan Gregg, Alexei Starovoitov, and a small army of bcc/bpftrace maintainers — you reap their work for free, on every kernel 4.4+.

What this looks like, two commands deep

Just enough code to make it tangible. Both of these work on every kldload install out of the box.

# Histogram of block I/O completion times, refreshed every 5s
$ sudo biolatency 5 1

# Every completed TCP connection — source PID, peer, bytes, duration
$ sudo tcplife -L 443,80 -t

That's it. No agent install, no SDK wiring, no config file. The output feeds straight into jq, awk, or a Grafana Loki pipeline if you want to keep it. Out of the box, kldload pipes many of these into a dashboard automatically.

What kldload ships ready

Every kldload install includes the full kernel-observability toolchain, installed and indexed:

bcc-tools — the full library of biolatency / execsnoop / tcplife / ~150 others, symlinked into /usr/local/bin for direct invocation.
bpftrace — the awk of the kernel; one-liners and short scripts that attach probes anywhere.
Cilium & Hubble — for K8s installs, every pod's network traffic is observed at the syscall layer, not the proxy layer.
Tetragon — security-policy enforcement and process-exec audit, all in eBPF.
Parca / Pyroscope — continuous CPU profiling, always-on, attached to every process without instrumentation.
Prometheus exporters — the histograms from biolatency, runqlat, tcpretrans and others are exposed as native Prometheus metrics, ready for Grafana.

The whole stack lives under Metrics in the kldload web UI — a Grafana embed with dashboards already wired up. You don't have to know which tool to run; the dashboards already do.

Modern Linux · 2 of 3

What ZFS obsoletes

If the kernel is the platform, the filesystem is most of it — and most of the storage stack you've been installing for 30 years is no longer necessary.

→