Documentation

Observability — Intermediate

You know execsnoop and bpftrace one-liners. Now build real-time monitoring: eBPF-powered socket tracking, latency measurement with Prometheus metrics, and dashboards that show what’s happening inside the kernel.

The beginner page showed you ad-hoc tools — run a command, see what’s happening now, close it. This page bridges to continuous monitoring: eBPF programs that run as services, export metrics to Prometheus, and show up on Grafana dashboards. The difference: ad-hoc tools tell you what’s happening when you look. Continuous monitoring tells you what happened while you weren’t looking. Both matter. The workflow: Prometheus alert fires → Grafana shows the spike → SSH in → run bcc/bpftrace from the beginner page to find the root cause. Continuous monitoring is your eyes. Ad-hoc tools are your microscope. For the full Prometheus + Grafana + Alertmanager stack setup, see the Observability Masterclass.

socket_snoop hooks into the kernel at the TCP state machine level — it sees every connection state transition (SYN_SENT, ESTABLISHED, FIN_WAIT, TIME_WAIT, CLOSE) as it happens. This is fundamentally different from ss (which shows a snapshot) or tcpdump (which shows packets). socket_snoop shows state changes. A connection stuck in SYN_SENT means the remote isn't responding. A flood of TIME_WAIT means connections aren't being reused. Retransmissions mean packet loss. The pattern tells the story.

socket_snoop — Real-time TCP state monitoring

socket_snoop hooks into the kernel’s inet_sock_set_state tracepoint and logs every TCP state change — connections opening, closing, hanging in TIME_WAIT, retransmitting. It’s a lightweight alternative to running tcpdump or ss in a loop.

Install

cd /opt/linux-tools/debian/monitoring

# One-shot install (system deps + Python venv)
chmod +x install-deps-debian.sh
sudo ./install-deps-debian.sh

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Or use the Makefile:

make setup    # system deps
make deps     # Python venv + pip
make test     # run tests
sudo make run # start monitoring

Run it

sudo .venv/bin/python socket_snoop.py --log-file /var/log/socket_monitor.log

Output (live to console + file):

Mar 21 2026 14:23:25.454 State Change: SRC=10.100.10.150:39134 DST=10.100.10.202:8000 PID=30512 COMM=nginx STATE=Connection Closing (FIN_WAIT1)
Mar 21 2026 14:23:25.455 State Change: SRC=10.100.10.150:39134 DST=10.100.10.202:8000 PID=30512 COMM=nginx STATE=Waiting (FIN_WAIT2)
Mar 21 2026 14:23:25.501 State Change: SRC=10.100.10.150:39134 DST=10.100.10.202:8000 PID=30512 COMM=nginx STATE=Cooldown (TIME_WAIT)

Filter by process, IP, or port

# Only watch nginx
sudo .venv/bin/python socket_snoop.py --pid $(pgrep nginx)

# Only watch connections to the database
sudo .venv/bin/python socket_snoop.py --dst-ip 10.100.10.50 --dst-port 5432

# Only active connections (skip TIME_WAIT noise)
sudo .venv/bin/python socket_snoop.py --active-only

Run as a systemd service

sudo cp systemd/socket-snoop.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now socket-snoop.service

# Check status
journalctl -u socket-snoop -f

Run in Docker

sudo docker build -t socket-snoop:latest .
sudo docker run --rm -it \
  --privileged --pid=host --net=host \
  -v /lib/modules:/lib/modules:ro \
  -v /usr/src:/usr/src:ro \
  -v /sys:/sys:ro \
  -v /var/log:/var/log \
  socket-snoop:latest \
  /app/.venv/bin/python /app/socket_snoop.py --log-file /var/log/socket_monitor.log

What to look for

Pattern	Meaning
Many SYN_SENT → no ESTABLISHED	Connection refused or firewall blocking
Piling up TIME_WAIT	High connection churn — consider connection pooling
Retransmissions	Network packet loss or congestion
Long-lived ESTABLISHED	Persistent connections (database pools, WebSockets)
CLOSE_WAIT accumulating	Application not closing sockets (resource leak)

latency_snoop answers the question that network teams fight about most: "is it the network or the application?" It measures actual TCP connect latency — the time between SYN and SYN-ACK. If that's high, it's the network (or the remote server's TCP stack). If connect latency is fine but application response is slow, the problem is in the application layer. The Prometheus histogram export means you get p50/p95/p99 percentiles over time, which is how you catch intermittent latency spikes that averages hide.

latency_snoop — TCP latency with Prometheus

latency_snoop goes deeper — it measures the actual time between TCP state transitions and exports metrics to Prometheus.

Install

Same venv as socket_snoop, plus:

source /opt/linux-tools/debian/monitoring/.venv/bin/activate
pip install prometheus_client

Run with Prometheus exporter

sudo .venv/bin/python latency_snoop.py \
  --prometheus-port 9900 \
  --log-file /var/log/latency_monitor.log

Now http://localhost:9900/metrics serves Prometheus metrics:

# HELP tcp_connect_latency_ms TCP connect latency (SYN_SENT to ESTABLISHED)
# TYPE tcp_connect_latency_ms histogram
tcp_connect_latency_ms_bucket{le="1.0"} 142
tcp_connect_latency_ms_bucket{le="5.0"} 203
tcp_connect_latency_ms_bucket{le="10.0"} 215
tcp_connect_latency_ms_bucket{le="50.0"} 218
tcp_connect_latency_ms_bucket{le="100.0"} 218

# HELP tcp_retransmits_total TCP segment retransmissions
# TYPE tcp_retransmits_total counter
tcp_retransmits_total 7

Advanced options

# JSON output (for piping to jq, Elasticsearch, etc.)
sudo .venv/bin/python latency_snoop.py --json

# Per-flow metrics (WARNING: high cardinality — use for debugging, not production)
sudo .venv/bin/python latency_snoop.py --prometheus-port 9900 --per-flow

# Collect user-space stack traces
sudo .venv/bin/python latency_snoop.py --stacks --prometheus-port 9900

# Custom histogram buckets (in milliseconds)
sudo .venv/bin/python latency_snoop.py --prometheus-port 9900 \
  --buckets "0.5,1,2,5,10,25,50,100,250,500,1000"

# Filter to just database traffic
sudo .venv/bin/python latency_snoop.py \
  --dst-port 5432 \
  --prometheus-port 9900

What latency_snoop measures

Metric	What it captures
Connect latency	Time from SYN_SENT → ESTABLISHED (TCP handshake)
RTT (srtt_us)	Kernel’s smoothed round-trip time estimate
RTT deviation (mdev_us)	Jitter — how much RTT varies
Retransmits	Packets the kernel had to resend
Process metadata	PID, thread ID, parent PID, UID, command name
Cgroup/K8s	Cgroup ID, pod UID, container ID (if running in K8s)

Wire it into Prometheus + Grafana

Add latency_snoop to your Prometheus config

# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: "latency-snoop"
    static_configs:
      - targets: ["localhost:9900"]

systemctl restart prometheus

Build a Grafana dashboard

Create panels with these queries:

Connect latency (p50 / p95 / p99):

histogram_quantile(0.50, rate(tcp_connect_latency_ms_bucket[5m]))
histogram_quantile(0.95, rate(tcp_connect_latency_ms_bucket[5m]))
histogram_quantile(0.99, rate(tcp_connect_latency_ms_bucket[5m]))

Retransmit rate:

rate(tcp_retransmits_total[5m])

Connection rate:

rate(tcp_connect_latency_ms_count[5m])

Alert on latency spikes

# /etc/prometheus/alerts.yml
groups:
  - name: latency
    rules:
      - alert: HighTCPConnectLatency
        expr: histogram_quantile(0.95, rate(tcp_connect_latency_ms_bucket[5m])) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p95 TCP connect latency >50ms on {{ $labels.instance }}"

      - alert: HighRetransmitRate
        expr: rate(tcp_retransmits_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "TCP retransmit rate >10/s on {{ $labels.instance }}"

Combine socket_snoop + latency_snoop + Prometheus

Run both as systemd services:

# socket_snoop — event log (console + file)
# latency_snoop — metrics (Prometheus)

# Create latency_snoop service
cat > /etc/systemd/system/latency-snoop.service << 'EOF'
[Unit]
Description=TCP Latency Monitor (eBPF)
After=network.target

[Service]
ExecStart=/opt/linux-tools/debian/monitoring/.venv/bin/python \
  /opt/linux-tools/debian/monitoring/latency_snoop.py \
  --prometheus-port 9900 \
  --log-file /var/log/latency_monitor.log
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now socket-snoop latency-snoop

Now you have: - socket_snoop → real-time event stream in /var/log/socket_monitor.log (grep it, tail it, feed to LogHog) - latency_snoop → continuous metrics on :9900 (scrape with Prometheus, visualize in Grafana)

Email infrastructure auditing with mail-audit

If your kldload system runs a mail server or you need to audit mail delivery:

cd /opt/linux-tools/debian/email

# Install dependencies
pip install dnspython cryptography pyOpenSSL requests

# Audit a domain
./mail-audit.py example.com

Generates example.com.json + example.com.txt with: - SPF record analysis (DNS cost, over-limit detection) - DKIM selector discovery (brute-force scan) - DMARC policy parsing and linting - TLS/STARTTLS handshake testing per MX - DANE/TLSA record checks - MTA-STS policy fetch - DNSBL blacklist checks (Spamhaus, Spamcop, Barracuda, etc.) - Per-MX port probing (25, 465, 587, IMAP, POP) - Overall score: Authentication (40%), Transport (30%), Hygiene (20%), Client (10%)

# Verbose mode
./mail-audit.py example.com -vv

# Rate-limited (for batch scanning)
./mail-audit.py example.com --max-qps 2

# Skip port 25 checks (cloud environments block outbound SMTP)
./mail-audit.py example.com --assume-port25-blocked

# Batch scan
xargs -a domains.txt -I{} ./mail-audit.py {} --max-qps 1

Next level

Ready to write your own eBPF programs in C, build custom kernel modules, and deploy kldload images to AWS/Azure? Move on to Observability — Advanced.

← Level 0: What am I looking Writing eBPF programs in C →