Containers Masterclass
This guide covers the complete container stack on kldload: from Linux namespace fundamentals and cgroup resource controls through Podman, Docker compatibility, the ZFS storage driver, rootless containers, container networking, pods, systemd integration via Quadlet, Firecracker microVMs, SELinux MCS isolation, image management with Buildah, multi-arch builds, local registries for air-gapped deployments, GPU containers, monitoring, and a comprehensive troubleshooting reference. By the end you will understand every layer of the container stack and know how to run production container workloads on ZFS with proper isolation, networking, and observability.
The premise: Containers are the most misunderstood technology in modern infrastructure. Most engineers treat them as lightweight VMs. They are not. A container is a set of kernel isolation primitives — namespaces, cgroups, seccomp, LSM — applied to an ordinary Linux process. There is no container kernel. There is no hypervisor. There is just your process, running on the host kernel, with restricted views of the filesystem, network, PIDs, and resources. Understanding this changes everything about how you design, deploy, secure, and debug containerized workloads.
What this page covers: Linux namespaces and cgroups fundamentals, the OCI runtime specification, Podman vs Docker architecture, the ZFS storage driver and dataset-per-container model, rootless containers with user namespaces, container networking with CNI and Netavark, Podman pods and Kubernetes YAML portability, systemd Quadlet integration, Firecracker microVMs and kata-containers for hardware isolation, SELinux MCS label isolation, Buildah image builds, multi-arch manifests, local registries for darksite environments, GPU passthrough to containers, monitoring and logging, and a complete troubleshooting reference.
Prerequisites: a running kldload system with ZFS on root. The Firecracker sections assume KVM is available. The GPU sections assume the GPU & NVIDIA Masterclass setup is complete. Networking sections build on the Backplane Networks Masterclass.
Why the isolation primitives matter more than the runtime
A container shares the host kernel. Every kernel vulnerability is a container vulnerability. A container escape is not like a VM escape — it is trivially easier because there is no hypervisor boundary to cross. This is why understanding the actual isolation primitives — namespaces, cgroups, seccomp, SELinux — matters more than understanding any particular container runtime. The runtime is just the tool that sets up those primitives. The primitives are the security boundary.
1. Container Fundamentals
Before touching Podman or Docker, you need to understand what a container actually is at the kernel level. A container is not a thing — it is a collection of kernel features applied to a process. There is no "container" data structure in the Linux kernel. There are namespaces (which restrict what a process can see), cgroups (which restrict what a process can use), seccomp (which restricts what system calls a process can make), and LSM hooks like SELinux and AppArmor (which enforce mandatory access control). A "container runtime" is simply a program that creates a process with the right combination of these restrictions.
Linux Namespaces
Namespaces partition kernel resources so that one set of processes sees one set of resources and another set of processes sees a different set. There are eight namespace types in modern Linux kernels. Each namespace isolates a specific kernel resource.
Mount Namespace (mnt)
Isolates the filesystem mount table. Each container gets its own view of the filesystem tree. The container sees its rootfs at / and cannot see the host's mounts. Created with CLONE_NEWNS. This is the oldest namespace, added in Linux 2.4.19 (2002).
PID Namespace (pid)
Isolates the process ID number space. The first process in the container is PID 1 inside the namespace, but has a different PID (e.g., 45832) on the host. The container cannot see or signal host processes. CLONE_NEWPID.
Network Namespace (net)
Isolates network devices, IP addresses, routing tables, firewall rules, and /proc/net. Each container gets its own network stack — its own eth0, its own IP, its own iptables. Traffic between namespaces crosses a virtual bridge or veth pair. CLONE_NEWNET.
User Namespace (user)
Maps UIDs and GIDs inside the namespace to different UIDs/GIDs outside. Root (UID 0) inside the container can be UID 100000 on the host — an unprivileged user. This is the foundation of rootless containers. CLONE_NEWUSER.
UTS Namespace (uts)
Isolates the hostname and NIS domain name. Each container can have its own hostname without affecting the host. Simple but essential — many applications use hostname for identity. CLONE_NEWUTS.
IPC Namespace (ipc)
Isolates System V IPC objects (shared memory segments, message queues, semaphore arrays) and POSIX message queues. Prevents containers from interfering with each other's IPC. CLONE_NEWIPC.
Cgroup Namespace (cgroup)
Virtualizes the view of /proc/self/cgroup. The container sees its cgroup root as / rather than its actual position in the host's cgroup hierarchy. Prevents information leakage about host cgroup structure. CLONE_NEWCGROUP.
Time Namespace (time)
Added in Linux 5.6. Allows per-namespace offsets for CLOCK_MONOTONIC and CLOCK_BOOTTIME. Useful for migrating containers between hosts with different uptimes without breaking timers. CLONE_NEWTIME.
Cgroups v2
Control groups (cgroups) limit, account for, and isolate resource usage. Cgroups v2 is the unified hierarchy — all controllers (cpu, memory, io, pids) live in a single tree. kldload uses cgroups v2 exclusively. The key controllers are:
# View the cgroup hierarchy for a container
podman inspect --format '{{.State.CgroupPath}}' my-container
# Check cgroup v2 is active
stat -fc %T /sys/fs/cgroup/
# Returns "cgroup2fs" on cgroups v2
# Set memory limit to 2GB and CPU to 1.5 cores
podman run --memory 2g --cpus 1.5 my-image
# These translate to cgroup writes:
# memory.max = 2147483648
# cpu.max = 150000 100000 (150ms per 100ms period = 1.5 CPUs)
OCI Runtime Specification
The Open Container Initiative (OCI) defines three specifications: the Runtime Specification
(how to run a container), the Image Specification (how to package a container image), and
the Distribution Specification (how to push/pull images). The runtime spec defines a JSON
config (config.json) that describes namespaces, cgroups, mounts, capabilities, and the
process to execute. Any OCI-compliant runtime (runc, crun, youki, kata-runtime) can
execute any OCI-compliant image. This is what makes containers portable — not Docker, not
Podman, but the OCI spec.
# Inspect the OCI config that Podman generates for a container
podman create --name test alpine sleep 3600
podman inspect test | jq '.[0].OCIRuntime'
# "crun" — kldload uses crun by default (written in C, faster than runc)
# Export the OCI bundle
podman export test | tar -C /tmp/bundle -xf -
ls /tmp/bundle/
# bin dev etc home lib ... (the rootfs)
# The runtime config lives alongside the rootfs
cat /run/containers/storage/overlay-containers/<id>/userdata/config.json | jq '.linux.namespaces'
# [{"type":"pid"},{"type":"network"},{"type":"mount"},{"type":"ipc"},{"type":"uts"},{"type":"cgroup"}]
2. Podman vs Docker
kldload ships Podman as the default container runtime. Podman is a daemonless, rootless
container engine that is wire-compatible with Docker's CLI. You can alias docker to
podman and most workflows work unchanged. But the architectural differences matter deeply
for security, reliability, and systemd integration.
Architecture: fork-exec vs client-server
Docker uses a client-server model: the CLI talks to a daemon (dockerd) that does all the
work. The daemon runs as root. Every container operation goes through this single root-owned
daemon. If the daemon dies, all containers die. If the daemon is compromised, the attacker
has root on the host.
Podman uses a fork-exec model: each container is a direct child of the process that started it. No daemon. No single point of failure. No permanent root process. Processes are children of the process that started them, managed by the init system (systemd), not by a bespoke daemon that reimplements half of an init system badly.
| Feature | Podman | Docker |
|---|---|---|
| Architecture | Fork-exec, daemonless | Client-server daemon (dockerd) |
| Root required | No — rootless by default | Yes — daemon runs as root |
| Container parent | conmon (per-container monitor) | containerd-shim |
| Daemon crash | Containers unaffected (no daemon) | All containers may die |
| OCI runtime | crun (default), runc, kata | runc (default) |
| systemd integration | Native — Quadlet, socket activation | Restart policies only |
| Pod support | Native — shares namespaces like K8s | Docker Compose (different model) |
| SELinux | Full MCS label separation | Supported but less integrated |
| Docker CLI compat | 99%+ — alias docker=podman | N/A |
# Install Podman on kldload (already included in desktop/server profiles)
dnf install -y podman podman-plugins buildah skopeo
# Docker compatibility — alias or install podman-docker
dnf install -y podman-docker
# This creates /usr/bin/docker -> podman and emulates the Docker socket
# Verify the setup
podman info --format '{{.Host.OCIRuntime.Name}}'
# crun
podman info --format '{{.Store.GraphDriverName}}'
# zfs (on kldload with ZFS root)
# Run a container — identical syntax to docker
podman run --rm -it alpine:3.19 cat /etc/os-release
Docker Compose compatibility
# Podman supports Docker Compose files via podman-compose or the docker-compose binary
# pointing at the Podman socket
# Enable the Podman socket (emulates Docker socket)
systemctl --user enable --now podman.socket
# Verify the socket exists
ls -la /run/user/$(id -u)/podman/podman.sock
# Point docker-compose at Podman
export DOCKER_HOST=unix:///run/user/$(id -u)/podman/podman.sock
docker-compose up -d
# Or use podman-compose directly
pip install podman-compose
podman-compose up -d
3. ZFS Storage Driver
When Podman or Docker runs on a ZFS filesystem, it can use the ZFS storage driver. This is one of the most powerful and least understood container storage configurations. Instead of overlay filesystems (which stack layers using OverlayFS), the ZFS driver creates a ZFS dataset per image layer and a ZFS clone per container. Every layer is a proper ZFS dataset with its own properties, compression, and snapshot lineage.
ZFS vs OverlayFS for container storage
OverlayFS — the default storage driver on every other Linux distribution — is a kernel filesystem that presents a merged view of multiple directories. It has no concept of block checksums, copy-on-write at the block level, or transactional writes. If power dies mid-write, you get a corrupt layer. If a bit flips on disk, OverlayFS serves the corrupted data silently.
ZFS does none of this. Every block is checksummed. Every write is transactional. Every layer
is a proper dataset with its own compression policy, quota, and snapshot lineage. You can
snapshot a container, clone it, send it to another host with zfs send, and roll it back
— using the same ZFS tools you use for everything else. No special container-specific
backup tools needed.
How it works
# When Podman pulls an image, each layer becomes a ZFS dataset
podman pull nginx:alpine
# View the ZFS datasets created
zfs list -r -o name,used,refer,mountpoint rpool/var/lib/containers
# NAME USED REFER MOUNTPOINT
# rpool/var/lib/containers 856M 128K /var/lib/containers
# rpool/var/lib/containers/storage 856M 128K /var/lib/containers/storage
# rpool/var/lib/containers/storage/zfs 854M 128K /var/lib/containers/storage/zfs
# rpool/var/lib/containers/storage/zfs/abc123def456 3.2M 3.2M legacy (base layer)
# rpool/var/lib/containers/storage/zfs/789abc012def 28M 28M legacy (nginx layer)
# rpool/var/lib/containers/storage/zfs/345ghi678jkl 1.1M 1.1M legacy (config layer)
# When you run a container, Podman creates a ZFS clone of the top layer
podman run -d --name web nginx:alpine
# The container's writable layer is a ZFS clone
zfs list -t all -r rpool/var/lib/containers/storage/zfs | grep clone
# A clone is a writable snapshot — copy-on-write, instant creation, zero initial space
Configure ZFS storage driver
# /etc/containers/storage.conf — kldload sets this automatically on ZFS systems
[storage]
driver = "zfs"
[storage.options.zfs]
# Parent dataset for all container storage
# Podman creates sub-datasets automatically
fsname = "rpool/var/lib/containers/storage/zfs"
# Mountopt controls mount options for container rootfs
mountopt = "nodev"
ZFS advantages for container workloads
# Snapshot a running container's filesystem
podman commit my-container my-snapshot:v1
# Under the hood: zfs snapshot + zfs clone
# Instant container cloning — create 100 identical containers in seconds
for i in $(seq 1 100); do
podman run -d --name "worker-$i" my-image:latest
done
# Each container's writable layer is a ZFS clone — zero copy, instant
# Per-container compression
zfs set compression=zstd rpool/var/lib/containers/storage/zfs
# Block-level deduplication across layers (use with caution — RAM hungry)
zfs set dedup=on rpool/var/lib/containers/storage/zfs
# Quota per container (via ZFS dataset properties)
zfs set quota=10G rpool/var/lib/containers/storage/zfs/<container-dataset>
# Send a container image to another host using ZFS send
zfs snapshot rpool/var/lib/containers/storage/zfs/abc123@export
zfs send rpool/var/lib/containers/storage/zfs/abc123@export | \
ssh remote-host zfs recv rpool/var/lib/containers/storage/zfs/abc123
4. Rootless Containers
Rootless containers run entirely as an unprivileged user. The container process never touches UID 0 on the host. This eliminates an entire class of container escape vulnerabilities — even if an attacker breaks out of the namespace, they land as an unprivileged user. kldload configures rootless Podman by default for non-root users.
Why rootless matters
Rootless containers are one of the most important security advances in container technology.
The reason they remain underused in production is that they are harder to set up. Network access
requires slirp4netns or pasta instead of direct bridge access. Binding to ports below 1024
requires net.ipv4.ip_unprivileged_port_start=0. Volume mounts need careful UID mapping. The
ZFS storage driver in rootless mode requires the ZFS datasets to be owned by the rootless
user, which means you need delegated datasets.
None of this is impossible, but Docker's "just run as root" default made everyone lazy. On kldload, all of this is configured out of the box because the security benefit is worth the complexity. A rootless container escape gives the attacker UID 100000 on the host — they own nothing.
User namespace mapping
# View subuid/subgid allocations
cat /etc/subuid
# live:100000:65536
# todd:165536:65536
cat /etc/subgid
# live:100000:65536
# todd:165536:65536
# This means user "todd" gets UIDs 165536-231071 for container use
# UID 0 inside the container = UID 165536 on the host
# UID 1000 inside the container = UID 166536 on the host
# Verify rootless configuration
podman unshare cat /proc/self/uid_map
# 0 165536 65536
# Run a rootless container
podman run --rm -it alpine id
# uid=0(root) gid=0(root) — root inside the namespace
# On the host: ps -o uid,pid,cmd shows UID 165536
Rootless networking with pasta
# Podman 5.x+ uses pasta (from passt) by default instead of slirp4netns
# pasta is faster and supports IPv6 properly
# Check which network backend is active
podman info --format '{{.Host.Pasta.Executable}}'
# /usr/bin/pasta
# Rootless port forwarding — no root required
podman run -d -p 8080:80 nginx:alpine
# pasta sets up port forwarding without iptables (which requires root)
# Allow binding to privileged ports (kldload sets this)
sysctl net.ipv4.ip_unprivileged_port_start=0
# Now rootless containers can bind to port 80, 443, etc.
# Rootless container storage location
# Root: /var/lib/containers/storage/
# Rootless: ~/.local/share/containers/storage/
podman info --format '{{.Store.GraphRoot}}'
ZFS delegated datasets for rootless
# Create a delegated ZFS dataset for rootless container storage
zfs create rpool/containers-todd
zfs allow todd create,destroy,mount,snapshot,clone,promote,rename,send,receive rpool/containers-todd
chown todd:todd /rpool/containers-todd
# Configure Podman to use the delegated dataset
# ~/.config/containers/storage.conf (as user todd)
[storage]
driver = "zfs"
graphroot = "/rpool/containers-todd"
[storage.options.zfs]
fsname = "rpool/containers-todd"
5. Image Management
Container images are the unit of distribution. An image is an ordered collection of filesystem layers plus metadata (environment variables, entrypoint, exposed ports). Understanding how images are built, stored, tagged, and distributed is essential for reproducible deployments.
Building images with Buildah
# Buildah is a standalone image builder — no daemon, scriptable, OCI-native
# It can build from Dockerfiles OR from shell scripts
# Build from a Dockerfile (standard approach)
buildah bud -t my-app:v1 -f Containerfile .
# Build from shell script (more flexible, no Dockerfile syntax limitations)
ctr=$(buildah from registry.access.redhat.com/ubi9/ubi-minimal:latest)
buildah run $ctr -- dnf install -y python3 python3-pip
buildah run $ctr -- pip3 install flask gunicorn
buildah copy $ctr ./app /opt/app
buildah config --entrypoint '["/usr/bin/gunicorn", "-w", "4", "-b", "0.0.0.0:8000", "app:app"]' $ctr
buildah config --port 8000 $ctr
buildah config --label maintainer="todd@kldload.com" $ctr
buildah commit $ctr my-app:v1
buildah rm $ctr
Multi-stage builds
# Containerfile with multi-stage build — compile in one stage, deploy in another
# This produces a minimal final image without build tools
FROM registry.access.redhat.com/ubi9/ubi:latest AS builder
RUN dnf install -y gcc make openssl-devel
COPY src/ /build/
WORKDIR /build
RUN make -j$(nproc) && make install DESTDIR=/output
FROM registry.access.redhat.com/ubi9/ubi-minimal:latest
COPY --from=builder /output/usr/local/bin/myapp /usr/local/bin/myapp
RUN microdnf install -y openssl-libs && microdnf clean all
USER 1001
ENTRYPOINT ["/usr/local/bin/myapp"]
Multi-arch builds
ARM servers (Graviton, Ampere) are 30-40% cheaper per vCPU in every major cloud. Apple Silicon developers are running ARM natively. Edge deployments are overwhelmingly ARM. If your CI pipeline does not produce multi-arch images, you are leaving performance and cost savings on the table.
The manifest list is the key abstraction — it is a pointer that says "for amd64, use this digest; for arm64, use this digest." The registry serves the right image for the platform automatically. One tag, multiple architectures, zero client-side complexity.
# Build for multiple architectures and create a manifest list
# Requires qemu-user-static for cross-architecture emulation
dnf install -y qemu-user-static
# Build for amd64 and arm64
podman build --platform linux/amd64,linux/arm64 \
--manifest my-app:v1 \
-f Containerfile .
# Inspect the manifest list
podman manifest inspect my-app:v1
# Shows two entries: one for amd64, one for arm64
# Push the manifest list to a registry
podman manifest push --all my-app:v1 \
docker://registry.example.com/my-app:v1
# Alternative: build separately and combine
podman build --platform linux/amd64 -t my-app:v1-amd64 .
podman build --platform linux/arm64 -t my-app:v1-arm64 .
podman manifest create my-app:v1 my-app:v1-amd64 my-app:v1-arm64
podman manifest push --all my-app:v1 docker://registry.example.com/my-app:v1
Image inspection with Skopeo
# Skopeo inspects and copies images without pulling them
# No daemon required — works directly against registries
# Inspect a remote image without downloading
skopeo inspect docker://docker.io/library/nginx:alpine
# Returns JSON with layers, config, labels, architecture
# Copy between registries without local storage
skopeo copy docker://docker.io/library/nginx:alpine \
docker://registry.local:5000/nginx:alpine
# Copy to a local OCI directory (for air-gapped transfer)
skopeo copy docker://docker.io/library/nginx:alpine \
oci:/tmp/nginx-alpine:latest
# Copy to a Docker archive (tarball)
skopeo copy docker://docker.io/library/nginx:alpine \
docker-archive:/tmp/nginx-alpine.tar
6. Container Networking
Container networking is where most of the complexity lives. Each container gets its own network namespace with its own interfaces, IP addresses, and routing table. Connecting containers to each other and to the outside world requires virtual network infrastructure — bridges, veth pairs, NAT rules, DNS. Podman 5.x uses Netavark as its network backend (replacing CNI plugins).
Network modes
# Bridge mode (default) — container gets a veth pair connected to a bridge
podman run -d --name web --network bridge -p 8080:80 nginx:alpine
# Container gets 10.88.0.x, host forwards port 8080 to container port 80
# Host mode — container shares the host's network namespace
podman run -d --name web --network host nginx:alpine
# Container binds directly to host ports, no NAT, maximum performance
# WARNING: no network isolation
# None — container has no network (only loopback)
podman run -d --name isolated --network none alpine sleep 3600
# Macvlan — container gets a MAC address directly on the physical network
podman network create -d macvlan \
--subnet 192.168.1.0/24 --gateway 192.168.1.1 \
-o parent=eno1 my-macvlan
podman run -d --network my-macvlan --ip 192.168.1.50 nginx:alpine
# Container appears as a real host on the LAN — no NAT, no port mapping
# Create a custom bridge network with specific subnet
podman network create \
--subnet 10.50.0.0/24 \
--gateway 10.50.0.1 \
--dns 10.50.0.1 \
app-network
podman run -d --network app-network --name api my-api:latest
podman run -d --network app-network --name db postgres:16
# api can reach db at "db:5432" via built-in DNS
Inter-container DNS
# Podman provides automatic DNS resolution for containers on the same network
# Container names resolve to their IP addresses
# Create a network and two containers
podman network create backend
podman run -d --network backend --name redis redis:7-alpine
podman run -d --network backend --name app my-app:latest
# From inside app, redis resolves automatically
podman exec app ping -c 1 redis
# PING redis (10.89.0.2): 56 data bytes
# 64 bytes from 10.89.0.2: seq=0 ttl=64 time=0.043 ms
# The DNS server is aardvark-dns, managed by Netavark
podman network inspect backend | jq '.[0].dns_enabled'
# true
NAT overhead and high-performance alternatives
The default bridge mode uses iptables/nftables DNAT to forward traffic from host ports to container ports. For low-traffic web services this is fine. For high-throughput workloads (databases, message queues, storage services), NAT adds measurable latency and CPU overhead.
The solution depends on your requirements: host networking eliminates all overhead but sacrifices isolation. Macvlan gives you direct LAN access with isolation but breaks container-to-host communication. IPVLAN is similar but shares the MAC address. For truly high-performance container networking on kldload, consider macvlan with a dedicated VLAN for container traffic — you get near-native performance with proper network segmentation.
7. Podman Pods
A Podman pod is a group of containers that share network and (optionally) PID, IPC, and UTS namespaces. This is the same concept as a Kubernetes pod. Containers in a pod communicate over localhost, share the same IP address, and are scheduled as a unit. This makes pods the natural way to co-locate tightly coupled services — like an application and its sidecar proxy, or a web server and a log shipper.
# Create a pod with port mapping
podman pod create --name my-pod -p 8080:80 -p 5432:5432
# Add containers to the pod
podman run -d --pod my-pod --name web nginx:alpine
podman run -d --pod my-pod --name db postgres:16-alpine
# web and db share the same network namespace
# web can reach db at localhost:5432
# external clients reach web at host:8080
# View pod status
podman pod ps
# POD ID NAME STATUS CREATED INFRA ID # OF CONTAINERS
# abc123def456 my-pod Running 2m ago 789ghi012jkl 3
# The "3" includes the infra container — a pause container that holds the namespaces
podman pod inspect my-pod | jq '.InfraContainerId'
Pod YAML and Kubernetes portability
# Generate Kubernetes YAML from a running pod
podman generate kube my-pod > my-pod.yaml
# The generated YAML is valid Kubernetes YAML
cat my-pod.yaml
# apiVersion: v1
# kind: Pod
# metadata:
# name: my-pod
# spec:
# containers:
# - name: web
# image: docker.io/library/nginx:alpine
# ports:
# - containerPort: 80
# hostPort: 8080
# - name: db
# image: docker.io/library/postgres:16-alpine
# Deploy the same YAML on Kubernetes
kubectl apply -f my-pod.yaml
# Or play it back on another Podman host
podman play kube my-pod.yaml
# Tear down
podman play kube --down my-pod.yaml
8. systemd Integration — Quadlet
Quadlet is Podman's native systemd integration. Instead of writing systemd unit files
that call podman run, you write declarative .container, .pod, .volume, and
.network files. systemd's generator converts these into proper unit files at boot.
The result is containers managed by systemd — with dependency ordering, restart policies,
socket activation, and journald logging built in.
Why Quadlet replaces podman generate systemd
Before Quadlet, the standard advice was podman generate systemd — which produced
40-line unit files that were fragile, hard to read, and painful to maintain. Quadlet files
are 10-15 lines of declarative configuration. systemd handles the lifecycle. journald
handles the logs. You get automatic restarts, dependency ordering, and systemctl status
for your containers.
For single-host deployments, small clusters, and edge devices, Quadlet is often all you need. Kubernetes is for multi-host orchestration. Quadlet is for single-host orchestration. Most workloads that people deploy on Kubernetes actually belong on Quadlet.
Quadlet .container file
# /etc/containers/systemd/nginx.container
[Unit]
Description=Nginx web server
After=network-online.target
[Container]
Image=docker.io/library/nginx:alpine
PublishPort=80:80
PublishPort=443:443
Volume=/etc/nginx/conf.d:/etc/nginx/conf.d:ro,Z
Volume=/var/www/html:/usr/share/nginx/html:ro,Z
Volume=/etc/letsencrypt:/etc/letsencrypt:ro,Z
Environment=NGINX_ENTRYPOINT_QUIET_LOGS=1
AutoUpdate=registry
HealthCmd=curl -sf http://localhost/ || exit 1
HealthInterval=30s
[Service]
Restart=always
TimeoutStartSec=120
[Install]
WantedBy=multi-user.target default.target
# /etc/containers/systemd/postgres.container
[Unit]
Description=PostgreSQL 16
After=network-online.target
[Container]
Image=docker.io/library/postgres:16-alpine
PublishPort=5432:5432
Volume=pg-data.volume:/var/lib/postgresql/data:Z
Environment=POSTGRES_PASSWORD_FILE=/run/secrets/pg_password
Secret=pg_password,type=mount
HealthCmd=pg_isready -U postgres
HealthInterval=10s
[Service]
Restart=always
TimeoutStartSec=300
[Install]
WantedBy=multi-user.target
Quadlet .volume file
# /etc/containers/systemd/pg-data.volume
[Volume]
# On ZFS, this creates a dedicated ZFS dataset for the volume
Driver=local
Label=app=postgres
Quadlet .pod file
# /etc/containers/systemd/webapp.pod
[Unit]
Description=Web application pod
[Pod]
PodName=webapp
PublishPort=8080:8080
# Containers reference this pod with Pod=webapp.pod
# Reload systemd to pick up Quadlet files
systemctl daemon-reload
# Start the container
systemctl start nginx.service
# View status — full systemd integration
systemctl status nginx.service
# Shows container ID, health status, recent logs
# View logs via journald
journalctl -u nginx.service -f
# Enable auto-start at boot
systemctl enable nginx.service
# Auto-update containers (check registry for new images)
systemctl enable --now podman-auto-update.timer
# Manual update check
podman auto-update --dry-run
podman auto-update
Socket activation
# Socket activation starts the container only when a connection arrives
# Perfect for rarely-used services that should not consume resources when idle
# /etc/containers/systemd/dev-tools.container
[Unit]
Description=Development tools (socket activated)
[Container]
Image=my-dev-tools:latest
PublishPort=9090:9090
[Service]
# Container only starts when someone connects to port 9090
Type=notify
Restart=on-failure
[Install]
# No WantedBy — started by socket only
# /etc/systemd/system/dev-tools.socket
[Socket]
ListenStream=9090
[Install]
WantedBy=sockets.target
9. Firecracker MicroVMs
Firecracker is a virtual machine monitor (VMM) built by AWS that creates lightweight microVMs with a minimal device model. Each microVM boots a real Linux kernel in <125ms, uses ~5MB of memory overhead, and provides the hardware isolation of a VM with startup times approaching a container. Firecracker is what powers AWS Lambda and Fargate.
Containers vs microVMs: choosing the right isolation boundary
Containers share the host kernel. A kernel exploit in any container compromises every container on the host. VMs have separate kernels but take seconds to boot and consume hundreds of megabytes per instance. Firecracker gives you a separate kernel in 125ms with 5MB overhead.
The tradeoff is operational complexity — you need to manage VM images (not container images), the network setup is different, and the tooling is less mature. But for multi-tenant workloads where you cannot trust the code running in the container (FaaS, student labs, CI runners), Firecracker or kata-containers are the correct isolation boundary. Containers are for code you trust. MicroVMs are for code you do not trust.
Firecracker on kldload
# Install Firecracker
ARCH=$(uname -m)
FC_VERSION="1.7.0"
curl -fsSL "https://github.com/firecracker-microvm/firecracker/releases/download/v${FC_VERSION}/firecracker-v${FC_VERSION}-${ARCH}.tgz" \
| tar -xz -C /usr/local/bin --strip-components=1 \
"release-v${FC_VERSION}-${ARCH}/firecracker-v${FC_VERSION}-${ARCH}" \
"release-v${FC_VERSION}-${ARCH}/jailer-v${FC_VERSION}-${ARCH}"
ln -sf /usr/local/bin/firecracker-v${FC_VERSION}-${ARCH} /usr/local/bin/firecracker
ln -sf /usr/local/bin/jailer-v${FC_VERSION}-${ARCH} /usr/local/bin/jailer
# Verify KVM is available (required)
ls -la /dev/kvm
# crw-rw-rw- 1 root kvm 10, 232 ... /dev/kvm
# Prepare a minimal kernel and rootfs
# The kernel must be uncompressed (vmlinux, not vmlinuz)
curl -fsSLo /var/lib/firecracker/vmlinux \
https://s3.amazonaws.com/spec.ccfc.min/img/quickstart_guide/x86_64/kernels/vmlinux.bin
# Create a rootfs from an Alpine container
podman run --rm -v /var/lib/firecracker:/output:Z alpine sh -c '
dd if=/dev/zero of=/output/rootfs.ext4 bs=1M count=512
mkfs.ext4 /output/rootfs.ext4
mkdir -p /mnt/rootfs
mount /output/rootfs.ext4 /mnt/rootfs
apk add --root /mnt/rootfs --initdb alpine-base openrc
echo "ttyS0::respawn:/sbin/getty -L ttyS0 115200 vt100" > /mnt/rootfs/etc/inittab
umount /mnt/rootfs
'
Launch a microVM
# Start Firecracker with a Unix socket for API control
firecracker --api-sock /tmp/firecracker.sock &
# Configure the VM via the API
curl --unix-socket /tmp/firecracker.sock -X PUT \
http://localhost/boot-source \
-H 'Content-Type: application/json' \
-d '{
"kernel_image_path": "/var/lib/firecracker/vmlinux",
"boot_args": "console=ttyS0 reboot=k panic=1 pci=off"
}'
curl --unix-socket /tmp/firecracker.sock -X PUT \
http://localhost/drives/rootfs \
-H 'Content-Type: application/json' \
-d '{
"drive_id": "rootfs",
"path_on_host": "/var/lib/firecracker/rootfs.ext4",
"is_root_device": true,
"is_read_only": false
}'
curl --unix-socket /tmp/firecracker.sock -X PUT \
http://localhost/machine-config \
-H 'Content-Type: application/json' \
-d '{
"vcpu_count": 2,
"mem_size_mib": 256
}'
# Start the microVM — boots in <125ms
curl --unix-socket /tmp/firecracker.sock -X PUT \
http://localhost/actions \
-H 'Content-Type: application/json' \
-d '{"action_type": "InstanceStart"}'
Kata Containers — OCI-compatible microVMs
# Kata Containers wraps Firecracker (or QEMU) as an OCI runtime
# Podman uses it transparently — same CLI, hardware isolation
dnf install -y kata-containers
# Configure Podman to use kata as an alternative runtime
# /etc/containers/containers.conf
[engine.runtimes]
kata = ["/usr/bin/kata-runtime"]
# Run a container with hardware isolation
podman run --rm --runtime kata -it alpine sh
# This boots a real microVM, runs the container inside it, and destroys it
# ~200ms startup vs ~50ms for crun, but full kernel isolation
# Use kata for specific containers, crun for others
podman run --runtime crun -d --name trusted-app my-trusted:v1
podman run --runtime kata -d --name untrusted-job user-submitted-code:latest
10. SELinux & Containers
SELinux provides mandatory access control (MAC) that operates independently of
Unix permissions. For containers, SELinux uses Multi-Category Security (MCS)
labels to isolate containers from each other and from the host. Each container gets a
unique MCS label (e.g., s0:c123,c456), and SELinux enforces that it can only
access files and resources with matching labels.
The SELinux container model
SELinux assigns a type and an MCS label to every process and file. Containers run as
type container_t. Container files are labeled container_file_t. The policy says
container_t can read/write container_file_t. When you bind-mount a host directory,
it has the host's SELinux type (e.g., default_t or var_t), and container_t
is not allowed to access those types.
The fix is volume labels: the :Z flag tells Podman to relabel the mount with the
container's MCS label (private to that container). The :z flag relabels it as shared
(accessible by multiple containers). This is the correct fix — not disabling SELinux.
setenforce 0 — which is like disabling your firewall because it blocked something. The actual fix is almost always a volume label (:Z or :z) or a udica-generated policy.MCS label isolation
# View the SELinux context of a running container
podman run -d --name test1 alpine sleep 3600
podman run -d --name test2 alpine sleep 3600
podman inspect --format '{{.ProcessLabel}}' test1
# system_u:system_r:container_t:s0:c100,c200
podman inspect --format '{{.ProcessLabel}}' test2
# system_u:system_r:container_t:s0:c300,c400
# Different MCS labels — SELinux prevents cross-container access
# Even if a container escapes its namespace, it cannot read files
# labeled with a different MCS category
Volume mount labels
# WRONG — SELinux will block access
podman run -v /data/app:/data alpine ls /data
# ls: can't open '/data': Permission denied
# RIGHT — :Z relabels the mount for this specific container
podman run -v /data/app:/data:Z alpine ls /data
# (works — /data/app is relabeled to container's MCS label)
# :z — shared label (multiple containers can access)
podman run -v /data/shared:/data:z alpine ls /data
# :Z — private label (only this container)
podman run -v /data/private:/data:Z alpine ls /data
# View the relabeling
ls -lZ /data/app
# drwxr-xr-x. root root system_u:object_r:container_file_t:s0:c100,c200 app
# WARNING: :Z on a host system directory (like /etc or /var) will break the host
# Only use :Z on directories dedicated to the container
udica — automatic policy generation
# udica generates SELinux policies from container inspection data
dnf install -y udica
# Run your container and capture its inspection data
podman inspect my-container > container.json
# Generate a policy
udica -j container.json my-container-policy
# Load the policy
semodule -i my-container-policy.cil /usr/share/udica/templates/{base_container.cil,net_container.cil}
# Run the container with the custom policy
podman run --security-opt label=type:my-container-policy.process -d my-image
# This gives the container exactly the permissions it needs — no more
# Much more secure than --privileged or setenforce 0
11. Container Storage on ZFS
Beyond the ZFS storage driver for image layers, containers need persistent storage for data that survives container restarts and rebuilds. On kldload, all persistent container storage sits on ZFS datasets — giving you snapshots, compression, encryption, and replication for container data with zero additional tooling.
Named volumes on ZFS
# Create a named volume — Podman creates a ZFS dataset automatically
podman volume create pg-data
# Inspect the volume
podman volume inspect pg-data
# "Mountpoint": "/var/lib/containers/storage/volumes/pg-data/_data"
# On ZFS, this is backed by a ZFS dataset
# Use the volume
podman run -d --name postgres \
-v pg-data:/var/lib/postgresql/data:Z \
-e POSTGRES_PASSWORD=secret \
postgres:16-alpine
# Snapshot the volume's ZFS dataset
zfs snapshot rpool/var/lib/containers/storage/volumes/pg-data@before-migration
# Rollback if the migration fails
zfs rollback rpool/var/lib/containers/storage/volumes/pg-data@before-migration
ZFS datasets as direct mounts
# Create dedicated ZFS datasets for container data
zfs create -o mountpoint=/data/redis \
-o recordsize=4K \
-o compression=lz4 \
-o atime=off \
rpool/data/redis
zfs create -o mountpoint=/data/postgres \
-o recordsize=8K \
-o compression=lz4 \
-o logbias=throughput \
rpool/data/postgres
zfs create -o mountpoint=/data/minio \
-o recordsize=1M \
-o compression=zstd \
rpool/data/minio
# Mount with appropriate ZFS tuning per workload
podman run -d --name redis \
-v /data/redis:/data:Z \
redis:7-alpine --save 60 1 --dir /data
podman run -d --name postgres \
-v /data/postgres:/var/lib/postgresql/data:Z \
postgres:16-alpine
podman run -d --name minio \
-v /data/minio:/data:Z \
-p 9000:9000 \
minio/minio server /data
Why ZFS recordsize tuning matters for containers
The ZFS recordsize tuning per container workload is where kldload shines compared to running containers on ext4 or XFS. PostgreSQL performs best with 8K recordsize (matching its page size). Redis needs 4K for small key-value pairs. MinIO stores large objects and benefits from 1M recordsize. On ext4, you get one block size for the whole filesystem. On ZFS, each dataset — and therefore each container's storage — gets its own recordsize, compression algorithm, and caching policy.
This is not a micro-optimization. Wrong recordsize can cause 2-5x write amplification. A PostgreSQL database on a 128K recordsize dataset (the ZFS default) writes 128KB for every 8KB page update. On an 8K dataset, it writes 8KB. That is a 16x difference in write amplification, and it directly translates to disk throughput and latency.
tmpfs for ephemeral data
# Use tmpfs for data that does not need to persist
# Faster than any disk — lives in RAM
podman run -d --name build-runner \
--tmpfs /tmp:rw,size=2g,exec \
--tmpfs /build:rw,size=10g \
my-build-image:latest
# tmpfs is perfect for:
# - Build artifacts during CI
# - Temporary caches
# - Session data
# - Test databases that are rebuilt each run
12. Registries & Mirrors
A container registry stores and distributes OCI images. For air-gapped environments, darksite deployments, and performance optimization, running a local registry mirror is essential. kldload's darksite model extends naturally to container images — pull once, serve locally forever.
Running a local registry
# Deploy a local registry on ZFS
zfs create -o mountpoint=/data/registry \
-o compression=zstd \
-o recordsize=128K \
rpool/data/registry
# Quadlet file for the registry
# /etc/containers/systemd/registry.container
[Unit]
Description=Local container registry
After=network-online.target
[Container]
Image=docker.io/library/registry:2
PublishPort=5000:5000
Volume=/data/registry:/var/lib/registry:Z
Environment=REGISTRY_STORAGE_DELETE_ENABLED=true
Environment=REGISTRY_HTTP_HEADERS_Access-Control-Allow-Origin=['*']
HealthCmd=wget -q --spider http://localhost:5000/v2/ || exit 1
HealthInterval=30s
[Service]
Restart=always
[Install]
WantedBy=multi-user.target
Configure registry mirrors
# /etc/containers/registries.conf.d/010-local-mirror.conf
# Mirror Docker Hub through the local registry
[[registry]]
prefix = "docker.io"
location = "docker.io"
[[registry.mirror]]
location = "registry.local:5000"
insecure = false
# For air-gapped environments, block all external registries
[[registry]]
prefix = "docker.io"
blocked = false
location = "registry.local:5000"
[[registry]]
prefix = "quay.io"
blocked = false
location = "registry.local:5000"
# Short-name aliases — resolve unqualified names
[aliases]
"nginx" = "registry.local:5000/library/nginx"
"postgres" = "registry.local:5000/library/postgres"
"redis" = "registry.local:5000/library/redis"
Mirroring images for darksite deployment
The air-gapped container registry is the missing piece in most darksite deployments. People focus on OS packages (RPM, APT) and forget that their containerized services need images too. On kldload, the darksite model already handles RPM and APT packages. Adding a container registry mirror is the natural extension.
The workflow is: pull images on a connected host, skopeo-copy them to a portable medium or
to the local registry, and configure all hosts to resolve image names to the local mirror.
The registries.conf approach is cleaner than retagging images — the original image
references in your Quadlet files and Kubernetes manifests stay unchanged.
# Script to mirror a list of images to the local registry
#!/bin/bash
# mirror-images.sh — run on a connected host, copy results to air-gapped
REGISTRY="registry.local:5000"
IMAGES=(
"docker.io/library/nginx:alpine"
"docker.io/library/postgres:16-alpine"
"docker.io/library/redis:7-alpine"
"docker.io/library/registry:2"
"docker.io/grafana/grafana:latest"
"docker.io/prom/prometheus:latest"
"docker.io/prom/node-exporter:latest"
)
for img in "${IMAGES[@]}"; do
echo "Mirroring $img -> $REGISTRY"
skopeo copy --all \
"docker://$img" \
"docker://$REGISTRY/$(echo "$img" | sed 's|docker.io/||')"
done
echo "Mirrored ${#IMAGES[@]} images to $REGISTRY"
Image signing with sigstore
# Sign images with cosign (sigstore)
dnf install -y cosign
# Generate a key pair
cosign generate-key-pair
# Sign an image
cosign sign --key cosign.key registry.local:5000/my-app:v1
# Verify the signature
cosign verify --key cosign.pub registry.local:5000/my-app:v1
# Configure Podman to require signatures
# /etc/containers/policy.json
{
"default": [{"type": "reject"}],
"transports": {
"docker": {
"registry.local:5000": [
{
"type": "sigstoreSigned",
"keyPath": "/etc/pki/containers/cosign.pub",
"signedIdentity": {"type": "matchRepository"}
}
],
"docker.io/library": [{"type": "insecureAcceptAnything"}]
}
}
}
13. GPU Containers
Running GPU workloads in containers requires the NVIDIA Container Toolkit, which provides a custom OCI runtime hook that mounts the GPU driver and device files into the container at startup. On kldload, this integrates with Podman, CDI (Container Device Interface), and rootless GPU access.
CDI: the modern approach to GPU containers
CDI (Container Device Interface) is a game-changer for GPU containers. Before CDI, GPU
access required the nvidia-container-runtime as a wrapper around runc — it intercepted
the container start, detected environment variables like NVIDIA_VISIBLE_DEVICES, and
injected the driver. It was fragile, version-sensitive, and did not work with rootless Podman.
CDI takes a completely different approach: it generates a static JSON specification of all GPU devices and their required mounts/hooks. The container runtime reads the CDI spec and handles device injection natively. No wrapper runtime. No environment variable magic. Works with crun, runc, and kata. Works rootless. Works with any OCI runtime. This is why kldload configures CDI by default.
# Install NVIDIA Container Toolkit (see GPU Masterclass for driver setup)
dnf install -y nvidia-container-toolkit
# Generate CDI specification
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# Verify CDI devices
podman run --rm --device nvidia.com/gpu=all \
nvidia/cuda:12.4.0-base-ubi9 nvidia-smi
# Run a specific GPU (multi-GPU hosts)
podman run --rm --device nvidia.com/gpu=0 \
nvidia/cuda:12.4.0-base-ubi9 nvidia-smi
# GPU access in rootless mode (CDI makes this possible)
podman run --rm --user 1000:1000 --device nvidia.com/gpu=all \
nvidia/cuda:12.4.0-base-ubi9 nvidia-smi
# Quadlet with GPU access
# /etc/containers/systemd/ollama.container
[Unit]
Description=Ollama LLM server
After=network-online.target
[Container]
Image=docker.io/ollama/ollama:latest
PublishPort=11434:11434
Volume=ollama-models.volume:/root/.ollama:Z
AddDevice=nvidia.com/gpu=all
[Service]
Restart=always
[Install]
WantedBy=multi-user.target
NVIDIA_VISIBLE_DEVICES environment variables, you are using the old path — switch to CDI.14. Monitoring & Logging
Container observability requires visibility into resource usage, health status, logs,
and lifecycle events. On kldload, containers log to journald by default (via Podman's
journald log driver), and resource metrics are available through Podman's stats API,
Prometheus cAdvisor, and eBPF-based tools.
Podman stats and health checks
# Real-time resource usage for all containers
podman stats --no-stream
# ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
# abc123def456 web 0.12% 15.2MiB / 2GiB 0.74% 12.3kB / 8.1kB 0B / 4.1MB
# 789ghi012jkl postgres 2.34% 128MiB / 4GiB 3.13% 1.2MB / 856kB 24MB / 156MB
# Health check status
podman healthcheck run web
podman inspect --format '{{.State.Health.Status}}' web
# healthy
# Container events stream
podman events --filter event=start --filter event=die --filter event=health_status
journald logging
# Podman logs to journald by default on kldload
# View logs for a specific container
journalctl CONTAINER_NAME=web --since "1 hour ago"
# Follow logs
journalctl CONTAINER_NAME=postgres -f
# View logs by container ID
journalctl CONTAINER_ID=abc123def456
# Structured log queries
journalctl CONTAINER_NAME=web --output json-pretty | jq '.MESSAGE'
# Configure log driver per container
podman run -d --log-driver journald \
--log-opt tag="{{.Name}}" \
--name web nginx:alpine
Prometheus cAdvisor
# cAdvisor exports container metrics for Prometheus scraping
# /etc/containers/systemd/cadvisor.container
[Unit]
Description=cAdvisor container metrics
After=network-online.target
[Container]
Image=gcr.io/cadvisor/cadvisor:v0.49.1
PublishPort=8080:8080
Volume=/:/rootfs:ro
Volume=/var/run:/var/run:ro
Volume=/sys:/sys:ro
Volume=/var/lib/containers:/var/lib/containers:ro
SecurityLabelDisable=true
[Service]
Restart=always
[Install]
WantedBy=multi-user.target
# Prometheus scrape config for cAdvisor
# /etc/prometheus/prometheus.yml (add to scrape_configs)
scrape_configs:
- job_name: 'cadvisor'
static_configs:
- targets: ['localhost:8080']
metrics_path: /metrics
scrape_interval: 15s
# Key metrics to alert on:
# container_cpu_usage_seconds_total — CPU usage per container
# container_memory_usage_bytes — memory consumption
# container_network_receive_bytes_total — network ingress
# container_fs_usage_bytes — filesystem usage
# container_oom_events_total — OOM kills (critical)
Loki for container log aggregation
# Promtail ships container logs from journald to Loki
# /etc/promtail/config.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: containers
journal:
json: false
max_age: 12h
labels:
job: containers
relabel_configs:
- source_labels: ['__journal_container_name']
target_label: 'container'
- source_labels: ['__journal_container_id']
target_label: 'container_id'
- source_labels: ['__journal__hostname']
target_label: 'host'
15. Troubleshooting Reference
Container problems fall into predictable categories. This table covers the most common issues, their root causes, and the exact commands to diagnose and fix them.
Essential debugging tools
The single most useful troubleshooting command for container issues is podman inspect.
It gives you every configuration detail of a container — network settings, mount points,
SELinux labels, health check config, environment variables, OCI runtime used, cgroup path,
and more. The second most useful is podman logs (or journalctl CONTAINER_NAME=x).
Between these two commands, you can diagnose 80% of container problems. The remaining 20%
require nsenter to enter the container's namespaces without exec-ing into the container
itself (useful when the container has no shell), and ausearch -m avc for SELinux denials.
| Symptom | Likely Cause | Diagnosis & Fix |
|---|---|---|
| Permission denied on volume mount | SELinux label mismatch | Add :Z (private) or :z (shared) to the volume mount. Check ausearch -m avc -ts recent. |
| Container starts then immediately exits | Entrypoint/CMD fails or PID 1 exits | podman logs <container> to see the error. Check entrypoint with podman inspect --format '{{.Config.Entrypoint}}'. |
| Cannot pull image (timeout/TLS error) | Network/DNS/registry config issue | podman pull --log-level debug. Check /etc/containers/registries.conf. Verify DNS with dig registry.example.com. |
| Port binding fails (address already in use) | Host port conflict or stale container | ss -tlnp | grep :<port> to find the conflict. podman ps -a to check for stopped containers holding ports. |
| Container OOM killed | Memory limit too low | podman inspect --format '{{.State.OOMKilled}}'. Check journalctl -k --grep oom. Increase --memory. |
| Rootless container cannot bind port <1024 | ip_unprivileged_port_start too high | sysctl net.ipv4.ip_unprivileged_port_start=0. Make permanent in /etc/sysctl.d/. |
| DNS resolution fails inside container | Network DNS not configured | podman exec <ctr> cat /etc/resolv.conf. Check podman network inspect <net> for DNS settings. |
| ZFS storage driver: "dataset busy" | Container or mount still referencing dataset | podman ps -a for stopped containers. lsof +D /var/lib/containers. podman system prune. |
| Slow container startup | Large image layers or ZFS fragmentation | Check image size: podman image ls. Use multi-stage builds. Check zpool status for fragmentation. |
| Container cannot reach other containers | Containers on different networks | podman inspect --format '{{.NetworkSettings.Networks}}' for both containers. Ensure same network name. |
| Podman socket not found (Docker compat) | Podman socket not enabled | systemctl --user enable --now podman.socket (rootless) or systemctl enable --now podman.socket (root). |
| Image build fails with "no space" | ZFS dataset quota or pool full | zfs list -o name,used,avail,quota. podman system prune -a to reclaim space. Check zpool list. |
Advanced debugging with nsenter
There is a debugging technique that most container engineers never learn: nsenter.
When a container is misbehaving and you need to inspect its network, filesystem, or process
state — but the container image has no shell, no curl, no debugging tools —
nsenter lets you enter any combination of the container's namespaces from the host.
You stay as root on the host with full access to host tools, but you see the container's
network namespace, mount namespace, or PID namespace.
This is strictly more powerful than podman exec because you are not limited to the
tools inside the container image. And unlike podman exec, it works even when the
container's PID 1 has crashed but the namespaces still exist.
# Enter a container's namespaces without exec (works even if container has no shell)
nsenter -t $(podman inspect --format '{{.State.Pid}}' web) -n -m -p sh
# Trace all syscalls made by a container
podman run --rm -it --security-opt seccomp=unconfined \
strace -f -e trace=network alpine wget -q -O /dev/null http://example.com
# Debug container networking from the host
# Find the container's network namespace
podman inspect --format '{{.NetworkSettings.SandboxKey}}' web
# /run/netns/cni-abc123
# Execute commands in the container's network namespace
nsenter --net=/run/netns/cni-abc123 ip addr
nsenter --net=/run/netns/cni-abc123 ss -tlnp
nsenter --net=/run/netns/cni-abc123 iptables -t nat -L -n -v
# Check SELinux denials for a specific container
ausearch -m avc -ts recent | grep container_t
# Full container lifecycle audit
podman events --since 1h --format '{{.Time}} {{.Status}} {{.Name}}'
# Reset everything (nuclear option — development only)
podman system reset
# WARNING: This destroys all containers, images, volumes, and networks
nsenter. It is the single most powerful container debugging tool that almost nobody uses, and it works when podman exec cannot — crashed containers, minimal images, broken shells.16. Production Patterns
Running containers in production on kldload means combining everything above into coherent patterns. Here are the configurations that work for real workloads.
Hardened container runtime
# /etc/containers/containers.conf — production hardening
[containers]
# Drop all capabilities, add back only what is needed
default_capabilities = [
"CHOWN",
"DAC_OVERRIDE",
"FOWNER",
"FSETID",
"KILL",
"NET_BIND_SERVICE",
"SETFCAP",
"SETGID",
"SETPCAP",
"SETUID"
]
# Default seccomp profile
default_sysctls = [
"net.ipv4.ping_group_range=0 0"
]
# Default ulimits
default_ulimits = [
"nofile=65536:65536",
"nproc=4096:4096"
]
# Log driver
log_driver = "journald"
# Default timezone
tz = "UTC"
[engine]
# Use crun (faster, written in C)
runtime = "crun"
# Health check interval
healthcheck_events = true
# Image pull policy
pull_policy = "newer"
[network]
# Use netavark (modern network backend)
network_backend = "netavark"
dns_bind_port = 5353
Backup strategy for containerized workloads
# ZFS snapshots for container volumes — automated with sanoid
# /etc/sanoid/sanoid.conf
[rpool/var/lib/containers/storage/volumes]
use_template = container-volumes
recursive = yes
[template_container-volumes]
frequently = 4
hourly = 24
daily = 30
monthly = 3
autosnap = yes
autoprune = yes
# Replicate container volumes to a backup host
syncoid --recursive \
rpool/var/lib/containers/storage/volumes \
backup-host:rpool/backup/container-volumes
# Restore a specific volume from snapshot
zfs rollback rpool/var/lib/containers/storage/volumes/pg-data@autosnap_2026-04-05_hourly
Related Pages
- Docker & Podman on ZFS — tutorial-level guide to running containers on ZFS
- Containers on ZFS — build recipe for a container host
- Serverless / Firecracker — FaaS workloads on microVMs
- GPU & NVIDIA Masterclass — GPU passthrough and container GPU access
- Kubernetes Masterclass — multi-host container orchestration
- systemd Masterclass — deep dive on service management and Quadlet
- Keycloak & SELinux Masterclass — SELinux policy in depth
- ZFS Masterclass — ZFS fundamentals and tuning
- Observability Masterclass — Prometheus, Grafana, and Loki for full-stack monitoring
- Packer & IaC Masterclass — infrastructure as code for container hosts
- Backup & DR Masterclass — ZFS replication and disaster recovery