Documentation

Containers Masterclass

This guide covers the complete container stack on kldload: from Linux namespace fundamentals and cgroup resource controls through Podman, Docker compatibility, the ZFS storage driver, rootless containers, container networking, pods, systemd integration via Quadlet, Firecracker microVMs, SELinux MCS isolation, image management with Buildah, multi-arch builds, local registries for air-gapped deployments, GPU containers, monitoring, and a comprehensive troubleshooting reference. By the end you will understand every layer of the container stack and know how to run production container workloads on ZFS with proper isolation, networking, and observability.

The premise: Containers are the most misunderstood technology in modern infrastructure. Most engineers treat them as lightweight VMs. They are not. A container is a set of kernel isolation primitives — namespaces, cgroups, seccomp, LSM — applied to an ordinary Linux process. There is no container kernel. There is no hypervisor. There is just your process, running on the host kernel, with restricted views of the filesystem, network, PIDs, and resources. Understanding this changes everything about how you design, deploy, secure, and debug containerized workloads.

What this page covers: Linux namespaces and cgroups fundamentals, the OCI runtime specification, Podman vs Docker architecture, the ZFS storage driver and dataset-per-container model, rootless containers with user namespaces, container networking with CNI and Netavark, Podman pods and Kubernetes YAML portability, systemd Quadlet integration, Firecracker microVMs and kata-containers for hardware isolation, SELinux MCS label isolation, Buildah image builds, multi-arch manifests, local registries for darksite environments, GPU passthrough to containers, monitoring and logging, and a complete troubleshooting reference.

Prerequisites: a running kldload system with ZFS on root. The Firecracker sections assume KVM is available. The GPU sections assume the GPU & NVIDIA Masterclass setup is complete. Networking sections build on the Backplane Networks Masterclass.

Why the isolation primitives matter more than the runtime

A container shares the host kernel. Every kernel vulnerability is a container vulnerability. A container escape is not like a VM escape — it is trivially easier because there is no hypervisor boundary to cross. This is why understanding the actual isolation primitives — namespaces, cgroups, seccomp, SELinux — matters more than understanding any particular container runtime. The runtime is just the tool that sets up those primitives. The primitives are the security boundary.

Docker made containers popular by selling them as "lightweight VMs" — and that mental model causes real production incidents. The container industry still has this marketing problem, and this guide exists to fix the understanding underneath it.

1. Container Fundamentals

Before touching Podman or Docker, you need to understand what a container actually is at the kernel level. A container is not a thing — it is a collection of kernel features applied to a process. There is no "container" data structure in the Linux kernel. There are namespaces (which restrict what a process can see), cgroups (which restrict what a process can use), seccomp (which restricts what system calls a process can make), and LSM hooks like SELinux and AppArmor (which enforce mandatory access control). A "container runtime" is simply a program that creates a process with the right combination of these restrictions.

Linux Namespaces

Namespaces partition kernel resources so that one set of processes sees one set of resources and another set of processes sees a different set. There are eight namespace types in modern Linux kernels. Each namespace isolates a specific kernel resource.

Mount Namespace (mnt)

Isolates the filesystem mount table. Each container gets its own view of the filesystem tree. The container sees its rootfs at / and cannot see the host's mounts. Created with CLONE_NEWNS. This is the oldest namespace, added in Linux 2.4.19 (2002).

// The container thinks it has its own hard drive. It is actually looking at an overlay of read-only image layers and a writable top layer.

PID Namespace (pid)

Isolates the process ID number space. The first process in the container is PID 1 inside the namespace, but has a different PID (e.g., 45832) on the host. The container cannot see or signal host processes. CLONE_NEWPID.

// The container's init process thinks it is PID 1 and running alone. The host knows better.

Network Namespace (net)

Isolates network devices, IP addresses, routing tables, firewall rules, and /proc/net. Each container gets its own network stack — its own eth0, its own IP, its own iptables. Traffic between namespaces crosses a virtual bridge or veth pair. CLONE_NEWNET.

// Each container is on its own virtual LAN. The bridge is the switch. The host is the router.

User Namespace (user)

Maps UIDs and GIDs inside the namespace to different UIDs/GIDs outside. Root (UID 0) inside the container can be UID 100000 on the host — an unprivileged user. This is the foundation of rootless containers. CLONE_NEWUSER.

// The container thinks it is root. The host knows it is nobody special. Both are correct.

UTS Namespace (uts)

Isolates the hostname and NIS domain name. Each container can have its own hostname without affecting the host. Simple but essential — many applications use hostname for identity. CLONE_NEWUTS.

// The container can call itself anything it wants. hostname(1) returns the container name, not the host name.

IPC Namespace (ipc)

Isolates System V IPC objects (shared memory segments, message queues, semaphore arrays) and POSIX message queues. Prevents containers from interfering with each other's IPC. CLONE_NEWIPC.

// Shared memory is only shared with processes in the same namespace. No cross-container side channels via SHM.

Cgroup Namespace (cgroup)

Virtualizes the view of /proc/self/cgroup. The container sees its cgroup root as / rather than its actual position in the host's cgroup hierarchy. Prevents information leakage about host cgroup structure. CLONE_NEWCGROUP.

// The container cannot discover it is in a cgroup — it thinks the cgroup root is its own.

Time Namespace (time)

Added in Linux 5.6. Allows per-namespace offsets for CLOCK_MONOTONIC and CLOCK_BOOTTIME. Useful for migrating containers between hosts with different uptimes without breaking timers. CLONE_NEWTIME.

// The container can think the system booted at a different time. Useful for live migration, dangerous for forensics.

Cgroups v2

Control groups (cgroups) limit, account for, and isolate resource usage. Cgroups v2 is the unified hierarchy — all controllers (cpu, memory, io, pids) live in a single tree. kldload uses cgroups v2 exclusively. The key controllers are:

# View the cgroup hierarchy for a container
podman inspect --format '{{.State.CgroupPath}}' my-container

# Check cgroup v2 is active
stat -fc %T /sys/fs/cgroup/
# Returns "cgroup2fs" on cgroups v2

# Set memory limit to 2GB and CPU to 1.5 cores
podman run --memory 2g --cpus 1.5 my-image

# These translate to cgroup writes:
# memory.max = 2147483648
# cpu.max = 150000 100000  (150ms per 100ms period = 1.5 CPUs)

OCI Runtime Specification

The Open Container Initiative (OCI) defines three specifications: the Runtime Specification (how to run a container), the Image Specification (how to package a container image), and the Distribution Specification (how to push/pull images). The runtime spec defines a JSON config (config.json) that describes namespaces, cgroups, mounts, capabilities, and the process to execute. Any OCI-compliant runtime (runc, crun, youki, kata-runtime) can execute any OCI-compliant image. This is what makes containers portable — not Docker, not Podman, but the OCI spec.

# Inspect the OCI config that Podman generates for a container
podman create --name test alpine sleep 3600
podman inspect test | jq '.[0].OCIRuntime'
# "crun" — kldload uses crun by default (written in C, faster than runc)

# Export the OCI bundle
podman export test | tar -C /tmp/bundle -xf -
ls /tmp/bundle/
# bin  dev  etc  home  lib  ...  (the rootfs)

# The runtime config lives alongside the rootfs
cat /run/containers/storage/overlay-containers/<id>/userdata/config.json | jq '.linux.namespaces'
# [{"type":"pid"},{"type":"network"},{"type":"mount"},{"type":"ipc"},{"type":"uts"},{"type":"cgroup"}]

2. Podman vs Docker

kldload ships Podman as the default container runtime. Podman is a daemonless, rootless container engine that is wire-compatible with Docker's CLI. You can alias docker to podman and most workflows work unchanged. But the architectural differences matter deeply for security, reliability, and systemd integration.

Architecture: fork-exec vs client-server

Docker uses a client-server model: the CLI talks to a daemon (dockerd) that does all the work. The daemon runs as root. Every container operation goes through this single root-owned daemon. If the daemon dies, all containers die. If the daemon is compromised, the attacker has root on the host.

Podman uses a fork-exec model: each container is a direct child of the process that started it. No daemon. No single point of failure. No permanent root process. Processes are children of the process that started them, managed by the init system (systemd), not by a bespoke daemon that reimplements half of an init system badly.

The deepest difference between Podman and Docker is not technical — it is philosophical. Podman's fork-exec model is more Unix, more secure, and more compatible with systemd. Docker's daemon model was an engineering shortcut that became an architectural constraint.

Feature	Podman	Docker
Architecture	Fork-exec, daemonless	Client-server daemon (dockerd)
Root required	No — rootless by default	Yes — daemon runs as root
Container parent	conmon (per-container monitor)	containerd-shim
Daemon crash	Containers unaffected (no daemon)	All containers may die
OCI runtime	crun (default), runc, kata	runc (default)
systemd integration	Native — Quadlet, socket activation	Restart policies only
Pod support	Native — shares namespaces like K8s	Docker Compose (different model)
SELinux	Full MCS label separation	Supported but less integrated
Docker CLI compat	99%+ — alias docker=podman	N/A

# Install Podman on kldload (already included in desktop/server profiles)
dnf install -y podman podman-plugins buildah skopeo

# Docker compatibility — alias or install podman-docker
dnf install -y podman-docker
# This creates /usr/bin/docker -> podman and emulates the Docker socket

# Verify the setup
podman info --format '{{.Host.OCIRuntime.Name}}'
# crun

podman info --format '{{.Store.GraphDriverName}}'
# zfs (on kldload with ZFS root)

# Run a container — identical syntax to docker
podman run --rm -it alpine:3.19 cat /etc/os-release

Docker Compose compatibility

# Podman supports Docker Compose files via podman-compose or the docker-compose binary
# pointing at the Podman socket

# Enable the Podman socket (emulates Docker socket)
systemctl --user enable --now podman.socket

# Verify the socket exists
ls -la /run/user/$(id -u)/podman/podman.sock

# Point docker-compose at Podman
export DOCKER_HOST=unix:///run/user/$(id -u)/podman/podman.sock
docker-compose up -d

# Or use podman-compose directly
pip install podman-compose
podman-compose up -d

3. ZFS Storage Driver

When Podman or Docker runs on a ZFS filesystem, it can use the ZFS storage driver. This is one of the most powerful and least understood container storage configurations. Instead of overlay filesystems (which stack layers using OverlayFS), the ZFS driver creates a ZFS dataset per image layer and a ZFS clone per container. Every layer is a proper ZFS dataset with its own properties, compression, and snapshot lineage.

ZFS vs OverlayFS for container storage

OverlayFS — the default storage driver on every other Linux distribution — is a kernel filesystem that presents a merged view of multiple directories. It has no concept of block checksums, copy-on-write at the block level, or transactional writes. If power dies mid-write, you get a corrupt layer. If a bit flips on disk, OverlayFS serves the corrupted data silently.

ZFS does none of this. Every block is checksummed. Every write is transactional. Every layer is a proper dataset with its own compression policy, quota, and snapshot lineage. You can snapshot a container, clone it, send it to another host with zfs send, and roll it back — using the same ZFS tools you use for everything else. No special container-specific backup tools needed.

The ZFS storage driver is the single biggest reason to run containers on kldload. You get data integrity, instant clones, and per-layer compression for free — things that OverlayFS cannot provide at any cost.

How it works

# When Podman pulls an image, each layer becomes a ZFS dataset
podman pull nginx:alpine

# View the ZFS datasets created
zfs list -r -o name,used,refer,mountpoint rpool/var/lib/containers
# NAME                                                    USED  REFER  MOUNTPOINT
# rpool/var/lib/containers                                856M   128K  /var/lib/containers
# rpool/var/lib/containers/storage                        856M   128K  /var/lib/containers/storage
# rpool/var/lib/containers/storage/zfs                    854M   128K  /var/lib/containers/storage/zfs
# rpool/var/lib/containers/storage/zfs/abc123def456       3.2M  3.2M   legacy   (base layer)
# rpool/var/lib/containers/storage/zfs/789abc012def       28M   28M    legacy   (nginx layer)
# rpool/var/lib/containers/storage/zfs/345ghi678jkl       1.1M  1.1M   legacy   (config layer)

# When you run a container, Podman creates a ZFS clone of the top layer
podman run -d --name web nginx:alpine

# The container's writable layer is a ZFS clone
zfs list -t all -r rpool/var/lib/containers/storage/zfs | grep clone
# A clone is a writable snapshot — copy-on-write, instant creation, zero initial space

Configure ZFS storage driver

# /etc/containers/storage.conf — kldload sets this automatically on ZFS systems
[storage]
driver = "zfs"

[storage.options.zfs]
# Parent dataset for all container storage
# Podman creates sub-datasets automatically
fsname = "rpool/var/lib/containers/storage/zfs"

# Mountopt controls mount options for container rootfs
mountopt = "nodev"

ZFS advantages for container workloads

# Snapshot a running container's filesystem
podman commit my-container my-snapshot:v1
# Under the hood: zfs snapshot + zfs clone

# Instant container cloning — create 100 identical containers in seconds
for i in $(seq 1 100); do
  podman run -d --name "worker-$i" my-image:latest
done
# Each container's writable layer is a ZFS clone — zero copy, instant

# Per-container compression
zfs set compression=zstd rpool/var/lib/containers/storage/zfs

# Block-level deduplication across layers (use with caution — RAM hungry)
zfs set dedup=on rpool/var/lib/containers/storage/zfs

# Quota per container (via ZFS dataset properties)
zfs set quota=10G rpool/var/lib/containers/storage/zfs/<container-dataset>

# Send a container image to another host using ZFS send
zfs snapshot rpool/var/lib/containers/storage/zfs/abc123@export
zfs send rpool/var/lib/containers/storage/zfs/abc123@export | \
  ssh remote-host zfs recv rpool/var/lib/containers/storage/zfs/abc123

4. Rootless Containers

Rootless containers run entirely as an unprivileged user. The container process never touches UID 0 on the host. This eliminates an entire class of container escape vulnerabilities — even if an attacker breaks out of the namespace, they land as an unprivileged user. kldload configures rootless Podman by default for non-root users.

Why rootless matters

Rootless containers are one of the most important security advances in container technology. The reason they remain underused in production is that they are harder to set up. Network access requires slirp4netns or pasta instead of direct bridge access. Binding to ports below 1024 requires net.ipv4.ip_unprivileged_port_start=0. Volume mounts need careful UID mapping. The ZFS storage driver in rootless mode requires the ZFS datasets to be owned by the rootless user, which means you need delegated datasets.

None of this is impossible, but Docker's "just run as root" default made everyone lazy. On kldload, all of this is configured out of the box because the security benefit is worth the complexity. A rootless container escape gives the attacker UID 100000 on the host — they own nothing.

Docker normalized running everything as root. Podman normalized not doing that. The operational overhead of rootless is real but finite, and kldload handles the setup for you.

User namespace mapping

# View subuid/subgid allocations
cat /etc/subuid
# live:100000:65536
# todd:165536:65536

cat /etc/subgid
# live:100000:65536
# todd:165536:65536

# This means user "todd" gets UIDs 165536-231071 for container use
# UID 0 inside the container = UID 165536 on the host
# UID 1000 inside the container = UID 166536 on the host

# Verify rootless configuration
podman unshare cat /proc/self/uid_map
# 0     165536      65536

# Run a rootless container
podman run --rm -it alpine id
# uid=0(root) gid=0(root) — root inside the namespace
# On the host: ps -o uid,pid,cmd shows UID 165536

Rootless networking with pasta

# Podman 5.x+ uses pasta (from passt) by default instead of slirp4netns
# pasta is faster and supports IPv6 properly

# Check which network backend is active
podman info --format '{{.Host.Pasta.Executable}}'
# /usr/bin/pasta

# Rootless port forwarding — no root required
podman run -d -p 8080:80 nginx:alpine
# pasta sets up port forwarding without iptables (which requires root)

# Allow binding to privileged ports (kldload sets this)
sysctl net.ipv4.ip_unprivileged_port_start=0
# Now rootless containers can bind to port 80, 443, etc.

# Rootless container storage location
# Root: /var/lib/containers/storage/
# Rootless: ~/.local/share/containers/storage/
podman info --format '{{.Store.GraphRoot}}'

ZFS delegated datasets for rootless

# Create a delegated ZFS dataset for rootless container storage
zfs create rpool/containers-todd
zfs allow todd create,destroy,mount,snapshot,clone,promote,rename,send,receive rpool/containers-todd
chown todd:todd /rpool/containers-todd

# Configure Podman to use the delegated dataset
# ~/.config/containers/storage.conf (as user todd)
[storage]
driver = "zfs"
graphroot = "/rpool/containers-todd"

[storage.options.zfs]
fsname = "rpool/containers-todd"

5. Image Management

Container images are the unit of distribution. An image is an ordered collection of filesystem layers plus metadata (environment variables, entrypoint, exposed ports). Understanding how images are built, stored, tagged, and distributed is essential for reproducible deployments.

Building images with Buildah

# Buildah is a standalone image builder — no daemon, scriptable, OCI-native
# It can build from Dockerfiles OR from shell scripts

# Build from a Dockerfile (standard approach)
buildah bud -t my-app:v1 -f Containerfile .

# Build from shell script (more flexible, no Dockerfile syntax limitations)
ctr=$(buildah from registry.access.redhat.com/ubi9/ubi-minimal:latest)
buildah run $ctr -- dnf install -y python3 python3-pip
buildah run $ctr -- pip3 install flask gunicorn
buildah copy $ctr ./app /opt/app
buildah config --entrypoint '["/usr/bin/gunicorn", "-w", "4", "-b", "0.0.0.0:8000", "app:app"]' $ctr
buildah config --port 8000 $ctr
buildah config --label maintainer="todd@kldload.com" $ctr
buildah commit $ctr my-app:v1
buildah rm $ctr

Multi-stage builds

# Containerfile with multi-stage build — compile in one stage, deploy in another
# This produces a minimal final image without build tools

FROM registry.access.redhat.com/ubi9/ubi:latest AS builder
RUN dnf install -y gcc make openssl-devel
COPY src/ /build/
WORKDIR /build
RUN make -j$(nproc) && make install DESTDIR=/output

FROM registry.access.redhat.com/ubi9/ubi-minimal:latest
COPY --from=builder /output/usr/local/bin/myapp /usr/local/bin/myapp
RUN microdnf install -y openssl-libs && microdnf clean all
USER 1001
ENTRYPOINT ["/usr/local/bin/myapp"]

Multi-arch builds

ARM servers (Graviton, Ampere) are 30-40% cheaper per vCPU in every major cloud. Apple Silicon developers are running ARM natively. Edge deployments are overwhelmingly ARM. If your CI pipeline does not produce multi-arch images, you are leaving performance and cost savings on the table.

The manifest list is the key abstraction — it is a pointer that says "for amd64, use this digest; for arm64, use this digest." The registry serves the right image for the platform automatically. One tag, multiple architectures, zero client-side complexity.

# Build for multiple architectures and create a manifest list
# Requires qemu-user-static for cross-architecture emulation

dnf install -y qemu-user-static

# Build for amd64 and arm64
podman build --platform linux/amd64,linux/arm64 \
  --manifest my-app:v1 \
  -f Containerfile .

# Inspect the manifest list
podman manifest inspect my-app:v1
# Shows two entries: one for amd64, one for arm64

# Push the manifest list to a registry
podman manifest push --all my-app:v1 \
  docker://registry.example.com/my-app:v1

# Alternative: build separately and combine
podman build --platform linux/amd64 -t my-app:v1-amd64 .
podman build --platform linux/arm64 -t my-app:v1-arm64 .
podman manifest create my-app:v1 my-app:v1-amd64 my-app:v1-arm64
podman manifest push --all my-app:v1 docker://registry.example.com/my-app:v1

If you are building container images and not producing multi-arch manifests, you are already behind. ARM is not the future — it is the present, and the cost savings are too large to ignore.

Image inspection with Skopeo

# Skopeo inspects and copies images without pulling them
# No daemon required — works directly against registries

# Inspect a remote image without downloading
skopeo inspect docker://docker.io/library/nginx:alpine
# Returns JSON with layers, config, labels, architecture

# Copy between registries without local storage
skopeo copy docker://docker.io/library/nginx:alpine \
  docker://registry.local:5000/nginx:alpine

# Copy to a local OCI directory (for air-gapped transfer)
skopeo copy docker://docker.io/library/nginx:alpine \
  oci:/tmp/nginx-alpine:latest

# Copy to a Docker archive (tarball)
skopeo copy docker://docker.io/library/nginx:alpine \
  docker-archive:/tmp/nginx-alpine.tar

6. Container Networking

Container networking is where most of the complexity lives. Each container gets its own network namespace with its own interfaces, IP addresses, and routing table. Connecting containers to each other and to the outside world requires virtual network infrastructure — bridges, veth pairs, NAT rules, DNS. Podman 5.x uses Netavark as its network backend (replacing CNI plugins).

Network modes

# Bridge mode (default) — container gets a veth pair connected to a bridge
podman run -d --name web --network bridge -p 8080:80 nginx:alpine
# Container gets 10.88.0.x, host forwards port 8080 to container port 80

# Host mode — container shares the host's network namespace
podman run -d --name web --network host nginx:alpine
# Container binds directly to host ports, no NAT, maximum performance
# WARNING: no network isolation

# None — container has no network (only loopback)
podman run -d --name isolated --network none alpine sleep 3600

# Macvlan — container gets a MAC address directly on the physical network
podman network create -d macvlan \
  --subnet 192.168.1.0/24 --gateway 192.168.1.1 \
  -o parent=eno1 my-macvlan

podman run -d --network my-macvlan --ip 192.168.1.50 nginx:alpine
# Container appears as a real host on the LAN — no NAT, no port mapping

# Create a custom bridge network with specific subnet
podman network create \
  --subnet 10.50.0.0/24 \
  --gateway 10.50.0.1 \
  --dns 10.50.0.1 \
  app-network

podman run -d --network app-network --name api my-api:latest
podman run -d --network app-network --name db postgres:16
# api can reach db at "db:5432" via built-in DNS

Inter-container DNS

# Podman provides automatic DNS resolution for containers on the same network
# Container names resolve to their IP addresses

# Create a network and two containers
podman network create backend
podman run -d --network backend --name redis redis:7-alpine
podman run -d --network backend --name app my-app:latest

# From inside app, redis resolves automatically
podman exec app ping -c 1 redis
# PING redis (10.89.0.2): 56 data bytes
# 64 bytes from 10.89.0.2: seq=0 ttl=64 time=0.043 ms

# The DNS server is aardvark-dns, managed by Netavark
podman network inspect backend | jq '.[0].dns_enabled'
# true

NAT overhead and high-performance alternatives

The default bridge mode uses iptables/nftables DNAT to forward traffic from host ports to container ports. For low-traffic web services this is fine. For high-throughput workloads (databases, message queues, storage services), NAT adds measurable latency and CPU overhead.

The solution depends on your requirements: host networking eliminates all overhead but sacrifices isolation. Macvlan gives you direct LAN access with isolation but breaks container-to-host communication. IPVLAN is similar but shares the MAC address. For truly high-performance container networking on kldload, consider macvlan with a dedicated VLAN for container traffic — you get near-native performance with proper network segmentation.

NAT is the silent performance killer in container networking. Most people never measure it, and for web services it does not matter. But for databases or message queues doing 100K+ ops/sec, the difference between NAT and macvlan is measurable and significant.

7. Podman Pods

A Podman pod is a group of containers that share network and (optionally) PID, IPC, and UTS namespaces. This is the same concept as a Kubernetes pod. Containers in a pod communicate over localhost, share the same IP address, and are scheduled as a unit. This makes pods the natural way to co-locate tightly coupled services — like an application and its sidecar proxy, or a web server and a log shipper.

# Create a pod with port mapping
podman pod create --name my-pod -p 8080:80 -p 5432:5432

# Add containers to the pod
podman run -d --pod my-pod --name web nginx:alpine
podman run -d --pod my-pod --name db postgres:16-alpine

# web and db share the same network namespace
# web can reach db at localhost:5432
# external clients reach web at host:8080

# View pod status
podman pod ps
# POD ID        NAME     STATUS   CREATED   INFRA ID     # OF CONTAINERS
# abc123def456  my-pod   Running  2m ago    789ghi012jkl  3

# The "3" includes the infra container — a pause container that holds the namespaces
podman pod inspect my-pod | jq '.InfraContainerId'

Pod YAML and Kubernetes portability

# Generate Kubernetes YAML from a running pod
podman generate kube my-pod > my-pod.yaml

# The generated YAML is valid Kubernetes YAML
cat my-pod.yaml
# apiVersion: v1
# kind: Pod
# metadata:
#   name: my-pod
# spec:
#   containers:
#   - name: web
#     image: docker.io/library/nginx:alpine
#     ports:
#     - containerPort: 80
#       hostPort: 8080
#   - name: db
#     image: docker.io/library/postgres:16-alpine

# Deploy the same YAML on Kubernetes
kubectl apply -f my-pod.yaml

# Or play it back on another Podman host
podman play kube my-pod.yaml

# Tear down
podman play kube --down my-pod.yaml

8. systemd Integration — Quadlet

Quadlet is Podman's native systemd integration. Instead of writing systemd unit files that call podman run, you write declarative .container, .pod, .volume, and .network files. systemd's generator converts these into proper unit files at boot. The result is containers managed by systemd — with dependency ordering, restart policies, socket activation, and journald logging built in.

Why Quadlet replaces podman generate systemd

Before Quadlet, the standard advice was podman generate systemd — which produced 40-line unit files that were fragile, hard to read, and painful to maintain. Quadlet files are 10-15 lines of declarative configuration. systemd handles the lifecycle. journald handles the logs. You get automatic restarts, dependency ordering, and systemctl status for your containers.

For single-host deployments, small clusters, and edge devices, Quadlet is often all you need. Kubernetes is for multi-host orchestration. Quadlet is for single-host orchestration. Most workloads that people deploy on Kubernetes actually belong on Quadlet.

Quadlet is the correct answer to "how do I run containers in production without Kubernetes." For single-host workloads, it is simpler, more reliable, and easier to debug than any orchestrator.

Quadlet .container file

# /etc/containers/systemd/nginx.container
[Unit]
Description=Nginx web server
After=network-online.target

[Container]
Image=docker.io/library/nginx:alpine
PublishPort=80:80
PublishPort=443:443
Volume=/etc/nginx/conf.d:/etc/nginx/conf.d:ro,Z
Volume=/var/www/html:/usr/share/nginx/html:ro,Z
Volume=/etc/letsencrypt:/etc/letsencrypt:ro,Z
Environment=NGINX_ENTRYPOINT_QUIET_LOGS=1
AutoUpdate=registry
HealthCmd=curl -sf http://localhost/ || exit 1
HealthInterval=30s

[Service]
Restart=always
TimeoutStartSec=120

[Install]
WantedBy=multi-user.target default.target

# /etc/containers/systemd/postgres.container
[Unit]
Description=PostgreSQL 16
After=network-online.target

[Container]
Image=docker.io/library/postgres:16-alpine
PublishPort=5432:5432
Volume=pg-data.volume:/var/lib/postgresql/data:Z
Environment=POSTGRES_PASSWORD_FILE=/run/secrets/pg_password
Secret=pg_password,type=mount
HealthCmd=pg_isready -U postgres
HealthInterval=10s

[Service]
Restart=always
TimeoutStartSec=300

[Install]
WantedBy=multi-user.target

Quadlet .volume file

# /etc/containers/systemd/pg-data.volume
[Volume]
# On ZFS, this creates a dedicated ZFS dataset for the volume
Driver=local
Label=app=postgres

Quadlet .pod file

# /etc/containers/systemd/webapp.pod
[Unit]
Description=Web application pod

[Pod]
PodName=webapp
PublishPort=8080:8080

# Containers reference this pod with Pod=webapp.pod

# Reload systemd to pick up Quadlet files
systemctl daemon-reload

# Start the container
systemctl start nginx.service

# View status — full systemd integration
systemctl status nginx.service
# Shows container ID, health status, recent logs

# View logs via journald
journalctl -u nginx.service -f

# Enable auto-start at boot
systemctl enable nginx.service

# Auto-update containers (check registry for new images)
systemctl enable --now podman-auto-update.timer

# Manual update check
podman auto-update --dry-run
podman auto-update

Socket activation

# Socket activation starts the container only when a connection arrives
# Perfect for rarely-used services that should not consume resources when idle

# /etc/containers/systemd/dev-tools.container
[Unit]
Description=Development tools (socket activated)

[Container]
Image=my-dev-tools:latest
PublishPort=9090:9090

[Service]
# Container only starts when someone connects to port 9090
Type=notify
Restart=on-failure

[Install]
# No WantedBy — started by socket only

# /etc/systemd/system/dev-tools.socket
[Socket]
ListenStream=9090

[Install]
WantedBy=sockets.target

9. Firecracker MicroVMs

Firecracker is a virtual machine monitor (VMM) built by AWS that creates lightweight microVMs with a minimal device model. Each microVM boots a real Linux kernel in <125ms, uses ~5MB of memory overhead, and provides the hardware isolation of a VM with startup times approaching a container. Firecracker is what powers AWS Lambda and Fargate.

Containers vs microVMs: choosing the right isolation boundary

Containers share the host kernel. A kernel exploit in any container compromises every container on the host. VMs have separate kernels but take seconds to boot and consume hundreds of megabytes per instance. Firecracker gives you a separate kernel in 125ms with 5MB overhead.

The tradeoff is operational complexity — you need to manage VM images (not container images), the network setup is different, and the tooling is less mature. But for multi-tenant workloads where you cannot trust the code running in the container (FaaS, student labs, CI runners), Firecracker or kata-containers are the correct isolation boundary. Containers are for code you trust. MicroVMs are for code you do not trust.

The rule is simple: if you wrote the code or trust the vendor, use a container. If you are running user-submitted code, student projects, or untrusted CI jobs, use a microVM. The 150ms startup penalty is nothing compared to the cost of a kernel escape.

Firecracker on kldload

# Install Firecracker
ARCH=$(uname -m)
FC_VERSION="1.7.0"
curl -fsSL "https://github.com/firecracker-microvm/firecracker/releases/download/v${FC_VERSION}/firecracker-v${FC_VERSION}-${ARCH}.tgz" \
  | tar -xz -C /usr/local/bin --strip-components=1 \
    "release-v${FC_VERSION}-${ARCH}/firecracker-v${FC_VERSION}-${ARCH}" \
    "release-v${FC_VERSION}-${ARCH}/jailer-v${FC_VERSION}-${ARCH}"
ln -sf /usr/local/bin/firecracker-v${FC_VERSION}-${ARCH} /usr/local/bin/firecracker
ln -sf /usr/local/bin/jailer-v${FC_VERSION}-${ARCH} /usr/local/bin/jailer

# Verify KVM is available (required)
ls -la /dev/kvm
# crw-rw-rw- 1 root kvm 10, 232 ... /dev/kvm

# Prepare a minimal kernel and rootfs
# The kernel must be uncompressed (vmlinux, not vmlinuz)
curl -fsSLo /var/lib/firecracker/vmlinux \
  https://s3.amazonaws.com/spec.ccfc.min/img/quickstart_guide/x86_64/kernels/vmlinux.bin

# Create a rootfs from an Alpine container
podman run --rm -v /var/lib/firecracker:/output:Z alpine sh -c '
  dd if=/dev/zero of=/output/rootfs.ext4 bs=1M count=512
  mkfs.ext4 /output/rootfs.ext4
  mkdir -p /mnt/rootfs
  mount /output/rootfs.ext4 /mnt/rootfs
  apk add --root /mnt/rootfs --initdb alpine-base openrc
  echo "ttyS0::respawn:/sbin/getty -L ttyS0 115200 vt100" > /mnt/rootfs/etc/inittab
  umount /mnt/rootfs
'

Launch a microVM

# Start Firecracker with a Unix socket for API control
firecracker --api-sock /tmp/firecracker.sock &

# Configure the VM via the API
curl --unix-socket /tmp/firecracker.sock -X PUT \
  http://localhost/boot-source \
  -H 'Content-Type: application/json' \
  -d '{
    "kernel_image_path": "/var/lib/firecracker/vmlinux",
    "boot_args": "console=ttyS0 reboot=k panic=1 pci=off"
  }'

curl --unix-socket /tmp/firecracker.sock -X PUT \
  http://localhost/drives/rootfs \
  -H 'Content-Type: application/json' \
  -d '{
    "drive_id": "rootfs",
    "path_on_host": "/var/lib/firecracker/rootfs.ext4",
    "is_root_device": true,
    "is_read_only": false
  }'

curl --unix-socket /tmp/firecracker.sock -X PUT \
  http://localhost/machine-config \
  -H 'Content-Type: application/json' \
  -d '{
    "vcpu_count": 2,
    "mem_size_mib": 256
  }'

# Start the microVM — boots in <125ms
curl --unix-socket /tmp/firecracker.sock -X PUT \
  http://localhost/actions \
  -H 'Content-Type: application/json' \
  -d '{"action_type": "InstanceStart"}'

Kata Containers — OCI-compatible microVMs

# Kata Containers wraps Firecracker (or QEMU) as an OCI runtime
# Podman uses it transparently — same CLI, hardware isolation

dnf install -y kata-containers

# Configure Podman to use kata as an alternative runtime
# /etc/containers/containers.conf
[engine.runtimes]
kata = ["/usr/bin/kata-runtime"]

# Run a container with hardware isolation
podman run --rm --runtime kata -it alpine sh
# This boots a real microVM, runs the container inside it, and destroys it
# ~200ms startup vs ~50ms for crun, but full kernel isolation

# Use kata for specific containers, crun for others
podman run --runtime crun -d --name trusted-app my-trusted:v1
podman run --runtime kata -d --name untrusted-job user-submitted-code:latest

10. SELinux & Containers

SELinux provides mandatory access control (MAC) that operates independently of Unix permissions. For containers, SELinux uses Multi-Category Security (MCS) labels to isolate containers from each other and from the host. Each container gets a unique MCS label (e.g., s0:c123,c456), and SELinux enforces that it can only access files and resources with matching labels.

The SELinux container model

SELinux assigns a type and an MCS label to every process and file. Containers run as type container_t. Container files are labeled container_file_t. The policy says container_t can read/write container_file_t. When you bind-mount a host directory, it has the host's SELinux type (e.g., default_t or var_t), and container_t is not allowed to access those types.

The fix is volume labels: the :Z flag tells Podman to relabel the mount with the container's MCS label (private to that container). The :z flag relabels it as shared (accessible by multiple containers). This is the correct fix — not disabling SELinux.

The standard response to any SELinux denial is setenforce 0 — which is like disabling your firewall because it blocked something. The actual fix is almost always a volume label (:Z or :z) or a udica-generated policy.

MCS label isolation

# View the SELinux context of a running container
podman run -d --name test1 alpine sleep 3600
podman run -d --name test2 alpine sleep 3600

podman inspect --format '{{.ProcessLabel}}' test1
# system_u:system_r:container_t:s0:c100,c200

podman inspect --format '{{.ProcessLabel}}' test2
# system_u:system_r:container_t:s0:c300,c400

# Different MCS labels — SELinux prevents cross-container access
# Even if a container escapes its namespace, it cannot read files
# labeled with a different MCS category

Volume mount labels

# WRONG — SELinux will block access
podman run -v /data/app:/data alpine ls /data
# ls: can't open '/data': Permission denied

# RIGHT — :Z relabels the mount for this specific container
podman run -v /data/app:/data:Z alpine ls /data
# (works — /data/app is relabeled to container's MCS label)

# :z — shared label (multiple containers can access)
podman run -v /data/shared:/data:z alpine ls /data

# :Z — private label (only this container)
podman run -v /data/private:/data:Z alpine ls /data

# View the relabeling
ls -lZ /data/app
# drwxr-xr-x. root root system_u:object_r:container_file_t:s0:c100,c200 app

# WARNING: :Z on a host system directory (like /etc or /var) will break the host
# Only use :Z on directories dedicated to the container

udica — automatic policy generation

# udica generates SELinux policies from container inspection data
dnf install -y udica

# Run your container and capture its inspection data
podman inspect my-container > container.json

# Generate a policy
udica -j container.json my-container-policy

# Load the policy
semodule -i my-container-policy.cil /usr/share/udica/templates/{base_container.cil,net_container.cil}

# Run the container with the custom policy
podman run --security-opt label=type:my-container-policy.process -d my-image

# This gives the container exactly the permissions it needs — no more
# Much more secure than --privileged or setenforce 0

11. Container Storage on ZFS

Beyond the ZFS storage driver for image layers, containers need persistent storage for data that survives container restarts and rebuilds. On kldload, all persistent container storage sits on ZFS datasets — giving you snapshots, compression, encryption, and replication for container data with zero additional tooling.

Named volumes on ZFS

# Create a named volume — Podman creates a ZFS dataset automatically
podman volume create pg-data

# Inspect the volume
podman volume inspect pg-data
# "Mountpoint": "/var/lib/containers/storage/volumes/pg-data/_data"
# On ZFS, this is backed by a ZFS dataset

# Use the volume
podman run -d --name postgres \
  -v pg-data:/var/lib/postgresql/data:Z \
  -e POSTGRES_PASSWORD=secret \
  postgres:16-alpine

# Snapshot the volume's ZFS dataset
zfs snapshot rpool/var/lib/containers/storage/volumes/pg-data@before-migration

# Rollback if the migration fails
zfs rollback rpool/var/lib/containers/storage/volumes/pg-data@before-migration

ZFS datasets as direct mounts

# Create dedicated ZFS datasets for container data
zfs create -o mountpoint=/data/redis \
  -o recordsize=4K \
  -o compression=lz4 \
  -o atime=off \
  rpool/data/redis

zfs create -o mountpoint=/data/postgres \
  -o recordsize=8K \
  -o compression=lz4 \
  -o logbias=throughput \
  rpool/data/postgres

zfs create -o mountpoint=/data/minio \
  -o recordsize=1M \
  -o compression=zstd \
  rpool/data/minio

# Mount with appropriate ZFS tuning per workload
podman run -d --name redis \
  -v /data/redis:/data:Z \
  redis:7-alpine --save 60 1 --dir /data

podman run -d --name postgres \
  -v /data/postgres:/var/lib/postgresql/data:Z \
  postgres:16-alpine

podman run -d --name minio \
  -v /data/minio:/data:Z \
  -p 9000:9000 \
  minio/minio server /data

Why ZFS recordsize tuning matters for containers

The ZFS recordsize tuning per container workload is where kldload shines compared to running containers on ext4 or XFS. PostgreSQL performs best with 8K recordsize (matching its page size). Redis needs 4K for small key-value pairs. MinIO stores large objects and benefits from 1M recordsize. On ext4, you get one block size for the whole filesystem. On ZFS, each dataset — and therefore each container's storage — gets its own recordsize, compression algorithm, and caching policy.

This is not a micro-optimization. Wrong recordsize can cause 2-5x write amplification. A PostgreSQL database on a 128K recordsize dataset (the ZFS default) writes 128KB for every 8KB page update. On an 8K dataset, it writes 8KB. That is a 16x difference in write amplification, and it directly translates to disk throughput and latency.

Per-workload recordsize tuning is probably the single most impactful ZFS feature for containerized databases. Most people never change it from the 128K default and wonder why their PostgreSQL writes are slow.

tmpfs for ephemeral data

# Use tmpfs for data that does not need to persist
# Faster than any disk — lives in RAM

podman run -d --name build-runner \
  --tmpfs /tmp:rw,size=2g,exec \
  --tmpfs /build:rw,size=10g \
  my-build-image:latest

# tmpfs is perfect for:
# - Build artifacts during CI
# - Temporary caches
# - Session data
# - Test databases that are rebuilt each run

12. Registries & Mirrors

A container registry stores and distributes OCI images. For air-gapped environments, darksite deployments, and performance optimization, running a local registry mirror is essential. kldload's darksite model extends naturally to container images — pull once, serve locally forever.

Running a local registry

# Deploy a local registry on ZFS
zfs create -o mountpoint=/data/registry \
  -o compression=zstd \
  -o recordsize=128K \
  rpool/data/registry

# Quadlet file for the registry
# /etc/containers/systemd/registry.container
[Unit]
Description=Local container registry
After=network-online.target

[Container]
Image=docker.io/library/registry:2
PublishPort=5000:5000
Volume=/data/registry:/var/lib/registry:Z
Environment=REGISTRY_STORAGE_DELETE_ENABLED=true
Environment=REGISTRY_HTTP_HEADERS_Access-Control-Allow-Origin=['*']
HealthCmd=wget -q --spider http://localhost:5000/v2/ || exit 1
HealthInterval=30s

[Service]
Restart=always

[Install]
WantedBy=multi-user.target

Configure registry mirrors

# /etc/containers/registries.conf.d/010-local-mirror.conf
# Mirror Docker Hub through the local registry

[[registry]]
prefix = "docker.io"
location = "docker.io"

[[registry.mirror]]
location = "registry.local:5000"
insecure = false

# For air-gapped environments, block all external registries
[[registry]]
prefix = "docker.io"
blocked = false
location = "registry.local:5000"

[[registry]]
prefix = "quay.io"
blocked = false
location = "registry.local:5000"

# Short-name aliases — resolve unqualified names
[aliases]
"nginx" = "registry.local:5000/library/nginx"
"postgres" = "registry.local:5000/library/postgres"
"redis" = "registry.local:5000/library/redis"

Mirroring images for darksite deployment

The air-gapped container registry is the missing piece in most darksite deployments. People focus on OS packages (RPM, APT) and forget that their containerized services need images too. On kldload, the darksite model already handles RPM and APT packages. Adding a container registry mirror is the natural extension.

The workflow is: pull images on a connected host, skopeo-copy them to a portable medium or to the local registry, and configure all hosts to resolve image names to the local mirror. The registries.conf approach is cleaner than retagging images — the original image references in your Quadlet files and Kubernetes manifests stay unchanged.

# Script to mirror a list of images to the local registry
#!/bin/bash
# mirror-images.sh — run on a connected host, copy results to air-gapped

REGISTRY="registry.local:5000"
IMAGES=(
  "docker.io/library/nginx:alpine"
  "docker.io/library/postgres:16-alpine"
  "docker.io/library/redis:7-alpine"
  "docker.io/library/registry:2"
  "docker.io/grafana/grafana:latest"
  "docker.io/prom/prometheus:latest"
  "docker.io/prom/node-exporter:latest"
)

for img in "${IMAGES[@]}"; do
  echo "Mirroring $img -> $REGISTRY"
  skopeo copy --all \
    "docker://$img" \
    "docker://$REGISTRY/$(echo "$img" | sed 's|docker.io/||')"
done

echo "Mirrored ${#IMAGES[@]} images to $REGISTRY"

If your darksite handles RPM and APT packages but not container images, your deployment is incomplete. The registry mirror closes the gap with zero changes to your Quadlet files or Kubernetes manifests.

Image signing with sigstore

# Sign images with cosign (sigstore)
dnf install -y cosign

# Generate a key pair
cosign generate-key-pair

# Sign an image
cosign sign --key cosign.key registry.local:5000/my-app:v1

# Verify the signature
cosign verify --key cosign.pub registry.local:5000/my-app:v1

# Configure Podman to require signatures
# /etc/containers/policy.json
{
  "default": [{"type": "reject"}],
  "transports": {
    "docker": {
      "registry.local:5000": [
        {
          "type": "sigstoreSigned",
          "keyPath": "/etc/pki/containers/cosign.pub",
          "signedIdentity": {"type": "matchRepository"}
        }
      ],
      "docker.io/library": [{"type": "insecureAcceptAnything"}]
    }
  }
}

13. GPU Containers

Running GPU workloads in containers requires the NVIDIA Container Toolkit, which provides a custom OCI runtime hook that mounts the GPU driver and device files into the container at startup. On kldload, this integrates with Podman, CDI (Container Device Interface), and rootless GPU access.

CDI: the modern approach to GPU containers

CDI (Container Device Interface) is a game-changer for GPU containers. Before CDI, GPU access required the nvidia-container-runtime as a wrapper around runc — it intercepted the container start, detected environment variables like NVIDIA_VISIBLE_DEVICES, and injected the driver. It was fragile, version-sensitive, and did not work with rootless Podman.

CDI takes a completely different approach: it generates a static JSON specification of all GPU devices and their required mounts/hooks. The container runtime reads the CDI spec and handles device injection natively. No wrapper runtime. No environment variable magic. Works with crun, runc, and kata. Works rootless. Works with any OCI runtime. This is why kldload configures CDI by default.

# Install NVIDIA Container Toolkit (see GPU Masterclass for driver setup)
dnf install -y nvidia-container-toolkit

# Generate CDI specification
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# Verify CDI devices
podman run --rm --device nvidia.com/gpu=all \
  nvidia/cuda:12.4.0-base-ubi9 nvidia-smi

# Run a specific GPU (multi-GPU hosts)
podman run --rm --device nvidia.com/gpu=0 \
  nvidia/cuda:12.4.0-base-ubi9 nvidia-smi

# GPU access in rootless mode (CDI makes this possible)
podman run --rm --user 1000:1000 --device nvidia.com/gpu=all \
  nvidia/cuda:12.4.0-base-ubi9 nvidia-smi

# Quadlet with GPU access
# /etc/containers/systemd/ollama.container
[Unit]
Description=Ollama LLM server
After=network-online.target

[Container]
Image=docker.io/ollama/ollama:latest
PublishPort=11434:11434
Volume=ollama-models.volume:/root/.ollama:Z
AddDevice=nvidia.com/gpu=all

[Service]
Restart=always

[Install]
WantedBy=multi-user.target

CDI replaced a fragile runtime-wrapper hack with a clean, declarative device specification. If you are still using NVIDIA_VISIBLE_DEVICES environment variables, you are using the old path — switch to CDI.

14. Monitoring & Logging

Container observability requires visibility into resource usage, health status, logs, and lifecycle events. On kldload, containers log to journald by default (via Podman's journald log driver), and resource metrics are available through Podman's stats API, Prometheus cAdvisor, and eBPF-based tools.

Podman stats and health checks

# Real-time resource usage for all containers
podman stats --no-stream
# ID            NAME        CPU %    MEM USAGE / LIMIT  MEM %    NET I/O          BLOCK I/O
# abc123def456  web         0.12%    15.2MiB / 2GiB     0.74%    12.3kB / 8.1kB   0B / 4.1MB
# 789ghi012jkl  postgres    2.34%    128MiB / 4GiB      3.13%    1.2MB / 856kB    24MB / 156MB

# Health check status
podman healthcheck run web
podman inspect --format '{{.State.Health.Status}}' web
# healthy

# Container events stream
podman events --filter event=start --filter event=die --filter event=health_status

journald logging

# Podman logs to journald by default on kldload
# View logs for a specific container
journalctl CONTAINER_NAME=web --since "1 hour ago"

# Follow logs
journalctl CONTAINER_NAME=postgres -f

# View logs by container ID
journalctl CONTAINER_ID=abc123def456

# Structured log queries
journalctl CONTAINER_NAME=web --output json-pretty | jq '.MESSAGE'

# Configure log driver per container
podman run -d --log-driver journald \
  --log-opt tag="{{.Name}}" \
  --name web nginx:alpine

Prometheus cAdvisor

# cAdvisor exports container metrics for Prometheus scraping
# /etc/containers/systemd/cadvisor.container
[Unit]
Description=cAdvisor container metrics
After=network-online.target

[Container]
Image=gcr.io/cadvisor/cadvisor:v0.49.1
PublishPort=8080:8080
Volume=/:/rootfs:ro
Volume=/var/run:/var/run:ro
Volume=/sys:/sys:ro
Volume=/var/lib/containers:/var/lib/containers:ro
SecurityLabelDisable=true

[Service]
Restart=always

[Install]
WantedBy=multi-user.target

# Prometheus scrape config for cAdvisor
# /etc/prometheus/prometheus.yml (add to scrape_configs)
scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: /metrics
    scrape_interval: 15s

# Key metrics to alert on:
# container_cpu_usage_seconds_total — CPU usage per container
# container_memory_usage_bytes — memory consumption
# container_network_receive_bytes_total — network ingress
# container_fs_usage_bytes — filesystem usage
# container_oom_events_total — OOM kills (critical)

Loki for container log aggregation

# Promtail ships container logs from journald to Loki
# /etc/promtail/config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: containers
    journal:
      json: false
      max_age: 12h
      labels:
        job: containers
    relabel_configs:
      - source_labels: ['__journal_container_name']
        target_label: 'container'
      - source_labels: ['__journal_container_id']
        target_label: 'container_id'
      - source_labels: ['__journal__hostname']
        target_label: 'host'

15. Troubleshooting Reference

Container problems fall into predictable categories. This table covers the most common issues, their root causes, and the exact commands to diagnose and fix them.

Essential debugging tools

The single most useful troubleshooting command for container issues is podman inspect. It gives you every configuration detail of a container — network settings, mount points, SELinux labels, health check config, environment variables, OCI runtime used, cgroup path, and more. The second most useful is podman logs (or journalctl CONTAINER_NAME=x). Between these two commands, you can diagnose 80% of container problems. The remaining 20% require nsenter to enter the container's namespaces without exec-ing into the container itself (useful when the container has no shell), and ausearch -m avc for SELinux denials.

Symptom	Likely Cause	Diagnosis & Fix
Permission denied on volume mount	SELinux label mismatch	Add `:Z` (private) or `:z` (shared) to the volume mount. Check `ausearch -m avc -ts recent`.
Container starts then immediately exits	Entrypoint/CMD fails or PID 1 exits	`podman logs <container>` to see the error. Check entrypoint with `podman inspect --format '{{.Config.Entrypoint}}'`.
Cannot pull image (timeout/TLS error)	Network/DNS/registry config issue	`podman pull --log-level debug`. Check `/etc/containers/registries.conf`. Verify DNS with `dig registry.example.com`.
Port binding fails (address already in use)	Host port conflict or stale container	`ss -tlnp \| grep :<port>` to find the conflict. `podman ps -a` to check for stopped containers holding ports.
Container OOM killed	Memory limit too low	`podman inspect --format '{{.State.OOMKilled}}'`. Check `journalctl -k --grep oom`. Increase `--memory`.
Rootless container cannot bind port <1024	`ip_unprivileged_port_start` too high	`sysctl net.ipv4.ip_unprivileged_port_start=0`. Make permanent in `/etc/sysctl.d/`.
DNS resolution fails inside container	Network DNS not configured	`podman exec <ctr> cat /etc/resolv.conf`. Check `podman network inspect <net>` for DNS settings.
ZFS storage driver: "dataset busy"	Container or mount still referencing dataset	`podman ps -a` for stopped containers. `lsof +D /var/lib/containers`. `podman system prune`.
Slow container startup	Large image layers or ZFS fragmentation	Check image size: `podman image ls`. Use multi-stage builds. Check `zpool status` for fragmentation.
Container cannot reach other containers	Containers on different networks	`podman inspect --format '{{.NetworkSettings.Networks}}'` for both containers. Ensure same network name.
Podman socket not found (Docker compat)	Podman socket not enabled	`systemctl --user enable --now podman.socket` (rootless) or `systemctl enable --now podman.socket` (root).
Image build fails with "no space"	ZFS dataset quota or pool full	`zfs list -o name,used,avail,quota`. `podman system prune -a` to reclaim space. Check `zpool list`.

Advanced debugging with nsenter

There is a debugging technique that most container engineers never learn: nsenter. When a container is misbehaving and you need to inspect its network, filesystem, or process state — but the container image has no shell, no curl, no debugging tools — nsenter lets you enter any combination of the container's namespaces from the host. You stay as root on the host with full access to host tools, but you see the container's network namespace, mount namespace, or PID namespace.

This is strictly more powerful than podman exec because you are not limited to the tools inside the container image. And unlike podman exec, it works even when the container's PID 1 has crashed but the namespaces still exist.

# Enter a container's namespaces without exec (works even if container has no shell)
nsenter -t $(podman inspect --format '{{.State.Pid}}' web) -n -m -p sh

# Trace all syscalls made by a container
podman run --rm -it --security-opt seccomp=unconfined \
  strace -f -e trace=network alpine wget -q -O /dev/null http://example.com

# Debug container networking from the host
# Find the container's network namespace
podman inspect --format '{{.NetworkSettings.SandboxKey}}' web
# /run/netns/cni-abc123

# Execute commands in the container's network namespace
nsenter --net=/run/netns/cni-abc123 ip addr
nsenter --net=/run/netns/cni-abc123 ss -tlnp
nsenter --net=/run/netns/cni-abc123 iptables -t nat -L -n -v

# Check SELinux denials for a specific container
ausearch -m avc -ts recent | grep container_t

# Full container lifecycle audit
podman events --since 1h --format '{{.Time}} {{.Status}} {{.Name}}'

# Reset everything (nuclear option — development only)
podman system reset
# WARNING: This destroys all containers, images, volumes, and networks

Learn nsenter. It is the single most powerful container debugging tool that almost nobody uses, and it works when podman exec cannot — crashed containers, minimal images, broken shells.

16. Production Patterns

Running containers in production on kldload means combining everything above into coherent patterns. Here are the configurations that work for real workloads.

Hardened container runtime

# /etc/containers/containers.conf — production hardening
[containers]
# Drop all capabilities, add back only what is needed
default_capabilities = [
  "CHOWN",
  "DAC_OVERRIDE",
  "FOWNER",
  "FSETID",
  "KILL",
  "NET_BIND_SERVICE",
  "SETFCAP",
  "SETGID",
  "SETPCAP",
  "SETUID"
]

# Default seccomp profile
default_sysctls = [
  "net.ipv4.ping_group_range=0 0"
]

# Default ulimits
default_ulimits = [
  "nofile=65536:65536",
  "nproc=4096:4096"
]

# Log driver
log_driver = "journald"

# Default timezone
tz = "UTC"

[engine]
# Use crun (faster, written in C)
runtime = "crun"

# Health check interval
healthcheck_events = true

# Image pull policy
pull_policy = "newer"

[network]
# Use netavark (modern network backend)
network_backend = "netavark"
dns_bind_port = 5353

Backup strategy for containerized workloads

# ZFS snapshots for container volumes — automated with sanoid
# /etc/sanoid/sanoid.conf

[rpool/var/lib/containers/storage/volumes]
  use_template = container-volumes
  recursive = yes

[template_container-volumes]
  frequently = 4
  hourly = 24
  daily = 30
  monthly = 3
  autosnap = yes
  autoprune = yes

# Replicate container volumes to a backup host
syncoid --recursive \
  rpool/var/lib/containers/storage/volumes \
  backup-host:rpool/backup/container-volumes

# Restore a specific volume from snapshot
zfs rollback rpool/var/lib/containers/storage/volumes/pg-data@autosnap_2026-04-05_hourly

Docker & Podman on ZFS — tutorial-level guide to running containers on ZFS
Containers on ZFS — build recipe for a container host
Serverless / Firecracker — FaaS workloads on microVMs
GPU & NVIDIA Masterclass — GPU passthrough and container GPU access
Kubernetes Masterclass — multi-host container orchestration
systemd Masterclass — deep dive on service management and Quadlet
Keycloak & SELinux Masterclass — SELinux policy in depth
ZFS Masterclass — ZFS fundamentals and tuning
Observability Masterclass — Prometheus, Grafana, and Loki for full-stack monitoring
Packer & IaC Masterclass — infrastructure as code for container hosts
Backup & DR Masterclass — ZFS replication and disaster recovery

← Packer & IaC Masterclass CI/CD & GitOps Masterclass →