kldload — Containers on ZFS

Tutorials

Containers on ZFS — Docker, Podman & Firecracker.

Containers on overlay2 are disposable by default. Containers on ZFS are disposable by choice. Every layer is a dataset. Every volume is a dataset. Every dataset gets checksums, compression, snapshots, clones, and send/recv. You can snapshot before docker pull, rollback a bad image in seconds, clone a volume for testing at zero cost, and replicate your entire container state to another host with syncoid.

kldloadOS ships with ZFS as the root filesystem. Docker, Podman, and Firecracker all sit on top of it. This page shows you how to configure each one, how to secure them, and how to use ZFS to do things that overlay2 cannot.

1. Docker on ZFS storage driver

By default, Docker uses overlay2. On kldloadOS, you switch it to the ZFS storage driver. Each container layer becomes a ZFS dataset. Each image layer becomes a ZFS dataset. Docker manages the datasets automatically — you just tell it to use ZFS.

Configure Docker to use ZFS

# Create a dedicated dataset for Docker
zfs create -o mountpoint=/var/lib/docker rpool/docker
zfs create rpool/docker/volumes

# Configure the storage driver
mkdir -p /etc/docker
cat > /etc/docker/daemon.json <<'EOF'
{
  "storage-driver": "zfs",
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  },
  "default-ulimits": {
    "nofile": { "Name": "nofile", "Hard": 65536, "Soft": 65536 }
  }
}
EOF

# Restart Docker
systemctl restart docker

# Verify
docker info | grep -A5 'Storage Driver'
# Storage Driver: zfs
#  Zpool: rpool
#  Zpool Health: ONLINE
#  Parent Dataset: rpool/docker

# Check ZFS datasets created by Docker
zfs list -r rpool/docker | head -20

overlay2 stacks files like transparencies on an overhead projector. ZFS stacks datasets in a copy-on-write tree. Both layer images. Only one gives you checksums, compression, and instant snapshots of every layer.

What ZFS gives Docker

With the ZFS storage driver, every docker pull creates ZFS datasets for each image layer. Every docker run creates a ZFS clone for the container's writable layer. This means:

# Snapshot before pulling a new image
zfs snapshot rpool/docker@before-pull-$(date +%F)
docker pull nginx:latest
# Something wrong with the new image? Rollback.
docker stop $(docker ps -q)
zfs rollback rpool/docker@before-pull-2026-03-23

# See compression savings on container data
zfs get compressratio rpool/docker
# rpool/docker  compressratio  2.83x  -
# That's 2.83x compression on all image layers. Free disk space.

# Clone a volume for testing (instant, zero disk cost)
zfs snapshot rpool/docker/volumes/myapp-data@test
zfs clone rpool/docker/volumes/myapp-data@test rpool/docker/volumes/myapp-data-test

# Replicate Docker state to another host
syncoid -r rpool/docker root@node2:rpool/docker

2. Podman — rootless, daemonless

Podman runs containers without a daemon and without root. It's CLI-compatible with Docker (alias docker=podman and most scripts work). On kldloadOS, Podman uses ZFS for storage just like Docker does.

Podman on ZFS

# Install Podman (already included in kldloadOS desktop/server profiles)
dnf install -y podman

# Rootless setup — runs as your user, no daemon, no root
podman info | grep -A3 graphDriver
# graphDriverName: zfs

# Run a container (same syntax as Docker)
podman run -d --name web -p 8080:80 nginx:alpine

# Podman Compose (drop-in for docker compose)
dnf install -y podman-compose
podman-compose up -d

# Generate systemd unit from a running container
podman generate systemd --name web --files --new
systemctl --user enable --now container-web.service

# Rootless means: no daemon to crash, no root to exploit,
# and the container runs as your UID with user namespaces

Docker is a taxi service with a dispatcher (the daemon). Podman is a rental car — you drive it yourself, no dispatcher needed. If the dispatcher crashes, all Docker containers are orphaned. If Podman crashes, nothing else is affected.

When to use Podman vs Docker

Use Podman when you want rootless containers, systemd integration, no long-running daemon, or you're running on a machine where Docker's daemon model is a liability (single-user servers, CI runners, embedded systems).

Use Docker when you need Docker Compose with full feature parity, Docker Swarm, or your team's tooling depends on the Docker API socket.

Use both — they coexist on kldloadOS. Same images, same registries, same OCI format.

3. Private registry on ZFS

A private registry stores your images locally. On ZFS, the registry data is compressed, checksummed, snapshotable, and replicable. No cloud registry fees, no egress charges, no dependency on someone else's infrastructure.

Set up a local registry

# Create a ZFS dataset for the registry
zfs create -o compression=zstd rpool/srv/registry

# Run the registry
docker run -d --restart=always --name registry \
    -p 5000:5000 \
    -v /srv/registry:/var/lib/registry \
    registry:2

# Tag and push an image
docker tag myapp:latest localhost:5000/myapp:latest
docker push localhost:5000/myapp:latest

# Pull from the registry (from any host on the network)
docker pull 192.168.1.10:5000/myapp:latest

# Snapshot the registry before changes
ksnap /srv/registry

# Replicate the registry to another host
syncoid rpool/srv/registry root@node2:rpool/srv/registry

# Check compression savings
zfs get compressratio rpool/srv/registry
# Container images compress well — expect 2-4x with zstd

Docker Hub is a library where someone else decides when books go missing. A private registry on ZFS is your own bookshelf — checksummed, backed up, and nobody removes images because they changed their pricing model.

TLS and authentication

# Generate a self-signed cert (or use Let's Encrypt)
mkdir -p /srv/registry/certs /srv/registry/auth
openssl req -x509 -nodes -days 3650 -newkey rsa:4096 \
    -keyout /srv/registry/certs/registry.key \
    -out /srv/registry/certs/registry.crt \
    -subj "/CN=registry.local"

# Create htpasswd auth
dnf install -y httpd-tools
htpasswd -Bc /srv/registry/auth/htpasswd admin

# Run registry with TLS + auth
docker run -d --restart=always --name registry \
    -p 5000:5000 \
    -v /srv/registry/data:/var/lib/registry \
    -v /srv/registry/certs:/certs \
    -v /srv/registry/auth:/auth \
    -e REGISTRY_HTTP_TLS_CERTIFICATE=/certs/registry.crt \
    -e REGISTRY_HTTP_TLS_KEY=/certs/registry.key \
    -e REGISTRY_AUTH=htpasswd \
    -e REGISTRY_AUTH_HTPASSWD_REALM="kldload Registry" \
    -e REGISTRY_AUTH_HTPASSWD_PATH=/auth/htpasswd \
    registry:2

# Login from any host
docker login registry.local:5000

4. Compose patterns for common stacks

Real Compose files for real stacks. All volumes on ZFS. All images pinned. All health checks defined. Copy, adjust, deploy.

Web + Database + Cache

# docker-compose.yml — typical web application stack
services:
  web:
    image: nginx:1.27-alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./html:/usr/share/nginx/html:ro
    depends_on:
      app:
        condition: service_healthy
    restart: unless-stopped

  app:
    image: node:22-alpine
    working_dir: /app
    volumes:
      - ./app:/app
    command: node server.js
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3
    environment:
      - DATABASE_URL=postgresql://app:secret@db:5432/myapp
      - REDIS_URL=redis://cache:6379
    restart: unless-stopped

  db:
    image: postgres:16-alpine
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: app
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d myapp"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  cache:
    image: redis:7-alpine
    command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
    restart: unless-stopped

volumes:
  pgdata:
    driver: local

# Create a ZFS dataset for the postgres volume with tuned recordsize
zfs create -o recordsize=16k rpool/docker/volumes/pgdata

# Snapshot before deploying
ksnap /var/lib/docker/volumes/pgdata

# Deploy
docker compose up -d

A Compose file is a recipe. Pin the image versions (ingredients) and define health checks (doneness tests). Without pinned versions, your recipe changes every time you run it. Without health checks, you have no idea if it's done.

5. Firecracker microVMs vs Docker containers

Docker containers share the host kernel. Firecracker microVMs each get their own kernel in a lightweight VM that boots in under 125 milliseconds. Different isolation levels, different use cases, same ZFS storage underneath.

Docker containers

Isolation: cgroups + namespaces (process-level)
Boot time: milliseconds
Overhead: near zero (shares host kernel)
Use for: web apps, databases, caches, microservices, CI/CD
Risk: kernel exploit in container = host compromise

Firecracker microVMs

Isolation: hardware virtualization (VM-level)
Boot time: <125ms
Overhead: ~5MB per microVM
Use for: untrusted code, CI runners, serverless functions, sandboxing
Risk: VM escape required for host compromise (much harder)

Firecracker on kldloadOS

# Download Firecracker
ARCH=$(uname -m)
curl -L "https://github.com/firecracker-microvm/firecracker/releases/latest/download/firecracker-v1.9.1-${ARCH}.tgz" | \
    tar xz -C /usr/local/bin --strip-components=1

# Prepare rootfs on ZFS
zfs create rpool/srv/firecracker
# (copy or build your rootfs.ext4 and vmlinux kernel here)

# Snapshot the clean rootfs — rollback after each run
zfs snapshot rpool/srv/firecracker@clean

# Launch a microVM
firecracker --api-sock /tmp/fc.sock --config-file config.json

# After the workload completes, rollback to pristine state
zfs rollback rpool/srv/firecracker@clean
# Next microVM starts with a perfectly clean filesystem. Every time.

A Docker container is a room with a locked door in a shared building. A Firecracker microVM is a separate building with its own foundation. The room is cheaper and faster to build. The separate building is harder to break into.

6. Resource limits — cgroups v2

kldloadOS uses cgroups v2 (unified hierarchy). Every container gets explicit resource limits. No container should be able to starve the host or other containers of CPU, memory, or I/O.

Resource limits in practice

# Memory limit: container is OOM-killed if it exceeds 512MB
docker run -d --memory=512m --memory-swap=512m myapp

# CPU limit: container gets at most 1.5 CPU cores
docker run -d --cpus=1.5 myapp

# CPU shares: relative weight (default 1024)
docker run -d --cpu-shares=512 myapp    # half priority
docker run -d --cpu-shares=2048 myapp   # double priority

# I/O limit: cap write throughput to 50MB/s
docker run -d --device-write-bps /dev/zd0:50mb myapp

# PID limit: prevent fork bombs
docker run -d --pids-limit=100 myapp

# In Compose:
services:
  app:
    image: myapp:latest
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 1G
        reservations:
          cpus: '0.5'
          memory: 256M

cgroups are the building code for containers. Without limits, one container can consume all the RAM and crash everything. With limits, each container has a budget and cannot exceed it.

7. Networking

Containers need to talk to each other, to the host, and to the outside world. Docker provides bridge, host, macvlan, and overlay networks. Pick the right one.

Bridge (default)

Containers get a private IP on a virtual bridge. Port mapping (-p 80:80) exposes services to the host network. Containers on the same bridge resolve each other by name. Good for most workloads.

docker network create mynet
docker run -d --network mynet --name web nginx
docker run -d --network mynet --name app myapp
# app can reach web via: http://web:80

Macvlan (LAN IP)

Each container gets its own IP on the physical LAN. No port mapping, no NAT. The container appears as a separate host to the rest of the network. Good for services that need to be discoverable on the LAN (NFS, DNS, DHCP).

docker network create -d macvlan \
    --subnet=192.168.1.0/24 \
    --gateway=192.168.1.1 \
    -o parent=eth0 lannet
docker run -d --network lannet \
    --ip 192.168.1.50 --name dns \
    pihole/pihole

WireGuard overlay

Connect containers across hosts using WireGuard tunnels. Each host runs WireGuard, containers route through it. Encrypted, fast, and works across the internet. Good for multi-site container clusters without Kubernetes.

# On each host, set up WireGuard
# (see WireGuard Masterclass page)
# Then route container traffic through wg0
docker network create \
    --subnet=10.10.0.0/16 \
    -o com.docker.network.bridge.name=wg-br \
    wg-containers

8. Container security

A container is only as secure as its configuration. Default Docker settings are more permissive than they should be. Lock them down.

Security hardening checklist

# 1. Rootless containers (Podman does this by default)
podman run --user 1000:1000 myapp

# 2. Read-only root filesystem
docker run --read-only --tmpfs /tmp --tmpfs /run myapp

# 3. Drop ALL capabilities, add only what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myapp

# 4. No new privileges (prevent setuid binaries)
docker run --security-opt=no-new-privileges myapp

# 5. Seccomp profile (restrict syscalls)
docker run --security-opt seccomp=strict-profile.json myapp

# 6. AppArmor profile
docker run --security-opt apparmor=docker-custom myapp

# 7. Resource limits (prevent resource exhaustion)
docker run --memory=512m --cpus=1.0 --pids-limit=100 myapp

# 8. Non-root user in Dockerfile
# FROM alpine:3.19
# RUN adduser -D -u 1000 appuser
# USER appuser
# CMD ["/app/server"]

# 9. Scan images for vulnerabilities
trivy image myapp:latest

# 10. Never use --privileged unless you have a specific,
# documented reason. --privileged gives the container
# full access to the host. It defeats the purpose of
# containerization.

A container without security hardening is a locked door with the key taped to the frame. The lock exists. The security does not. Drop capabilities, read-only rootfs, non-root user — do all three, every time.

9. ZFS advantages for containers

Here is what you can do with ZFS under your containers that you cannot do with overlay2, ext4, or XFS.

Snapshot before docker pull

A bad image can break your stack. Snapshot the entire Docker dataset before pulling. If the new image causes problems, zfs rollback restores every layer, every volume, every container to the exact state before the pull.

Clone volumes for testing

Need to test a database migration? zfs clone the production volume. The clone is instant, shares blocks with the original, and costs zero disk space until the test writes diverge. Delete it when done.

Compression on all layers

ZFS compresses every image layer and every volume with zstd. Container images are highly compressible (text files, binaries, libraries). Expect 2-4x compression. That is 2-4x more images on the same disk.

Checksums on all data

ZFS checksums every block of every container layer and volume. Silent data corruption (bit rot) is detected and auto-repaired from redundancy. overlay2 on ext4 does not checksum anything. A corrupt image layer is served silently.

Send/recv containers

Replicate your entire Docker state to another host with syncoid -r rpool/docker root@node2:rpool/docker. All images, all volumes, all layers. Incremental. Efficient. No Docker registry needed.

Rollback bad images

Pulled nginx:latest and it broke your TLS config? Stop the container, zfs rollback to before the pull, start the container. You are back to the exact previous image. Seconds, not a rebuild.

Containers are not magic. They are Linux namespaces, cgroups, and a layered filesystem. The filesystem matters. overlay2 gives you layers. ZFS gives you layers plus checksums, compression, snapshots, clones, and replication. Same containers, different foundation. The containers don't know the difference. You will.

Run Docker for the ecosystem. Run Podman for rootless. Run Firecracker for isolation. Run all three on ZFS and stop worrying about silent corruption, unreversible upgrades, and volumes you cannot replicate.

← Build Overview ↑ Top