| pick your distro, get ZFS on root
kldload — your platform, your way, free
Source

ZFS on root. Every distro. Automatic. Identical.

Every kldload install creates the same ZFS dataset hierarchy. The pool layout, compression settings, snapshot policies, and mount points are identical across every supported distro — CentOS, Debian, Ubuntu, Fedora, Rocky, RHEL, Arch, and Alpine. Switch distros and your ZFS muscle memory transfers completely. The pool does not care which kernel is above it.

The thesis: Storage is the foundation everything else sits on. If your storage is wrong, everything above it — your app, your backups, your security, your recovery plan — is built on sand. kldload bets the entire platform on ZFS because ZFS is the only filesystem that checksums every block, compresses transparently, snapshots atomically, replicates natively, and encrypts per-dataset. Every other storage layer requires bolting on separate tools for each of those features. ZFS does all of them in one coherent system.

The kldload installer does not ask you how to set up storage. It creates the correct layout, sets the correct properties, configures the correct snapshot policies, and gets out of your way. You can override everything. But the defaults are right.

This is the most important page on the platform section. Storage is the foundation everything else sits on. If your storage is wrong, everything above it — your app, your backups, your security, your recovery plan — is built on sand. kldload's dataset layout is not arbitrary. Every dataset exists for a specific operational reason, and understanding those reasons is the difference between "I have ZFS" and "I use ZFS correctly."

ZFS on root — the foundation

Every kldload install — regardless of distro, profile, or topology — puts the operating system directly on ZFS. There is no ext4 boot partition. There is no LVM layer. There is no mdraid array. The disk is partitioned into exactly two parts: a 512MB EFI System Partition (FAT32) and the rest of the disk as a ZFS pool named rpool.

The partitioning is handled by sgdisk in the installer. On a single-disk install:

# Single-disk layout: EFI + rpool
sgdisk -n1:1M:+512M -t1:EF00 -c1:"EFI System Partition" /dev/sda
sgdisk -n2:0:0      -t2:BF01 -c2:"KLDload rpool"        /dev/sda

Partition 1 is the ESP — formatted FAT32, mounted at /boot/efi. This holds the EFI bootloader (ZFSBootMenu) and nothing else. Partition 2 is the entire remaining disk, used as the sole vdev in the rpool pool. On multi-disk topologies, only the boot disk gets an ESP; the data disks are used raw as pool vdevs.

There is no separate /boot partition. ZFSBootMenu reads kernels and initramfs directly from the ZFS dataset. This eliminates the classic ZFS-on-root headache where GRUB needs a separate ext4 /boot because it cannot read ZFS (or can only read old pool versions). ZFSBootMenu is an EFI application that loads the ZFS kernel module in its own initramfs, imports the pool, reads the kernel from the root dataset, and kexec's into it. One EFI partition. One ZFS pool. No intermediate filesystem.

Boot pool vs root pool

Some ZFS-on-root guides create two pools: a small bpool for /boot and a larger rpool for everything else. This exists to work around GRUB's limited ZFS feature support. kldload does not use GRUB. kldload uses ZFSBootMenu, which supports all modern ZFS features. Therefore kldload creates a single pool: rpool. One pool. No feature restrictions. No compatibility hacks.

The rpool pool is created with -R /target (altroot) during installation, which temporarily mounts everything under /target. After installation completes, the pool is exported and the target reboots. ZFSBootMenu imports the pool at boot with the correct mountpoints.

ESP handling

The EFI System Partition is 512MB FAT32. It holds ZFSBootMenu's EFI binary, the ZFSBootMenu initramfs (which contains the ZFS kernel module), and a fallback kernel. On multi-disk topologies, only the boot disk (KLDLOAD_DISK, typically /dev/sda) has an ESP. The rpool lives on separate data disks. This means the boot disk can be a small SSD while the data disks are large spinners or NVMe.

# What lives on the ESP
/boot/efi/
  EFI/
    BOOT/
      BOOTX64.EFI          # ZFSBootMenu EFI binary
    kldload/
      vmlinuz-zbm           # ZFSBootMenu kernel
      initramfs-zbm.img     # ZFSBootMenu initramfs (contains zfs.ko)

Default dataset layout

The desktop and server profiles create a specific dataset hierarchy. Every dataset exists for a specific operational reason. This is not a guess. This is the result of running ZFS in production across hundreds of machines and learning what breaks when the layout is wrong.

Dataset
Mount
Purpose
rpool
none
Pool root. Container only — canmount=off, never mounted directly. All properties set here are inherited by child datasets.
rpool/ROOT
none
Boot environment container. canmount=off. Holds one child per boot environment. ZFSBootMenu reads org.zfsbootmenu:commandline from this dataset.
rpool/ROOT/<host>
/
Active boot environment. Your running OS. canmount=noauto — only mounted when selected as the active BE. Snapshottable, cloneable, rollbackable.
rpool/home
/home
User homes. Separate dataset — survives root rollbacks. Each user gets their own child dataset automatically.
rpool/home/<user>
/home/<user>
Per-user home. Created at install time for the primary user. Allows per-user quotas, snapshots, and replication.
rpool/root
/root
Root user's home. Isolated from the system root dataset so root's files survive OS rollback.
rpool/srv
/srv
Service data. Web roots, databases, app state. Snapshotted every 15 minutes by Sanoid.
rpool/opt
/opt
Third-party software. Separate so custom installs survive OS rollback.
rpool/usr
/usr
Container dataset (canmount=off). Exists to provide a mountpoint namespace for rpool/usr/local.
rpool/usr/local
/usr/local
Locally installed software. Separate so custom binaries and libraries survive OS rollback.
rpool/var
/var
Variable data container (canmount=off). Namespace parent for var subdatasets.
rpool/var/cache
/var/cache
Package cache. Survives rollbacks so you do not re-download packages after reverting.
rpool/var/lib
/var/lib
Application state. Docker images, libvirt storage, package databases. Persists across root rollbacks.
rpool/var/log
/var/log
Logs. Persists across root rollbacks — you can always see what happened before the rollback.
rpool/var/spool
/var/spool
Mail queues, print queues, cron state. Isolated so spool data survives OS changes.
rpool/var/tmp
/var/tmp
Persistent temp files. Mode 1777. Excluded from meaningful snapshot policies.
rpool/tmp
/tmp
Temporary files. sync=disabled, setuid=off, exec=off, devices=off — performance and security hardening impossible with a directory.

Why separate datasets instead of separate partitions?

On ext4, you partition the disk: 50GB for root, 200GB for home, 100GB for var. Guess wrong and root fills up while home has 150GB free. Resizing means booting a rescue disk and running resize2fs. With ZFS datasets, all of them share the pool's free space. /home can use 500GB if /var only needs 20GB. No partitions. No guessing. No resizing.

Why these specific datasets?

Each one exists for rollback isolation. When you roll back root (rpool/ROOT/host), everything else survives:

  • /home — your users' data survives OS rollback
  • /var/log — the logs that explain why you rolled back survive the rollback
  • /srv — your application data survives OS rollback
  • /var/lib — Docker images, database state, libvirt configs survive rollback
  • /var/cache — package cache survives so you do not re-download everything
  • /opt — third-party software you compiled or installed survives rollback
  • /usr/local — custom binaries survive rollback
  • /tmp — has sync=disabled, exec=off, setuid=off — performance and security hardening that is impossible with a directory

Rolling back the OS should not destroy your work. Rolling back the OS should not hide the evidence of why you needed to. That is the layout.

This is the exact code from storage-zfs.sh that creates the hierarchy:

# Root dataset hierarchy
zfs create -o canmount=off -o mountpoint=none rpool/ROOT
zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/${KLDLOAD_HOSTNAME}
zfs mount rpool/ROOT/${KLDLOAD_HOSTNAME}

# Data datasets
zfs create -o mountpoint=/root      rpool/root
zfs create -o mountpoint=/home      rpool/home
zfs create -o mountpoint=/home/todd rpool/home/todd   # per-user dataset
zfs create -o mountpoint=/srv       rpool/srv
zfs create -o mountpoint=/opt       rpool/opt

# /usr namespace
zfs create -o canmount=off -o mountpoint=/usr rpool/usr
zfs create -o mountpoint=/usr/local rpool/usr/local

# /var namespace
zfs create -o canmount=off -o mountpoint=/var rpool/var
zfs create -o mountpoint=/var/cache rpool/var/cache
zfs create -o mountpoint=/var/lib   rpool/var/lib
zfs create -o mountpoint=/var/log   rpool/var/log
zfs create -o mountpoint=/var/spool rpool/var/spool
zfs create -o mountpoint=/var/tmp   rpool/var/tmp

# /tmp — hardened: no suid, no exec, no devices, async writes
zfs create -o mountpoint=/tmp -o sync=disabled \
  -o setuid=off -o exec=off -o devices=off rpool/tmp

Core profile: design your own. The desktop and server profiles create this layout automatically. The core profile does not. It gives you ZFS on root with rpool/ROOT/hostname and nothing else. You design your own dataset hierarchy. This is for people who know exactly what they want — database servers with custom recordsizes, storage nodes with specific vdev layouts, appliances with single-purpose datasets. Core is the blank canvas. The layout above is the opinionated default for everyone else.

kldload's ZFS defaults — and why each was chosen

Every property set on the pool at creation time is inherited by every child dataset unless explicitly overridden. These are the properties kldload sets, and the engineering reason behind each one.

compression=lz4
Always on. LZ4 compresses at 3+ GB/s per core. On modern CPUs, compressing and writing less data to disk is faster than writing uncompressed data. You gain both space and speed. There is no reason to ever turn this off. The only question is whether to use lz4 (fast, default) or zstd (slower, better ratio for cold storage).
ashift=12
4K sector alignment. All modern SSDs and HDDs use 4096-byte physical sectors. Some report 512-byte sectors for backwards compatibility, which causes ZFS to use ashift=9 and misalign every write. kldload forces ashift=12 to guarantee alignment. Set at pool creation — cannot be changed later. Getting this wrong halves SSD performance.
autotrim=on
Automatic TRIM for SSDs. When ZFS frees blocks, the SSD is informed immediately. This maintains write performance over time and extends SSD lifespan. On HDDs this is a no-op. On SSDs it is essential.
acltype=posixacl
POSIX ACLs enabled. Required by systemd, containers, and most Linux services. Without this, systemd-tmpfiles fails, journald fails, and Docker refuses to start. Not optional on Linux.
xattr=sa
Extended attributes stored in the dnode. The alternative (xattr=dir) stores xattrs as hidden files in a directory, which requires extra disk lookups. xattr=sa stores them directly in the inode's bonus buffer — one read instead of many. SELinux labels, ACLs, and file capabilities all use xattrs. This makes them fast.
dnodesize=auto
Automatic dnode sizing. ZFS can use 512-byte or larger dnodes (up to 16KB). Larger dnodes store more metadata (xattrs, system attributes) inline, avoiding extra I/O. auto lets ZFS choose the optimal size per file. Required for efficient xattr=sa on files with many extended attributes.
relatime=on
Reduce atime writes. Standard atime updates the access timestamp on every read — turning every read into a write. relatime only updates atime if the file was modified since last access or the atime is older than 24 hours. Eliminates 90%+ of atime writes while keeping tmpwatch and mutt happy.
normalization=formD
Unicode normalization. Ensures filenames with accented characters are stored consistently. Prevents the invisible bug where café (two codepoints) and café (one codepoint) are different files on disk but look identical to humans. Set at pool creation — cannot be changed later.
mountpoint=none
Pool root is not mounted. The pool root dataset (rpool) is a container only. Child datasets set their own mountpoints. This prevents the pool root from shadowing the real root filesystem.

The exact zpool create command from the installer:

zpool create -f \
  -o ashift=12 \
  -o autotrim=on \
  -O acltype=posixacl \
  -O canmount=off \
  -O compression=lz4 \
  -O dnodesize=auto \
  -O normalization=formD \
  -O relatime=on \
  -O xattr=sa \
  -O mountpoint=none \
  -R /target \
  rpool /dev/sda2

Every one of these properties was learned the hard way. ashift=9 on a 4K SSD halves write speed and you cannot fix it without destroying the pool. xattr=dir makes SELinux and ACLs unusably slow. atime=on generates terabytes of useless writes on file servers. normalization not being set means Unicode filenames silently create duplicates that break backup scripts. These are not preferences. These are the correct settings for Linux. Any ZFS guide that omits them is incomplete.

Pool design — topologies and disk selection

kldload supports four pool topologies. The web UI presents them as choices during install. The topology determines how many disks are needed, how redundancy works, and how the rpool vdev is constructed.

single — one disk

The boot disk has two partitions: EFI (part 1) and rpool (part 2). No redundancy. Suitable for VMs, dev machines, and any system with external backups. This is the default for single-disk installs.

sgdisk -n1:1M:+512M -t1:EF00 /dev/sda
sgdisk -n2:0:0      -t2:BF01 /dev/sda
# rpool on partition 2
rpool_vdevs=( /dev/sda2 )

mirror — two disks

EFI on the boot disk. rpool is a mirror across two data disks. Survives one disk failure. Read performance scales (ZFS reads from both mirrors). Write performance unchanged. The standard choice for servers.

# EFI on /dev/sda, rpool mirrors sdb+sdc
rpool_vdevs=( mirror /dev/sdb /dev/sdc )

raidz1 — three or four disks

Single-parity RAID across 3+ data disks. Survives one disk failure. Better space efficiency than mirrors (66-75% usable). Sequential read/write performance scales with disk count. Random read IOPS limited — do not use for database primary storage.

# EFI on /dev/sda, rpool raidz1 across sdb-sde
rpool_vdevs=( raidz1 /dev/sdb /dev/sdc /dev/sdd /dev/sde )

mirror-stripe (RAID10) — four disks

Two mirrored pairs striped together. Best random I/O performance. Survives one disk failure per mirror pair. 50% space efficiency. The choice for database servers, KVM hypervisors, and any IOPS-sensitive workload.

# EFI on /dev/sda, rpool RAID10 across sdb-sde
rpool_vdevs=(
  mirror /dev/sdb /dev/sdc
  mirror /dev/sdd /dev/sde
)

ashift auto-detection

kldload always forces ashift=12 (4K sectors). ZFS can auto-detect the physical sector size, but many drives lie about it. Samsung SSDs report 512-byte sectors when they use 4K internally. WD Reds report 512e (512-byte emulated). If ZFS trusts these drives and uses ashift=9, every 4K physical write requires a read-modify-write cycle on the drive controller. Performance drops 30-50%. Since ashift cannot be changed after pool creation, kldload always sets 12. Using ashift=12 on a true 512-byte drive wastes at most a few percent of space. Using ashift=9 on a 4K drive wastes half your IOPS permanently.

Special vdev acceleration

kldload supports an optional special vdev for metadata and small-block acceleration. If you provide one or two fast NVMe drives as KLDLOAD_ZFS_SPECIAL_DISKS, the installer adds a special vdev to the pool after creation. ZFS stores metadata (directory entries, file sizes, block pointers) and optionally small files on the special vdev instead of the main data vdevs.

# Two NVMe drives as mirrored special vdev
zpool add rpool special mirror /dev/nvme0n1 /dev/nvme1n1

# Direct small blocks (under 64K) to the special vdev
zfs set special_small_blocks=65536 rpool

A special vdev on NVMe in front of a raidz1 of spinning disks transforms performance. Metadata lookups that normally require seeking across 4+ HDDs complete in microseconds on NVMe. Directory listings, find commands, ls -la — all become instant. The main vdevs handle large sequential I/O, which is what HDDs are good at.

If you have one fast NVMe and a bunch of HDDs, the special vdev is the single biggest performance improvement you can make. It turns a NAS-grade pool into something that feels like an SSD for daily operations while keeping HDD capacity for bulk storage. The metadata is the bottleneck on spinning disk pools. Move the metadata to NVMe and the bottleneck vanishes. Mirror the special vdev — losing it without a mirror loses the entire pool.

Snapshots and replication

kldload pre-configures Sanoid for automatic snapshot management on desktop and server profiles. Sanoid handles creation and pruning. Syncoid handles replication. Both are installed and configured at install time. You never have to think about snapshot rotation unless you want to change the defaults.

Snapshot tiers

Factory snapshot
Taken at install completion: rpool@install-20260404-1530. Your known-good baseline. Never auto-pruned. This is your "as installed" state — you can always get back to day zero.
Package snapshots
Before every dnf or apt operation. Keep last 10. Auto-pruned by Sanoid. Undo any package install, upgrade, or removal.
Boot environment hourly
rpool/ROOT snapshotted every hour. Keep 48 (2 days). Roll back to any hour in the last 2 days.
Service data (/srv)
Snapshotted every 15 minutes. Keep last 4. Application data is never more than 15 minutes stale.
Home directories
Snapshotted hourly. Keep 24 (1 day). User data recoverable at hourly granularity.

Default sanoid.conf

# /etc/sanoid/sanoid.conf — kldload default

[rpool/ROOT]
  use_template = production
  recursive = yes

[rpool/home]
  use_template = production
  recursive = yes

[rpool/srv]
  use_template = frequent
  recursive = yes

[rpool/var/log]
  use_template = production

[rpool/var/lib]
  use_template = production

#---- Templates ----
[template_production]
  frequently = 0
  hourly = 48
  daily = 30
  monthly = 6
  yearly = 0
  autosnap = yes
  autoprune = yes

[template_frequent]
  frequently = 4
  hourly = 48
  daily = 30
  monthly = 6
  yearly = 0
  autosnap = yes
  autoprune = yes

The retention numbers are deliberate. 48 hourly snapshots means you can roll back to any hour in the last 2 days. 30 daily snapshots means monthly granularity for a month. 4 frequent snapshots on /srv at 15-minute intervals means app data is never more than 15 minutes stale. Old snapshots are pruned by systemd timers — you never have to think about it.

All of this is overridable. Want hourly snapshots for 90 days? Change the retention number. Want snapshots every 5 minutes for a database dataset? Add a dataset-specific policy with frequently = 12. The defaults are sane. The overrides are one line.

Replication with Syncoid

Syncoid replicates datasets to any ZFS target: another kldload node, TrueNAS, any Linux with OpenZFS. Initial full send, then incremental (only changed blocks). Over SSH, over WireGuard, over anything that can carry a byte stream.

# Replicate /srv to a backup host every 15 minutes
syncoid rpool/srv backup-host:backup/srv

# Replicate recursively (all child datasets)
syncoid -r rpool/home backup-host:backup/home

# Cron job: replicate every 15 minutes over WireGuard
*/15 * * * * /usr/sbin/syncoid --no-sync-snap rpool/srv wg-backup:backup/srv

# Manual zfs send/recv for full control
zfs send -R rpool/srv@snap | ssh backup "zfs recv -F backup/srv"

# Incremental (only changes since last sync)
zfs send -R -I @old @new rpool/srv | ssh backup "zfs recv backup/srv"

# Encrypted replication (ciphertext only — receiver cannot read)
zfs send -w rpool/srv@snap | ssh backup "zfs recv backup/srv"

Replication over WireGuard is the endgame. Both modules are in the kernel. ZFS sends checksummed blocks. WireGuard encrypts the transport. The data is verified at the source, encrypted in flight, and verified again at the destination. No agent. No daemon. No intermediate storage. No vendor. Two kernel modules piped together. That is the entire DR strategy.

zfs send -w is the enterprise feature nobody talks about. The -w flag sends raw encrypted blocks. The receiving machine stores ciphertext it cannot decrypt. You can replicate to an untrusted host — a cloud provider, a colo, a partner's datacenter — and they physically cannot read your data. The encryption key never leaves the source. This is data sovereignty at the storage layer.

Boot environments

A boot environment is a snapshot or clone of rpool/ROOT/<hostname>. Every kernel update, every major package upgrade, every risky change — create a new boot environment first. If the change breaks the system, select the previous BE in ZFSBootMenu and boot it. Total downtime: one reboot.

kldload uses ZFSBootMenu as its bootloader. ZFSBootMenu scans all datasets under rpool/ROOT and presents them as bootable entries. The bootfs property on the pool controls which one boots by default.

# Create a boot environment before a risky upgrade
zfs snapshot rpool/ROOT/myhost@before-upgrade
zfs clone rpool/ROOT/myhost@before-upgrade rpool/ROOT/myhost-backup

# List all boot environments
zfs list -r -o name,used,creation rpool/ROOT

# Roll back to the previous boot environment
zpool set bootfs=rpool/ROOT/myhost-backup rpool
# Reboot — ZFSBootMenu boots the backup BE automatically

# Or use ZFSBootMenu interactively at boot (press ESC during boot countdown)
# Select any BE from the menu, boot it, done

# After confirming the old BE works, promote it
zpool set bootfs=rpool/ROOT/myhost-backup rpool

# Clean up the broken BE
zfs destroy rpool/ROOT/myhost

This is what makes ZFS on root transformative for operations. On ext4, a bad kernel update means booting a rescue ISO, chrooting, and manually downgrading packages. On ZFS, it means pressing a key at boot and selecting the previous snapshot. The entire recovery is one reboot and one menu selection. This is why every kldload install puts the OS on ZFS — not for the compression, not for the checksumming, but for the ability to undo any change to the operating system in 30 seconds.

The hourly snapshots of rpool/ROOT mean you always have a recent known-good state. The factory snapshot means you can always get back to the freshly installed state. The package snapshots mean you can undo the specific dnf update that broke things. Layers of safety, all automatic, all free.

ZFSBootMenu kernel commandline

kldload sets the kernel commandline via a ZFS property so it is stored in the pool itself, not in a separate bootloader config file:

# Set during install — inherited by all boot environments
zfs set org.zfsbootmenu:commandline="rw console=tty1 console=ttyS0,115200" rpool/ROOT

# console=tty1       — VGA output for physical machines
# console=ttyS0      — serial console for Proxmox/KVM/headless
# Both are set so it works everywhere without reconfiguration

Storage for virtual machines

kldload's desktop and server profiles install KVM with ZFS zvols as VM storage. A zvol is a ZFS dataset that presents itself as a raw block device at /dev/zvol/rpool/vms/name. QEMU sees a block device, not a file. There is no qcow2 layer. No double copy-on-write. ZFS handles snapshots, clones, compression, and replication transparently.

# Create a 100GB zvol for a VM
zfs create -V 100G -o volblocksize=64K rpool/vms/webserver

# The zvol appears as a block device
ls -la /dev/zvol/rpool/vms/webserver
# lrwxrwxrwx 1 root root 9 Apr  4 15:30 /dev/zvol/rpool/vms/webserver -> ../../zd0

# Snapshot before upgrade — instant, O(1)
zfs snapshot rpool/vms/webserver@before-upgrade

# Upgrade goes wrong — rollback in milliseconds
zfs rollback rpool/vms/webserver@before-upgrade

# Clone a VM — instant, zero disk space used initially
zfs snapshot rpool/vms/webserver@template
zfs clone rpool/vms/webserver@template rpool/vms/webserver-staging
# New VM is ready immediately — shares all blocks with parent via COW

# Replicate a VM to another host
zfs send -R rpool/vms/webserver@snap | ssh dr-host "zfs recv backup/vms/webserver"

volblocksize tuning

The volblocksize is the ZFS block size for the zvol. It must match or be a multiple of the guest filesystem's block size. The default is 8K, but for VM workloads:

volblocksize=64K
Best for general-purpose Linux VMs. Matches the typical ext4 block allocation group and gives ZFS good compression ratios. kldload's default for rpool/vms.
volblocksize=16K
Better for database VMs (PostgreSQL 8K pages, MySQL 16K pages). Reduces write amplification at the cost of slightly lower compression ratio.
volblocksize=128K
Good for Windows VMs with NTFS. Large blocks match NTFS cluster allocation. Better sequential performance.

The instant clone is the killer feature for VM management. Need to test a kernel upgrade on your production database? Clone the VM in under a second. The clone shares all existing data with the original via copy-on-write. Run your test. If it works, promote the clone. If it fails, destroy it. Zero risk. Zero wait time. Try doing that with qcow2 files on ext4.

Storage for containers

Docker and Podman on kldload use the ZFS storage driver. Each container layer is a ZFS dataset. Each container gets its own dataset. Copy-on-write is native — no overlayfs, no devicemapper, no loopback files. The ZFS driver is the most efficient storage backend for containers on Linux when you already have ZFS.

# Docker ZFS storage driver configuration
# /etc/docker/daemon.json (set automatically by kldload)
{
  "storage-driver": "zfs",
  "storage-opts": [
    "zfs.fsname=rpool/var/lib/docker"
  ]
}

# Verify Docker is using ZFS
docker info | grep "Storage Driver"
# Storage Driver: zfs

# Each image layer is a ZFS dataset
zfs list -r rpool/var/lib/docker | head
# NAME                                              USED  AVAIL  REFER  MOUNTPOINT
# rpool/var/lib/docker                              4.2G   180G   24K   legacy
# rpool/var/lib/docker/abc123...                    1.1G   180G  1.1G   legacy
# rpool/var/lib/docker/def456...                    256M   180G  256M   legacy

# Snapshot all container storage
zfs snapshot -r rpool/var/lib/docker@backup-$(date +%Y%m%d)

# Per-container datasets mean per-container quotas
# No single container can fill the disk

The ZFS storage driver provides real copy-on-write at the filesystem level. When Docker creates a new container from an image, ZFS clones the image dataset. The clone is instant and uses zero additional space until the container writes data. Overlayfs achieves similar COW semantics but at the VFS layer with significant overhead for deep layer stacks. The ZFS driver eliminates that overhead entirely.

Podman rootless on ZFS

Podman also uses the ZFS driver when configured. Each rootless user gets containers under their own ZFS dataset subtree. Per-user quotas are enforced at the ZFS level, not the container runtime level. A misbehaving container owned by user A cannot fill the disk for user B.

# Podman per-user ZFS storage
zfs create rpool/home/todd/.local/share/containers
podman info --format '{{.Store.GraphDriverName}}'
# zfs

COW efficiency

Run 50 containers from the same base image. On overlayfs, each container copies metadata for every layer. On ZFS, all 50 containers are clones of the same dataset — they share every unchanged block. The 50th container uses the same disk space as the first. Only bytes that differ are stored separately. This is real COW, not VFS-layer COW.

NFS and iSCSI — sharing ZFS storage

ZFS datasets can be shared via NFS with a single property. ZFS zvols can be shared via iSCSI with targetcli. No separate volume manager needed. No LVM. No mdraid. The ZFS pool is the storage layer, NFS/iSCSI is the transport.

NFS exports

# Share a dataset via NFS — one command
zfs set sharenfs="rw=@10.0.0.0/24,no_root_squash" rpool/srv/shared

# ZFS writes /etc/exports.d/zfs.exports automatically
# exportfs picks it up — no manual /etc/exports editing

# Verify the export
showmount -e localhost
# /srv/shared  10.0.0.0/24

# Client mount
mount -t nfs server:/srv/shared /mnt/shared

# Performance tuning for NFS on ZFS
zfs set recordsize=128K rpool/srv/shared    # match NFS rsize/wsize
zfs set sync=standard rpool/srv/shared      # NFS requires sync writes for NFSv3
# For NFSv4 with async mount: zfs set sync=disabled (unsafe but fast)

# Multiple shares with different ACLs
zfs set sharenfs="rw=@10.0.0.0/24,ro=@10.0.1.0/24" rpool/srv/shared
zfs set sharenfs="rw=@10.0.0.5,no_root_squash" rpool/srv/admin-share

iSCSI targets

# Create a zvol for iSCSI
zfs create -V 500G -o volblocksize=4K rpool/iscsi/database-lun

# Configure targetcli
targetcli /backstores/block create db-lun /dev/zvol/rpool/iscsi/database-lun
targetcli /iscsi create iqn.2026-04.com.kldload:storage
targetcli /iscsi/iqn.2026-04.com.kldload:storage/tpg1/luns create /backstores/block/db-lun
targetcli /iscsi/iqn.2026-04.com.kldload:storage/tpg1/acls create iqn.2026-04.com.kldload:client1

# Save the targetcli config
targetcli saveconfig

# Client discovery and login
iscsiadm -m discovery -t sendtargets -p 10.0.0.1:3260
iscsiadm -m node --login

# Performance tuning for iSCSI on ZFS
zfs set volblocksize=4K rpool/iscsi/database-lun     # match DB page size
zfs set compression=off rpool/iscsi/database-lun     # database does its own compression
zfs set primarycache=metadata rpool/iscsi/database-lun  # DB has its own buffer cache
zfs set logbias=throughput rpool/iscsi/database-lun  # skip ZIL for async workloads

NFS on ZFS is criminally underrated. The sharenfs property means the export definition lives in the pool, not in a config file. When you replicate the pool to a DR host, the NFS export definition comes with it. Import the pool on the DR host, start nfs-server, and clients can mount immediately. No config file synchronization. No Ansible playbook. The storage and its sharing policy travel together.

iSCSI on zvols is the way to give remote hosts block-level access to ZFS storage. The zvol handles checksumming, compression, and snapshots. The iSCSI transport is stateless. You get the full ZFS feature set on storage that a remote database server sees as a local disk. Snapshot the zvol before a database migration. If it fails, roll back. The database server never knows the difference.

Encryption — native ZFS, not LUKS

kldload offers optional ZFS-native encryption at install time. This is not LUKS. Not dm-crypt. Not a block device encryption layer under ZFS. It is AES-256-GCM encryption built into the ZFS storage layer itself. Per-dataset. Hardware-accelerated on any CPU with AES-NI (every x86-64 CPU since 2010).

encryption=aes-256-gcm
AES-256 in GCM mode. Authenticated encryption — both confidentiality and integrity. Hardware-accelerated via AES-NI.
keyformat=passphrase
Key derived from a passphrase via PBKDF2. Entered at ZFSBootMenu during boot.
keylocation=prompt
ZFS prompts for the key at pool import. ZFSBootMenu presents the prompt before the OS loads.

When encryption is enabled, the installer adds these properties to the zpool create command. Every dataset inherits encryption from the pool root. The passphrase is entered once at boot in ZFSBootMenu; all datasets unlock together.

# Pool creation with encryption (from storage-zfs.sh)
zpool create -f \
  -o ashift=12 -o autotrim=on \
  -O acltype=posixacl -O canmount=off -O compression=lz4 \
  -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \
  -O encryption=aes-256-gcm \
  -O keyformat=passphrase \
  -O keylocation=prompt \
  -O mountpoint=none \
  -R /target \
  rpool /dev/sda2

# Verify encryption status
zfs get encryption,keystatus rpool
# NAME   PROPERTY      VALUE            SOURCE
# rpool  encryption    aes-256-gcm      -
# rpool  keystatus     available        -

# Every child dataset inherits encryption
zfs get encryption rpool/home
# NAME        PROPERTY    VALUE          SOURCE
# rpool/home  encryption  aes-256-gcm    inherited from rpool

Per-dataset encryption

Even on a non-encrypted pool, individual datasets can be encrypted with different keys. This enables multi-tenant scenarios where each tenant's data has a separate passphrase:

# Encrypt a specific dataset on an unencrypted pool
zfs create -o encryption=aes-256-gcm -o keyformat=passphrase \
  -o keylocation=prompt rpool/srv/tenant-a

# Different key for a different tenant
zfs create -o encryption=aes-256-gcm -o keyformat=passphrase \
  -o keylocation=prompt rpool/srv/tenant-b

# Lock a dataset (unmount and unload key)
zfs unmount rpool/srv/tenant-a
zfs unload-key rpool/srv/tenant-a

# Unlock and mount
zfs load-key rpool/srv/tenant-a
zfs mount rpool/srv/tenant-a

Encrypted replication

The -w (raw) flag on zfs send sends encrypted blocks as ciphertext. The receiving machine stores data it cannot decrypt. This is data sovereignty at the storage layer — replicate to an untrusted host and they physically cannot read your data.

# Send encrypted — receiver stores ciphertext, cannot decrypt
zfs send -w rpool/srv@snap | ssh untrusted-host "zfs recv backup/srv"

# The key never leaves the source machine
# The receiving host cannot: load-key, mount, or read the data
# Perfect for offsite DR to cloud or partner infrastructure

ZFS native encryption is fundamentally different from LUKS. LUKS encrypts a block device — ZFS sits on top and has no idea the blocks are encrypted. This means LUKS snapshots, compression, and dedup all operate on ciphertext (incompressible random data). ZFS native encryption operates below compression but above the disk — data is compressed first, then encrypted. You get both compression savings and encryption. With LUKS under ZFS, compression does nothing because it sees random bytes. This alone makes ZFS native encryption the correct choice.

Recovery: there is none. Forget the passphrase, lose the data. There is no recovery key, no backdoor, no master password. This is by design. The overhead is 5-15% for sequential I/O, negligible for random I/O. On a modern CPU with AES-NI, AES-256-GCM runs at 5+ GB/s — faster than any SSD can read.

Monitoring — health checks, scrubs, ARC, and alerts

kldload configures ZFS monitoring out of the box. Pool health checks, automatic scrubs, ARC utilization tracking, and capacity alerts are all set up at install time. The monitoring stack integrates with the kldload observability tools if installed.

Scrub scheduling

# Weekly scrub via systemd timer (installed by kldload)
# /etc/systemd/system/zfs-scrub@.timer
[Timer]
OnCalendar=Sun *-*-* 02:00:00
Persistent=true

# Manual scrub
zpool scrub rpool

# Check scrub status
zpool status rpool | grep scan
#   scan: scrub repaired 0B in 00:12:34 with 0 errors on Sun Apr  4 02:12:34 2026

# Pause a scrub (if it's impacting production I/O)
zpool scrub -p rpool

# Resume
zpool scrub rpool

Pool health and status

# Full pool status — health, scrub, errors, vdev layout
zpool status rpool
#   pool: rpool
#  state: ONLINE
# status: One or more devices has experienced an unrecoverable error.
# action: Determine if the device needs to be replaced.
#   scan: scrub repaired 4K in 00:12:34 with 0 errors
# config:
#   NAME          STATE     READ WRITE CKSUM
#   rpool         ONLINE       0     0     0
#     mirror-0    ONLINE       0     0     0
#       sdb       ONLINE       0     0     0
#       sdc       ONLINE       0     0     1

# Quick health check (scriptable)
zpool status -x
# all pools are healthy

# Pool I/O statistics (like iostat for ZFS)
zpool iostat rpool 5
#               capacity     operations     bandwidth
# pool        alloc   free   read  write   read  write
# rpool       45.2G   180G    125    340  4.12M  12.5M

# Per-vdev I/O stats
zpool iostat -v rpool 5

ARC monitoring

# ARC (Adaptive Replacement Cache) statistics
arc_summary

# Quick ARC hit ratio
arcstat 5
#     time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c
# 15:30:01  1.2K    45   3.7%    30  2.5%    15  1.2%     0  0.0%  3.8G  4.0G

# ARC size and target
cat /proc/spl/kstat/zfs/arcstats | grep -E "^(size|c_max|c_min)"
# size     4   4089446400  (current ARC size ~3.8GB)
# c_min    4   1073741824  (minimum target ~1GB)
# c_max    4   8589934592  (maximum target ~8GB)

# Set maximum ARC size (e.g., limit to 4GB on a 16GB machine)
echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_max
# Persist across reboots:
echo "options zfs zfs_arc_max=4294967296" > /etc/modprobe.d/zfs.conf

Capacity alerts

# Check pool capacity — ZFS degrades above 80%, critical above 90%
zpool list -o name,size,alloc,free,cap,health
# NAME    SIZE  ALLOC   FREE   CAP  HEALTH
# rpool   224G  45.2G   179G   20%  ONLINE

# Simple alert script (runs via cron)
#!/bin/bash
CAP=$(zpool list -Hp -o capacity rpool)
if [ "$CAP" -gt 80 ]; then
  echo "WARNING: rpool at ${CAP}% capacity" | \
    mail -s "ZFS capacity alert" admin@example.com
fi

# Per-dataset usage breakdown
zfs list -o name,used,avail,refer,compressratio -r rpool | head -20

# Find datasets consuming the most space
zfs list -o name,used -s used -r rpool | tail -10

# Check compression savings
zfs get compressratio rpool
# NAME   PROPERTY       VALUE  SOURCE
# rpool  compressratio  1.82x  -
# That means 45GB on disk is storing 82GB of logical data

ZFS performance degrades significantly above 80% capacity. This is not a bug — it is a consequence of copy-on-write. When the pool is nearly full, ZFS has to search harder for free blocks, fragmentation increases, and write amplification rises. The 80% threshold is not a suggestion. It is a hard operational boundary. If your pool is above 80%, you need to add storage, delete data, or both. At 90% you are in emergency territory. At 95% ZFS becomes effectively read-only for some operations.

The ARC is ZFS's read cache. It sits in RAM and caches frequently accessed blocks. A healthy ARC hit rate is above 90%. Below 80%, you are thrashing — ZFS is reading from disk too often. The fix is either more RAM (ARC grows to fill available memory) or an L2ARC (SSD read cache). On a 32GB machine, ZFS will typically use 16-24GB for ARC. This is not a memory leak. It is ZFS doing its job. The ARC releases memory instantly when applications need it.

Comparison — kldload ZFS vs the alternatives

Every storage decision is a tradeoff. ZFS is not the answer to every problem. But for the use cases kldload targets — OS root, VM storage, container storage, NFS/iSCSI serving, single-node and replicated setups — it is the correct choice. Here is why.

kldload ZFS vs ext4 + LVM

Capabilitykldload ZFSext4 + LVM
ChecksummingEvery block, alwaysNone. Silent corruption is undetectable.
CompressionLZ4 transparent, 1.5-2x savingsNone at filesystem level
SnapshotsO(1), per-dataset, instantLVM snapshots exist but degrade performance and are crash-unsafe
ReplicationNative, incremental, block-levelrsync (file-level, slow) or LVM snap + dd (clunky)
EncryptionPer-dataset, native, works with compressionLUKS under LVM. Compression on encrypted data is useless.
Boot environmentsNative. Clone, boot, rollback in seconds.Not possible without Snapper + complex GRUB config
ResizeDatasets share pool space. No resizing needed.LV resize + resize2fs. Can grow online, shrinking requires umount.
Self-healingMirror/RAIDZ auto-repairs corrupt blocks on readmdraid rebuilds entire disks. No block-level repair.
ComplexityOne tool (zfs/zpool). Higher learning curve.Three tools (fdisk + pvcreate/lvcreate + mkfs). More moving parts.

kldload ZFS vs Ceph

Capabilitykldload ZFS (single node / replicated)Ceph
Minimum nodes1 node. Replication to a second is optional.3 nodes minimum (MON quorum). 5+ recommended.
Operational complexityzfs/zpool commands. Shell scripts. cron.ceph-deploy or cephadm. PG balancing. CRUSH maps. OSD management. MON quorum.
Block storagezvols as block devices. Native.RBD. Distributed, replicated. Network overhead.
File storageNFS on ZFS datasets. Simple.CephFS. Metadata server required. Complex failure modes.
Object storageNot applicableRADOS Gateway. S3-compatible.
Network dependencyNone for local. SSH for replication.Requires dedicated cluster network. Latency-sensitive.
ScalingScale-up (add disks to pool). Replication for DR.Scale-out (add nodes). Automatic rebalancing.
Best for1-10 nodes. Single-site or replicated pairs.10+ nodes. Multi-site. Cloud-scale object storage.

kldload ZFS vs GlusterFS

Capabilitykldload ZFSGlusterFS
ArchitectureKernel module. Block-level COW.FUSE-based (userspace). File-level operations. Higher latency per I/O.
ChecksummingEvery blockNone. Relies on underlying filesystem.
CompressionNative LZ4/zstdNone. Must use compressed filesystem underneath.
Small file performanceExcellent with special vdevPoor. Each file is a file on the brick filesystem. Metadata overhead is high.
ReplicationBlock-level incremental (zfs send)File-level geo-replication. Slower, higher bandwidth.
Distributed volumesNot applicableNative distributed + replicated volumes across nodes
Best forSingle-node NFS. Replicated pairs for DR.Distributed NAS. Multi-node file access. Slowly being replaced by CephFS.

Use ZFS when you have 1-10 nodes and want simple, reliable, high-performance storage with snapshots, compression, checksumming, and replication. Use Ceph when you have 10+ nodes, need object storage, and have a dedicated storage network with staff to operate it. Use GlusterFS when someone already set it up and you have not migrated to Ceph yet.

kldload's bet is that most infrastructure runs on 1-10 nodes and most teams do not have the staff or budget for a Ceph cluster. For these teams, ZFS replication between a pair of nodes provides 90% of what Ceph offers at 10% of the complexity. Two kldload boxes with syncoid running every 15 minutes is a complete storage and DR solution. No quorum. No CRUSH map. No MON daemons. Two machines and a cron job.

When NOT to use ZFS

Severely RAM-constrained systems

ZFS needs at least 2GB of RAM for the ARC. On a 512MB embedded device, use ext4 or f2fs. The ARC will consume available memory (and release it when needed), which can confuse monitoring tools that do not understand ZFS's memory model.

Write-heavy dedup workloads

ZFS deduplication requires a DDT (dedup table) that must fit in RAM. A 10TB pool with dedup might need 50-100GB of RAM for the DDT. If the DDT spills to disk, performance collapses. Unless you have verified the DDT fits in RAM, do not enable dedup. Use compression instead — it is free and always helps.

Cloud-scale object storage

If you need S3-compatible object storage across 100+ nodes, use Ceph or MinIO. ZFS is a local/replicated filesystem, not a distributed object store. It excels at block and file storage on individual nodes. Distributed consensus is not its domain.

Extremely latency-sensitive databases

Some database benchmarks show ext4 with O_DIRECT slightly outperforming ZFS on latency-sensitive OLTP workloads. ZFS's COW mechanism adds a write amplification factor. For sub-millisecond database latency requirements on NVMe, benchmark both. For everything else, ZFS's safety features (checksumming, snapshots, replication) far outweigh the latency difference.

Summary — what kldload gives you out of the box

Every kldload install creates a ZFS root pool with correct properties, a dataset hierarchy designed for rollback isolation, automatic snapshot rotation, Sanoid/Syncoid for replication, ZFSBootMenu for boot environments, and optional AES-256-GCM encryption. The storage layer is identical across every supported distro. Switch from CentOS to Debian and your pool, datasets, snapshots, and replication continue unchanged. The OS is a guest on ZFS's stage.

The philosophy: Storage must be correct before anything else can be correct. Correct means checksummed, compressed, snapshottable, replicable, and encrypted when needed. ZFS provides all five in one system. Everything else requires bolting together separate tools and hoping they do not conflict. kldload chooses the integrated solution and configures it right the first time.