ZFS Masterclass
This guide goes deep on ZFS — well past the basics covered in ZFS Zero to Hero. You already know how to create a pool, take a snapshot, and roll back. This guide covers the decisions that separate an expert ZFS operator from someone who knows the commands: pool topology for real hardware, dataset architecture as a design discipline, send/receive pipelines for production DR, encryption that survives replication, boot environments as OS safety nets, and the performance knobs that make a 2x difference on databases.
What this page covers: pool topology and vdev selection for real workloads, datasets as architecture, the full send/receive replication model, native ZFS encryption with encrypted replication, boot environments, ARC and L2ARC tuning, SLOG and special vdevs, delegation and permissions, performance tuning by workload, ZFS with databases (PostgreSQL, MySQL, MongoDB, Redis), channel programs for atomic automation, and a troubleshooting reference.
Prerequisites: the ZFS Zero to Hero tutorial. You should be comfortable with zpool create, zfs snapshot, zfs rollback, and basic property management before reading this.
1. ZFS Is Not a Filesystem
When people say "ZFS is a filesystem" they are describing one feature of a storage platform. ZFS is a volume manager, a filesystem, a snapshot engine, a replication system, and a data integrity layer — all in one, all sharing the same on-disk format, all operating atomically. No other tool does all of this. You cannot bolt these properties onto ext4. You cannot simulate them with LVM + mdadm + rsync. They only exist as a unified whole.
The traditional Linux storage stack looks like this:
# Traditional stack: four separate tools, no shared atomicity
Hardware
└─ mdadm (RAID)
└─ LVM (volume management)
└─ ext4 / xfs (filesystem)
└─ rsync / tar (backup/replication)
Each layer is independent. RAID does not know about the filesystem. LVM does not know about snapshots. rsync does not know about block changes — it scans files. When you replicate with rsync, you are comparing modification times and checksums of millions of files. When you scrub with mdadm, you are checking parity — not data integrity. When ext4 corrupts silently, nothing detects it until an application reads garbage.
ZFS collapses all of this:
# ZFS: one tool, all layers, full atomicity
Hardware
└─ ZFS (RAID + volume management + filesystem + snapshot + replication + checksums)
The copy-on-write guarantee. Every write in ZFS goes to a new location on disk. The old location stays valid until the transaction commits. This means: no partial writes, no torn blocks, no "filesystem in an inconsistent state after power loss." You never run fsck on ZFS. There is nothing to check — either the write committed atomically or the old version is still intact.
The checksum guarantee. Every block in ZFS carries a checksum stored in the block's parent pointer — not in the block itself. A corrupted block cannot forge its own checksum. ZFS detects corruption on every read and, if you have redundancy, silently repairs it from a good copy. This is called self-healing and it runs automatically, every time data is read.
2. Pool Design for Real Workloads
A ZFS pool (zpool) is the top-level storage container. Everything below it —
datasets, volumes, snapshots — lives inside the pool. The pool is made of vdevs
(virtual devices). The topology of those vdevs determines your IOPS, redundancy,
rebuild speed, and failure tolerance.
Mirror
Two or more disks that hold identical copies of all data. Reads can be served from either disk (doubling read IOPS). Writes go to all mirrors simultaneously (write IOPS = single disk). Rebuild time is bounded by disk read speed, not pool size — a 4TB mirror pair rebuilds in roughly the time it takes to read 4TB from one disk.
# 2-disk mirror (the reliable homelab minimum)
zpool create rpool mirror /dev/sda /dev/sdb
# 3-way mirror (survive 2 simultaneous drive failures)
zpool create rpool mirror /dev/sda /dev/sdb /dev/sdc
# Two 2-disk mirror vdevs striped (4 disks, 2x write IOPS, 4x read IOPS)
zpool create rpool \
mirror /dev/sda /dev/sdb \
mirror /dev/sdc /dev/sdd
RAIDZ1, RAIDZ2, RAIDZ3
RAIDZ is ZFS's answer to RAID5/6. Unlike hardware RAID, RAIDZ uses variable-width stripes that eliminate the write hole — no partial-stripe writes, no need for a battery-backed write cache. RAIDZ1 tolerates one disk failure. RAIDZ2 tolerates two. RAIDZ3 tolerates three.
# RAIDZ2 across 6 disks (4 data + 2 parity)
zpool create rpool raidz2 /dev/sd{a..f}
# Two RAIDZ2 vdevs striped (12 disks total — common production layout)
zpool create rpool \
raidz2 /dev/sd{a..f} \
raidz2 /dev/sd{g..l}
dRAID
Distributed RAID — ZFS 2.1+. Parity is spread across all disks rather than grouped into fixed vdevs. This enables faster rebuilds (rebuild I/O is distributed) and allows you to specify spare slots that become active automatically during a rebuild. dRAID shines on large drive counts (12+) where RAIDZ rebuild times become dangerous.
# dRAID2 across 12 disks with 2 hot spares, parity=2, data=10
zpool create rpool draid2:10d:12c:2s /dev/sd{a..l}
# notation: draid{parity}:{data}d:{children}c:{spares}s
When to use each
Mirror
Best random read IOPS. Fastest rebuild. Highest cost per usable TB. Best for: OS pools, KVM host root, databases, anything latency-sensitive. Minimum: 2 disks. Maximum redundancy: as many mirrors as you want.
RAIDZ2
Good sequential throughput. Two disk failure tolerance. Slower rebuild than mirrors. Best for: NAS, media storage, large cold storage. Use RAIDZ2 as the minimum for drives over 2TB — RAIDZ1 on large drives has a meaningful chance of a second failure during the hours-long rebuild.
RAIDZ3
Three disk failure tolerance. High overhead (3 parity disks). Justified for large pools (10+ disks) where simultaneous triple failure is non-trivial, or for archival storage where rebuild windows stretch into days.
dRAID
Distributed rebuild — every disk participates in restoring a failed drive, so rebuild speed scales with drive count instead of a single rebuild drive bottleneck. Ideal for 12+ disk arrays.
Stripe width and IOPS
RAIDZ IOPS scale with the number of vdevs (stripes), not the number of disks per vdev. Two RAIDZ2 vdevs of 6 disks each deliver roughly 2x the IOPS of one RAIDZ2 vdev of 12 disks, with the same usable capacity. When IOPS matter, split your disks into more vdevs rather than wider vdevs.
# Low IOPS: one wide vdev
zpool create rpool raidz2 /dev/sd{a..k} # 1 vdev, ~1x sequential IOPS
# Higher IOPS: two narrower vdevs striped
zpool create rpool \
raidz2 /dev/sd{a..e} \
raidz2 /dev/sd{f..j} # 2 vdevs, ~2x sequential IOPS, same usable capacity
Special allocation classes: metadata vdev, L2ARC, SLOG
Beyond data vdevs, ZFS supports three auxiliary vdev types that accelerate specific access patterns:
- special vdev — receives metadata and small blocks (<special_small_blocks threshold). Put this on NVMe to make a slow spinning-disk pool feel fast for random access.
- L2ARC — a second-level read cache (SSD) that extends the in-RAM ARC. Covered in section 7.
- SLOG (Separate Log) — an NVMe device that accelerates synchronous writes. Covered in section 8.
# Add a special vdev to an existing pool
zpool add rpool special mirror /dev/nvme0n1 /dev/nvme1n1
# Route blocks smaller than 64k to the special vdev (metadata + small files)
zfs set special_small_blocks=64k rpool
Concrete examples
2-disk homelab: A mirror. Two identical drives, zpool create rpool mirror /dev/sda /dev/sdb. Nothing else makes sense at 2 drives — you have no room for RAIDZ, and a single disk is a single point of failure. If you only have 2 drives, use a mirror.
4-disk NAS: Two 2-disk mirrors striped, or RAIDZ2 across 4 disks. Mirror gives better IOPS and faster rebuild. RAIDZ2 gives slightly more usable capacity (2 drives vs 2 drives, but RAIDZ2 across 4 gives 2 drives usable vs mirror-pair gives 2 drives usable — they are identical at 4 disks). For a NAS serving media, RAIDZ2 is fine. For a NAS serving a database, use mirrors.
12-disk production server: Two 6-disk RAIDZ2 vdevs for storage. A mirror pair of NVMe for the boot pool. Optionally, a special vdev on NVMe to accelerate metadata. If the workload is database-heavy, replace the two RAIDZ2 vdevs with six 2-disk mirrors — more IOPS, faster rebuild, no stranded capacity problem.
3. Datasets as Architecture
A ZFS dataset is a filesystem with its own properties, its own snapshot timeline, its own quota, its own compression algorithm, its own record size, and its own mount point. Creating a dataset costs nothing — there is no pre-allocation, no format step, no minimum size. It is as cheap as creating a directory. But a dataset gives you control that a directory never can.
Datasets vs directories
# A directory: you can set permissions, that is all
mkdir -p /srv/postgres
chown postgres:postgres /srv/postgres
# A dataset: you control everything
zfs create -o recordsize=8k \
-o compression=lz4 \
-o atime=off \
-o logbias=throughput \
rpool/srv/postgres
# Every property can be different per dataset
# Every dataset has its own snapshot timeline
# Every dataset can be replicated independently
Inheritance
ZFS properties cascade from parent to child unless overridden. Set compression on
rpool and every dataset beneath it inherits it. Set a different compression on
rpool/vms and that dataset and its children use the override. Unset it and it falls
back to the parent value.
# Set defaults on the pool root — all children inherit
zfs set compression=lz4 rpool
zfs set atime=off rpool
zfs set xattr=sa rpool
# Override for a specific workload
zfs set compression=zstd rpool/backups # better ratio for cold storage
zfs set atime=on rpool/home # some apps need access times
# Check where a property comes from
zfs get -o name,property,value,source compression rpool rpool/backups
The kldload dataset layout
kldload installs with a deliberate dataset hierarchy. Each workload gets its own dataset with tuned properties:
rpool/
ROOT/ # boot environments live here
default/ # the active OS root
home/ # user home directories (atime=on, compression=lz4)
vms/ # KVM disk images (recordsize=64k or volblocksize=64k for zvols)
docker/ # Docker storage driver root (compression=lz4, atime=off)
srv/ # services
postgres/ # recordsize=8k, logbias=throughput
nginx/ # default settings
redis/ # recordsize=64k, sync=disabled
backups/ # zfs send output (compression=zstd, atime=off)
# Inspect the full property set for a dataset
zfs get all rpool/srv/postgres
# See only locally-set properties (not inherited)
zfs get -s local all rpool/srv/postgres
Per-application tuning reference
recordsize
The maximum block size ZFS uses when writing data. Match it to the application's I/O unit. Default is 128k, which is wrong for databases. See section 10 and 11 for specifics by database.
compression
lz4 for almost everything — near-zero CPU cost, 1.5-3x ratio on typical data. zstd for cold archives where CPU time is acceptable. off only for already-compressed data (video, images, encrypted blobs).
atime
Access time tracking writes a metadata update every time a file is read. Turn it off on any dataset where access patterns do not require it — nearly everything except mail spools and some application logs.
sync
standard (default): applications can request synchronous writes. always: forces every write to sync — maximum durability, lower throughput. disabled: async only — maximum speed, data loss on power failure. Use disabled only for ephemeral or externally-replicated data.
logbias
Hint to ZFS about the workload type. latency (default): prefer low-latency paths including the SLOG. throughput: prefer throughput over latency, bypass the SLOG. Use throughput for databases that already manage their own write ordering.
quota / reservation
quota caps how much space a dataset can use. reservation guarantees space is available. Use quotas to prevent runaway datasets from filling the pool. Use reservations to guarantee space for critical workloads.
4. Send/Receive — the Replication Engine
zfs send serializes the state of a snapshot into a byte stream. zfs receive
deserializes it into a dataset on any ZFS pool, anywhere. Together they form the only
block-level, incremental, checksummed, resumable replication tool built into a
filesystem. rsync scans files. zfs send compares block pointers.
Full send
# Send a snapshot to a remote host over SSH
zfs snapshot rpool/srv/postgres@2026-04-01
zfs send rpool/srv/postgres@2026-04-01 \
| ssh backup-host zfs receive -F backup/postgres
Incremental send
An incremental send transmits only the blocks that changed between two snapshots. ZFS computes this by comparing snapshot metadata — it does not scan files. A 2TB dataset that changed 1GB sends 1GB.
# Incremental send from @2026-04-01 to @2026-04-02
zfs snapshot rpool/srv/postgres@2026-04-02
zfs send -i rpool/srv/postgres@2026-04-01 \
rpool/srv/postgres@2026-04-02 \
| ssh backup-host zfs receive backup/postgres
# Short form: -I sends all intermediate snapshots (catches up if you missed one)
zfs send -I rpool/srv/postgres@2026-04-01 \
rpool/srv/postgres@2026-04-02 \
| ssh backup-host zfs receive backup/postgres
Recursive send
# Send an entire dataset tree recursively
# Creates matching snapshots on all child datasets
zfs snapshot -r rpool/srv@2026-04-02
zfs send -R rpool/srv@2026-04-02 \
| ssh backup-host zfs receive -F backup/srv
Raw send (encrypted datasets)
Raw sends transmit the encrypted bytes without decrypting. The receiving end stores the ciphertext. Without the key, the data is unreadable — you can replicate to an untrusted remote.
# Send encrypted dataset without decrypting
zfs send --raw rpool/home/alice@2026-04-02 \
| ssh untrusted-remote zfs receive backup/home/alice
# The remote stores encrypted blocks — it cannot read them without the key
# The key never leaves the source system
Resume support
Large sends over unreliable networks can be interrupted. ZFS records the send position and allows resuming from where it stopped.
# If a send fails mid-stream, the receive records a resume token
zfs get receive_resume_token backup/postgres
# Resume the interrupted send
RESUME_TOKEN=$(ssh backup-host zfs get -H -o value receive_resume_token backup/postgres)
zfs send -t "$RESUME_TOKEN" | ssh backup-host zfs receive backup/postgres
Bookmarks — keep the send position without keeping the snapshot
An incremental send requires the previous snapshot to exist on the source. But snapshots consume space. A bookmark records the send position without retaining the data — it costs almost nothing and lets the source prune old snapshots while keeping the ability to continue incrementally.
# Create a bookmark from a snapshot
zfs bookmark rpool/srv/postgres@2026-04-01 rpool/srv/postgres#2026-04-01
# Now you can destroy the snapshot but keep the bookmark
zfs destroy rpool/srv/postgres@2026-04-01
# The next incremental send uses the bookmark as the base
zfs send -i rpool/srv/postgres#2026-04-01 \
rpool/srv/postgres@2026-04-02 \
| ssh backup-host zfs receive backup/postgres
Automated hourly replication pipeline
#!/bin/bash
# /usr/local/bin/zfs-replicate — run hourly via systemd timer
set -euo pipefail
POOL=rpool
DATASET=srv
REMOTE=backup-host
REMOTE_POOL=backup
SNAP_NAME="$(date +%Y-%m-%dT%H:%M)"
KEEP_LOCAL=48 # keep 48 hourly snapshots on source
KEEP_REMOTE=168 # keep 7 days of hourly snapshots on remote
# Snapshot the full tree
zfs snapshot -r "${POOL}/${DATASET}@${SNAP_NAME}"
# Find the most recent snapshot on the remote to use as the incremental base
LAST_REMOTE=$(ssh "${REMOTE}" \
zfs list -H -t snapshot -o name -s creation "${REMOTE_POOL}/${DATASET}" \
| tail -1 | sed 's/.*@//')
if [[ -z "$LAST_REMOTE" ]]; then
# First run — full send
zfs send -R "${POOL}/${DATASET}@${SNAP_NAME}" \
| ssh "${REMOTE}" zfs receive -F "${REMOTE_POOL}/${DATASET}"
else
# Incremental send over WireGuard (wg0 is the DR tunnel)
zfs send -R -I \
"${POOL}/${DATASET}@${LAST_REMOTE}" \
"${POOL}/${DATASET}@${SNAP_NAME}" \
| ssh -o StrictHostKeyChecking=no "${REMOTE}" \
zfs receive -F "${REMOTE_POOL}/${DATASET}"
fi
# Prune old snapshots on source
zfs list -H -t snapshot -o name -s creation -r "${POOL}/${DATASET}" \
| head -n -${KEEP_LOCAL} \
| xargs -r -n1 zfs destroy
# Prune old snapshots on remote
ssh "${REMOTE}" \
zfs list -H -t snapshot -o name -s creation -r "${REMOTE_POOL}/${DATASET}" \
| head -n -${KEEP_REMOTE} \
| xargs -r -n1 ssh "${REMOTE}" zfs destroy
# systemd timer to run hourly
# /etc/systemd/system/zfs-replicate.timer
[Unit]
Description=ZFS hourly replication to DR site
[Timer]
OnCalendar=hourly
Persistent=true
[Install]
WantedBy=timers.target
5. Encryption
ZFS encryption is native, per-dataset, and transparent to applications. It operates at the dataset level — not the pool level. You can encrypt some datasets and leave others unencrypted, all in the same pool. Encryption happens in the kernel ZFS module, before data reaches disk. It is AES-256-GCM by default.
Creating encrypted datasets
# Passphrase-protected dataset (prompts at creation)
zfs create -o encryption=aes-256-gcm \
-o keyformat=passphrase \
-o keylocation=prompt \
rpool/home/alice
# Keyfile (for automated unlock)
dd if=/dev/urandom bs=32 count=1 > /etc/zfs/keys/alice.key
chmod 400 /etc/zfs/keys/alice.key
zfs create -o encryption=aes-256-gcm \
-o keyformat=raw \
-o keylocation=file:///etc/zfs/keys/alice.key \
rpool/home/alice
Loading and unloading keys
# Load key and mount (on boot or after key change)
zfs load-key rpool/home/alice
zfs mount rpool/home/alice
# Load all encrypted datasets (systemd service does this)
zfs load-key -a
# Unload key (dataset stays mounted until next umount)
zfs unload-key rpool/home/alice
# Check encryption status
zfs get encryption,keystatus,keylocation rpool/home/alice
Encryption + send/receive
Raw sends preserve the encryption. The ciphertext is transmitted without decryption. The receiving pool stores the encrypted blocks. Without loading the key on the remote, the data cannot be read.
# Raw send — encrypted blocks, key not transmitted
zfs snapshot rpool/home/alice@2026-04-02
zfs send --raw rpool/home/alice@2026-04-02 \
| ssh dr-host zfs receive backup/home/alice
# On the DR host — data exists but cannot be read without the key
zfs list backup/home/alice
zfs get keystatus backup/home/alice # shows: unavailable
# To restore: bring the key to the DR host and load it
zfs load-key backup/home/alice # prompts for passphrase
zfs mount backup/home/alice
Key management: passphrase, keyfile, prompted on boot
# Change the key format or location without re-encrypting data
# (only the wrapping key changes, not the actual encryption key)
zfs change-key -o keyformat=passphrase rpool/home/alice
# Change the passphrase
zfs change-key rpool/home/alice
# Rotate to a new keyfile
zfs change-key -o keylocation=file:///etc/zfs/keys/alice-new.key rpool/home/alice
Encrypted home directories with auto-unlock via PAM
kldload supports per-user encrypted datasets that unlock at login using
pam_zfs_key. The dataset key is derived from the user's login password. The
dataset is mounted when the user logs in and unmounted when their last session exits.
# Install pam_zfs_key (included in kldload)
# /etc/pam.d/system-auth — add after auth stack
# auth optional pam_zfs_key.so homes=rpool/home
# The dataset name must match the username
# zfs create -o encryption=on -o keyformat=passphrase rpool/home/alice
# The passphrase must match the login password (or be derived from it)
# On login: pam_zfs_key loads the key, mounts the dataset
# On logout: key is unloaded if no other sessions exist
/home but leave /var/lib/docker unencrypted. You can replicate encrypted datasets to an untrusted remote — they arrive encrypted and the remote cannot read them without the key. This is the only filesystem that gives you encrypted replication to untrusted storage for free. A cloud backup target receives your ciphertext. An offsite NAS stores your ciphertext. You own the key. Nobody else can read it.6. Boot Environments
A boot environment is a bootable clone of rpool/ROOT/<name>. ZFS boot
environments give you the ability to snapshot the entire OS root filesystem, switch
between them from the bootloader, and roll back a failed update in seconds — without
any special tooling beyond ZFS and a compatible bootloader (ZFSBootMenu or GRUB with
ZFS support).
How kldload uses boot environments
kldload creates a new boot environment automatically before any system update. The sequence:
- Snapshot the current
rpool/ROOT/defaultdataset. - Create a new boot environment from that snapshot.
- Mark the new BE as the active boot target.
- Apply the update inside the running system.
- If the update fails or the system fails to boot: select the previous BE from ZFSBootMenu.
Manual boot environment management
# List boot environments
beadm list
# Create a new boot environment (snapshot of current root)
beadm create before-kernel-update
# Activate a boot environment (next boot will use it)
beadm activate before-kernel-update
# Mount a boot environment without booting it (inspect or repair)
beadm mount before-kernel-update /mnt
# Destroy a boot environment you no longer need
beadm destroy old-desktop-environment
Direct ZFS commands (without beadm)
# The pool layout for boot environments
# rpool/ROOT/ — parent dataset, never mounted directly
# default/ — the active root
# before-k6.2/ — saved before a kernel update
# Create a BE manually
zfs snapshot rpool/ROOT/default@before-k6.2
zfs clone rpool/ROOT/default@before-k6.2 rpool/ROOT/before-k6.2
# Set the bootfs property to switch which BE boots
zpool set bootfs=rpool/ROOT/before-k6.2 rpool
ZFSBootMenu
ZFSBootMenu is kldload's default bootloader for ZFS-on-root systems. It is a small initramfs that can read ZFS directly and presents a menu of available boot environments. You can select a BE, edit kernel arguments, import pools, and drop to a recovery shell — all before the OS starts.
# From the ZFSBootMenu prompt:
# [Enter] — boot the highlighted environment
# [K] — choose a different kernel for this BE
# [E] — edit kernel command line
# [S] — snapshots menu (boot directly from a snapshot)
# [Alt+S] — snapshot the current BE before booting
# To rollback a failed update:
# 1. Reboot
# 2. ZFSBootMenu shows list of boot environments
# 3. Select "before-kernel-update"
# 4. Boot — you are running the pre-update OS in seconds
7. ARC and L2ARC — the Caching Layer
ZFS does not use the Linux page cache. It has its own in-kernel cache called the ARC (Adaptive Replacement Cache). The ARC holds two lists: recently-used blocks and frequently-used blocks. It dynamically balances between them to maximize cache hits across both access patterns. On a read-heavy workload, ARC hit rates of 95%+ are normal.
How ARC works
The ARC grows to use available RAM and shrinks when the OS needs memory for applications. It is not a fixed allocation. On a KVM host under memory pressure, the ARC releases pages to guest VMs automatically. On an idle NAS, the ARC expands to hold the entire working set.
# Check current ARC size and hit rate
arc_summary
# Or directly from /proc/spl/kstat/zfs/arcstats
awk '/^c / {printf "ARC target: %.1f GB\n", $3/1024/1024/1024}
/^size / {printf "ARC current: %.1f GB\n", $3/1024/1024/1024}
/^hits / {h=$3}
/^misses / {m=$3; printf "Hit rate: %.1f%%\n", h/(h+m)*100}' \
/proc/spl/kstat/zfs/arcstats
ARC size tuning
# Cap ARC at 8GB (suitable for a shared server with 32GB RAM)
# /etc/modprobe.d/zfs.conf
echo "options zfs zfs_arc_max=8589934592" > /etc/modprobe.d/zfs.conf
# 8GB = 8 * 1024 * 1024 * 1024 = 8589934592
# Set a minimum ARC size (ZFS will never release below this)
echo "options zfs zfs_arc_min=2147483648" >> /etc/modprobe.d/zfs.conf
# 2GB minimum
# Apply without rebooting (the value is in bytes)
echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max
# kldload's KVM profile sets arc_max to 50% of system RAM automatically
# Check the current live value
cat /sys/module/zfs/parameters/zfs_arc_max
L2ARC — second-level cache on SSD
The L2ARC extends the ARC onto an SSD. Blocks evicted from the RAM-based ARC are written to the L2ARC device. On a subsequent cache miss in RAM, ZFS checks the L2ARC before going to disk. The L2ARC is most useful when:
- Your working set is larger than RAM but fits on an SSD.
- Primary storage is slow spinning disks.
- The workload is predominantly reads (the L2ARC does not accelerate writes).
# Add an L2ARC device to a pool
zpool add rpool cache /dev/nvme2n1
# Check L2ARC statistics
arc_summary | grep -A 20 "L2 ARC"
# Remove the L2ARC device
zpool remove rpool /dev/nvme2n1
primarycache property
# Control what a dataset caches in ARC
# all (default): data + metadata cached in ARC
# metadata: only metadata cached, data goes straight to disk
# none: no ARC caching for this dataset
# Disable caching for a write-once archive dataset (free ARC for hot data)
zfs set primarycache=metadata rpool/backups
# Disable entirely for pre-compressed video storage (no benefit from cache)
zfs set primarycache=none rpool/media/raw
8. SLOG and Special Vdevs
SLOG — Separate Log
ZFS groups writes into transaction groups (TXGs) that commit roughly every 5 seconds. Normally, synchronous writes (where the calling application waits for confirmation that data is on stable storage) block until the TXG commits — which takes up to 5 seconds of latency. A SLOG is an NVMe device that receives the synchronous write immediately, returning success to the application in microseconds. The main pool write still happens on the normal TXG schedule, but the application does not wait for it.
# Add a SLOG mirror (always mirror SLOG — it must survive power failure)
zpool add rpool log mirror /dev/nvme0n1p1 /dev/nvme1n1p1
# The SLOG device must be:
# 1. Fast (NVMe, not SATA SSD)
# 2. Power-loss safe (enterprise NVMe with capacitors, or consumer NVMe
# with flush support verified — consumer SATA SSDs lie about flush)
# 3. Mirrored (SLOG failure on unmirrored device = pool import failure)
# Check SLOG status
zpool status rpool | grep log -A 5
When SLOG matters
High benefit
PostgreSQL and MySQL (fsync-heavy), NFS with sync exports, Samba with oplocks disabled, any application that calls fsync() or O_SYNC frequently. Synchronous write latency drops from ~5000ms (worst case TXG wait) to microseconds.
No benefit
File servers, media streaming, web server content, Docker layers, most container workloads. These use async writes. The TXG batches them efficiently. Adding a SLOG to an async workload does nothing — the synchronous write path is never taken.
Special vdev — metadata acceleration
The special vdev receives metadata allocations and, optionally, small data blocks
below the special_small_blocks threshold. Moving metadata to NVMe means that
directory lookups, attribute reads, and small file access hit NVMe rather than spinning
disks — even when the data vdevs are HDD.
# Add a special vdev (mirror for durability)
zpool add rpool special mirror /dev/nvme0n1 /dev/nvme1n1
# Route small blocks (under 32k) to the special vdev as well
zfs set special_small_blocks=32768 rpool
# The special vdev must be mirrored — losing it loses all metadata = losing the pool
# Use the same or better redundancy as your data vdevs for the special vdev
Concrete example: NVMe SLOG for a PostgreSQL server
# Scenario: 6-disk RAIDZ2 spinning disk pool, PostgreSQL sync-heavy workload
# Without SLOG: each fsync waits up to 5 seconds for TXG commit — terrible latency
# With SLOG: each fsync returns in microseconds — TXG still commits every 5s but
# postgres does not wait
# 1. Create the pool
zpool create rpool raidz2 /dev/sd{a..f}
# 2. Add mirrored SLOG (two cheap NVMe sticks with power-loss protection)
zpool add rpool log mirror /dev/nvme0n1 /dev/nvme1n1
# 3. Create the postgres dataset with appropriate settings
zfs create -o recordsize=8k \
-o compression=lz4 \
-o atime=off \
-o logbias=latency \
rpool/srv/postgres
# logbias=latency tells ZFS to use the SLOG path for sync writes
# logbias=throughput bypasses the SLOG — use throughput only when you
# are managing write ordering yourself (e.g. async postgres workloads)
zpool iostat -v rpool 1 with the SLOG in place — if the SLOG's read/write counters are not moving during database operations, your workload is async and the SLOG is decorative.9. Delegation and Permissions
ZFS delegation allows non-root users and groups to perform specific ZFS operations on specific datasets. The model is fine-grained: you can allow a user to snapshot a dataset but not destroy it. You can allow a group to receive data into a dataset but not create child datasets. You can allow a service account to manage properties on its own dataset subtree without any access to other datasets.
zfs allow and zfs unallow
# Show current delegations on a dataset
zfs allow rpool/srv/postgres
# Grant a user the ability to snapshot their dataset
zfs allow alice snapshot rpool/home/alice
# Grant a group snapshot + send (backup team)
zfs allow -g backup snapshot,send rpool
# Grant a user full control of their subtree
# (snapshot, rollback, clone, destroy, create, mount, rename, receive)
zfs allow alice \
snapshot,rollback,clone,destroy,create,mount,rename,receive \
rpool/home/alice
# Delegate to a specific user only for that dataset and its children (-d)
zfs allow -d -u dbadmin \
snapshot,rollback,create,destroy,mount,quota,reservation \
rpool/srv/postgres
# Remove a delegation
zfs unallow alice snapshot rpool/home/alice
zfs unallow -g backup snapshot,send rpool
Available permissions
# Full list of delegatable permissions
zfs allow # view delegations
snapshot # create snapshots
rollback # rollback to snapshot
clone # clone a snapshot
destroy # destroy datasets and snapshots
create # create child datasets
mount # mount/unmount
rename # rename datasets
receive # zfs receive into dataset
send # zfs send (read all snapshots)
share # share via NFS/SMB
quota # set quota property
reservation # set reservation property
compression # set compression property
recordsize # set recordsize property
# Use @setuid, @encryption, etc. for property groups
Concrete example: backup user that can snapshot and send
# Create a dedicated backup service account
useradd -r -s /usr/sbin/nologin zfsbackup
# Grant it snapshot and send rights on the datasets it needs to replicate
zfs allow -u zfsbackup snapshot,send rpool/srv
zfs allow -u zfsbackup snapshot,send rpool/home
zfs allow -u zfsbackup snapshot,send rpool/vms
# The backup script runs as zfsbackup — it can take snapshots and stream them
# It cannot destroy datasets, change properties, or touch other pools
sudo -u zfsbackup zfs snapshot -r rpool/srv@$(date +%Y-%m-%dT%H:%M)
sudo -u zfsbackup zfs send -R rpool/srv@$(date +%Y-%m-%dT%H:%M) \
| ssh backup-host zfs receive backup/srv
Concrete example: DBA managing database datasets
# The DBA team gets full control of the postgres subtree
# They can create and destroy datasets, manage quotas, resize, snapshot, rollback
# They have no access to any other dataset on rpool
zfs allow -g dbadmin \
snapshot,rollback,clone,destroy,create,mount,quota,reservation,recordsize,compression \
rpool/srv/postgres
# DBAs can now manage their storage directly
# They cannot touch rpool/home, rpool/vms, or the pool itself
# The sysadmin retains control of pool-level settings and vdev topology
zfs allow gives them exactly the operations they need on exactly the datasets they own.10. Performance Tuning
recordsize
recordsize is the most impactful single property in ZFS performance. It sets the maximum size of a ZFS block. When an application writes data smaller than recordsize, ZFS packs multiple writes into a single block. When an application writes data larger than recordsize, ZFS breaks it into recordsize-sized blocks. The default is 128k, which is right for general-purpose workloads and wrong for databases.
# Check current recordsize
zfs get recordsize rpool/srv/postgres
# Set appropriate recordsize before loading data
# (changing recordsize does not rewrite existing data — only new writes use the new size)
zfs set recordsize=8k rpool/srv/postgres # PostgreSQL (8k pages)
zfs set recordsize=16k rpool/srv/mysql # MySQL InnoDB (16k pages)
zfs set recordsize=64k rpool/srv/mongodb # MongoDB WiredTiger (64k pages)
zfs set recordsize=128k rpool/srv/nginx # general files (default)
zfs set recordsize=1m rpool/media # large sequential files
Compression
# lz4: negligible CPU overhead, 1.5-3x compression ratio on typical data
# The default and right choice for almost everything
zfs set compression=lz4 rpool
# zstd: better ratio than lz4, significantly more CPU
# Right for cold archives and backup datasets
zfs set compression=zstd rpool/backups
# zstd-3 through zstd-19: higher level = better ratio, more CPU
zfs set compression=zstd-3 rpool/backups # fast zstd, good ratio
zfs set compression=zstd-9 rpool/cold # slower, better ratio
# off: only for data that cannot be compressed (pre-compressed video, encrypted blobs)
zfs set compression=off rpool/media/raw
# Check compression ratios
zfs get compressratio rpool rpool/backups rpool/srv/postgres
sync=disabled
# sync=disabled means ZFS acknowledges writes before they hit stable storage
# Data in the TXG buffer is at risk until the next TXG commit (~5 seconds)
# Power failure in that window = data loss
# Acceptable use cases:
# - Build caches and artifact stores (you can rebuild from source)
# - Temp/scratch datasets
# - Redis (Redis has its own AOF/RDB persistence)
# - CI runner workspaces
zfs set sync=disabled rpool/tmp
zfs set sync=disabled rpool/build-cache
zfs set sync=disabled rpool/srv/redis # Redis manages its own durability
# NEVER use sync=disabled on databases without external durability
# NEVER use sync=disabled on the OS root
# NEVER use sync=disabled on datasets you replicate to DR (unless the replica is your durability)
Scrub scheduling
# Manual scrub (reads every block, verifies checksums, repairs from parity if corrupt)
zpool scrub rpool
# Check scrub status
zpool status rpool
# Schedule weekly scrubs via systemd (kldload does this by default)
# /etc/systemd/system/zfs-scrub.timer
[Timer]
OnCalendar=weekly
Persistent=true
# Throttle scrub I/O to reduce impact on foreground workloads
# (lower = slower scrub, less I/O impact)
echo 50 > /sys/module/zfs/parameters/zfs_scrub_delay
11. ZFS and Databases (Deep Dive)
Databases and ZFS have a complex relationship. The wrong settings produce 10x write amplification. The right settings give you better performance than any other filesystem plus data integrity guarantees that eliminate the need for some database safety features.
The core insight: page size alignment
Every database engine works in pages — fixed-size units it reads and writes atomically. If ZFS's recordsize does not match the database's page size, ZFS reads a full record to modify a partial page. Match them and writes become 1:1.
Additionally: ZFS provides copy-on-write atomicity. When ZFS writes a block, the old version stays intact until the new write commits. This is the same guarantee that databases use expensive mechanisms (double-write buffer, full-page writes) to achieve. When your filesystem already provides atomic writes, you can turn those database mechanisms off — less I/O, same or better durability.
PostgreSQL
# PostgreSQL uses 8k pages by default (adjustable at initdb time)
zfs create -o recordsize=8k \
-o compression=lz4 \
-o atime=off \
-o logbias=throughput \
rpool/srv/postgres
# Separate dataset for WAL (Write-Ahead Log)
# WAL is sequential writes — larger recordsize is fine here
zfs create -o recordsize=128k \
-o compression=lz4 \
-o atime=off \
-o logbias=latency \
rpool/srv/postgres-wal
# In postgresql.conf:
# full_page_writes = off — ZFS provides this guarantee via CoW
# safe ONLY when postgres and ZFS are on the same host
# (not safe if using iSCSI or NFS to ZFS)
# Link the WAL to the separate dataset
# initdb -X /srv/postgres-wal/pg_wal /srv/postgres/data
# Verify settings
sudo -u postgres psql -c "SHOW full_page_writes;"
zfs get recordsize,compression,atime,logbias rpool/srv/postgres
MySQL / InnoDB
# InnoDB uses 16k pages by default
zfs create -o recordsize=16k \
-o compression=lz4 \
-o atime=off \
-o logbias=throughput \
rpool/srv/mysql
# Separate dataset for InnoDB redo log
zfs create -o recordsize=128k \
-o compression=lz4 \
-o atime=off \
rpool/srv/mysql-log
# In my.cnf:
# innodb_doublewrite = OFF
# The doublewrite buffer writes every page twice to prevent torn writes.
# ZFS copy-on-write already prevents torn writes.
# innodb_doublewrite=OFF on ZFS is safe and cuts write I/O by ~50%.
# innodb_flush_method = O_DIRECT
# Bypass the Linux page cache (ZFS has its own ARC — double caching wastes RAM)
# innodb_use_native_aio = ON (default — ZFS supports AIO)
MongoDB / WiredTiger
# WiredTiger uses 64k internal pages
zfs create -o recordsize=64k \
-o compression=lz4 \
-o atime=off \
rpool/srv/mongodb
# In mongod.conf storage section:
# wiredTiger:
# engineConfig:
# cacheSizeGB: 4 # WiredTiger cache, keep below 50% RAM
# journalCompressor: none # ZFS compression handles this
# collectionConfig:
# blockCompressor: none # let ZFS compress — avoid double compression
# MongoDB's journal provides crash consistency within WiredTiger.
# ZFS CoW provides the underlying block-level atomicity.
# Both are active and complement each other.
Redis
# Redis works in memory — its persistence (AOF/RDB) is sequential writes
zfs create -o recordsize=64k \
-o compression=lz4 \
-o atime=off \
-o sync=disabled \
rpool/srv/redis
# sync=disabled is safe here because:
# 1. Redis AOF (appendonly yes) manages its own fsync schedule
# 2. The AOF rewrite is an atomic rename — crash-safe regardless of ZFS sync
# 3. RDB snapshots are written to a temp file then renamed — also crash-safe
# 4. If you lose the last second of writes, Redis
# recovers from the AOF to the last fsync point anyway
# In redis.conf:
# appendonly yes
# appendfsync everysec # Redis fsyncs every second — ZFS sync=disabled
# is safe because Redis provides its own guarantee
SQLite
# SQLite uses variable page sizes (default 4k, tunable)
# Match recordsize to your SQLite page_size pragma
zfs create -o recordsize=4k \
-o compression=lz4 \
-o atime=off \
rpool/srv/sqlite
# For large SQLite databases, increase both:
# PRAGMA page_size = 8192; -- set before any data, cannot change after
# zfs set recordsize=8k rpool/srv/sqlite
12. Channel Programs (Advanced Automation)
Channel programs are Lua scripts that execute inside the ZFS kernel module with full atomicity. No other ZFS operation can interleave with a running channel program. This makes them the right tool for operations that need to check state and act conditionally — race-condition-free.
What channel programs can do
- Create and destroy snapshots atomically
- Check properties and take conditional action in the same atomic context
- Iterate over datasets and snapshots
- Set and get properties
- Perform multi-dataset operations that must succeed or fail as a unit
Basic usage
# Run a channel program
# zfs program [args...]
# Simple: create a snapshot if the dataset exists
cat > /tmp/snap-if-exists.lua << 'EOF'
local dataset = arg[1]
local snapname = arg[2]
-- Check if the dataset exists
local exists = zfs.exists(dataset)
if not exists then
return "dataset does not exist: " .. dataset
end
-- Create the snapshot atomically
local err = zfs.snapshot(dataset .. "@" .. snapname)
if err then
return "snapshot failed: " .. err
end
return "snapshot created: " .. dataset .. "@" .. snapname
EOF
zfs program rpool /tmp/snap-if-exists.lua rpool/srv/postgres 2026-04-02T14:00
Atomic multi-dataset snapshot + property check
# Snapshot multiple datasets, but only if they are all under quota
cat > /tmp/quota-aware-snapshot.lua << 'EOF'
local datasets = { "rpool/srv/postgres", "rpool/srv/mysql", "rpool/home" }
local snapname = arg[1]
-- Check all datasets before snapshotting any
for _, ds in ipairs(datasets) do
local used = zfs.get_prop(ds, "used")
local quota = zfs.get_prop(ds, "quota")
-- quota of 0 means no quota set
if quota > 0 and used > quota * 0.95 then
return "ABORT: " .. ds .. " is at " .. math.floor(used/quota*100) .. "% of quota"
end
end
-- All checks passed — snapshot atomically
for _, ds in ipairs(datasets) do
local err = zfs.snapshot(ds .. "@" .. snapname)
if err then
return "snapshot failed on " .. ds .. ": " .. err
end
end
return "all snapshots created: " .. snapname
EOF
zfs program rpool /tmp/quota-aware-snapshot.lua "$(date +%Y-%m-%dT%H:%M)"
Conditional snapshot + destroy old snapshots atomically
# Create a new snapshot and destroy snapshots older than N, atomically
cat > /tmp/rolling-snap.lua << 'EOF'
local dataset = arg[1]
local keep = tonumber(arg[2]) or 10
local now = arg[3]
-- Create new snapshot
local newsnap = dataset .. "@" .. now
local err = zfs.snapshot(newsnap)
if err then return "snapshot failed: " .. err end
-- Collect all snapshots for this dataset
local snaps = {}
for snap in zfs.list.snapshots(dataset) do
table.insert(snaps, snap)
end
-- Sort by creation (oldest first)
table.sort(snaps)
-- Destroy oldest if over limit
local to_destroy = #snaps - keep
for i = 1, to_destroy do
zfs.destroy(snaps[i])
end
return "snapshot created, kept " .. keep .. ", destroyed " .. math.max(0, to_destroy)
EOF
zfs program rpool /tmp/rolling-snap.lua rpool/srv/postgres 10 "$(date +%Y-%m-%dT%H:%M)"
13. Troubleshooting
zpool status — pool health and scrub state
# Full pool status
zpool status rpool
# Watch for errors in the output:
# state: ONLINE — healthy
# state: DEGRADED — one or more vdevs have failed, pool is still accessible
# state: FAULTED — pool cannot be opened (too many failed vdevs or unresolvable error)
# state: OFFLINE — vdev was taken offline manually
# Example degraded pool output
zpool status rpool
# pool: rpool
# state: DEGRADED
# status: One or more devices has been removed by the administrator.
# action: Online the device using 'zpool online' or replace the device with 'zpool replace'.
# scan: scrub repaired 0B in 00:01:15 with 0 errors on Sun Mar 30 00:26:15 2026
#
# config:
# NAME STATE READ WRITE CKSUM
# rpool DEGRADED 0 0 0
# mirror-0 DEGRADED 0 0 0
# sda ONLINE 0 0 0
# sdb OFFLINE 0 0 0
# Replace a failed drive (in-place, maintains redundancy)
zpool replace rpool /dev/sdb /dev/sdc
# Check the resilver (rebuild) progress
zpool status rpool
# Online a device that was offlined
zpool online rpool /dev/sdb
# Check when the last scrub ran and how many errors it found
zpool status -v rpool
zpool events — kernel-level ZFS events
# Stream kernel ZFS events live (checksum errors, I/O errors, pool state changes)
zpool events -vf
# Dump all historical events
zpool events -v
# Key event classes:
# sysevent.fs.zfs.scrub_finish — scrub completed
# sysevent.fs.zfs.checksum — checksum error detected and repaired
# sysevent.fs.zfs.io — I/O error
# sysevent.fs.zfs.resilver_finish — resilver (rebuild) completed
# sysevent.fs.zfs.vdev_remove — vdev removed from pool
arc_summary — ARC statistics
# Full ARC report
arc_summary
# Key numbers to watch:
# ARC Size: current RAM used by cache
# ARC Target Size: what ZFS is aiming for
# Cache Hit Ratio: should be 90%+ for a healthy read-heavy workload
# L2ARC Hits: if you have an L2ARC, this shows how much it is contributing
# Evicted: blocks pushed out of ARC — normal, but high eviction + low hits = undersized ARC
zdb — low-level pool inspection
# Inspect a pool's configuration (block device layout, vdev tree)
zdb -C rpool
# Show all metadata for a dataset (object types, property blocks)
zdb -d rpool/srv/postgres
# Read a specific object (advanced: useful when debugging data corruption)
zdb -ddddd rpool/srv/postgres 5
# Check block checksums without scrubbing (read-only verification)
zdb -b rpool # WARNING: this reads the entire pool — slow on large pools
Common issues and fixes
ARC consuming too much RAM. The ARC has no hard limit unless you set one. On a shared server, cap it: echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max (8GB). Persist in /etc/modprobe.d/zfs.conf.
Slow scrubs impacting I/O. Throttle scrub I/O: echo 200 > /sys/module/zfs/parameters/zfs_scrub_delay. Higher value = more delay between reads = less scrub I/O impact. Schedule scrubs during off-hours with a systemd timer.
Pool import fails after unclean shutdown. ZFS performs a replay of the last transaction group on import — this is normal and fast. If import fails with "cannot import — missing device," a disk is physically missing. Identify it with zpool import -d /dev rpool and replace the missing vdev.
# Import a pool with a missing vdev (forces import in degraded state)
zpool import -m -f -d /dev rpool # -m: missing log device OK
# -f: force (ignore hostid mismatch)
# If the pool was exported on another host
zpool import -d /dev rpool
# List all importable pools (shows what ZFS can find on available devices)
zpool import -d /dev
Checksum errors. Checksum errors during scrub mean a block was corrupted on disk. If you have redundancy (mirror, RAIDZ), ZFS repairs it automatically and increments the repair counter. Zero errors after repair = fixed. Persistent errors on a specific vdev = that drive is failing.
# After a scrub that found and repaired errors
zpool status -v rpool
# Look for "repaired" in the scan line and per-vdev error counts
# A vdev with rising CKSUM errors is failing — replace it soon
# Clear error counters after replacing a drive
zpool clear rpool
Dataset will not mount — wrong hostid. ZFS records the host ID when a pool is first created. If the hostid changes (new OS install, cloud instance clone), import fails. Fix:
# Check current hostid
hostid
# Force import ignoring hostid mismatch
zpool import -f rpool
# After import, update the hostid cache
zgenhostid
Related pages
- ZFS Zero to Hero — the prerequisite: pools, snapshots, basic properties
- ZFS Wiki: Pool Design — hardware selection and vdev topology reference
- ZFS Wiki: Snapshots & Replication — sanoid/syncoid for automated lifecycle management
- ZFS Wiki: Encryption — full encryption reference including LUKS comparison
- ZFS Wiki: Memory & ARC — deep ARC internals and tuning tables
- ZFS Wiki: Tuning for Workloads — property cheat sheet by workload type
- Databases on ZFS — full setup guides for PostgreSQL, MySQL, MongoDB on kldload
- NAS Server recipe — concrete pool layout for a home or lab NAS
- dRAID Storage recipe — 12+ disk dRAID array from scratch