Documentation

ZFS Masterclass

This guide goes deep on ZFS — well past the basics covered in ZFS Zero to Hero. You already know how to create a pool, take a snapshot, and roll back. This guide covers the decisions that separate an expert ZFS operator from someone who knows the commands: pool topology for real hardware, dataset architecture as a design discipline, send/receive pipelines for production DR, encryption that survives replication, boot environments as OS safety nets, and the performance knobs that make a 2x difference on databases.

What this page covers: pool topology and vdev selection for real workloads, datasets as architecture, the full send/receive replication model, native ZFS encryption with encrypted replication, boot environments, ARC and L2ARC tuning, SLOG and special vdevs, delegation and permissions, performance tuning by workload, ZFS with databases (PostgreSQL, MySQL, MongoDB, Redis), channel programs for atomic automation, and a troubleshooting reference.

Prerequisites: the ZFS Zero to Hero tutorial. You should be comfortable with zpool create, zfs snapshot, zfs rollback, and basic property management before reading this.

1. ZFS Is Not a Filesystem

When people say "ZFS is a filesystem" they are describing one feature of a storage platform. ZFS is a volume manager, a filesystem, a snapshot engine, a replication system, and a data integrity layer — all in one, all sharing the same on-disk format, all operating atomically. No other tool does all of this. You cannot bolt these properties onto ext4. You cannot simulate them with LVM + mdadm + rsync. They only exist as a unified whole.

The traditional Linux storage stack looks like this:

# Traditional stack: four separate tools, no shared atomicity
Hardware
  └─ mdadm (RAID)
       └─ LVM (volume management)
            └─ ext4 / xfs (filesystem)
                 └─ rsync / tar (backup/replication)

Each layer is independent. RAID does not know about the filesystem. LVM does not know about snapshots. rsync does not know about block changes — it scans files. When you replicate with rsync, you are comparing modification times and checksums of millions of files. When you scrub with mdadm, you are checking parity — not data integrity. When ext4 corrupts silently, nothing detects it until an application reads garbage.

ZFS collapses all of this:

# ZFS: one tool, all layers, full atomicity
Hardware
  └─ ZFS (RAID + volume management + filesystem + snapshot + replication + checksums)

The copy-on-write guarantee. Every write in ZFS goes to a new location on disk. The old location stays valid until the transaction commits. This means: no partial writes, no torn blocks, no "filesystem in an inconsistent state after power loss." You never run fsck on ZFS. There is nothing to check — either the write committed atomically or the old version is still intact.

The checksum guarantee. Every block in ZFS carries a checksum stored in the block's parent pointer — not in the block itself. A corrupted block cannot forge its own checksum. ZFS detects corruption on every read and, if you have redundancy, silently repairs it from a good copy. This is called self-healing and it runs automatically, every time data is read.

The reason kldload uses ZFS on root for everything — KVM, Docker, Kubernetes, databases, home directories — is that ZFS provides the same primitives everywhere: snapshots, clones, compression, checksums, replication. Learn ZFS once and you have the tool for every storage problem. A KVM disk image is a ZFS volume with snapshots. A Docker layer cache is a ZFS dataset with clones. A database WAL is a ZFS dataset with its own sync and recordsize settings. The storage model is uniform across the entire stack.

2. Pool Design for Real Workloads

A ZFS pool (zpool) is the top-level storage container. Everything below it — datasets, volumes, snapshots — lives inside the pool. The pool is made of vdevs (virtual devices). The topology of those vdevs determines your IOPS, redundancy, rebuild speed, and failure tolerance.

Mirror

Two or more disks that hold identical copies of all data. Reads can be served from either disk (doubling read IOPS). Writes go to all mirrors simultaneously (write IOPS = single disk). Rebuild time is bounded by disk read speed, not pool size — a 4TB mirror pair rebuilds in roughly the time it takes to read 4TB from one disk.

# 2-disk mirror (the reliable homelab minimum)
zpool create rpool mirror /dev/sda /dev/sdb

# 3-way mirror (survive 2 simultaneous drive failures)
zpool create rpool mirror /dev/sda /dev/sdb /dev/sdc

# Two 2-disk mirror vdevs striped (4 disks, 2x write IOPS, 4x read IOPS)
zpool create rpool \
  mirror /dev/sda /dev/sdb \
  mirror /dev/sdc /dev/sdd

RAIDZ1, RAIDZ2, RAIDZ3

RAIDZ is ZFS's answer to RAID5/6. Unlike hardware RAID, RAIDZ uses variable-width stripes that eliminate the write hole — no partial-stripe writes, no need for a battery-backed write cache. RAIDZ1 tolerates one disk failure. RAIDZ2 tolerates two. RAIDZ3 tolerates three.

# RAIDZ2 across 6 disks (4 data + 2 parity)
zpool create rpool raidz2 /dev/sd{a..f}

# Two RAIDZ2 vdevs striped (12 disks total — common production layout)
zpool create rpool \
  raidz2 /dev/sd{a..f} \
  raidz2 /dev/sd{g..l}

dRAID

Distributed RAID — ZFS 2.1+. Parity is spread across all disks rather than grouped into fixed vdevs. This enables faster rebuilds (rebuild I/O is distributed) and allows you to specify spare slots that become active automatically during a rebuild. dRAID shines on large drive counts (12+) where RAIDZ rebuild times become dangerous.

# dRAID2 across 12 disks with 2 hot spares, parity=2, data=10
zpool create rpool draid2:10d:12c:2s /dev/sd{a..l}
# notation: draid{parity}:{data}d:{children}c:{spares}s

When to use each

Mirror

Best random read IOPS. Fastest rebuild. Highest cost per usable TB. Best for: OS pools, KVM host root, databases, anything latency-sensitive. Minimum: 2 disks. Maximum redundancy: as many mirrors as you want.

rule: if it runs databases or VMs, use mirrors

RAIDZ2

Good sequential throughput. Two disk failure tolerance. Slower rebuild than mirrors. Best for: NAS, media storage, large cold storage. Use RAIDZ2 as the minimum for drives over 2TB — RAIDZ1 on large drives has a meaningful chance of a second failure during the hours-long rebuild.

rule: RAIDZ2 minimum for drives 2TB+. RAIDZ1 is for small drives or acceptable data loss risk.

RAIDZ3

Three disk failure tolerance. High overhead (3 parity disks). Justified for large pools (10+ disks) where simultaneous triple failure is non-trivial, or for archival storage where rebuild windows stretch into days.

rule: RAIDZ3 for large archival arrays where rebuilds take more than 24 hours

dRAID

Distributed rebuild — every disk participates in restoring a failed drive, so rebuild speed scales with drive count instead of a single rebuild drive bottleneck. Ideal for 12+ disk arrays.

rule: dRAID when you have 12+ disks and rebuild time is the threat model

Stripe width and IOPS

RAIDZ IOPS scale with the number of vdevs (stripes), not the number of disks per vdev. Two RAIDZ2 vdevs of 6 disks each deliver roughly 2x the IOPS of one RAIDZ2 vdev of 12 disks, with the same usable capacity. When IOPS matter, split your disks into more vdevs rather than wider vdevs.

# Low IOPS: one wide vdev
zpool create rpool raidz2 /dev/sd{a..k}   # 1 vdev, ~1x sequential IOPS

# Higher IOPS: two narrower vdevs striped
zpool create rpool \
  raidz2 /dev/sd{a..e} \
  raidz2 /dev/sd{f..j}   # 2 vdevs, ~2x sequential IOPS, same usable capacity

Special allocation classes: metadata vdev, L2ARC, SLOG

Beyond data vdevs, ZFS supports three auxiliary vdev types that accelerate specific access patterns:

special vdev — receives metadata and small blocks (<special_small_blocks threshold). Put this on NVMe to make a slow spinning-disk pool feel fast for random access.
L2ARC — a second-level read cache (SSD) that extends the in-RAM ARC. Covered in section 7.
SLOG (Separate Log) — an NVMe device that accelerates synchronous writes. Covered in section 8.

# Add a special vdev to an existing pool
zpool add rpool special mirror /dev/nvme0n1 /dev/nvme1n1

# Route blocks smaller than 64k to the special vdev (metadata + small files)
zfs set special_small_blocks=64k rpool

Concrete examples

2-disk homelab: A mirror. Two identical drives, zpool create rpool mirror /dev/sda /dev/sdb. Nothing else makes sense at 2 drives — you have no room for RAIDZ, and a single disk is a single point of failure. If you only have 2 drives, use a mirror.

4-disk NAS: Two 2-disk mirrors striped, or RAIDZ2 across 4 disks. Mirror gives better IOPS and faster rebuild. RAIDZ2 gives slightly more usable capacity (2 drives vs 2 drives, but RAIDZ2 across 4 gives 2 drives usable vs mirror-pair gives 2 drives usable — they are identical at 4 disks). For a NAS serving media, RAIDZ2 is fine. For a NAS serving a database, use mirrors.

12-disk production server: Two 6-disk RAIDZ2 vdevs for storage. A mirror pair of NVMe for the boot pool. Optionally, a special vdev on NVMe to accelerate metadata. If the workload is database-heavy, replace the two RAIDZ2 vdevs with six 2-disk mirrors — more IOPS, faster rebuild, no stranded capacity problem.

The most common ZFS mistake is using RAIDZ1 because "it's like RAID5." RAIDZ1 on large drives (4TB+) has a real chance of a second drive failure during rebuild — the rebuild stresses every remaining drive for hours or days. An unrecoverable read error on the surviving drives during rebuild means data loss with no warning. RAIDZ2 is the minimum for drives over 2TB. For anything mission-critical, mirrors are better: faster rebuilds, faster random reads, and you can lose any one drive without rebuild stress causing a second failure.

3. Datasets as Architecture

A ZFS dataset is a filesystem with its own properties, its own snapshot timeline, its own quota, its own compression algorithm, its own record size, and its own mount point. Creating a dataset costs nothing — there is no pre-allocation, no format step, no minimum size. It is as cheap as creating a directory. But a dataset gives you control that a directory never can.

Datasets vs directories

# A directory: you can set permissions, that is all
mkdir -p /srv/postgres
chown postgres:postgres /srv/postgres

# A dataset: you control everything
zfs create -o recordsize=8k \
           -o compression=lz4 \
           -o atime=off \
           -o logbias=throughput \
           rpool/srv/postgres

# Every property can be different per dataset
# Every dataset has its own snapshot timeline
# Every dataset can be replicated independently

Inheritance

ZFS properties cascade from parent to child unless overridden. Set compression on rpool and every dataset beneath it inherits it. Set a different compression on rpool/vms and that dataset and its children use the override. Unset it and it falls back to the parent value.

# Set defaults on the pool root — all children inherit
zfs set compression=lz4 rpool
zfs set atime=off rpool
zfs set xattr=sa rpool

# Override for a specific workload
zfs set compression=zstd rpool/backups   # better ratio for cold storage
zfs set atime=on rpool/home              # some apps need access times

# Check where a property comes from
zfs get -o name,property,value,source compression rpool rpool/backups

The kldload dataset layout

kldload installs with a deliberate dataset hierarchy. Each workload gets its own dataset with tuned properties:

rpool/
  ROOT/          # boot environments live here
    default/     # the active OS root
  home/          # user home directories (atime=on, compression=lz4)
  vms/           # KVM disk images (recordsize=64k or volblocksize=64k for zvols)
  docker/        # Docker storage driver root (compression=lz4, atime=off)
  srv/           # services
    postgres/    # recordsize=8k, logbias=throughput
    nginx/       # default settings
    redis/       # recordsize=64k, sync=disabled
  backups/       # zfs send output (compression=zstd, atime=off)

# Inspect the full property set for a dataset
zfs get all rpool/srv/postgres

# See only locally-set properties (not inherited)
zfs get -s local all rpool/srv/postgres

Per-application tuning reference

recordsize

The maximum block size ZFS uses when writing data. Match it to the application's I/O unit. Default is 128k, which is wrong for databases. See section 10 and 11 for specifics by database.

zfs set recordsize=8k rpool/srv/postgres

compression

lz4 for almost everything — near-zero CPU cost, 1.5-3x ratio on typical data. zstd for cold archives where CPU time is acceptable. off only for already-compressed data (video, images, encrypted blobs).

zfs set compression=lz4 rpool

atime

Access time tracking writes a metadata update every time a file is read. Turn it off on any dataset where access patterns do not require it — nearly everything except mail spools and some application logs.

zfs set atime=off rpool

sync

standard (default): applications can request synchronous writes. always: forces every write to sync — maximum durability, lower throughput. disabled: async only — maximum speed, data loss on power failure. Use disabled only for ephemeral or externally-replicated data.

zfs set sync=disabled rpool/tmp

logbias

Hint to ZFS about the workload type. latency (default): prefer low-latency paths including the SLOG. throughput: prefer throughput over latency, bypass the SLOG. Use throughput for databases that already manage their own write ordering.

zfs set logbias=throughput rpool/srv/postgres

quota / reservation

quota caps how much space a dataset can use. reservation guarantees space is available. Use quotas to prevent runaway datasets from filling the pool. Use reservations to guarantee space for critical workloads.

zfs set quota=500G rpool/vms/untrusted-vm

Datasets are ZFS's killer feature that people learn last. A dataset is a filesystem with its own properties, its own snapshot timeline, its own quota, its own compression setting. Creating a dataset costs nothing — there is no pre-allocation, no size limit, no formatting step. Use datasets like you use directories. More datasets means more granular control. If you find yourself thinking "should I create a dataset for this?" — yes, you should. The cost is zero. The benefit is snapshot isolation, independent replication, and per-workload tuning.

4. Send/Receive — the Replication Engine

zfs send serializes the state of a snapshot into a byte stream. zfs receive deserializes it into a dataset on any ZFS pool, anywhere. Together they form the only block-level, incremental, checksummed, resumable replication tool built into a filesystem. rsync scans files. zfs send compares block pointers.

Full send

# Send a snapshot to a remote host over SSH
zfs snapshot rpool/srv/postgres@2026-04-01

zfs send rpool/srv/postgres@2026-04-01 \
  | ssh backup-host zfs receive -F backup/postgres

Incremental send

An incremental send transmits only the blocks that changed between two snapshots. ZFS computes this by comparing snapshot metadata — it does not scan files. A 2TB dataset that changed 1GB sends 1GB.

# Incremental send from @2026-04-01 to @2026-04-02
zfs snapshot rpool/srv/postgres@2026-04-02

zfs send -i rpool/srv/postgres@2026-04-01 \
             rpool/srv/postgres@2026-04-02 \
  | ssh backup-host zfs receive backup/postgres

# Short form: -I sends all intermediate snapshots (catches up if you missed one)
zfs send -I rpool/srv/postgres@2026-04-01 \
            rpool/srv/postgres@2026-04-02 \
  | ssh backup-host zfs receive backup/postgres

Recursive send

# Send an entire dataset tree recursively
# Creates matching snapshots on all child datasets
zfs snapshot -r rpool/srv@2026-04-02

zfs send -R rpool/srv@2026-04-02 \
  | ssh backup-host zfs receive -F backup/srv

Raw send (encrypted datasets)

Raw sends transmit the encrypted bytes without decrypting. The receiving end stores the ciphertext. Without the key, the data is unreadable — you can replicate to an untrusted remote.

# Send encrypted dataset without decrypting
zfs send --raw rpool/home/alice@2026-04-02 \
  | ssh untrusted-remote zfs receive backup/home/alice

# The remote stores encrypted blocks — it cannot read them without the key
# The key never leaves the source system

Resume support

Large sends over unreliable networks can be interrupted. ZFS records the send position and allows resuming from where it stopped.

# If a send fails mid-stream, the receive records a resume token
zfs get receive_resume_token backup/postgres

# Resume the interrupted send
RESUME_TOKEN=$(ssh backup-host zfs get -H -o value receive_resume_token backup/postgres)
zfs send -t "$RESUME_TOKEN" | ssh backup-host zfs receive backup/postgres

Bookmarks — keep the send position without keeping the snapshot

An incremental send requires the previous snapshot to exist on the source. But snapshots consume space. A bookmark records the send position without retaining the data — it costs almost nothing and lets the source prune old snapshots while keeping the ability to continue incrementally.

# Create a bookmark from a snapshot
zfs bookmark rpool/srv/postgres@2026-04-01 rpool/srv/postgres#2026-04-01

# Now you can destroy the snapshot but keep the bookmark
zfs destroy rpool/srv/postgres@2026-04-01

# The next incremental send uses the bookmark as the base
zfs send -i rpool/srv/postgres#2026-04-01 \
            rpool/srv/postgres@2026-04-02 \
  | ssh backup-host zfs receive backup/postgres

Automated hourly replication pipeline

#!/bin/bash
# /usr/local/bin/zfs-replicate — run hourly via systemd timer
set -euo pipefail

POOL=rpool
DATASET=srv
REMOTE=backup-host
REMOTE_POOL=backup
SNAP_NAME="$(date +%Y-%m-%dT%H:%M)"
KEEP_LOCAL=48     # keep 48 hourly snapshots on source
KEEP_REMOTE=168   # keep 7 days of hourly snapshots on remote

# Snapshot the full tree
zfs snapshot -r "${POOL}/${DATASET}@${SNAP_NAME}"

# Find the most recent snapshot on the remote to use as the incremental base
LAST_REMOTE=$(ssh "${REMOTE}" \
  zfs list -H -t snapshot -o name -s creation "${REMOTE_POOL}/${DATASET}" \
  | tail -1 | sed 's/.*@//')

if [[ -z "$LAST_REMOTE" ]]; then
  # First run — full send
  zfs send -R "${POOL}/${DATASET}@${SNAP_NAME}" \
    | ssh "${REMOTE}" zfs receive -F "${REMOTE_POOL}/${DATASET}"
else
  # Incremental send over WireGuard (wg0 is the DR tunnel)
  zfs send -R -I \
    "${POOL}/${DATASET}@${LAST_REMOTE}" \
    "${POOL}/${DATASET}@${SNAP_NAME}" \
    | ssh -o StrictHostKeyChecking=no "${REMOTE}" \
        zfs receive -F "${REMOTE_POOL}/${DATASET}"
fi

# Prune old snapshots on source
zfs list -H -t snapshot -o name -s creation -r "${POOL}/${DATASET}" \
  | head -n -${KEEP_LOCAL} \
  | xargs -r -n1 zfs destroy

# Prune old snapshots on remote
ssh "${REMOTE}" \
  zfs list -H -t snapshot -o name -s creation -r "${REMOTE_POOL}/${DATASET}" \
  | head -n -${KEEP_REMOTE} \
  | xargs -r -n1 ssh "${REMOTE}" zfs destroy

# systemd timer to run hourly
# /etc/systemd/system/zfs-replicate.timer
[Unit]
Description=ZFS hourly replication to DR site

[Timer]
OnCalendar=hourly
Persistent=true

[Install]
WantedBy=timers.target

zfs send/receive is the most underappreciated feature in ZFS. It is a block-level, incremental, checksummed, resumable replication tool built into the filesystem. Nothing in the Linux ecosystem comes close. rsync compares files. zfs send compares block pointers. A 2TB dataset that changed 1GB sends 1GB — not by scanning 2TB of files, but by comparing snapshot metadata in milliseconds. The incremental send of a lightly-used PostgreSQL database is typically a few hundred megabytes per hour. The incremental send of a heavily-modified VM disk is proportional to writes, not pool size. This is the model that enterprise storage SANs sell for hundreds of thousands of dollars. ZFS ships it for free.

5. Encryption

ZFS encryption is native, per-dataset, and transparent to applications. It operates at the dataset level — not the pool level. You can encrypt some datasets and leave others unencrypted, all in the same pool. Encryption happens in the kernel ZFS module, before data reaches disk. It is AES-256-GCM by default.

Creating encrypted datasets

# Passphrase-protected dataset (prompts at creation)
zfs create -o encryption=aes-256-gcm \
           -o keyformat=passphrase \
           -o keylocation=prompt \
           rpool/home/alice

# Keyfile (for automated unlock)
dd if=/dev/urandom bs=32 count=1 > /etc/zfs/keys/alice.key
chmod 400 /etc/zfs/keys/alice.key

zfs create -o encryption=aes-256-gcm \
           -o keyformat=raw \
           -o keylocation=file:///etc/zfs/keys/alice.key \
           rpool/home/alice

Loading and unloading keys

# Load key and mount (on boot or after key change)
zfs load-key rpool/home/alice
zfs mount rpool/home/alice

# Load all encrypted datasets (systemd service does this)
zfs load-key -a

# Unload key (dataset stays mounted until next umount)
zfs unload-key rpool/home/alice

# Check encryption status
zfs get encryption,keystatus,keylocation rpool/home/alice

Encryption + send/receive

Raw sends preserve the encryption. The ciphertext is transmitted without decryption. The receiving pool stores the encrypted blocks. Without loading the key on the remote, the data cannot be read.

# Raw send — encrypted blocks, key not transmitted
zfs snapshot rpool/home/alice@2026-04-02
zfs send --raw rpool/home/alice@2026-04-02 \
  | ssh dr-host zfs receive backup/home/alice

# On the DR host — data exists but cannot be read without the key
zfs list backup/home/alice
zfs get keystatus backup/home/alice  # shows: unavailable

# To restore: bring the key to the DR host and load it
zfs load-key backup/home/alice   # prompts for passphrase
zfs mount backup/home/alice

Key management: passphrase, keyfile, prompted on boot

# Change the key format or location without re-encrypting data
# (only the wrapping key changes, not the actual encryption key)
zfs change-key -o keyformat=passphrase rpool/home/alice

# Change the passphrase
zfs change-key rpool/home/alice

# Rotate to a new keyfile
zfs change-key -o keylocation=file:///etc/zfs/keys/alice-new.key rpool/home/alice

Encrypted home directories with auto-unlock via PAM

kldload supports per-user encrypted datasets that unlock at login using pam_zfs_key. The dataset key is derived from the user's login password. The dataset is mounted when the user logs in and unmounted when their last session exits.

# Install pam_zfs_key (included in kldload)
# /etc/pam.d/system-auth — add after auth stack
# auth optional pam_zfs_key.so homes=rpool/home

# The dataset name must match the username
# zfs create -o encryption=on -o keyformat=passphrase rpool/home/alice
# The passphrase must match the login password (or be derived from it)

# On login: pam_zfs_key loads the key, mounts the dataset
# On logout: key is unloaded if no other sessions exist

ZFS encryption is per-dataset, not per-pool. You can encrypt /home but leave /var/lib/docker unencrypted. You can replicate encrypted datasets to an untrusted remote — they arrive encrypted and the remote cannot read them without the key. This is the only filesystem that gives you encrypted replication to untrusted storage for free. A cloud backup target receives your ciphertext. An offsite NAS stores your ciphertext. You own the key. Nobody else can read it.

6. Boot Environments

A boot environment is a bootable clone of rpool/ROOT/<name>. ZFS boot environments give you the ability to snapshot the entire OS root filesystem, switch between them from the bootloader, and roll back a failed update in seconds — without any special tooling beyond ZFS and a compatible bootloader (ZFSBootMenu or GRUB with ZFS support).

How kldload uses boot environments

kldload creates a new boot environment automatically before any system update. The sequence:

Snapshot the current rpool/ROOT/default dataset.
Create a new boot environment from that snapshot.
Mark the new BE as the active boot target.
Apply the update inside the running system.
If the update fails or the system fails to boot: select the previous BE from ZFSBootMenu.

Manual boot environment management

# List boot environments
beadm list

# Create a new boot environment (snapshot of current root)
beadm create before-kernel-update

# Activate a boot environment (next boot will use it)
beadm activate before-kernel-update

# Mount a boot environment without booting it (inspect or repair)
beadm mount before-kernel-update /mnt

# Destroy a boot environment you no longer need
beadm destroy old-desktop-environment

Direct ZFS commands (without beadm)

# The pool layout for boot environments
# rpool/ROOT/          — parent dataset, never mounted directly
#   default/           — the active root
#   before-k6.2/       — saved before a kernel update

# Create a BE manually
zfs snapshot rpool/ROOT/default@before-k6.2
zfs clone rpool/ROOT/default@before-k6.2 rpool/ROOT/before-k6.2

# Set the bootfs property to switch which BE boots
zpool set bootfs=rpool/ROOT/before-k6.2 rpool

ZFSBootMenu

ZFSBootMenu is kldload's default bootloader for ZFS-on-root systems. It is a small initramfs that can read ZFS directly and presents a menu of available boot environments. You can select a BE, edit kernel arguments, import pools, and drop to a recovery shell — all before the OS starts.

# From the ZFSBootMenu prompt:
# [Enter]           — boot the highlighted environment
# [K]               — choose a different kernel for this BE
# [E]               — edit kernel command line
# [S]               — snapshots menu (boot directly from a snapshot)
# [Alt+S]           — snapshot the current BE before booting

# To rollback a failed update:
# 1. Reboot
# 2. ZFSBootMenu shows list of boot environments
# 3. Select "before-kernel-update"
# 4. Boot — you are running the pre-update OS in seconds

Boot environments are the reason you can fearlessly update a kldload system. Before the update, a snapshot is taken. If the update breaks boot, you select the previous environment from the bootloader. Total rollback time: 10 seconds. This is why ZFS on root matters — it is not just about data protection, it is about OS protection. Every other Linux distribution uses a separate /boot partition on ext4 or fat32 and hopes the update completes cleanly. kldload treats the OS as data, subject to the same snapshot and rollback guarantees as everything else on the system.

7. ARC and L2ARC — the Caching Layer

ZFS does not use the Linux page cache. It has its own in-kernel cache called the ARC (Adaptive Replacement Cache). The ARC holds two lists: recently-used blocks and frequently-used blocks. It dynamically balances between them to maximize cache hits across both access patterns. On a read-heavy workload, ARC hit rates of 95%+ are normal.

How ARC works

The ARC grows to use available RAM and shrinks when the OS needs memory for applications. It is not a fixed allocation. On a KVM host under memory pressure, the ARC releases pages to guest VMs automatically. On an idle NAS, the ARC expands to hold the entire working set.

# Check current ARC size and hit rate
arc_summary

# Or directly from /proc/spl/kstat/zfs/arcstats
awk '/^c / {printf "ARC target: %.1f GB\n", $3/1024/1024/1024}
     /^size / {printf "ARC current: %.1f GB\n", $3/1024/1024/1024}
     /^hits / {h=$3}
     /^misses / {m=$3; printf "Hit rate: %.1f%%\n", h/(h+m)*100}' \
  /proc/spl/kstat/zfs/arcstats

ARC size tuning

# Cap ARC at 8GB (suitable for a shared server with 32GB RAM)
# /etc/modprobe.d/zfs.conf
echo "options zfs zfs_arc_max=8589934592" > /etc/modprobe.d/zfs.conf
# 8GB = 8 * 1024 * 1024 * 1024 = 8589934592

# Set a minimum ARC size (ZFS will never release below this)
echo "options zfs zfs_arc_min=2147483648" >> /etc/modprobe.d/zfs.conf
# 2GB minimum

# Apply without rebooting (the value is in bytes)
echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max

# kldload's KVM profile sets arc_max to 50% of system RAM automatically
# Check the current live value
cat /sys/module/zfs/parameters/zfs_arc_max

L2ARC — second-level cache on SSD

The L2ARC extends the ARC onto an SSD. Blocks evicted from the RAM-based ARC are written to the L2ARC device. On a subsequent cache miss in RAM, ZFS checks the L2ARC before going to disk. The L2ARC is most useful when:

Your working set is larger than RAM but fits on an SSD.
Primary storage is slow spinning disks.
The workload is predominantly reads (the L2ARC does not accelerate writes).

# Add an L2ARC device to a pool
zpool add rpool cache /dev/nvme2n1

# Check L2ARC statistics
arc_summary | grep -A 20 "L2 ARC"

# Remove the L2ARC device
zpool remove rpool /dev/nvme2n1

primarycache property

# Control what a dataset caches in ARC
# all (default): data + metadata cached in ARC
# metadata: only metadata cached, data goes straight to disk
# none: no ARC caching for this dataset

# Disable caching for a write-once archive dataset (free ARC for hot data)
zfs set primarycache=metadata rpool/backups

# Disable entirely for pre-compressed video storage (no benefit from cache)
zfs set primarycache=none rpool/media/raw

ARC is the reason ZFS "uses all your RAM." It is not a bug — it is an intelligent cache that gives back memory when applications need it. But on a KVM host or container host, you need to cap it so guests get memory. The kvm profile in kldload caps at 50% of RAM. The rule: dedicated storage server = let ARC use everything. Shared server (KVM, containers, databases) = cap ARC at 50%. Without a cap on a KVM host, the ARC can hold 90% of RAM and leave guests starved, causing them to swap to disk — which then triggers more ZFS I/O — which the ARC tries to cache — creating a positive feedback loop. Cap it before the first VM boots.

8. SLOG and Special Vdevs

SLOG — Separate Log

ZFS groups writes into transaction groups (TXGs) that commit roughly every 5 seconds. Normally, synchronous writes (where the calling application waits for confirmation that data is on stable storage) block until the TXG commits — which takes up to 5 seconds of latency. A SLOG is an NVMe device that receives the synchronous write immediately, returning success to the application in microseconds. The main pool write still happens on the normal TXG schedule, but the application does not wait for it.

# Add a SLOG mirror (always mirror SLOG — it must survive power failure)
zpool add rpool log mirror /dev/nvme0n1p1 /dev/nvme1n1p1

# The SLOG device must be:
# 1. Fast (NVMe, not SATA SSD)
# 2. Power-loss safe (enterprise NVMe with capacitors, or consumer NVMe
#    with flush support verified — consumer SATA SSDs lie about flush)
# 3. Mirrored (SLOG failure on unmirrored device = pool import failure)

# Check SLOG status
zpool status rpool | grep log -A 5

When SLOG matters

High benefit

PostgreSQL and MySQL (fsync-heavy), NFS with sync exports, Samba with oplocks disabled, any application that calls fsync() or O_SYNC frequently. Synchronous write latency drops from ~5000ms (worst case TXG wait) to microseconds.

postgres, nfs, iscsi, databases with sync=always

No benefit

File servers, media streaming, web server content, Docker layers, most container workloads. These use async writes. The TXG batches them efficiently. Adding a SLOG to an async workload does nothing — the synchronous write path is never taken.

http files, media, docker layers, logs

Special vdev — metadata acceleration

The special vdev receives metadata allocations and, optionally, small data blocks below the special_small_blocks threshold. Moving metadata to NVMe means that directory lookups, attribute reads, and small file access hit NVMe rather than spinning disks — even when the data vdevs are HDD.

# Add a special vdev (mirror for durability)
zpool add rpool special mirror /dev/nvme0n1 /dev/nvme1n1

# Route small blocks (under 32k) to the special vdev as well
zfs set special_small_blocks=32768 rpool

# The special vdev must be mirrored — losing it loses all metadata = losing the pool
# Use the same or better redundancy as your data vdevs for the special vdev

Concrete example: NVMe SLOG for a PostgreSQL server

# Scenario: 6-disk RAIDZ2 spinning disk pool, PostgreSQL sync-heavy workload
# Without SLOG: each fsync waits up to 5 seconds for TXG commit — terrible latency
# With SLOG: each fsync returns in microseconds — TXG still commits every 5s but
#            postgres does not wait

# 1. Create the pool
zpool create rpool raidz2 /dev/sd{a..f}

# 2. Add mirrored SLOG (two cheap NVMe sticks with power-loss protection)
zpool add rpool log mirror /dev/nvme0n1 /dev/nvme1n1

# 3. Create the postgres dataset with appropriate settings
zfs create -o recordsize=8k \
           -o compression=lz4 \
           -o atime=off \
           -o logbias=latency \
           rpool/srv/postgres

# logbias=latency tells ZFS to use the SLOG path for sync writes
# logbias=throughput bypasses the SLOG — use throughput only when you
# are managing write ordering yourself (e.g. async postgres workloads)

SLOG is the most misunderstood ZFS feature. It does not speed up all writes — only synchronous writes (where the application waits for the write to hit stable storage). Databases and NFS are sync-heavy. File servers and media streaming are not. Adding a SLOG to an async workload does nothing. Know your workload before buying hardware. Run zpool iostat -v rpool 1 with the SLOG in place — if the SLOG's read/write counters are not moving during database operations, your workload is async and the SLOG is decorative.

9. Delegation and Permissions

ZFS delegation allows non-root users and groups to perform specific ZFS operations on specific datasets. The model is fine-grained: you can allow a user to snapshot a dataset but not destroy it. You can allow a group to receive data into a dataset but not create child datasets. You can allow a service account to manage properties on its own dataset subtree without any access to other datasets.

zfs allow and zfs unallow

# Show current delegations on a dataset
zfs allow rpool/srv/postgres

# Grant a user the ability to snapshot their dataset
zfs allow alice snapshot rpool/home/alice

# Grant a group snapshot + send (backup team)
zfs allow -g backup snapshot,send rpool

# Grant a user full control of their subtree
# (snapshot, rollback, clone, destroy, create, mount, rename, receive)
zfs allow alice \
  snapshot,rollback,clone,destroy,create,mount,rename,receive \
  rpool/home/alice

# Delegate to a specific user only for that dataset and its children (-d)
zfs allow -d -u dbadmin \
  snapshot,rollback,create,destroy,mount,quota,reservation \
  rpool/srv/postgres

# Remove a delegation
zfs unallow alice snapshot rpool/home/alice
zfs unallow -g backup snapshot,send rpool

Available permissions

# Full list of delegatable permissions
zfs allow                 # view delegations
snapshot                  # create snapshots
rollback                  # rollback to snapshot
clone                     # clone a snapshot
destroy                   # destroy datasets and snapshots
create                    # create child datasets
mount                     # mount/unmount
rename                    # rename datasets
receive                   # zfs receive into dataset
send                      # zfs send (read all snapshots)
share                     # share via NFS/SMB
quota                     # set quota property
reservation               # set reservation property
compression               # set compression property
recordsize                # set recordsize property
# Use @setuid, @encryption, etc. for property groups

Concrete example: backup user that can snapshot and send

# Create a dedicated backup service account
useradd -r -s /usr/sbin/nologin zfsbackup

# Grant it snapshot and send rights on the datasets it needs to replicate
zfs allow -u zfsbackup snapshot,send rpool/srv
zfs allow -u zfsbackup snapshot,send rpool/home
zfs allow -u zfsbackup snapshot,send rpool/vms

# The backup script runs as zfsbackup — it can take snapshots and stream them
# It cannot destroy datasets, change properties, or touch other pools
sudo -u zfsbackup zfs snapshot -r rpool/srv@$(date +%Y-%m-%dT%H:%M)
sudo -u zfsbackup zfs send -R rpool/srv@$(date +%Y-%m-%dT%H:%M) \
  | ssh backup-host zfs receive backup/srv

Concrete example: DBA managing database datasets

# The DBA team gets full control of the postgres subtree
# They can create and destroy datasets, manage quotas, resize, snapshot, rollback
# They have no access to any other dataset on rpool

zfs allow -g dbadmin \
  snapshot,rollback,clone,destroy,create,mount,quota,reservation,recordsize,compression \
  rpool/srv/postgres

# DBAs can now manage their storage directly
# They cannot touch rpool/home, rpool/vms, or the pool itself
# The sysadmin retains control of pool-level settings and vdev topology

Delegation is how you give teams ownership of their storage without giving them root. The backup team can snapshot and replicate. The database team can resize their datasets. Nobody can touch the root pool or other teams' datasets. This is ZFS's access control model and it is more granular than anything sudo can do for storage. A sudo rule saying "this user can run zfs commands" gives them full ZFS access. zfs allow gives them exactly the operations they need on exactly the datasets they own.

10. Performance Tuning

recordsize

recordsize is the most impactful single property in ZFS performance. It sets the maximum size of a ZFS block. When an application writes data smaller than recordsize, ZFS packs multiple writes into a single block. When an application writes data larger than recordsize, ZFS breaks it into recordsize-sized blocks. The default is 128k, which is right for general-purpose workloads and wrong for databases.

# Check current recordsize
zfs get recordsize rpool/srv/postgres

# Set appropriate recordsize before loading data
# (changing recordsize does not rewrite existing data — only new writes use the new size)
zfs set recordsize=8k rpool/srv/postgres        # PostgreSQL (8k pages)
zfs set recordsize=16k rpool/srv/mysql          # MySQL InnoDB (16k pages)
zfs set recordsize=64k rpool/srv/mongodb        # MongoDB WiredTiger (64k pages)
zfs set recordsize=128k rpool/srv/nginx         # general files (default)
zfs set recordsize=1m rpool/media               # large sequential files

Compression

# lz4: negligible CPU overhead, 1.5-3x compression ratio on typical data
# The default and right choice for almost everything
zfs set compression=lz4 rpool

# zstd: better ratio than lz4, significantly more CPU
# Right for cold archives and backup datasets
zfs set compression=zstd rpool/backups

# zstd-3 through zstd-19: higher level = better ratio, more CPU
zfs set compression=zstd-3 rpool/backups     # fast zstd, good ratio
zfs set compression=zstd-9 rpool/cold        # slower, better ratio

# off: only for data that cannot be compressed (pre-compressed video, encrypted blobs)
zfs set compression=off rpool/media/raw

# Check compression ratios
zfs get compressratio rpool rpool/backups rpool/srv/postgres

sync=disabled

# sync=disabled means ZFS acknowledges writes before they hit stable storage
# Data in the TXG buffer is at risk until the next TXG commit (~5 seconds)
# Power failure in that window = data loss

# Acceptable use cases:
# - Build caches and artifact stores (you can rebuild from source)
# - Temp/scratch datasets
# - Redis (Redis has its own AOF/RDB persistence)
# - CI runner workspaces

zfs set sync=disabled rpool/tmp
zfs set sync=disabled rpool/build-cache
zfs set sync=disabled rpool/srv/redis  # Redis manages its own durability

# NEVER use sync=disabled on databases without external durability
# NEVER use sync=disabled on the OS root
# NEVER use sync=disabled on datasets you replicate to DR (unless the replica is your durability)

Scrub scheduling

# Manual scrub (reads every block, verifies checksums, repairs from parity if corrupt)
zpool scrub rpool

# Check scrub status
zpool status rpool

# Schedule weekly scrubs via systemd (kldload does this by default)
# /etc/systemd/system/zfs-scrub.timer
[Timer]
OnCalendar=weekly
Persistent=true

# Throttle scrub I/O to reduce impact on foreground workloads
# (lower = slower scrub, less I/O impact)
echo 50 > /sys/module/zfs/parameters/zfs_scrub_delay

The single biggest performance win in ZFS is getting recordsize right. A PostgreSQL database with recordsize=128k (the default) does 16x the I/O it needs because each 8k page write triggers a 128k block write. ZFS reads the existing 128k block, modifies the 8k portion, and writes 128k back. This is called write amplification. Set recordsize=8k and write amplification drops to 1x — the 8k write goes directly into an 8k block. This one property change can double your database IOPS on the same hardware. It costs you nothing except the time to set it before loading data.

11. ZFS and Databases (Deep Dive)

Databases and ZFS have a complex relationship. The wrong settings produce 10x write amplification. The right settings give you better performance than any other filesystem plus data integrity guarantees that eliminate the need for some database safety features.

The core insight: page size alignment

Every database engine works in pages — fixed-size units it reads and writes atomically. If ZFS's recordsize does not match the database's page size, ZFS reads a full record to modify a partial page. Match them and writes become 1:1.

Additionally: ZFS provides copy-on-write atomicity. When ZFS writes a block, the old version stays intact until the new write commits. This is the same guarantee that databases use expensive mechanisms (double-write buffer, full-page writes) to achieve. When your filesystem already provides atomic writes, you can turn those database mechanisms off — less I/O, same or better durability.

PostgreSQL

# PostgreSQL uses 8k pages by default (adjustable at initdb time)
zfs create -o recordsize=8k \
           -o compression=lz4 \
           -o atime=off \
           -o logbias=throughput \
           rpool/srv/postgres

# Separate dataset for WAL (Write-Ahead Log)
# WAL is sequential writes — larger recordsize is fine here
zfs create -o recordsize=128k \
           -o compression=lz4 \
           -o atime=off \
           -o logbias=latency \
           rpool/srv/postgres-wal

# In postgresql.conf:
# full_page_writes = off   — ZFS provides this guarantee via CoW
#                            safe ONLY when postgres and ZFS are on the same host
#                            (not safe if using iSCSI or NFS to ZFS)

# Link the WAL to the separate dataset
# initdb -X /srv/postgres-wal/pg_wal /srv/postgres/data

# Verify settings
sudo -u postgres psql -c "SHOW full_page_writes;"
zfs get recordsize,compression,atime,logbias rpool/srv/postgres

MySQL / InnoDB

# InnoDB uses 16k pages by default
zfs create -o recordsize=16k \
           -o compression=lz4 \
           -o atime=off \
           -o logbias=throughput \
           rpool/srv/mysql

# Separate dataset for InnoDB redo log
zfs create -o recordsize=128k \
           -o compression=lz4 \
           -o atime=off \
           rpool/srv/mysql-log

# In my.cnf:
# innodb_doublewrite = OFF
#   The doublewrite buffer writes every page twice to prevent torn writes.
#   ZFS copy-on-write already prevents torn writes.
#   innodb_doublewrite=OFF on ZFS is safe and cuts write I/O by ~50%.

# innodb_flush_method = O_DIRECT
#   Bypass the Linux page cache (ZFS has its own ARC — double caching wastes RAM)

# innodb_use_native_aio = ON   (default — ZFS supports AIO)

MongoDB / WiredTiger

# WiredTiger uses 64k internal pages
zfs create -o recordsize=64k \
           -o compression=lz4 \
           -o atime=off \
           rpool/srv/mongodb

# In mongod.conf storage section:
# wiredTiger:
#   engineConfig:
#     cacheSizeGB: 4             # WiredTiger cache, keep below 50% RAM
#     journalCompressor: none    # ZFS compression handles this
#   collectionConfig:
#     blockCompressor: none      # let ZFS compress — avoid double compression

# MongoDB's journal provides crash consistency within WiredTiger.
# ZFS CoW provides the underlying block-level atomicity.
# Both are active and complement each other.

Redis

# Redis works in memory — its persistence (AOF/RDB) is sequential writes
zfs create -o recordsize=64k \
           -o compression=lz4 \
           -o atime=off \
           -o sync=disabled \
           rpool/srv/redis

# sync=disabled is safe here because:
# 1. Redis AOF (appendonly yes) manages its own fsync schedule
# 2. The AOF rewrite is an atomic rename — crash-safe regardless of ZFS sync
# 3. RDB snapshots are written to a temp file then renamed — also crash-safe
# 4. If you lose the last  second of writes, Redis
#    recovers from the AOF to the last fsync point anyway

# In redis.conf:
# appendonly yes
# appendfsync everysec    # Redis fsyncs every second — ZFS sync=disabled
#                           is safe because Redis provides its own guarantee

SQLite

# SQLite uses variable page sizes (default 4k, tunable)
# Match recordsize to your SQLite page_size pragma
zfs create -o recordsize=4k \
           -o compression=lz4 \
           -o atime=off \
           rpool/srv/sqlite

# For large SQLite databases, increase both:
# PRAGMA page_size = 8192;   -- set before any data, cannot change after
# zfs set recordsize=8k rpool/srv/sqlite

Every database has its own page size. Match recordsize to the page size and you eliminate write amplification. Then disable the database's own crash protection features (full_page_writes, innodb_doublewrite) because ZFS copy-on-write already provides atomic writes. You get better performance AND better data integrity — ZFS checksum detects corruption that these database mechanisms cannot. The doublewrite buffer protects against torn writes. ZFS protects against torn writes AND detects silent corruption. On ZFS, the doublewrite buffer is redundant overhead. Turn it off.

12. Channel Programs (Advanced Automation)

Channel programs are Lua scripts that execute inside the ZFS kernel module with full atomicity. No other ZFS operation can interleave with a running channel program. This makes them the right tool for operations that need to check state and act conditionally — race-condition-free.

What channel programs can do

Create and destroy snapshots atomically
Check properties and take conditional action in the same atomic context
Iterate over datasets and snapshots
Set and get properties
Perform multi-dataset operations that must succeed or fail as a unit

Basic usage

# Run a channel program
# zfs program   [args...]

# Simple: create a snapshot if the dataset exists
cat > /tmp/snap-if-exists.lua << 'EOF'
local dataset = arg[1]
local snapname = arg[2]

-- Check if the dataset exists
local exists = zfs.exists(dataset)
if not exists then
  return "dataset does not exist: " .. dataset
end

-- Create the snapshot atomically
local err = zfs.snapshot(dataset .. "@" .. snapname)
if err then
  return "snapshot failed: " .. err
end

return "snapshot created: " .. dataset .. "@" .. snapname
EOF

zfs program rpool /tmp/snap-if-exists.lua rpool/srv/postgres 2026-04-02T14:00

Atomic multi-dataset snapshot + property check

# Snapshot multiple datasets, but only if they are all under quota
cat > /tmp/quota-aware-snapshot.lua << 'EOF'
local datasets = { "rpool/srv/postgres", "rpool/srv/mysql", "rpool/home" }
local snapname = arg[1]

-- Check all datasets before snapshotting any
for _, ds in ipairs(datasets) do
  local used  = zfs.get_prop(ds, "used")
  local quota = zfs.get_prop(ds, "quota")
  -- quota of 0 means no quota set
  if quota > 0 and used > quota * 0.95 then
    return "ABORT: " .. ds .. " is at " .. math.floor(used/quota*100) .. "% of quota"
  end
end

-- All checks passed — snapshot atomically
for _, ds in ipairs(datasets) do
  local err = zfs.snapshot(ds .. "@" .. snapname)
  if err then
    return "snapshot failed on " .. ds .. ": " .. err
  end
end

return "all snapshots created: " .. snapname
EOF

zfs program rpool /tmp/quota-aware-snapshot.lua "$(date +%Y-%m-%dT%H:%M)"

Conditional snapshot + destroy old snapshots atomically

# Create a new snapshot and destroy snapshots older than N, atomically
cat > /tmp/rolling-snap.lua << 'EOF'
local dataset = arg[1]
local keep    = tonumber(arg[2]) or 10
local now     = arg[3]

-- Create new snapshot
local newsnap = dataset .. "@" .. now
local err = zfs.snapshot(newsnap)
if err then return "snapshot failed: " .. err end

-- Collect all snapshots for this dataset
local snaps = {}
for snap in zfs.list.snapshots(dataset) do
  table.insert(snaps, snap)
end

-- Sort by creation (oldest first)
table.sort(snaps)

-- Destroy oldest if over limit
local to_destroy = #snaps - keep
for i = 1, to_destroy do
  zfs.destroy(snaps[i])
end

return "snapshot created, kept " .. keep .. ", destroyed " .. math.max(0, to_destroy)
EOF

zfs program rpool /tmp/rolling-snap.lua rpool/srv/postgres 10 "$(date +%Y-%m-%dT%H:%M)"

Channel programs are ZFS's secret weapon for automation. They run inside the ZFS kernel module with full atomicity — no other operation can interleave. For complex snapshot policies that need to check state and act conditionally, channel programs eliminate race conditions. Without them, a shell script that checks a quota and then creates a snapshot has a window between the check and the create where another process can write data. With a channel program, the check and the create are a single atomic operation. No window. No race.

13. Troubleshooting

zpool status — pool health and scrub state

# Full pool status
zpool status rpool

# Watch for errors in the output:
# state: ONLINE   — healthy
# state: DEGRADED — one or more vdevs have failed, pool is still accessible
# state: FAULTED  — pool cannot be opened (too many failed vdevs or unresolvable error)
# state: OFFLINE  — vdev was taken offline manually

# Example degraded pool output
zpool status rpool
#   pool: rpool
#  state: DEGRADED
# status: One or more devices has been removed by the administrator.
# action: Online the device using 'zpool online' or replace the device with 'zpool replace'.
#   scan: scrub repaired 0B in 00:01:15 with 0 errors on Sun Mar 30 00:26:15 2026
#
# config:
#   NAME        STATE     READ WRITE CKSUM
#   rpool       DEGRADED     0     0     0
#     mirror-0  DEGRADED     0     0     0
#       sda     ONLINE       0     0     0
#       sdb     OFFLINE      0     0     0

# Replace a failed drive (in-place, maintains redundancy)
zpool replace rpool /dev/sdb /dev/sdc

# Check the resilver (rebuild) progress
zpool status rpool

# Online a device that was offlined
zpool online rpool /dev/sdb

# Check when the last scrub ran and how many errors it found
zpool status -v rpool

zpool events — kernel-level ZFS events

# Stream kernel ZFS events live (checksum errors, I/O errors, pool state changes)
zpool events -vf

# Dump all historical events
zpool events -v

# Key event classes:
# sysevent.fs.zfs.scrub_finish        — scrub completed
# sysevent.fs.zfs.checksum            — checksum error detected and repaired
# sysevent.fs.zfs.io                  — I/O error
# sysevent.fs.zfs.resilver_finish     — resilver (rebuild) completed
# sysevent.fs.zfs.vdev_remove         — vdev removed from pool

arc_summary — ARC statistics

# Full ARC report
arc_summary

# Key numbers to watch:
# ARC Size: current RAM used by cache
# ARC Target Size: what ZFS is aiming for
# Cache Hit Ratio: should be 90%+ for a healthy read-heavy workload
# L2ARC Hits: if you have an L2ARC, this shows how much it is contributing
# Evicted: blocks pushed out of ARC — normal, but high eviction + low hits = undersized ARC

zdb — low-level pool inspection

# Inspect a pool's configuration (block device layout, vdev tree)
zdb -C rpool

# Show all metadata for a dataset (object types, property blocks)
zdb -d rpool/srv/postgres

# Read a specific object (advanced: useful when debugging data corruption)
zdb -ddddd rpool/srv/postgres 5

# Check block checksums without scrubbing (read-only verification)
zdb -b rpool   # WARNING: this reads the entire pool — slow on large pools

Common issues and fixes

ARC consuming too much RAM. The ARC has no hard limit unless you set one. On a shared server, cap it: echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max (8GB). Persist in /etc/modprobe.d/zfs.conf.

Slow scrubs impacting I/O. Throttle scrub I/O: echo 200 > /sys/module/zfs/parameters/zfs_scrub_delay. Higher value = more delay between reads = less scrub I/O impact. Schedule scrubs during off-hours with a systemd timer.

Pool import fails after unclean shutdown. ZFS performs a replay of the last transaction group on import — this is normal and fast. If import fails with "cannot import — missing device," a disk is physically missing. Identify it with zpool import -d /dev rpool and replace the missing vdev.

# Import a pool with a missing vdev (forces import in degraded state)
zpool import -m -f -d /dev rpool   # -m: missing log device OK
                                   # -f: force (ignore hostid mismatch)

# If the pool was exported on another host
zpool import -d /dev rpool

# List all importable pools (shows what ZFS can find on available devices)
zpool import -d /dev

Checksum errors. Checksum errors during scrub mean a block was corrupted on disk. If you have redundancy (mirror, RAIDZ), ZFS repairs it automatically and increments the repair counter. Zero errors after repair = fixed. Persistent errors on a specific vdev = that drive is failing.

# After a scrub that found and repaired errors
zpool status -v rpool
# Look for "repaired" in the scan line and per-vdev error counts
# A vdev with rising CKSUM errors is failing — replace it soon

# Clear error counters after replacing a drive
zpool clear rpool

Dataset will not mount — wrong hostid. ZFS records the host ID when a pool is first created. If the hostid changes (new OS install, cloud instance clone), import fails. Fix:

# Check current hostid
hostid

# Force import ignoring hostid mismatch
zpool import -f rpool

# After import, update the hostid cache
zgenhostid

ZFS Zero to Hero — the prerequisite: pools, snapshots, basic properties
ZFS Wiki: Pool Design — hardware selection and vdev topology reference
ZFS Wiki: Snapshots & Replication — sanoid/syncoid for automated lifecycle management
ZFS Wiki: Encryption — full encryption reference including LUKS comparison
ZFS Wiki: Memory & ARC — deep ARC internals and tuning tables
ZFS Wiki: Tuning for Workloads — property cheat sheet by workload type
Databases on ZFS — full setup guides for PostgreSQL, MySQL, MongoDB on kldload
NAS Server recipe — concrete pool layout for a home or lab NAS
dRAID Storage recipe — 12+ disk dRAID array from scratch

← eBPF Masterclass nftables Masterclass →