| pick your distro, get ZFS on root
kldload — your platform, your way, free
Source
← Back to ZFS Overview

Tuning for Workloads — defaults are for nobody.

ZFS ships with defaults designed to be safe and broadly acceptable. They are optimized for no workload in particular, which means they are optimal for no workload at all. A PostgreSQL server, a Plex library, a KVM hypervisor, and a backup target all have radically different I/O patterns. Tuning ZFS means matching the filesystem's behavior to the work your system actually does. This page teaches you how — from dataset properties you set once, to kernel module parameters that reshape how ZFS manages memory and I/O at a system level.

The single biggest performance win in ZFS is recordsize. Get that right for your workload and you've captured 60% of the tuning benefit. Everything else on this page is refinement. Don't skip the monitoring section — you should profile before you tune, not after.

The tuning philosophy

Don't tune what you don't understand. Every parameter on this page exists because ZFS is making a tradeoff — latency vs throughput, memory vs disk, safety vs speed. If you change a parameter without understanding the tradeoff, you may make things worse. The defaults are conservative but safe. Only change them when you have measured a problem and understand why the change helps.

The anti-pattern: copying a block of tunables from a blog post or Reddit thread and pasting them into /etc/modprobe.d/zfs.conf. Those tunables were written for someone else's hardware, someone else's workload, and possibly an older version of OpenZFS. They may have been wrong even for the author. Understand each parameter individually, measure before and after, and keep only what helps.

ZFS tuning happens at three levels, each with different scope and persistence:

Dataset-level
Properties set with zfs set or at zfs create. Apply to individual datasets or ZVOLs. Inherited by child datasets. Examples: recordsize, compression, atime, logbias, primarycache, xattr. Persistent by default.
Pool-level
Properties set with zpool set or at zpool create. Apply to the entire pool. Examples: ashift, autotrim. Persistent by default. Some (like ashift) are immutable after creation.
Module-level
Kernel module parameters set via /etc/modprobe.d/zfs.conf or /sys/module/zfs/parameters/. Apply to the entire system. Examples: zfs_txg_timeout, zfs_dirty_data_max, zfs_arc_max. Must be made persistent manually.

Monitor before you tune

Never tune blind. ZFS provides excellent introspection tools. Use them to identify your actual bottleneck before changing anything.

zpool iostat — the first tool you reach for

# Basic I/O stats, refreshing every 5 seconds
zpool iostat tank 5

# Per-vdev breakdown — shows which vdevs are bottlenecked
zpool iostat -v tank 5

# Queue depths — critical for diagnosing sync write latency
# Look at syncq_read/syncq_write columns
zpool iostat -q tank 5

# Latency histograms — shows distribution, not just averages
zpool iostat -l tank 5

# Everything at once
zpool iostat -vql tank 5

What to look for: consistently high syncq_write means sync writes are queuing (SLOG will help). High asyncq_write means the pool can't flush dirty data fast enough (more IOPS or larger zfs_dirty_data_max). High read latency with low cache hit rate means you need more ARC (RAM) or L2ARC.

arc_summary — ARC health at a glance

# Full ARC report — hit rates, size, evictions
arc_summary

# Key metrics to check:
#   ARC hit rate — should be > 90% for read-heavy workloads
#   ARC size vs max — if ARC is at max and hit rate is low, you need more RAM
#   Prefetch hit rate — if low, prefetch may be hurting more than helping
#   L2ARC hit rate — if you have L2ARC, this tells you if it's earning its keep

arcstat — ARC metrics over time

# ARC stats refreshing every 2 seconds
arcstat 2

# Watch specific fields
arcstat -f hits,miss,hit%,arcsz,c 2

/proc/spl/kstat/zfs — the raw data

# ARC stats (the source of truth)
cat /proc/spl/kstat/zfs/arcstats

# Per-dataset I/O stats (OpenZFS 2.2+)
cat /proc/spl/kstat/zfs/objset-*

# TXG commit times — high values mean writes are batching too long
cat /proc/spl/kstat/zfs/<poolname>/txgs
I've seen people spend weeks tuning ZFS parameters when the actual problem was a bad disk (check zpool status for checksum errors), insufficient RAM (ARC was being evicted constantly), or the wrong VDEV topology (RAIDZ for a database). Profile first. The tools above take five minutes and save you from solving the wrong problem.

recordsize — the most important tunable

recordsize is the maximum block size ZFS uses for a dataset. It defaults to 128K. When ZFS writes a file, it breaks it into records of up to this size. When ZFS reads, it reads entire records. If your application reads 8K but your recordsize is 128K, ZFS reads 128K from disk and throws away 120K. This is read amplification. If your application writes 8K, ZFS must read the existing 128K record, modify 8K within it, and write back 128K. This is write amplification.

The rule is simple: match recordsize to your application's I/O size. If you don't know what your application does, leave the default. The default is not bad — it's a reasonable middle ground. But it's not optimal for anything specific.

WorkloadApplication I/O sizeRecommended recordsizeWhy
PostgreSQL8K pages8K or 16KEliminates read/write amplification. 16K handles WAL + data mix better.
MySQL InnoDB16K pages16KExact match to InnoDB page size. No amplification.
MongoDB (WiredTiger)32K–64K32K or 64KWiredTiger checkpoints in 32K–64K blocks. Match to your config.
KVM/QEMU (ZVOL)Variesvolblocksize=16KGuest OS issues mixed I/O. 16K balances small random + sequential.
Samba/NFS file sharesMixed128K (default)Mixed file sizes. Default works well. Consider 1M if files are large.
Video/media filesSequential MB+1MLarge sequential reads. Fewer, larger blocks = less metadata, more throughput.
Backup targets (restic, borg)1M–4M chunks1MBackup tools write large sequential chunks. Match to chunk size.
Build server (/tmp, ccache)Mixed small128K (default)Lots of small files created and deleted. Default handles this fine.
Container images (Docker)Mixed small128K (default)Many small files in layers. Default is fine. Special vdev helps more.
General / unknownUnknown128K (default)Leave the default. It's a safe middle ground.
# Set recordsize at dataset creation (preferred — cannot be retroactive for existing data)
zfs create -o recordsize=16K rpool/srv/postgres/data

# Change recordsize on existing dataset (only affects NEW writes)
zfs set recordsize=1M rpool/srv/media

# Check current recordsize
zfs get recordsize rpool/srv/postgres/data

Critical detail: changing recordsize on an existing dataset only affects data written after the change. Existing blocks keep their original size. To fully convert, you must rewrite the data (copy out and back, or zfs send | zfs receive into a new dataset with the correct recordsize).

The most common mistake I see: someone runs PostgreSQL on a dataset with recordsize=128K and wonders why their database is slow. Every 8K page read pulls 128K from disk. Every 8K page write triggers a 128K read-modify-write. That's 16x amplification. Setting recordsize=16K on the pgdata dataset is a 10-minute change that can double throughput. Do it.

atime and relatime — stop writing on reads

atime (access time) tells ZFS to update a file's access timestamp every time it is read. This means every cat, every grep, every backup scan generates a write to update metadata. On a busy system, this creates thousands of unnecessary writes per second. Almost no application needs access-time tracking.

atime=off
Never update access times. Best performance. Use this for databases, VMs, media, backups — anything where you don't need to know when a file was last read.
relatime=on
Update access time only if it is older than the modify time, or older than 24 hours. Same behavior as Linux relatime mount option. This is what kldload sets by default. Good balance of compatibility and performance.
atime=on (default)
Update access time on every read. Only needed for very specific compliance requirements (some mail systems check atime). Performance penalty on read-heavy workloads.
# kldload default: relatime=on (set at pool creation, inherited by all datasets)
# To disable atime entirely on a specific dataset:
zfs set atime=off rpool/srv/postgres
zfs set atime=off rpool/srv/media

# Check current setting
zfs get atime,relatime rpool/srv/postgres

sync — the safety vs speed tradeoff

The sync property controls how ZFS handles synchronous write requests. When an application calls fsync() or opens a file with O_SYNC, it is asking the filesystem to guarantee the data is on stable storage before returning. Databases do this on every commit. NFS does this by default.

sync=standard
Default. Honor sync requests. When an app calls fsync(), ZFS writes to the ZIL (or SLOG if present) and waits for confirmation before returning. Safe. Can be slow without an SLOG on spinning rust.
sync=disabled
Dangerous. Lie to applications. ZFS tells the app the write is on disk when it's only in RAM. If the system crashes or loses power, committed transactions may be lost. The application (e.g., PostgreSQL) believes data is safe when it isn't. Never use this for databases in production. Acceptable for build caches, temp data, or any dataset where data loss on crash is tolerable.
sync=always
Force all writes through the ZIL. Even asynchronous writes become synchronous. Maximum safety, significant performance cost. Used for NFS exports that must guarantee write ordering, or paranoid replication targets.
# Disable sync for a build cache (data loss on crash is acceptable)
zfs set sync=disabled rpool/var/cache/builds

# Force sync for an NFS export
zfs set sync=always rpool/srv/nfs-critical

# Check current setting
zfs get sync rpool/srv/postgres
I use sync=disabled on exactly two things: build caches and throwaway temp datasets. Everything else gets sync=standard (the default). If sync writes are slow, the answer is an SLOG, not sync=disabled. Lying to your database about write durability is how you get corrupted data after a power outage. Not worth the risk.

primarycache and secondarycache

primarycache controls what the ARC (in-memory cache) stores. secondarycache controls what the L2ARC (SSD cache) stores. Both accept three values: all (default), metadata, or none.

primarycache=all
Default. Cache both data blocks and metadata in ARC. Correct for most workloads.
primarycache=metadata
Cache only metadata (block pointers, directory entries). Use when the application has its own cache (PostgreSQL shared_buffers, MySQL innodb_buffer_pool) and double-caching in ARC wastes RAM.
primarycache=none
Cache nothing. Almost never correct. Only for benchmarking or very specialized cases where you want to bypass ARC entirely.
# PostgreSQL: the database manages its own data cache (shared_buffers)
# Let ARC cache metadata only — avoids double-caching data pages
zfs set primarycache=metadata rpool/srv/postgres/data

# Media library: data caching helps — large reads benefit from ARC
zfs set primarycache=all rpool/srv/media

# L2ARC: only cache metadata for database ZVOLs
zfs set secondarycache=metadata rpool/srv/vms/db-server

When to use primarycache=metadata: Only when the application manages its own buffer cache and you're seeing ARC filled with data that's already cached at the application level. Check with arc_summary — if ARC hit rate is high and the application's cache hit rate is also high, you're double-caching. Switching to metadata frees that ARC space for other datasets that benefit from it.

logbias — latency vs throughput

logbias controls how ZFS handles synchronous writes to the ZIL.

logbias=latency
Default. Sync writes go to the ZIL (or SLOG) and also to the main pool. Optimized for low-latency sync writes. Best when you have an SLOG or when your pool is on SSDs.
logbias=throughput
Sync writes go only to the main pool (skipping the ZIL write). Reduces double-write overhead at the cost of higher commit latency. Best for databases on all-SSD pools without an SLOG, where the pool itself is fast enough and the ZIL double-write is the bottleneck.
# Database on all-SSD pool, no SLOG: skip the ZIL double-write
zfs set logbias=throughput rpool/srv/postgres/data

# NFS export with SLOG: keep latency mode for fast sync acks
zfs set logbias=latency rpool/srv/nfs
logbias=throughput is one of those tunables that gets cargo-culted. People set it because a blog told them to. It only helps in a specific scenario: high sync write volume on an all-SSD pool without an SLOG. If you have an SLOG, logbias=latency (the default) is almost always better because the SLOG handles the ZIL writes at NVMe speed. Measure with zpool iostat -q before changing this.

redundant_metadata

ZFS stores extra copies of metadata blocks for safety. The redundant_metadata property controls how many extra copies:

redundant_metadata=all
Default. Store extra copies of all metadata. Maximum safety. Small space overhead. Leave this for anything you care about.
redundant_metadata=most
Store extra copies of most metadata, but not the lowest-level indirect blocks. Minor space savings. Only consider for very large pools where metadata overhead is significant.
redundant_metadata=none
Dangerous. No extra metadata copies. A single bit flip in a metadata block can lose data. Never use on a pool without mirrors or RAIDZ.

Recommendation: leave the default (all). The space overhead is tiny compared to the protection it provides. The only time to consider most is on very large backup pools where every byte of metadata overhead matters and the data is expendable.

dnodesize — auto vs legacy

Dnodes are ZFS's equivalent of inodes — they store file metadata, extended attributes, and system attributes. The dnodesize property controls their size:

dnodesize=legacy
512 bytes. Compatible with all ZFS implementations. Use for pools shared with FreeBSD or older OpenZFS versions.
dnodesize=auto
Recommended. ZFS allocates larger dnodes (up to 16K) when needed. Stores extended attributes and system attributes directly in the dnode instead of spilling to a separate block. Faster metadata operations, especially with xattr=sa. This is what kldload sets by default.
# kldload sets this at pool creation:
# zpool create ... -O dnodesize=auto ...

# Check current setting
zfs get dnodesize rpool

When to use legacy: Only when you need to import the pool on a system that doesn't support large dnodes (FreeBSD pre-12, very old OpenZFS). For Linux-only pools, always use auto.

xattr — sa vs dir

Extended attributes (SELinux labels, POSIX ACLs, Samba metadata) are stored either as separate hidden files in a directory (xattr=dir) or directly in the dnode's system attribute area (xattr=sa).

xattr=sa
Recommended. Store xattrs in the system attribute (SA) area of the dnode. One I/O to read the file's metadata also fetches its xattrs. 3–7x faster for workloads that touch xattrs (SELinux, Samba, POSIX ACLs). Requires dnodesize=auto for best results. This is what kldload sets by default.
xattr=dir
Store xattrs as separate hidden files. Compatible with all ZFS versions. Each xattr access requires a separate I/O. Slow on metadata-heavy workloads (Samba with many ACLs, containers with SELinux).
# kldload sets this at pool creation:
# zpool create ... -O xattr=sa ...

# If you inherited xattr=dir and want to switch:
zfs set xattr=sa rpool/srv/samba

# Verify
zfs get xattr rpool/srv/samba

special_small_blocks

If your pool has a special vdev (mirrored SSDs that store metadata and small files), the special_small_blocks property controls the file-size threshold. Blocks smaller than this value are stored on the special vdev instead of the main pool. Metadata always goes to the special vdev regardless of this setting.

# Store files < 64K on the special vdev (good default)
zfs set special_small_blocks=65536 tank

# For database workloads with 8K pages — store pages on SSD
zfs set special_small_blocks=16384 tank/srv/postgres

# For container storage — many tiny files
zfs set special_small_blocks=131072 tank/srv/containers

# Check current setting
zfs get special_small_blocks tank

Sizing the special vdev: Monitor how much data lands on it with zpool list -v tank. If the special vdev fills up, new small blocks overflow to the main pool (no data loss, just slower). Size it at 3–5% of your total pool capacity as a starting point. For metadata-heavy workloads (millions of small files), size it larger.

Kernel module parameters

These parameters control ZFS behavior at the system level. They apply to all pools on the system. Set them at runtime via /sys/module/zfs/parameters/ for testing, then make them persistent in /etc/modprobe.d/zfs.conf.

zfs_txg_timeout — transaction group commit interval

ZFS batches writes into transaction groups (TXGs) and commits them to disk periodically. The zfs_txg_timeout parameter sets the maximum seconds between commits. Default: 5 seconds.

# Check current value
cat /sys/module/zfs/parameters/zfs_txg_timeout
# Default: 5

# Lower for databases — commit more frequently, smoother latency
echo 3 > /sys/module/zfs/parameters/zfs_txg_timeout

# Higher for bulk ingestion — batch more, higher throughput
echo 10 > /sys/module/zfs/parameters/zfs_txg_timeout

Lower values (3–5) reduce worst-case latency spikes because each TXG is smaller. Good for database servers. Higher values (10–30) let ZFS batch more writes together, improving throughput for bulk workloads (backup ingestion, large file copies). The tradeoff: higher values mean more data in flight, more RAM pressure, and longer pauses during the commit.

zfs_dirty_data_max — write throttle threshold

Controls how much dirty (uncommitted) data ZFS allows in memory before throttling new writes. Default: 10% of RAM (capped at 4GB). When dirty data reaches this limit, ZFS slows down incoming writes to let the pool flush.

# Check current value (in bytes)
cat /sys/module/zfs/parameters/zfs_dirty_data_max
# Default on 64GB system: ~6.7GB

# Increase for systems with fast NVMe pools that can flush quickly
# 8GB on a 128GB system — lets ZFS buffer more before throttling
echo 8589934592 > /sys/module/zfs/parameters/zfs_dirty_data_max

# Decrease on systems with slow HDDs to avoid long commit stalls
echo 2147483648 > /sys/module/zfs/parameters/zfs_dirty_data_max

Too high: ZFS buffers massive amounts of dirty data, then hits a wall when the TXG commits. The resulting flush causes a long pause visible as latency spikes. Too low: ZFS throttles writes prematurely, never utilizing the full disk bandwidth. Tune based on your pool's write throughput — if your disks can flush 500 MB/s, you want enough dirty data to keep them busy for the TXG timeout interval.

zfs_prefetch_disable — when prefetch hurts

ZFS tries to detect sequential access patterns and prefetch upcoming blocks into ARC. This works well for sequential workloads (media streaming, backups) but can waste ARC space and bandwidth on purely random workloads (databases with random reads).

# Check current value
cat /sys/module/zfs/parameters/zfs_prefetch_disable
# Default: 0 (prefetch enabled)

# Disable prefetch for database servers with purely random I/O
echo 1 > /sys/module/zfs/parameters/zfs_prefetch_disable

Check before disabling: run arc_summary and look at the prefetch hit rate. If the prefetch hit rate is above 50%, prefetch is working and you should leave it on. If it's below 20%, prefetch is wasting ARC space with data that never gets read, and disabling it will free ARC for useful data.

VDEV queue tuning — zfs_vdev_*_max_active

ZFS has its own I/O scheduler that queues requests to each vdev. The zfs_vdev_*_max_active parameters control how many concurrent I/O operations of each type ZFS sends to each device. The defaults are conservative, tuned for spinning disks.

# Defaults (tuned for HDDs — low queue depths)
cat /sys/module/zfs/parameters/zfs_vdev_async_read_max_active    # 3
cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active   # 10
cat /sys/module/zfs/parameters/zfs_vdev_sync_read_max_active     # 10
cat /sys/module/zfs/parameters/zfs_vdev_sync_write_max_active    # 10

# For all-SSD or NVMe pools — SSDs thrive on deep queues
echo 32 > /sys/module/zfs/parameters/zfs_vdev_async_read_max_active
echo 32 > /sys/module/zfs/parameters/zfs_vdev_async_write_max_active
echo 32 > /sys/module/zfs/parameters/zfs_vdev_sync_read_max_active
echo 32 > /sys/module/zfs/parameters/zfs_vdev_sync_write_max_active

# For mixed HDD+SSD (SSD cache/SLOG, HDD main pool) — leave defaults
# The HDD vdevs would choke on deep queues

For NVMe: increase these aggressively (32–64). NVMe drives have internal parallelism measured in thousands of queues. The default of 3 async reads starves the drive. For HDDs: leave the defaults. Sending too many concurrent requests to a spinning disk causes excessive seeking and actually reduces throughput.

I/O scheduler interaction — use none/noop

Linux has its own I/O schedulers (mq-deadline, bfq, kyber, none). ZFS has its own I/O scheduler internally. Running two schedulers is redundant and harmful — the Linux scheduler reorders I/O that ZFS has already carefully ordered, adding latency for no benefit.

# Check current scheduler for a disk
cat /sys/block/sda/queue/scheduler
# Output: [mq-deadline] kyber bfq none

# Set to none (bypass Linux scheduler — let ZFS handle it)
echo none > /sys/block/sda/queue/scheduler

# Make persistent via udev rule:
# /etc/udev/rules.d/60-zfs-scheduler.rules
ACTION=="add|change", KERNEL=="sd[a-z]*|nvme*", ATTR{queue/scheduler}="none"

Always set the Linux I/O scheduler to none on disks used by ZFS. This is one of the few "always do this" rules in ZFS tuning. The performance difference is measurable, especially on HDDs where the Linux scheduler's reordering fights with ZFS's own ordering.

Making module parameters persistent

Runtime changes via /sys/module/zfs/parameters/ are lost on reboot. To make them permanent, write them to /etc/modprobe.d/zfs.conf:

# /etc/modprobe.d/zfs.conf — persistent ZFS module parameters
#
# TXG timeout — commit every 3 seconds for smoother database latency
options zfs zfs_txg_timeout=3

# Dirty data max — 4GB, tuned for our NVMe pool
options zfs zfs_dirty_data_max=4294967296

# VDEV queue depths — tuned for all-NVMe pool
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_vdev_async_write_max_active=32
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_sync_write_max_active=32
# After editing zfs.conf, rebuild initramfs so early-boot ZFS uses the new values
# CentOS / RHEL / Rocky / Fedora:
dracut -f

# Debian / Ubuntu:
update-initramfs -u

# Arch:
mkinitcpio -P
I've lost count of how many times someone set a ZFS tunable at runtime, confirmed it fixed their issue, and then forgot to add it to /etc/modprobe.d/zfs.conf. Three months later the server reboots and the problem comes back. Always make it persistent. Always rebuild the initramfs if you boot from ZFS. kldload uses ZFSBootMenu, so the initramfs matters.

kldload defaults and why they were chosen

kldload sets the following at pool creation. These are inherited by all datasets unless overridden. Each choice was deliberate:

Propertykldload defaultZFS defaultWhy kldload changes it
ashift12auto-detectAuto-detect lies. Many 4K-sector drives report 512. Wrong ashift is permanent and halves throughput.
compressionlz4offLZ4 is nearly free in CPU and saves 30–50% disk. No reason to leave compression off.
relatimeonoffPrevents unnecessary writes on reads. Compatible with applications that check atime (mail, AIDE).
xattrsadir3–7x faster for SELinux, Samba ACLs, and any xattr-heavy workload. No downside on Linux.
dnodesizeautolegacyLarger dnodes store xattrs inline (works with xattr=sa). Faster metadata. Only breaks FreeBSD pre-12.
acltypeposixacloffRequired for POSIX ACLs (Samba, NFS, containers). No performance cost.
normalizationformDnoneUnicode normalization prevents filenames that look identical from being treated as different files.
autotrimonoffIssues TRIM/DISCARD to SSDs automatically. Maintains SSD performance and longevity over time.

These defaults are set in storage-zfs.sh at the zpool create call. They are deliberately conservative — no module-level tunables, no aggressive caching changes. The goal is a pool that performs well for any workload out of the box. You tune further based on your specific workload using the guidance on this page.

Workload-specific tuning profiles

Below are complete, copy-paste recipes for common workloads. Each includes dataset properties, module parameters where relevant, and explanations. Start with the profile closest to your workload and adjust.

Database server (PostgreSQL)

The PostgreSQL recipe

# Dataset for PostgreSQL data directory
zfs create -o recordsize=16K \
           -o logbias=throughput \
           -o primarycache=metadata \
           -o compression=lz4 \
           -o atime=off \
           -o redundant_metadata=all \
           rpool/srv/postgres

# Separate dataset for WAL (write-ahead log) — different I/O pattern
zfs create -o recordsize=64K \
           -o logbias=latency \
           -o primarycache=metadata \
           -o compression=lz4 \
           -o atime=off \
           rpool/srv/postgres/wal

Why 16K for data? PostgreSQL uses 8K pages. 16K gives a 2:1 ratio that handles both single-page reads and sequential scans well. 8K is theoretically perfect but 16K better accommodates TOAST tables and index pages that span multiple 8K blocks.

Why separate WAL? WAL writes are sequential and latency-sensitive. Data writes are random. Separating them prevents WAL flushes from competing with data I/O. If you have an SLOG, it handles the ZIL for both, but the dataset separation still helps ARC and cache management.

Why primarycache=metadata? PostgreSQL has its own data cache (shared_buffers). Double-caching in ARC wastes RAM. Let ARC cache the metadata (block pointers, directory entries) and let PostgreSQL cache the data pages.

Module-level tuning for database servers:

# /etc/modprobe.d/zfs.conf additions for PostgreSQL
options zfs zfs_txg_timeout=3
# Smoother commit latency — smaller, more frequent TXG flushes
SLOG strongly recommended. Without one, every PostgreSQL COMMIT waits for the ZIL to flush to the data pool.

VM host (KVM / libvirt / Proxmox)

The VM storage recipe

# ZVOLs for VM disks — raw block devices, no POSIX overhead
zfs create -V 40G -s \
           -o volblocksize=16K \
           -o compression=lz4 \
           -o primarycache=all \
           rpool/srv/vms/webserver-01

# Dataset for ISO images — large sequential reads
zfs create -o recordsize=1M \
           -o compression=off \
           -o atime=off \
           rpool/srv/vms/isos

# Module-level tuning for VM hosts
# /etc/modprobe.d/zfs.conf
# Increase ZVOL threads for concurrent VM I/O
options zfs zvol_threads=32

Why ZVOLs, not files? ZVOLs expose a block device to QEMU. The guest OS issues block I/O directly without the overhead of a POSIX filesystem layer. Less CPU, less latency, better IOPS.

Why volblocksize=16K? The guest OS generates mixed I/O — 4K filesystem metadata, 8K database pages, sequential reads. 16K is the best compromise. Larger values (64K, 128K) cause write amplification for small random writes. Smaller values (4K, 8K) increase metadata overhead and fragment the pool.

Why mirrors, not RAIDZ? VMs generate random I/O. RAIDZ's read-modify-write penalty destroys VM IOPS. Mirrors handle random I/O with linear scaling — each mirror pair serves independent requests.

Never use RAIDZ for VM storage. This is the most common ZFS performance mistake in virtualization.

File server (Samba / NFS)

The file server recipe

# General file shares — mixed workload
zfs create -o recordsize=128K \
           -o compression=lz4 \
           -o atime=off \
           -o xattr=sa \
           -o acltype=posixacl \
           rpool/srv/shares

# Large file shares (engineering, media) — sequential I/O
zfs create -o recordsize=1M \
           -o compression=zstd \
           -o atime=off \
           rpool/srv/shares/media

# Home directories — mixed small files
zfs create -o recordsize=128K \
           -o compression=lz4 \
           -o atime=off \
           rpool/srv/shares/homes

NFS and sync writes: NFS v3 with sync exports (the default) generates heavy sync write traffic. Without an SLOG, every NFS write waits for the ZIL to commit to the data pool. An SLOG drops NFS write latency by 10–100x on spinning rust.

Samba and xattrs: xattr=sa is critical for Samba. Windows clients set ACLs, alternate data streams, and DOS attributes as extended attributes. With xattr=dir, each attribute access is a separate I/O. With xattr=sa, they're inline in the dnode — a single I/O reads the file and its attributes together.

For NFS exports: add an SLOG. For Samba: xattr=sa + dnodesize=auto + acltype=posixacl. kldload sets all three by default.

Backup target (restic, borg, zfs send)

The backup recipe

# Backup dataset — large sequential writes, maximize compression
zfs create -o recordsize=1M \
           -o compression=zstd-3 \
           -o atime=off \
           -o sync=standard \
           -o redundant_metadata=all \
           rpool/srv/backups

# For zfs receive targets — match the source recordsize
# or use 1M if receiving mixed sources
zfs create -o recordsize=1M \
           -o compression=lz4 \
           rpool/srv/backups/remote

Why 1M recordsize? Backup tools write large sequential chunks. restic uses 1–8 MB packs. borg uses 1–2 MB chunks. zfs send streams are sequential by nature. Larger records mean fewer metadata blocks and higher sequential throughput.

Why zstd-3? Backup data compresses well (especially if the source didn't compress). zstd level 3 gives 2–4x compression at moderate CPU cost. For very slow CPUs, use lz4. For backup targets where CPU is plentiful and space is tight, zstd-9 or higher can be worth it.

Module tuning: increase zfs_txg_timeout to 10–15 for bulk ingestion. Reverts after the backup window.

Build server (CI, compilation, containers)

The build server recipe

# Build workspace — short-lived data, speed over safety
zfs create -o recordsize=128K \
           -o compression=lz4 \
           -o atime=off \
           -o sync=disabled \
           rpool/srv/builds

# ccache directory — lots of small reads, some writes
zfs create -o recordsize=128K \
           -o compression=lz4 \
           -o atime=off \
           rpool/srv/ccache

# Container storage (Docker/Podman ZFS driver)
zfs create -o recordsize=128K \
           -o compression=lz4 \
           -o atime=off \
           rpool/srv/containers

Why sync=disabled for builds? Build artifacts are ephemeral. If the system crashes mid-build, you re-run the build. The 2–5x write performance improvement is worth the risk of losing in-flight build data. Never use sync=disabled for the ccache or anything you don't want to regenerate.

Container special vdev: Container images are layers of many small files. A mirrored SSD special vdev with special_small_blocks=131072 dramatically accelerates image pulls, container starts, and layer deduplication lookups.

Build servers benefit more from a special vdev than from recordsize tuning. The bottleneck is metadata, not data.

Comprehensive recordsize reference

WorkloadStorage Typerecordsize / volblocksizeCompressionSLOG?primarycacheVDEV Type
PostgreSQL (data)Dataset16Klz4YesmetadataMirrors
PostgreSQL (WAL)Dataset64Klz4YesmetadataMirrors
MySQL InnoDBDataset16Klz4YesmetadataMirrors
MongoDBDataset64Klz4RecommendedallMirrors
VM disks (KVM)ZVOLvolblocksize=16Klz4RecommendedallMirrors
NFS sharesDataset128Klz4YesallMirrors or RAIDZ2
Samba sharesDataset128Klz4RecommendedallMirrors or RAIDZ2
Media filesDataset1MzstdNoallRAIDZ2
Backups / archivesDataset1Mzstd-3NoallRAIDZ2/3
Build artifactsDataset128Klz4NoallAny
Containers (Docker)Dataset128Klz4NoallMirrors + Special
General / unknownDataset128Klz4NoallAny

Common anti-patterns

Copying tunables from the internet

The #1 anti-pattern. Someone's "ultimate ZFS tuning guide" was written for their hardware, their workload, and possibly an older OpenZFS version. Parameters that helped them may hurt you. Always understand what each parameter does and measure the impact on your system.

Setting recordsize=8K everywhere

Databases need small recordsize. Your media library does not. Using 8K recordsize on large files multiplies metadata overhead 16x (compared to 128K) and destroys sequential throughput. Set recordsize per dataset, not per pool.

sync=disabled on databases

Lying to PostgreSQL about write durability "works" until a power outage. Then you have a corrupted database. The correct fix for slow sync writes is an SLOG, not disabling sync. sync=disabled is for throwaway data only.

primarycache=none

Almost never correct. Even databases benefit from metadata caching in ARC. If you want to avoid double-caching data, use primarycache=metadata, not none.

Maxing out zfs_arc_max

Setting ARC to use 90% of RAM leaves nothing for applications, page cache, or the kernel. Databases, VMs, and applications need RAM too. Start with the default (50% of RAM) and only increase if arc_summary shows high eviction pressure and your applications have memory to spare.

Tuning before profiling

If you haven't run zpool iostat -vql and arc_summary, you don't know what your bottleneck is. You might be tuning the wrong thing entirely. The disk might be failing. The pool might be 95% full (ZFS performance degrades above 80%). Profile first.

Ignoring pool fullness

ZFS performance drops sharply above 80% capacity due to fragmentation and COW overhead. No amount of tuning fixes this. Keep pools below 80%. If you're above 80%, the fix is more disks, not more tunables.

L2ARC on low-RAM systems

L2ARC index headers consume ~70 bytes of ARC per cached block. A 1TB L2ARC with 4K blocks needs ~17GB of RAM just for the index. On a 32GB system, that's half your ARC gone. L2ARC only makes sense when you have plenty of RAM (64GB+) and a working set larger than ARC.

The tuning checklist — in order

Follow this order. Each step builds on the previous. Don't skip to step 6 because it sounds exciting — the early steps deliver the most impact.

Step 1: Verify basics
Check zpool status — no errors, no degraded vdevs. Check pool capacity (zpool list) — below 80%. Check ashift=12. These are prerequisites, not tuning.
Step 2: Profile
Run zpool iostat -vql tank 5 and arc_summary under production load. Identify the bottleneck: read latency? sync write queuing? ARC eviction? Dirty data throttling?
Step 3: Set recordsize
Match recordsize to your application's I/O size on each dataset. This is the single biggest win. Use the table above.
Step 4: Set dataset properties
atime=off or relatime=on. compression=lz4 (or zstd for cold data). xattr=sa + dnodesize=auto. primarycache=metadata for databases with their own cache.
Step 5: I/O scheduler
Set Linux I/O scheduler to none on all ZFS disks. Create the udev rule. This is always correct.
Step 6: SLOG (if needed)
If zpool iostat -q shows sync write queuing, add a mirrored SLOG with PLP. If sync write queuing is zero, a SLOG won't help.
Step 7: Special vdev (if needed)
If metadata operations are slow (ls, find, container pulls), add a mirrored SSD special vdev. Set special_small_blocks appropriately.
Step 8: Module parameters
Only after steps 1–7. Adjust zfs_txg_timeout, zfs_dirty_data_max, VDEV queue depths based on profiling data. Make persistent. Rebuild initramfs.
Step 9: Re-profile
Run the same profiling tools from step 2. Compare before and after. If a change didn't help, revert it. Keep only what measurably improves your workload.
Most people who come to me with "ZFS is slow" are at step 1 — wrong ashift, pool at 90% capacity, or RAIDZ for a database. I've never seen someone who completed steps 1–5 and still had a performance problem that required module-level tuning. The basics matter more than the exotic knobs.

The short version

recordsize matched to your workload's I/O size is 60% of ZFS tuning.

compression=lz4, atime=off (or relatime), xattr=sa — set these on every pool, no exceptions.

Set the Linux I/O scheduler to none on all ZFS disks — always.

Profile before tuning. Run zpool iostat -vql and arc_summary. Know your bottleneck.

Don't copy tunables from the internet. Understand each parameter. Measure before and after. Keep only what helps.

If your pool is above 80% full, no tuning will save you. Add disks.