Tuning for Workloads — defaults are for nobody.
ZFS ships with defaults designed to be safe and broadly acceptable. They are optimized for no workload in particular, which means they are optimal for no workload at all. A PostgreSQL server, a Plex library, a KVM hypervisor, and a backup target all have radically different I/O patterns. Tuning ZFS means matching the filesystem's behavior to the work your system actually does. This page teaches you how — from dataset properties you set once, to kernel module parameters that reshape how ZFS manages memory and I/O at a system level.
recordsize.
Get that right for your workload and you've captured 60% of the tuning benefit. Everything
else on this page is refinement. Don't skip the monitoring section — you should profile
before you tune, not after.
The tuning philosophy
Don't tune what you don't understand. Every parameter on this page exists because ZFS is making a tradeoff — latency vs throughput, memory vs disk, safety vs speed. If you change a parameter without understanding the tradeoff, you may make things worse. The defaults are conservative but safe. Only change them when you have measured a problem and understand why the change helps.
The anti-pattern: copying a block of tunables from a blog post or Reddit thread
and pasting them into /etc/modprobe.d/zfs.conf. Those tunables were written for
someone else's hardware, someone else's workload, and possibly an older version of OpenZFS.
They may have been wrong even for the author. Understand each parameter individually, measure
before and after, and keep only what helps.
ZFS tuning happens at three levels, each with different scope and persistence:
zfs set or at zfs create. Apply to individual datasets or ZVOLs. Inherited by child datasets. Examples: recordsize, compression, atime, logbias, primarycache, xattr. Persistent by default.zpool set or at zpool create. Apply to the entire pool. Examples: ashift, autotrim. Persistent by default. Some (like ashift) are immutable after creation./etc/modprobe.d/zfs.conf or /sys/module/zfs/parameters/. Apply to the entire system. Examples: zfs_txg_timeout, zfs_dirty_data_max, zfs_arc_max. Must be made persistent manually.Monitor before you tune
Never tune blind. ZFS provides excellent introspection tools. Use them to identify your actual bottleneck before changing anything.
zpool iostat — the first tool you reach for
# Basic I/O stats, refreshing every 5 seconds
zpool iostat tank 5
# Per-vdev breakdown — shows which vdevs are bottlenecked
zpool iostat -v tank 5
# Queue depths — critical for diagnosing sync write latency
# Look at syncq_read/syncq_write columns
zpool iostat -q tank 5
# Latency histograms — shows distribution, not just averages
zpool iostat -l tank 5
# Everything at once
zpool iostat -vql tank 5
What to look for: consistently high syncq_write means sync writes are queuing
(SLOG will help). High asyncq_write means the pool can't flush dirty data fast enough
(more IOPS or larger zfs_dirty_data_max). High read latency with low cache hit rate means
you need more ARC (RAM) or L2ARC.
arc_summary — ARC health at a glance
# Full ARC report — hit rates, size, evictions
arc_summary
# Key metrics to check:
# ARC hit rate — should be > 90% for read-heavy workloads
# ARC size vs max — if ARC is at max and hit rate is low, you need more RAM
# Prefetch hit rate — if low, prefetch may be hurting more than helping
# L2ARC hit rate — if you have L2ARC, this tells you if it's earning its keep
arcstat — ARC metrics over time
# ARC stats refreshing every 2 seconds
arcstat 2
# Watch specific fields
arcstat -f hits,miss,hit%,arcsz,c 2
/proc/spl/kstat/zfs — the raw data
# ARC stats (the source of truth)
cat /proc/spl/kstat/zfs/arcstats
# Per-dataset I/O stats (OpenZFS 2.2+)
cat /proc/spl/kstat/zfs/objset-*
# TXG commit times — high values mean writes are batching too long
cat /proc/spl/kstat/zfs/<poolname>/txgs
zpool status for checksum errors), insufficient RAM (ARC was being
evicted constantly), or the wrong VDEV topology (RAIDZ for a database). Profile first. The tools above
take five minutes and save you from solving the wrong problem.
recordsize — the most important tunable
recordsize is the maximum block size ZFS uses for a dataset. It defaults to 128K.
When ZFS writes a file, it breaks it into records of up to this size. When ZFS reads, it reads
entire records. If your application reads 8K but your recordsize is 128K, ZFS reads 128K from
disk and throws away 120K. This is read amplification. If your application writes 8K,
ZFS must read the existing 128K record, modify 8K within it, and write back 128K. This is
write amplification.
The rule is simple: match recordsize to your application's I/O size. If you don't know what your application does, leave the default. The default is not bad — it's a reasonable middle ground. But it's not optimal for anything specific.
| Workload | Application I/O size | Recommended recordsize | Why |
|---|---|---|---|
| PostgreSQL | 8K pages | 8K or 16K | Eliminates read/write amplification. 16K handles WAL + data mix better. |
| MySQL InnoDB | 16K pages | 16K | Exact match to InnoDB page size. No amplification. |
| MongoDB (WiredTiger) | 32K–64K | 32K or 64K | WiredTiger checkpoints in 32K–64K blocks. Match to your config. |
| KVM/QEMU (ZVOL) | Varies | volblocksize=16K | Guest OS issues mixed I/O. 16K balances small random + sequential. |
| Samba/NFS file shares | Mixed | 128K (default) | Mixed file sizes. Default works well. Consider 1M if files are large. |
| Video/media files | Sequential MB+ | 1M | Large sequential reads. Fewer, larger blocks = less metadata, more throughput. |
| Backup targets (restic, borg) | 1M–4M chunks | 1M | Backup tools write large sequential chunks. Match to chunk size. |
| Build server (/tmp, ccache) | Mixed small | 128K (default) | Lots of small files created and deleted. Default handles this fine. |
| Container images (Docker) | Mixed small | 128K (default) | Many small files in layers. Default is fine. Special vdev helps more. |
| General / unknown | Unknown | 128K (default) | Leave the default. It's a safe middle ground. |
# Set recordsize at dataset creation (preferred — cannot be retroactive for existing data)
zfs create -o recordsize=16K rpool/srv/postgres/data
# Change recordsize on existing dataset (only affects NEW writes)
zfs set recordsize=1M rpool/srv/media
# Check current recordsize
zfs get recordsize rpool/srv/postgres/data
Critical detail: changing recordsize on an existing dataset only affects
data written after the change. Existing blocks keep their original size. To fully convert,
you must rewrite the data (copy out and back, or zfs send | zfs receive into a new dataset
with the correct recordsize).
atime and relatime — stop writing on reads
atime (access time) tells ZFS to update a file's access timestamp every time
it is read. This means every cat, every grep, every backup scan
generates a write to update metadata. On a busy system, this creates thousands of
unnecessary writes per second. Almost no application needs access-time tracking.
relatime mount option. This is what kldload sets by default. Good balance of compatibility and performance.# kldload default: relatime=on (set at pool creation, inherited by all datasets)
# To disable atime entirely on a specific dataset:
zfs set atime=off rpool/srv/postgres
zfs set atime=off rpool/srv/media
# Check current setting
zfs get atime,relatime rpool/srv/postgres
sync — the safety vs speed tradeoff
The sync property controls how ZFS handles synchronous write requests.
When an application calls fsync() or opens a file with O_SYNC,
it is asking the filesystem to guarantee the data is on stable storage before returning.
Databases do this on every commit. NFS does this by default.
# Disable sync for a build cache (data loss on crash is acceptable)
zfs set sync=disabled rpool/var/cache/builds
# Force sync for an NFS export
zfs set sync=always rpool/srv/nfs-critical
# Check current setting
zfs get sync rpool/srv/postgres
sync=disabled on exactly two things: build caches and
throwaway temp datasets. Everything else gets sync=standard (the default). If sync writes
are slow, the answer is an SLOG, not sync=disabled. Lying to your database about write
durability is how you get corrupted data after a power outage. Not worth the risk.
primarycache and secondarycache
primarycache controls what the ARC (in-memory cache) stores.
secondarycache controls what the L2ARC (SSD cache) stores.
Both accept three values: all (default), metadata, or none.
# PostgreSQL: the database manages its own data cache (shared_buffers)
# Let ARC cache metadata only — avoids double-caching data pages
zfs set primarycache=metadata rpool/srv/postgres/data
# Media library: data caching helps — large reads benefit from ARC
zfs set primarycache=all rpool/srv/media
# L2ARC: only cache metadata for database ZVOLs
zfs set secondarycache=metadata rpool/srv/vms/db-server
When to use primarycache=metadata: Only when the application manages its
own buffer cache and you're seeing ARC filled with data that's already cached at the application level.
Check with arc_summary — if ARC hit rate is high and the application's cache hit rate
is also high, you're double-caching. Switching to metadata frees that ARC space for other
datasets that benefit from it.
logbias — latency vs throughput
logbias controls how ZFS handles synchronous writes to the ZIL.
# Database on all-SSD pool, no SLOG: skip the ZIL double-write
zfs set logbias=throughput rpool/srv/postgres/data
# NFS export with SLOG: keep latency mode for fast sync acks
zfs set logbias=latency rpool/srv/nfs
logbias=throughput is one of those tunables that gets cargo-culted.
People set it because a blog told them to. It only helps in a specific scenario: high sync write volume
on an all-SSD pool without an SLOG. If you have an SLOG, logbias=latency (the default) is
almost always better because the SLOG handles the ZIL writes at NVMe speed. Measure with
zpool iostat -q before changing this.
redundant_metadata
ZFS stores extra copies of metadata blocks for safety. The redundant_metadata
property controls how many extra copies:
Recommendation: leave the default (all). The space overhead is tiny compared
to the protection it provides. The only time to consider most is on very large backup pools
where every byte of metadata overhead matters and the data is expendable.
dnodesize — auto vs legacy
Dnodes are ZFS's equivalent of inodes — they store file metadata, extended attributes,
and system attributes. The dnodesize property controls their size:
xattr=sa. This is what kldload sets by default.# kldload sets this at pool creation:
# zpool create ... -O dnodesize=auto ...
# Check current setting
zfs get dnodesize rpool
When to use legacy: Only when you need to import the pool on a system
that doesn't support large dnodes (FreeBSD pre-12, very old OpenZFS). For Linux-only pools,
always use auto.
xattr — sa vs dir
Extended attributes (SELinux labels, POSIX ACLs, Samba metadata) are stored either as
separate hidden files in a directory (xattr=dir) or directly in the dnode's
system attribute area (xattr=sa).
dnodesize=auto for best results. This is what kldload sets by default.# kldload sets this at pool creation:
# zpool create ... -O xattr=sa ...
# If you inherited xattr=dir and want to switch:
zfs set xattr=sa rpool/srv/samba
# Verify
zfs get xattr rpool/srv/samba
special_small_blocks
If your pool has a special vdev (mirrored SSDs that store metadata and small files),
the special_small_blocks property controls the file-size threshold. Blocks smaller than
this value are stored on the special vdev instead of the main pool. Metadata always goes to the
special vdev regardless of this setting.
# Store files < 64K on the special vdev (good default)
zfs set special_small_blocks=65536 tank
# For database workloads with 8K pages — store pages on SSD
zfs set special_small_blocks=16384 tank/srv/postgres
# For container storage — many tiny files
zfs set special_small_blocks=131072 tank/srv/containers
# Check current setting
zfs get special_small_blocks tank
Sizing the special vdev: Monitor how much data lands on it with
zpool list -v tank. If the special vdev fills up, new small blocks overflow to the
main pool (no data loss, just slower). Size it at 3–5% of your total pool capacity as a
starting point. For metadata-heavy workloads (millions of small files), size it larger.
Kernel module parameters
These parameters control ZFS behavior at the system level. They apply to all pools on the system.
Set them at runtime via /sys/module/zfs/parameters/ for testing, then make them
persistent in /etc/modprobe.d/zfs.conf.
zfs_txg_timeout — transaction group commit interval
ZFS batches writes into transaction groups (TXGs) and commits them to disk
periodically. The zfs_txg_timeout parameter sets the maximum seconds between commits.
Default: 5 seconds.
# Check current value
cat /sys/module/zfs/parameters/zfs_txg_timeout
# Default: 5
# Lower for databases — commit more frequently, smoother latency
echo 3 > /sys/module/zfs/parameters/zfs_txg_timeout
# Higher for bulk ingestion — batch more, higher throughput
echo 10 > /sys/module/zfs/parameters/zfs_txg_timeout
Lower values (3–5) reduce worst-case latency spikes because each TXG is smaller. Good for database servers. Higher values (10–30) let ZFS batch more writes together, improving throughput for bulk workloads (backup ingestion, large file copies). The tradeoff: higher values mean more data in flight, more RAM pressure, and longer pauses during the commit.
zfs_dirty_data_max — write throttle threshold
Controls how much dirty (uncommitted) data ZFS allows in memory before throttling new writes. Default: 10% of RAM (capped at 4GB). When dirty data reaches this limit, ZFS slows down incoming writes to let the pool flush.
# Check current value (in bytes)
cat /sys/module/zfs/parameters/zfs_dirty_data_max
# Default on 64GB system: ~6.7GB
# Increase for systems with fast NVMe pools that can flush quickly
# 8GB on a 128GB system — lets ZFS buffer more before throttling
echo 8589934592 > /sys/module/zfs/parameters/zfs_dirty_data_max
# Decrease on systems with slow HDDs to avoid long commit stalls
echo 2147483648 > /sys/module/zfs/parameters/zfs_dirty_data_max
Too high: ZFS buffers massive amounts of dirty data, then hits a wall when the TXG commits. The resulting flush causes a long pause visible as latency spikes. Too low: ZFS throttles writes prematurely, never utilizing the full disk bandwidth. Tune based on your pool's write throughput — if your disks can flush 500 MB/s, you want enough dirty data to keep them busy for the TXG timeout interval.
zfs_prefetch_disable — when prefetch hurts
ZFS tries to detect sequential access patterns and prefetch upcoming blocks into ARC. This works well for sequential workloads (media streaming, backups) but can waste ARC space and bandwidth on purely random workloads (databases with random reads).
# Check current value
cat /sys/module/zfs/parameters/zfs_prefetch_disable
# Default: 0 (prefetch enabled)
# Disable prefetch for database servers with purely random I/O
echo 1 > /sys/module/zfs/parameters/zfs_prefetch_disable
Check before disabling: run arc_summary and look at the prefetch hit rate.
If the prefetch hit rate is above 50%, prefetch is working and you should leave it on.
If it's below 20%, prefetch is wasting ARC space with data that never gets read, and disabling
it will free ARC for useful data.
VDEV queue tuning — zfs_vdev_*_max_active
ZFS has its own I/O scheduler that queues requests to each vdev. The zfs_vdev_*_max_active
parameters control how many concurrent I/O operations of each type ZFS sends to each device.
The defaults are conservative, tuned for spinning disks.
# Defaults (tuned for HDDs — low queue depths)
cat /sys/module/zfs/parameters/zfs_vdev_async_read_max_active # 3
cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active # 10
cat /sys/module/zfs/parameters/zfs_vdev_sync_read_max_active # 10
cat /sys/module/zfs/parameters/zfs_vdev_sync_write_max_active # 10
# For all-SSD or NVMe pools — SSDs thrive on deep queues
echo 32 > /sys/module/zfs/parameters/zfs_vdev_async_read_max_active
echo 32 > /sys/module/zfs/parameters/zfs_vdev_async_write_max_active
echo 32 > /sys/module/zfs/parameters/zfs_vdev_sync_read_max_active
echo 32 > /sys/module/zfs/parameters/zfs_vdev_sync_write_max_active
# For mixed HDD+SSD (SSD cache/SLOG, HDD main pool) — leave defaults
# The HDD vdevs would choke on deep queues
For NVMe: increase these aggressively (32–64). NVMe drives have internal parallelism measured in thousands of queues. The default of 3 async reads starves the drive. For HDDs: leave the defaults. Sending too many concurrent requests to a spinning disk causes excessive seeking and actually reduces throughput.
I/O scheduler interaction — use none/noop
Linux has its own I/O schedulers (mq-deadline, bfq, kyber, none). ZFS has its own I/O scheduler internally. Running two schedulers is redundant and harmful — the Linux scheduler reorders I/O that ZFS has already carefully ordered, adding latency for no benefit.
# Check current scheduler for a disk
cat /sys/block/sda/queue/scheduler
# Output: [mq-deadline] kyber bfq none
# Set to none (bypass Linux scheduler — let ZFS handle it)
echo none > /sys/block/sda/queue/scheduler
# Make persistent via udev rule:
# /etc/udev/rules.d/60-zfs-scheduler.rules
ACTION=="add|change", KERNEL=="sd[a-z]*|nvme*", ATTR{queue/scheduler}="none"
Always set the Linux I/O scheduler to none on disks used by ZFS.
This is one of the few "always do this" rules in ZFS tuning. The performance difference is
measurable, especially on HDDs where the Linux scheduler's reordering fights with ZFS's own ordering.
Making module parameters persistent
Runtime changes via /sys/module/zfs/parameters/ are lost on reboot.
To make them permanent, write them to /etc/modprobe.d/zfs.conf:
# /etc/modprobe.d/zfs.conf — persistent ZFS module parameters
#
# TXG timeout — commit every 3 seconds for smoother database latency
options zfs zfs_txg_timeout=3
# Dirty data max — 4GB, tuned for our NVMe pool
options zfs zfs_dirty_data_max=4294967296
# VDEV queue depths — tuned for all-NVMe pool
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_vdev_async_write_max_active=32
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_sync_write_max_active=32
# After editing zfs.conf, rebuild initramfs so early-boot ZFS uses the new values
# CentOS / RHEL / Rocky / Fedora:
dracut -f
# Debian / Ubuntu:
update-initramfs -u
# Arch:
mkinitcpio -P
/etc/modprobe.d/zfs.conf.
Three months later the server reboots and the problem comes back. Always make it persistent. Always
rebuild the initramfs if you boot from ZFS. kldload uses ZFSBootMenu, so the initramfs matters.
kldload defaults and why they were chosen
kldload sets the following at pool creation. These are inherited by all datasets unless overridden. Each choice was deliberate:
| Property | kldload default | ZFS default | Why kldload changes it |
|---|---|---|---|
ashift | 12 | auto-detect | Auto-detect lies. Many 4K-sector drives report 512. Wrong ashift is permanent and halves throughput. |
compression | lz4 | off | LZ4 is nearly free in CPU and saves 30–50% disk. No reason to leave compression off. |
relatime | on | off | Prevents unnecessary writes on reads. Compatible with applications that check atime (mail, AIDE). |
xattr | sa | dir | 3–7x faster for SELinux, Samba ACLs, and any xattr-heavy workload. No downside on Linux. |
dnodesize | auto | legacy | Larger dnodes store xattrs inline (works with xattr=sa). Faster metadata. Only breaks FreeBSD pre-12. |
acltype | posixacl | off | Required for POSIX ACLs (Samba, NFS, containers). No performance cost. |
normalization | formD | none | Unicode normalization prevents filenames that look identical from being treated as different files. |
autotrim | on | off | Issues TRIM/DISCARD to SSDs automatically. Maintains SSD performance and longevity over time. |
These defaults are set in storage-zfs.sh at the zpool create call.
They are deliberately conservative — no module-level tunables, no aggressive caching changes.
The goal is a pool that performs well for any workload out of the box. You tune further based on
your specific workload using the guidance on this page.
Workload-specific tuning profiles
Below are complete, copy-paste recipes for common workloads. Each includes dataset properties, module parameters where relevant, and explanations. Start with the profile closest to your workload and adjust.
Database server (PostgreSQL)
The PostgreSQL recipe
# Dataset for PostgreSQL data directory
zfs create -o recordsize=16K \
-o logbias=throughput \
-o primarycache=metadata \
-o compression=lz4 \
-o atime=off \
-o redundant_metadata=all \
rpool/srv/postgres
# Separate dataset for WAL (write-ahead log) — different I/O pattern
zfs create -o recordsize=64K \
-o logbias=latency \
-o primarycache=metadata \
-o compression=lz4 \
-o atime=off \
rpool/srv/postgres/wal
Why 16K for data? PostgreSQL uses 8K pages. 16K gives a 2:1 ratio that handles both single-page reads and sequential scans well. 8K is theoretically perfect but 16K better accommodates TOAST tables and index pages that span multiple 8K blocks.
Why separate WAL? WAL writes are sequential and latency-sensitive. Data writes are random. Separating them prevents WAL flushes from competing with data I/O. If you have an SLOG, it handles the ZIL for both, but the dataset separation still helps ARC and cache management.
Why primarycache=metadata? PostgreSQL has its own
data cache (shared_buffers). Double-caching in ARC wastes RAM. Let ARC cache the metadata
(block pointers, directory entries) and let PostgreSQL cache the data pages.
Module-level tuning for database servers:
# /etc/modprobe.d/zfs.conf additions for PostgreSQL
options zfs zfs_txg_timeout=3
# Smoother commit latency — smaller, more frequent TXG flushes
VM host (KVM / libvirt / Proxmox)
The VM storage recipe
# ZVOLs for VM disks — raw block devices, no POSIX overhead
zfs create -V 40G -s \
-o volblocksize=16K \
-o compression=lz4 \
-o primarycache=all \
rpool/srv/vms/webserver-01
# Dataset for ISO images — large sequential reads
zfs create -o recordsize=1M \
-o compression=off \
-o atime=off \
rpool/srv/vms/isos
# Module-level tuning for VM hosts
# /etc/modprobe.d/zfs.conf
# Increase ZVOL threads for concurrent VM I/O
options zfs zvol_threads=32
Why ZVOLs, not files? ZVOLs expose a block device to QEMU. The guest OS issues block I/O directly without the overhead of a POSIX filesystem layer. Less CPU, less latency, better IOPS.
Why volblocksize=16K? The guest OS generates mixed
I/O — 4K filesystem metadata, 8K database pages, sequential reads. 16K is the best compromise.
Larger values (64K, 128K) cause write amplification for small random writes. Smaller values (4K, 8K)
increase metadata overhead and fragment the pool.
Why mirrors, not RAIDZ? VMs generate random I/O. RAIDZ's read-modify-write penalty destroys VM IOPS. Mirrors handle random I/O with linear scaling — each mirror pair serves independent requests.
File server (Samba / NFS)
The file server recipe
# General file shares — mixed workload
zfs create -o recordsize=128K \
-o compression=lz4 \
-o atime=off \
-o xattr=sa \
-o acltype=posixacl \
rpool/srv/shares
# Large file shares (engineering, media) — sequential I/O
zfs create -o recordsize=1M \
-o compression=zstd \
-o atime=off \
rpool/srv/shares/media
# Home directories — mixed small files
zfs create -o recordsize=128K \
-o compression=lz4 \
-o atime=off \
rpool/srv/shares/homes
NFS and sync writes: NFS v3 with sync exports (the default) generates
heavy sync write traffic. Without an SLOG, every NFS write waits for the ZIL to commit to the data
pool. An SLOG drops NFS write latency by 10–100x on spinning rust.
Samba and xattrs: xattr=sa is critical for Samba.
Windows clients set ACLs, alternate data streams, and DOS attributes as extended attributes. With
xattr=dir, each attribute access is a separate I/O. With xattr=sa,
they're inline in the dnode — a single I/O reads the file and its attributes together.
Backup target (restic, borg, zfs send)
The backup recipe
# Backup dataset — large sequential writes, maximize compression
zfs create -o recordsize=1M \
-o compression=zstd-3 \
-o atime=off \
-o sync=standard \
-o redundant_metadata=all \
rpool/srv/backups
# For zfs receive targets — match the source recordsize
# or use 1M if receiving mixed sources
zfs create -o recordsize=1M \
-o compression=lz4 \
rpool/srv/backups/remote
Why 1M recordsize? Backup tools write large sequential chunks. restic uses 1–8 MB
packs. borg uses 1–2 MB chunks. zfs send streams are sequential by nature. Larger
records mean fewer metadata blocks and higher sequential throughput.
Why zstd-3? Backup data compresses well (especially if the source didn't compress). zstd level 3 gives 2–4x compression at moderate CPU cost. For very slow CPUs, use lz4. For backup targets where CPU is plentiful and space is tight, zstd-9 or higher can be worth it.
Build server (CI, compilation, containers)
The build server recipe
# Build workspace — short-lived data, speed over safety
zfs create -o recordsize=128K \
-o compression=lz4 \
-o atime=off \
-o sync=disabled \
rpool/srv/builds
# ccache directory — lots of small reads, some writes
zfs create -o recordsize=128K \
-o compression=lz4 \
-o atime=off \
rpool/srv/ccache
# Container storage (Docker/Podman ZFS driver)
zfs create -o recordsize=128K \
-o compression=lz4 \
-o atime=off \
rpool/srv/containers
Why sync=disabled for builds? Build artifacts are ephemeral. If the system
crashes mid-build, you re-run the build. The 2–5x write performance improvement is worth the risk
of losing in-flight build data. Never use sync=disabled for the ccache or anything
you don't want to regenerate.
Container special vdev: Container images are layers of many small
files. A mirrored SSD special vdev with special_small_blocks=131072 dramatically accelerates
image pulls, container starts, and layer deduplication lookups.
Comprehensive recordsize reference
| Workload | Storage Type | recordsize / volblocksize | Compression | SLOG? | primarycache | VDEV Type |
|---|---|---|---|---|---|---|
| PostgreSQL (data) | Dataset | 16K | lz4 | Yes | metadata | Mirrors |
| PostgreSQL (WAL) | Dataset | 64K | lz4 | Yes | metadata | Mirrors |
| MySQL InnoDB | Dataset | 16K | lz4 | Yes | metadata | Mirrors |
| MongoDB | Dataset | 64K | lz4 | Recommended | all | Mirrors |
| VM disks (KVM) | ZVOL | volblocksize=16K | lz4 | Recommended | all | Mirrors |
| NFS shares | Dataset | 128K | lz4 | Yes | all | Mirrors or RAIDZ2 |
| Samba shares | Dataset | 128K | lz4 | Recommended | all | Mirrors or RAIDZ2 |
| Media files | Dataset | 1M | zstd | No | all | RAIDZ2 |
| Backups / archives | Dataset | 1M | zstd-3 | No | all | RAIDZ2/3 |
| Build artifacts | Dataset | 128K | lz4 | No | all | Any |
| Containers (Docker) | Dataset | 128K | lz4 | No | all | Mirrors + Special |
| General / unknown | Dataset | 128K | lz4 | No | all | Any |
Common anti-patterns
Copying tunables from the internet
The #1 anti-pattern. Someone's "ultimate ZFS tuning guide" was written for their hardware, their workload, and possibly an older OpenZFS version. Parameters that helped them may hurt you. Always understand what each parameter does and measure the impact on your system.
Setting recordsize=8K everywhere
Databases need small recordsize. Your media library does not. Using 8K recordsize on large files multiplies metadata overhead 16x (compared to 128K) and destroys sequential throughput. Set recordsize per dataset, not per pool.
sync=disabled on databases
Lying to PostgreSQL about write durability "works" until a power outage. Then you have a
corrupted database. The correct fix for slow sync writes is an SLOG, not disabling sync.
sync=disabled is for throwaway data only.
primarycache=none
Almost never correct. Even databases benefit from metadata caching in ARC. If you want
to avoid double-caching data, use primarycache=metadata, not none.
Maxing out zfs_arc_max
Setting ARC to use 90% of RAM leaves nothing for applications, page cache, or the kernel.
Databases, VMs, and applications need RAM too. Start with the default (50% of RAM) and only
increase if arc_summary shows high eviction pressure and your applications
have memory to spare.
Tuning before profiling
If you haven't run zpool iostat -vql and arc_summary, you don't
know what your bottleneck is. You might be tuning the wrong thing entirely. The disk might be
failing. The pool might be 95% full (ZFS performance degrades above 80%). Profile first.
Ignoring pool fullness
ZFS performance drops sharply above 80% capacity due to fragmentation and COW overhead. No amount of tuning fixes this. Keep pools below 80%. If you're above 80%, the fix is more disks, not more tunables.
L2ARC on low-RAM systems
L2ARC index headers consume ~70 bytes of ARC per cached block. A 1TB L2ARC with 4K blocks needs ~17GB of RAM just for the index. On a 32GB system, that's half your ARC gone. L2ARC only makes sense when you have plenty of RAM (64GB+) and a working set larger than ARC.
The tuning checklist — in order
Follow this order. Each step builds on the previous. Don't skip to step 6 because it sounds exciting — the early steps deliver the most impact.
zpool status — no errors, no degraded vdevs. Check pool capacity (zpool list) — below 80%. Check ashift=12. These are prerequisites, not tuning.zpool iostat -vql tank 5 and arc_summary under production load. Identify the bottleneck: read latency? sync write queuing? ARC eviction? Dirty data throttling?atime=off or relatime=on. compression=lz4 (or zstd for cold data). xattr=sa + dnodesize=auto. primarycache=metadata for databases with their own cache.none on all ZFS disks. Create the udev rule. This is always correct.zpool iostat -q shows sync write queuing, add a mirrored SLOG with PLP. If sync write queuing is zero, a SLOG won't help.special_small_blocks appropriately.zfs_txg_timeout, zfs_dirty_data_max, VDEV queue depths based on profiling data. Make persistent. Rebuild initramfs.The short version
recordsize matched to your workload's I/O size is 60% of ZFS tuning.
compression=lz4, atime=off (or relatime), xattr=sa — set these on every pool, no exceptions.
Set the Linux I/O scheduler to none on all ZFS disks — always.
Profile before tuning. Run zpool iostat -vql and arc_summary. Know your bottleneck.
Don't copy tunables from the internet. Understand each parameter. Measure before and after. Keep only what helps.
If your pool is above 80% full, no tuning will save you. Add disks.