Pool Design & VDEV Layout — the decision you can't undo.
The layout of your pool determines performance, redundancy, and scalability. Once a ZFS pool is created, you cannot change the RAID layout without rebuilding the entire pool. This is the most important decision you'll make. Get it right the first time.
zpool create
is the topology you live with for the life of that pool. You can add vdevs, but you cannot reshape existing ones
(RAIDZ expansion in OpenZFS 2.3+ is the first exception — and it's still experimental). This page gives you
the recipes to get it right. For a deeper treatment, see the
ZFS Masterclass pool-design section
and the dRAID Storage recipe for large arrays.
ashift — sector size alignment
ashift tells ZFS the minimum block size your disks use, expressed as a power of two.
ashift=12 means 212 = 4096 bytes (4K sectors). ashift=9 means 512 bytes.
Every modern HDD and SSD uses 4K sectors internally, even if the firmware reports 512 bytes for backwards compatibility.
Always use ashift=12.
If you get ashift wrong (too low), ZFS writes 512-byte blocks to a 4K-sector drive. The drive must read-modify-write every block internally, cutting throughput in half and dramatically increasing latency. There is no way to fix this after pool creation — ashift is permanent per vdev.
# Always specify ashift=12 explicitly. Never trust auto-detection.
zpool create -o ashift=12 tank mirror /dev/sda /dev/sdb
VDEV types
The RAIDZ random write penalty
Why RAIDZ is terrible for databases and VMs
RAIDZ excels at sequential workloads — large blocks written contiguously. Media streaming, archives, backups. But for random writes (databases, VMs, email servers), RAIDZ hits the read-modify-write penalty: if a full stripe isn't written, ZFS must read the old data and parity blocks, compute new parity, and write back. This multiplies IOPS and kills latency.
Example: A PostgreSQL server on RAIDZ2 performing frequent 8KB updates. Each write touches multiple disks inefficiently, causing IOPS bottlenecks. The same workload on mirrors scales linearly — each mirror pair serves requests independently.
Pool recipes by disk count
Every recipe below includes the properties you should always set at pool creation:
ashift=12 (4K sectors), compression=lz4 (free performance),
atime=off (no access-time writes), xattr=sa (extended attributes in the inode),
and dnodesize=auto (larger dnodes for metadata-heavy workloads).
1 disk — testing only, no redundancy
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank /dev/sda
No redundancy at all. One disk failure = total data loss. Acceptable for throwaway test environments. Never use this for data you care about.
2 disks — mirror (recommended minimum)
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank mirror /dev/sda /dev/sdb
50% usable capacity. Survives one disk failure. Fast reads (both disks serve reads). This is the minimum for any system where data matters.
3 disks — 3-way mirror or RAIDZ1
# 3-way mirror — maximum safety, 33% usable capacity
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank mirror /dev/sda /dev/sdb /dev/sdc
# RAIDZ1 — 67% usable capacity, survives one failure
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank raidz1 /dev/sda /dev/sdb /dev/sdc
3-way mirror for IOPS workloads (databases, VMs). RAIDZ1 only for sequential workloads and only with small drives — see the resilver warning below.
4 disks — 2 mirror pairs (recommended for mixed workloads)
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd
50% usable capacity. Two independent mirror pairs = 2x IOPS of a single mirror. Survives one failure per pair. Easy to expand by adding more mirror pairs. This is the sweet spot for most general-purpose servers.
6 disks — 3 mirror pairs or RAIDZ2
# 3 mirror pairs — 50% capacity, excellent IOPS
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf
# RAIDZ2 — 67% capacity, survives two failures, sequential throughput
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf
8 disks — 4 mirror pairs or RAIDZ2
# 4 mirror pairs — 50% capacity, 4x IOPS
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd \
mirror /dev/sde /dev/sdf mirror /dev/sdg /dev/sdh
# RAIDZ2 — 75% capacity, survives two failures
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh
12 disks — dRAID2 with 1 spare or 6 mirror pairs
# dRAID2 with 1 distributed spare — resilvers in minutes
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank draid2:1s /dev/sd{a..l}
# 6 mirror pairs — maximum IOPS
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd \
mirror /dev/sde /dev/sdf mirror /dev/sdg /dev/sdh \
mirror /dev/sdi /dev/sdj mirror /dev/sdk /dev/sdl
24 disks — dRAID2 with 2 spares
# dRAID2 with 2 distributed spares — large-scale storage
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank draid2:2s /dev/sd{a..x}
At 24 disks, dRAID is the clear winner. Traditional RAIDZ2 resilvers would take 12+ hours per vdev. dRAID spreads the rebuild across all disks in parallel. See the dRAID Storage recipe for full details, tuning, and capacity planning.
dRAID — distributed RAID for large arrays
Traditional RAIDZ resilvers one disk at a time: the replacement drive is the bottleneck, and an 8TB drive takes 8–12 hours to fill. During those hours, another failure kills the pool. dRAID (OpenZFS 2.1+) fixes this by distributing parity and spare capacity across every disk in the vdev. When a drive fails, all remaining drives participate in the rebuild simultaneously — resilvers that took hours finish in minutes.
The tradeoff: dRAID uses fixed-width stripe groups, which can waste space on small writes. It shines at 12+ disks where resilver time is the dominant risk. Below 12 disks, mirrors or traditional RAIDZ are simpler and equally safe.
dRAID syntax: draid[parity]:[spares]s. For example, draid2:1s
means double-parity with one distributed spare. The spare capacity is reserved across all disks —
no physical hot spare sits idle. When a disk fails, the spare space activates immediately with no
human intervention.
Special VDEV deep dive
The special vdev stores two things: pool metadata (directory entries, file sizes,
block pointers) and small file blocks below the special_small_blocks
threshold. When your data pool lives on spinning rust, putting metadata on an SSD special vdev
means ls, find, du, container image pulls, and database
index lookups hit the SSD instead of waiting for a seek on a 7200 RPM platter.
The improvement is 10–50x for metadata-heavy operations.
The critical rule: the special vdev must be mirrored. If you add an unmirrored special vdev and that single SSD fails, the pool's metadata is gone. The pool is unrecoverable. ZFS will warn you, but it will let you do it. Don't.
# Add a mirrored special vdev to an existing pool
# This stores metadata + files under 64K on the SSDs
zpool add tank special mirror /dev/nvme0n1 /dev/nvme1n1
zfs set special_small_blocks=65536 tank
The special_small_blocks property controls the threshold. Blocks smaller than this value
go to the special vdev. Set it to 65536 (64K) for general use. For database workloads
with 8K pages, even 16384 makes a dramatic difference. Metadata always goes to the
special vdev regardless of this setting.
SLOG deep dive
The ZFS Intent Log (ZIL) records synchronous write transactions so they survive a power failure.
By default the ZIL lives on the data disks. A SLOG (Separate LOG device) moves
the ZIL to a dedicated, fast device. This only helps workloads that issue
synchronous writes: databases calling fsync(), NFS with
sync=always, iSCSI, and VM disk images on NFS/iSCSI.
A SLOG is not a general write cache. Asynchronous writes (the default for most Linux applications) bypass the ZIL entirely — they go straight to the transaction group. Adding a SLOG to an async-write workload does literally nothing. Before buying hardware, check whether your workload actually generates sync writes:
# Check if sync writes are queuing — look at the "syncq" columns
zpool iostat -q tank 5
# If syncq read/write are consistently > 0, a SLOG will help.
# If they're always 0, a SLOG is a waste of money.
SLOG devices must have power loss protection (PLP). Consumer NVMe drives have write-back caches that lose data on power failure — which defeats the entire purpose of the ZIL. Use enterprise NVMe (Intel DC P4510/P5800X, Samsung PM983/PM9A3) or Intel Optane. A SLOG does not need to be large — 16–32GB is plenty for most workloads, since the ZIL only holds data for a few seconds before the next transaction group commit.
# Add a SLOG device (mirror recommended for safety)
zpool add tank log mirror /dev/nvme2n1 /dev/nvme3n1
L2ARC — when it helps and when it doesn't
L2ARC extends the in-memory ARC read cache to an SSD. It helps when your working set is larger than RAM and the workload is read-heavy: file servers, media libraries, build caches, read-replica databases. In these cases, L2ARC gives you SSD-speed reads for data that would otherwise hit spinning disks.
L2ARC does not help when:
# Add L2ARC (no mirror needed — it's a cache, loss is harmless)
zpool add tank cache /dev/nvme4n1
Common pitfalls
Choosing RAIDZ for VMs or databases
This is the #1 mistake. RAIDZ has terrible random write performance. Use mirrors for anything that needs IOPS.
RAIDZ1 on large drives
A 16TB drive takes 12+ hours to resilver. During those hours, one more disk failure and the pool is gone. RAIDZ2 minimum for any drive over 2TB.
Not using special vdevs for metadata
Metadata-heavy workloads (containers, small files, databases) crawl without an SSD special vdev. One mirrored SSD pair changes everything.
Unmirrored special vdev
An unmirrored special vdev is a single point of failure for the entire pool. If that one SSD dies, the pool is unrecoverable. Always mirror the special vdev.
Adding single disks instead of vdevs
You cannot add a single disk to an existing RAIDZ vdev (pre-2.3). You must add an entirely new vdev of equal size. Plan for expansion from day one.
Mixing vdev sizes
ZFS distributes writes across vdevs proportionally. Mismatched vdev sizes create unbalanced performance. Keep all vdevs the same size and type.
Wrong ashift
Auto-detected ashift=9 on a 4K-sector drive halves throughput. Always pass -o ashift=12 explicitly. Cannot be changed after creation.
Consumer SSD as SLOG
Consumer NVMe without power loss protection defeats the purpose of the ZIL. On power failure, the SLOG loses uncommitted sync writes — the exact scenario it's supposed to protect against.
Resilver time comparison
Resilver time is how long it takes to rebuild redundancy after a disk failure. During resilver, you are running degraded — one more failure (in RAIDZ1) or two more (in RAIDZ2) means data loss. Shorter resilver = smaller risk window.
| Configuration | 8TB drive | 16TB drive | Risk during resilver |
|---|---|---|---|
| Mirror | 3–6 hours | 6–12 hours | Low — only the degraded pair is at risk |
| RAIDZ1 | 8–14 hours | 16–28 hours | Critical — one more failure = pool gone |
| RAIDZ2 | 8–14 hours | 16–28 hours | Moderate — can survive one more failure during resilver |
| dRAID2 (12 disks) | 15–30 minutes | 30–60 minutes | Minimal — parallel rebuild across all disks |
| dRAID2 (24 disks) | 8–15 minutes | 15–30 minutes | Minimal — more disks = faster parallel rebuild |
Mirror resilver times are driven by the single replacement disk's write speed. RAIDZ resilver reads from all surviving disks but writes to one — the replacement drive is the bottleneck. dRAID reads and writes across all surviving disks simultaneously, which is why resilvers finish in minutes instead of hours.
Pool expansion
How you expand a pool depends entirely on the VDEV type you chose at creation. This is another reason pool design is permanent — your expansion path is locked in.
Mirrors — add more mirror pairs
The simplest and most flexible expansion. Each new mirror pair adds capacity and IOPS linearly. No rebalancing needed (new writes naturally spread across all vdevs).
# Add a new mirror pair to an existing pool
zpool add tank mirror /dev/sdg /dev/sdh
RAIDZ — add another full RAIDZ vdev
You must add a complete RAIDZ vdev of the same width and parity level. If your pool is a 6-disk RAIDZ2, you add another 6-disk RAIDZ2. You cannot add individual disks to an existing RAIDZ vdev (pre-2.3).
# Add a second RAIDZ2 vdev (must match existing vdev width)
zpool add tank raidz2 /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl
RAIDZ expansion (OpenZFS 2.3+) — experimental
OpenZFS 2.3 introduces RAIDZ expansion: you can add a single disk to an existing RAIDZ vdev. ZFS re-stripes the data in the background. This is a major new feature, but it's still marked experimental. The re-stripe can take days on large pools, and the pool runs degraded-performance during the process. Test thoroughly before using in production.
# RAIDZ expansion — add one disk to existing RAIDZ vdev (OpenZFS 2.3+)
zpool attach tank raidz2-0 /dev/sdm
# Monitor progress
zpool status tank
Best practices
zpool iostat -q first.