Pool Design & VDEV Layout

ZFS Wiki

Pool Design & VDEV Layout — the decision you can't undo.

The layout of your pool determines performance, redundancy, and scalability. Once a ZFS pool is created, you cannot change the RAID layout without rebuilding the entire pool. This is the most important decision you'll make. Get it right the first time.

Pool design is permanent. The VDEV topology you choose at zpool create is the topology you live with for the life of that pool. You can add vdevs, but you cannot reshape existing ones (RAIDZ expansion in OpenZFS 2.3+ is the first exception — and it's still experimental). This page gives you the recipes to get it right. For a deeper treatment, see the ZFS Masterclass pool-design section and the dRAID Storage recipe for large arrays.

ashift — sector size alignment

ashift tells ZFS the minimum block size your disks use, expressed as a power of two. ashift=12 means 2¹² = 4096 bytes (4K sectors). ashift=9 means 512 bytes. Every modern HDD and SSD uses 4K sectors internally, even if the firmware reports 512 bytes for backwards compatibility. Always use ashift=12.

If you get ashift wrong (too low), ZFS writes 512-byte blocks to a 4K-sector drive. The drive must read-modify-write every block internally, cutting throughput in half and dramatically increasing latency. There is no way to fix this after pool creation — ashift is permanent per vdev.

# Always specify ashift=12 explicitly. Never trust auto-detection.
zpool create -o ashift=12 tank mirror /dev/sda /dev/sdb

VDEV types

Mirror

Best for high IOPS. Two (or more) disks with identical data. Every mirror vdev can serve reads independently. Use for: VMs, databases, high-traffic workloads. Expandable by adding more mirror pairs.

RAIDZ1

Single-parity striping. Survives one disk failure. Good for: bulk storage, archives, media. Terrible for random writes. Cannot add disks to an existing RAIDZ vdev (except via RAIDZ expansion in OpenZFS 2.3+).

RAIDZ2

Double-parity. Survives two disk failures. Recommended for large arrays where resilver times are long. Best balance of space efficiency and safety for bulk storage.

RAIDZ3

Triple-parity. Survives three disk failures. For very large arrays (20+ disks) where resilver can take days and the risk of a second failure during resilver is real.

dRAID

Distributed RAID (OpenZFS 2.1+). Parity and spare capacity spread across all disks. Resilvers in minutes instead of hours. Use for 12+ disk arrays. See dRAID recipe.

Special VDEV

SSD-based vdev that stores metadata and small files. Dramatically accelerates database workloads, container storage, and anything metadata-heavy. Must be mirrored — losing this vdev loses the pool.

SLOG

Separate ZFS Intent Log. Accelerates synchronous writes (databases, NFS, VM storage). Must be enterprise NVMe with power loss protection. Not a general write cache.

L2ARC

SSD-based read cache that extends ARC beyond RAM. Useful when RAM is limited but you need fast reads. Does NOT improve writes.

Mirrors for IOPS. RAIDZ for capacity. This is the fundamental tradeoff. Everything else is details.

The RAIDZ random write penalty

Why RAIDZ is terrible for databases and VMs

RAIDZ excels at sequential workloads — large blocks written contiguously. Media streaming, archives, backups. But for random writes (databases, VMs, email servers), RAIDZ hits the read-modify-write penalty: if a full stripe isn't written, ZFS must read the old data and parity blocks, compute new parity, and write back. This multiplies IOPS and kills latency.

Example: A PostgreSQL server on RAIDZ2 performing frequent 8KB updates. Each write touches multiple disks inefficiently, causing IOPS bottlenecks. The same workload on mirrors scales linearly — each mirror pair serves requests independently.

Rule of thumb: RAIDZ for throughput. Mirrors for IOPS.

Pool recipes by disk count

Every recipe below includes the properties you should always set at pool creation: ashift=12 (4K sectors), compression=lz4 (free performance), atime=off (no access-time writes), xattr=sa (extended attributes in the inode), and dnodesize=auto (larger dnodes for metadata-heavy workloads).

1 disk — testing only, no redundancy

zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank /dev/sda

No redundancy at all. One disk failure = total data loss. Acceptable for throwaway test environments. Never use this for data you care about.

2 disks — mirror (recommended minimum)

zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank mirror /dev/sda /dev/sdb

50% usable capacity. Survives one disk failure. Fast reads (both disks serve reads). This is the minimum for any system where data matters.

3 disks — 3-way mirror or RAIDZ1

# 3-way mirror — maximum safety, 33% usable capacity
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank mirror /dev/sda /dev/sdb /dev/sdc

# RAIDZ1 — 67% usable capacity, survives one failure
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank raidz1 /dev/sda /dev/sdb /dev/sdc

3-way mirror for IOPS workloads (databases, VMs). RAIDZ1 only for sequential workloads and only with small drives — see the resilver warning below.

4 disks — 2 mirror pairs (recommended for mixed workloads)

zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd

50% usable capacity. Two independent mirror pairs = 2x IOPS of a single mirror. Survives one failure per pair. Easy to expand by adding more mirror pairs. This is the sweet spot for most general-purpose servers.

6 disks — 3 mirror pairs or RAIDZ2

# 3 mirror pairs — 50% capacity, excellent IOPS
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf

# RAIDZ2 — 67% capacity, survives two failures, sequential throughput
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf

8 disks — 4 mirror pairs or RAIDZ2

# 4 mirror pairs — 50% capacity, 4x IOPS
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd \
  mirror /dev/sde /dev/sdf mirror /dev/sdg /dev/sdh

# RAIDZ2 — 75% capacity, survives two failures
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh

12 disks — dRAID2 with 1 spare or 6 mirror pairs

# dRAID2 with 1 distributed spare — resilvers in minutes
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank draid2:1s /dev/sd{a..l}

# 6 mirror pairs — maximum IOPS
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd \
  mirror /dev/sde /dev/sdf mirror /dev/sdg /dev/sdh \
  mirror /dev/sdi /dev/sdj mirror /dev/sdk /dev/sdl

24 disks — dRAID2 with 2 spares

# dRAID2 with 2 distributed spares — large-scale storage
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank draid2:2s /dev/sd{a..x}

At 24 disks, dRAID is the clear winner. Traditional RAIDZ2 resilvers would take 12+ hours per vdev. dRAID spreads the rebuild across all disks in parallel. See the dRAID Storage recipe for full details, tuning, and capacity planning.

For deeper coverage of pool geometry, stripe width selection, and performance benchmarks, see the ZFS Masterclass pool-design section. For dRAID configuration at scale, see the dRAID Storage recipe.

dRAID — distributed RAID for large arrays

Traditional RAIDZ resilvers one disk at a time: the replacement drive is the bottleneck, and an 8TB drive takes 8–12 hours to fill. During those hours, another failure kills the pool. dRAID (OpenZFS 2.1+) fixes this by distributing parity and spare capacity across every disk in the vdev. When a drive fails, all remaining drives participate in the rebuild simultaneously — resilvers that took hours finish in minutes.

The tradeoff: dRAID uses fixed-width stripe groups, which can waste space on small writes. It shines at 12+ disks where resilver time is the dominant risk. Below 12 disks, mirrors or traditional RAIDZ are simpler and equally safe.

dRAID syntax: draid[parity]:[spares]s. For example, draid2:1s means double-parity with one distributed spare. The spare capacity is reserved across all disks — no physical hot spare sits idle. When a disk fails, the spare space activates immediately with no human intervention.

Special VDEV deep dive

The special vdev stores two things: pool metadata (directory entries, file sizes, block pointers) and small file blocks below the special_small_blocks threshold. When your data pool lives on spinning rust, putting metadata on an SSD special vdev means ls, find, du, container image pulls, and database index lookups hit the SSD instead of waiting for a seek on a 7200 RPM platter. The improvement is 10–50x for metadata-heavy operations.

The critical rule: the special vdev must be mirrored. If you add an unmirrored special vdev and that single SSD fails, the pool's metadata is gone. The pool is unrecoverable. ZFS will warn you, but it will let you do it. Don't.

# Add a mirrored special vdev to an existing pool
# This stores metadata + files under 64K on the SSDs
zpool add tank special mirror /dev/nvme0n1 /dev/nvme1n1
zfs set special_small_blocks=65536 tank

The special_small_blocks property controls the threshold. Blocks smaller than this value go to the special vdev. Set it to 65536 (64K) for general use. For database workloads with 8K pages, even 16384 makes a dramatic difference. Metadata always goes to the special vdev regardless of this setting.

SLOG deep dive

The ZFS Intent Log (ZIL) records synchronous write transactions so they survive a power failure. By default the ZIL lives on the data disks. A SLOG (Separate LOG device) moves the ZIL to a dedicated, fast device. This only helps workloads that issue synchronous writes: databases calling fsync(), NFS with sync=always, iSCSI, and VM disk images on NFS/iSCSI.

A SLOG is not a general write cache. Asynchronous writes (the default for most Linux applications) bypass the ZIL entirely — they go straight to the transaction group. Adding a SLOG to an async-write workload does literally nothing. Before buying hardware, check whether your workload actually generates sync writes:

# Check if sync writes are queuing — look at the "syncq" columns
zpool iostat -q tank 5

# If syncq read/write are consistently > 0, a SLOG will help.
# If they're always 0, a SLOG is a waste of money.

SLOG devices must have power loss protection (PLP). Consumer NVMe drives have write-back caches that lose data on power failure — which defeats the entire purpose of the ZIL. Use enterprise NVMe (Intel DC P4510/P5800X, Samsung PM983/PM9A3) or Intel Optane. A SLOG does not need to be large — 16–32GB is plenty for most workloads, since the ZIL only holds data for a few seconds before the next transaction group commit.

# Add a SLOG device (mirror recommended for safety)
zpool add tank log mirror /dev/nvme2n1 /dev/nvme3n1

L2ARC — when it helps and when it doesn't

L2ARC extends the in-memory ARC read cache to an SSD. It helps when your working set is larger than RAM and the workload is read-heavy: file servers, media libraries, build caches, read-replica databases. In these cases, L2ARC gives you SSD-speed reads for data that would otherwise hit spinning disks.

L2ARC does not help when:

Write-heavy

L2ARC is a read cache. It does nothing for writes. Write-heavy workloads gain zero benefit.

Small working set

If your hot data fits in RAM, ARC already caches it. L2ARC adds latency (SSD vs RAM) for data ARC already has.

Low RAM systems

L2ARC headers consume ~70 bytes of ARC (RAM) per cached block. On an SSD with millions of 4K blocks, that overhead can consume gigabytes of RAM — stealing from ARC to index L2ARC. On systems with less than 64GB RAM, the overhead often costs more than the benefit.

# Add L2ARC (no mirror needed — it's a cache, loss is harmless)
zpool add tank cache /dev/nvme4n1

Common pitfalls

Choosing RAIDZ for VMs or databases

This is the #1 mistake. RAIDZ has terrible random write performance. Use mirrors for anything that needs IOPS.

RAIDZ1 on large drives

A 16TB drive takes 12+ hours to resilver. During those hours, one more disk failure and the pool is gone. RAIDZ2 minimum for any drive over 2TB.

Not using special vdevs for metadata

Metadata-heavy workloads (containers, small files, databases) crawl without an SSD special vdev. One mirrored SSD pair changes everything.

Unmirrored special vdev

An unmirrored special vdev is a single point of failure for the entire pool. If that one SSD dies, the pool is unrecoverable. Always mirror the special vdev.

Adding single disks instead of vdevs

You cannot add a single disk to an existing RAIDZ vdev (pre-2.3). You must add an entirely new vdev of equal size. Plan for expansion from day one.

Mixing vdev sizes

ZFS distributes writes across vdevs proportionally. Mismatched vdev sizes create unbalanced performance. Keep all vdevs the same size and type.

Wrong ashift

Auto-detected ashift=9 on a 4K-sector drive halves throughput. Always pass -o ashift=12 explicitly. Cannot be changed after creation.

Consumer SSD as SLOG

Consumer NVMe without power loss protection defeats the purpose of the ZIL. On power failure, the SLOG loses uncommitted sync writes — the exact scenario it's supposed to protect against.

The #1 regret in ZFS: choosing RAIDZ1 on large drives. A 16TB drive takes 12+ hours to resilver. During those 12 hours, one more failure and the pool is gone. RAIDZ2 minimum for drives over 2TB. If you have 12+ drives, dRAID eliminates the resilver window entirely.

Resilver time comparison

Resilver time is how long it takes to rebuild redundancy after a disk failure. During resilver, you are running degraded — one more failure (in RAIDZ1) or two more (in RAIDZ2) means data loss. Shorter resilver = smaller risk window.

Configuration	8TB drive	16TB drive	Risk during resilver
Mirror	3–6 hours	6–12 hours	Low — only the degraded pair is at risk
RAIDZ1	8–14 hours	16–28 hours	Critical — one more failure = pool gone
RAIDZ2	8–14 hours	16–28 hours	Moderate — can survive one more failure during resilver
dRAID2 (12 disks)	15–30 minutes	30–60 minutes	Minimal — parallel rebuild across all disks
dRAID2 (24 disks)	8–15 minutes	15–30 minutes	Minimal — more disks = faster parallel rebuild

Mirror resilver times are driven by the single replacement disk's write speed. RAIDZ resilver reads from all surviving disks but writes to one — the replacement drive is the bottleneck. dRAID reads and writes across all surviving disks simultaneously, which is why resilvers finish in minutes instead of hours.

Pool expansion

How you expand a pool depends entirely on the VDEV type you chose at creation. This is another reason pool design is permanent — your expansion path is locked in.

Mirrors — add more mirror pairs

The simplest and most flexible expansion. Each new mirror pair adds capacity and IOPS linearly. No rebalancing needed (new writes naturally spread across all vdevs).

# Add a new mirror pair to an existing pool
zpool add tank mirror /dev/sdg /dev/sdh

RAIDZ — add another full RAIDZ vdev

You must add a complete RAIDZ vdev of the same width and parity level. If your pool is a 6-disk RAIDZ2, you add another 6-disk RAIDZ2. You cannot add individual disks to an existing RAIDZ vdev (pre-2.3).

# Add a second RAIDZ2 vdev (must match existing vdev width)
zpool add tank raidz2 /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl

RAIDZ expansion (OpenZFS 2.3+) — experimental

OpenZFS 2.3 introduces RAIDZ expansion: you can add a single disk to an existing RAIDZ vdev. ZFS re-stripes the data in the background. This is a major new feature, but it's still marked experimental. The re-stripe can take days on large pools, and the pool runs degraded-performance during the process. Test thoroughly before using in production.

# RAIDZ expansion — add one disk to existing RAIDZ vdev (OpenZFS 2.3+)
zpool attach tank raidz2-0 /dev/sdm

# Monitor progress
zpool status tank

Best practices

VMs & databases

Mirrored vdevs. 6 disks = 3 mirror pairs. Linear IOPS scaling. Easy expansion by adding more pairs. Add a mirrored special vdev for metadata acceleration.

Bulk storage

RAIDZ2 or RAIDZ3. Optimize for capacity and sequential throughput. Accept the random write penalty. Never RAIDZ1 on drives over 2TB.

Mixed workloads

Mirrors for data + special vdev (mirrored SSDs) for metadata. Best of both worlds.

Large scale (12+)

dRAID2 with distributed spares. Resilvers in minutes. See the dRAID recipe for sizing and tuning.

NFS / iSCSI

Add a mirrored SLOG (enterprise NVMe with PLP). Verify sync write pressure with zpool iostat -q first.

Expansion planning

Mirrors expand by adding pairs. RAIDZ expands by adding full vdevs (or single-disk expansion on 2.3+). Plan your growth path before creating the pool.

← What changes when you have ZFS on root. Snapshots & Replication — the killer feature. →