Appliance Recipe

Distributed Storage with ZFS dRAID

This is for the large array crowd — 12+ disks, petabyte-scale, where resilver time on traditional RAIDZ would be unacceptable. dRAID distributes parity AND spare capacity across every disk in the vdev. When a disk fails, EVERY surviving disk participates in the rebuild. A 12-hour RAIDZ2 resilver becomes 30 minutes on dRAID.

What dRAID Is

Distributed Parity

Parity is spread across all disks in the vdev, not confined to fixed positions. Similar to Ceph's approach to redundancy, but built into ZFS at the vdev layer.

Distributed Hot Spares

Spare capacity is spread across every disk — not a dedicated idle disk sitting in a slot doing nothing. When a failure occurs, the spare space is already distributed and ready.

Massively Parallel Resilver

All surviving disks read AND write during rebuild. The bottleneck shifts from a single replacement disk to the aggregate bandwidth of the entire vdev.

Available Since OpenZFS 2.1.0

Production-ready since 2021. kldload ships current OpenZFS — dRAID works out of the box on every supported distro.

RAIDZ2 (8 disks):
  [D1][D2][D3][D4][D5][D6][P1][P2]
  Resilver: replacement disk is the bottleneck
  All surviving disks read, but ONE disk writes
  Time: hours to days

dRAID2 (8 disks, 1 distributed spare):
  [D+P+S scattered across all 8 disks]
  Resilver: all 7 surviving disks read AND write to distributed spare space
  The rebuild is a shuffle across existing disks, not a sequential copy
  Time: minutes

The key insight: in traditional RAIDZ, the replacement disk is a single sequential writer — it can only rebuild as fast as one disk can write. In dRAID, spare capacity is pre-distributed across every disk, so the rebuild is a parallel scatter/gather operation bounded by the aggregate bandwidth of all surviving disks.

dRAID solves the scariest problem in large ZFS deployments: the resilver window. With 16TB drives and RAIDZ2, a resilver takes 12-24 hours. During those hours, you're running on reduced redundancy — one more failure and you lose data. The probability of a second failure during a 24-hour window on a 48-disk array is not negligible. dRAID shrinks that window from hours to minutes. A 20-minute resilver on a 24-disk array means your exposure window is 20 minutes, not 18 hours. For anyone running 12+ disks with multi-terabyte drives, dRAID isn't optional — it's the difference between theoretical risk and practical safety. For the full ZFS pool design analysis, see the ZFS Masterclass.

When to Use dRAID

Use dRAID when:

12+ disks — below that, RAIDZ is simpler and the resilver time difference is negligible
Resilver time is critical — production systems that cannot afford extended degraded state
Large sequential workloads — video editing, backup targets, media archives, surveillance
Disk failure probability compounds — with 48 spinning disks, a second failure during a 12-hour resilver is not theoretical

Do NOT use dRAID for:

Small random I/O — databases belong on mirror vdevs, period
Fewer than 8 disks — RAIDZ is simpler, the overhead is not worth it
Pools where you need to add individual disks — dRAID requires adding whole vdevs

dRAID Topologies

dRAID1 — Single Parity (Distributed)

Like RAIDZ1 but with distributed parity and parallel resilver. Use with 12+ disks where single-parity risk is acceptable (non-critical data, or replicated elsewhere).

# 12 disks, single parity, 1 distributed spare
zpool create tank draid1:1s /dev/sd{a..l}

# Usable capacity: ~9.2 disks worth (12 - 1 parity - 1 spare, minus overhead)
# Resilver: all 11 surviving disks participate

dRAID2 — Double Parity (Distributed) — Recommended

The sweet spot. Survives two simultaneous disk failures with resilver times measured in minutes instead of hours. Use with 12–48 disks.

# 12 disks, double parity, 1 distributed spare
zpool create tank draid2:1s /dev/sd{a..l}

# 24 disks, double parity, 2 distributed spares
zpool create tank draid2:2s /dev/sd{a..x}

# 36 disks, double parity, 2 distributed spares, 9 data disks per group
zpool create tank draid2:2s:9d /dev/sd{a..aj}

dRAID3 — Triple Parity

Maximum protection for massive arrays. Survives three simultaneous failures. Use with 24+ disks where data loss is unacceptable and resilver overlaps are likely.

# 48 disks, triple parity, 3 distributed spares
zpool create tank draid3:3s /dev/sd{a..av}

# Usable capacity: ~42 disks worth (48 - 3 parity - 3 spare)
# Can lose 3 disks simultaneously and still serve data
# Resilver after first failure: minutes, not days

Anatomy of the Syntax

zpool create tank draid<parity>:<spares>s[:<data>d][:<children>c] devices...

  parity   = 1, 2, or 3 (number of parity disks per redundancy group)
  spares   = number of distributed spares (e.g., 1s, 2s, 3s)
  data     = (optional) data disks per group — controls stripe width
  children = (optional) expected number of children — for replacement mapping

Resilver Speed Comparison

Real-world resilver times on spinning rust (7200 RPM SATA). These are representative — your mileage varies with disk speed, fragmentation, and pool utilization. The point is the order of magnitude difference.

Config	Disks	Disk Size	Pool Used	Resilver Time
RAIDZ2	8x 16TB	16TB replacement	~60%	~12 hours
dRAID2:1s	8x 16TB	distributed	~60%	~45 minutes
RAIDZ2	24x 16TB	16TB replacement	~60%	~18 hours
dRAID2:2s	24x 16TB	distributed	~60%	~20 minutes
RAIDZ2	48x 16TB	16TB replacement	~60%	~24 hours
dRAID3:3s	48x 16TB	distributed	~60%	~15 minutes

The math is straightforward: in RAIDZ2 with 24 disks, the replacement disk writes at ~200 MB/s sustained. That is 16TB / 200 MB/s = ~22 hours. In dRAID2 with 24 disks, 22 surviving disks each write their portion in parallel: 16TB / (22 * 200 MB/s) = ~20 minutes. The bottleneck shifts from one disk to all disks.

Dataset Layout for Large Storage

tank/
├── archive/        recordsize=1M, compression=zstd
│   ├── video-raw/  # Production footage, 4K/8K masters
│   ├── video-edit/ # Active editing projects
│   └── video-final/# Delivered, graded finals
├── backup-targets/ recordsize=1M, compression=lz4
│   ├── site-a/     # ZFS recv from remote site
│   └── site-b/     # ZFS recv from second site
├── cold-storage/   recordsize=1M, compression=zstd-19
└── scratch/        recordsize=1M, sync=disabled

# Create the hierarchy
zfs create -o mountpoint=/tank/archive -o recordsize=1M \
    -o compression=zstd tank/archive
zfs create tank/archive/video-raw
zfs create tank/archive/video-edit
zfs create tank/archive/video-final

zfs create -o mountpoint=/tank/backup-targets -o recordsize=1M \
    -o compression=lz4 tank/backup-targets
zfs create tank/backup-targets/site-a
zfs create tank/backup-targets/site-b

# Cold storage: maximum compression, data is write-once-read-rarely
zfs create -o mountpoint=/tank/cold-storage -o recordsize=1M \
    -o compression=zstd-19 tank/cold-storage

# Scratch: fast writes, no sync overhead, data is expendable
zfs create -o mountpoint=/tank/scratch -o recordsize=1M \
    -o sync=disabled tank/scratch

Why recordsize=1M everywhere: Large sequential workloads (video, backups, archives) benefit from 1MB records. Fewer metadata operations, better compression ratios, and higher throughput. Do not use 1M recordsize for databases — those belong on mirror vdevs with 8K–16K recordsize, not on dRAID.

The compression tier matters here. lz4 for active data (near-zero CPU, modest compression). zstd for archive data (better ratio, more CPU, but the data is rarely read). zstd-19 for cold storage (maximum compression, significant CPU, but the data is write-once-read-never). off for already-compressed data (JPEG, H.264/H.265, compressed archives). The scratch dataset has sync=disabled — writes aren't guaranteed to survive a crash, but throughput is 2-3x higher. Only use this for data you can regenerate (render proxies, temp files, cache).

Monitoring dRAID

Pool and Spare Status

# Show pool status including distributed spare state
zpool status tank

# Example output during resilver:
#   NAME                          STATE     READ WRITE CKSUM
#   tank                          DEGRADED     0     0     0
#     draid2:1s:0:8-0             DEGRADED     0     0     0
#       sda                       ONLINE       0     0     0
#       sdb                       FAULTED      0     0     0  # failed disk
#       sdc                       ONLINE       0     0     0
#       ...
#       spare-0                   ONLINE       0     0     0  # distributed spare active
#   scan: resilver in progress, 45.2% done, 00:12:34 to go

I/O During Resilver

# Watch per-disk I/O during resilver — every disk should be busy
zpool iostat -v tank 5

# All surviving disks should show read AND write activity
# This is the key difference from RAIDZ: in RAIDZ only the replacement
# disk shows heavy writes. In dRAID, every disk writes.

eBPF Latency Monitoring During Rebuild

# Track I/O latency per disk during resilver
# Ensures no single disk is a bottleneck
bpftrace -e '
tracepoint:block:block_rq_complete {
    @lat[args->dev] = hist(args->nr_sector);
}
interval:s:10 { print(@lat); clear(@lat); }
'

# Or use kldload built-in monitoring
k-zfs-monitor --resilver

Alerting on Resilver Events

# ZFS Event Daemon (zed) handles this automatically on kldload
# Configure email alerts in /etc/zfs/zed.d/zed.rc:
ZED_EMAIL_ADDR="admin@example.com"
ZED_NOTIFY_VERBOSE=1

# Key events to monitor:
#   resilver_start   — a distributed spare has activated
#   resilver_finish  — rebuild complete
#   vdev_fault       — a disk has failed
#   io_delay         — slow I/O detected (possible failing disk)

# Track spare capacity utilization
zpool list -v tank | grep spare

Expanding dRAID Pools

You cannot add individual disks to a dRAID vdev. Growth means adding entire vdevs.

# Start with one 12-disk dRAID2 vdev
zpool create tank draid2:1s /dev/sd{a..l}

# Later: add a second 12-disk dRAID2 vdev — pool doubles in size
zpool add tank draid2:1s /dev/sd{m..x}

# The pool now stripes across two dRAID2 vdevs
# Both vdevs resilver independently and in parallel

Planning for growth: Decide your vdev width at build time and stick with it. If you start with 12-disk vdevs, buy disks in groups of 12. Mixing vdev widths works but complicates capacity planning and makes the pool harder to reason about.

Rebalancing After Expansion

# ZFS does not automatically rebalance data across new vdevs.
# New writes go to the vdev with the most free space (roughly).
# To rebalance existing data, copy it through ZFS:

# Option 1: zfs send/recv within the same pool
zfs snapshot tank/archive@rebalance
zfs send tank/archive@rebalance | zfs recv tank/archive-new
zfs rename tank/archive tank/archive-old
zfs rename tank/archive-new tank/archive
zfs destroy -r tank/archive-old

# Option 2: for small datasets, cp and delete
# Option 3: accept uneven distribution — ZFS handles it fine,
#            new writes will naturally fill the new vdev

dRAID + Replication

dRAID and replication protect against different failure modes. dRAID protects against disk failure (hardware). Replication protects against site failure (fire, flood, ransomware, human error). You need both. A 48-disk dRAID3 array that survives three simultaneous disk failures is still gone if the building floods. Replication to a remote site over a WireGuard tunnel means you have a second copy that survives anything short of a global catastrophe. The incremental nature of zfs send means even petabyte-scale replication is practical — only changed blocks transfer.

dRAID protects against disk failure. Replication protects against site failure. You need both.

# syncoid for offsite backup of dRAID pools
# Incremental send/recv over WireGuard

# Initial full send (will take a while for petabyte-scale)
syncoid -r tank/archive remote-host:backup/archive

# Subsequent runs send only changed blocks
# Run every hour via cron
cat > /etc/cron.d/draid-replicate << 'EOF'
0 * * * * root syncoid -r --no-sync-snap tank/archive remote-host:backup/archive 2>&1 | logger -t zfs-replicate
0 * * * * root syncoid -r --no-sync-snap tank/backup-targets remote-host:backup/targets 2>&1 | logger -t zfs-replicate
EOF

WireGuard Tunnel for Replication

# Dedicated WireGuard tunnel for replication traffic
# See the Firewall & Gateway recipe for full WireGuard setup

cat > /etc/wireguard/wg-repl.conf << EOF
[Interface]
Address = 10.99.0.1/30
PrivateKey = $(cat /etc/wireguard/repl-private.key)
ListenPort = 51821

[Peer]
PublicKey = 
AllowedIPs = 10.99.0.2/32
Endpoint = remote-host.example.com:51821
PersistentKeepalive = 25
EOF

systemctl enable --now wg-quick@wg-repl

The 3-2-1 Rule at Petabyte Scale

Three copies of data, two different media types, one offsite. At petabyte scale:

Copy 1: dRAID pool (production, fast access)
Copy 2: ZFS replication to a second dRAID pool at a remote site
Copy 3: Cold storage — tape (LTO-9 at 18TB/tape), or a third site with slower disks and zstd-19 compression

At 1 PB, a full tape backup is ~56 LTO-9 tapes. Incremental ZFS send/recv keeps the remote replicas current without retransmitting the entire dataset.

Hardware for Large Arrays

HBA Cards

Use HBA (Host Bus Adapter) cards in IT mode — never hardware RAID controllers. ZFS needs direct access to disks. Hardware RAID hides disk errors from ZFS and defeats the entire purpose of end-to-end checksumming.

Recommended HBAs:
  Broadcom (LSI) SAS 9300/9400 series — 12 Gbps SAS, IT mode
  Broadcom SAS 9500 series — 24 Gbps SAS, NVMe support
  Dell HBA330/HBA355i — rebranded LSI, same firmware

Flash to IT mode firmware if the card ships in IR (RAID) mode.
kldload includes mpt3sas and mpi3mr kernel modules for these controllers.

SAS Expanders and Disk Shelves

For 24+ disks:
  SAS expander backplane (HP D3600/D3700, Dell MD1200/MD1400, Supermicro JBOD)
  One SAS HBA connects to one or more JBODs via SFF-8644 cables
  Each JBOD holds 12-60 disks depending on form factor

Topology:
  [Server] --SFF-8644--> [SAS Expander JBOD 1: 24 disks]
            --SFF-8644--> [SAS Expander JBOD 2: 24 disks]

  Total: 48 disks on one HBA, two cables
  dRAID2:2s across all 48 disks = one massive vdev

ECC RAM Sizing

The rule of thumb: 1 GB of RAM per TB of storage. This is for the ARC (Adaptive Replacement Cache). You can run with less, but the ARC is what makes ZFS fast.

12x 16TB disks = 192TB raw, ~150TB usable → 150 GB RAM (minimum 64 GB)
24x 16TB disks = 384TB raw, ~300TB usable → 300 GB RAM (minimum 128 GB)
48x 16TB disks = 768TB raw, ~600TB usable → 512-768 GB RAM

Always ECC. Non-ECC RAM can silently corrupt data in the ARC and ZFS
will happily write the corrupted data back to disk. ECC is non-negotiable
for production storage.

The ECC requirement is not ZFS being precious — it's physics. ZFS checksums every block on disk, so it detects corruption on read. But if the corruption happens in RAM (a bit flip in the ARC cache), ZFS writes the corrupted block to disk with a valid checksum, because the checksum is computed from the corrupted data. ECC RAM detects and corrects single-bit errors before they reach ZFS. Without ECC, you have end-to-end checksumming with a hole in the middle. For a homelab NAS with 16GB of RAM, the risk is low. For a 48-disk production array with 512GB of RAM handling petabytes of irreplaceable data, ECC is non-negotiable.

NVMe for SLOG and Special Vdev

# SLOG (Separate Log) — accelerates synchronous writes
# Only helps if your workload does sync writes (NFS, databases, iSCSI)
zpool add tank log mirror /dev/nvme0n1p1 /dev/nvme1n1p1

# Special vdev — stores metadata and small blocks on fast NVMe
# Dramatically improves ls, find, stat operations on large pools
zpool add tank special mirror /dev/nvme0n1p2 /dev/nvme1n1p2

# Set small_blk threshold — blocks smaller than this go to special vdev
zfs set special_small_blocks=64k tank

# ALWAYS mirror your special vdev. If you lose it, you lose the pool.

Use Cases

Video Production

4K/8K editing requires sustained sequential reads at 800+ MB/s. dRAID2 across 24 disks delivers aggregate throughput that saturates 10GbE. Snapshots before every render — roll back a bad color grade in seconds. ZFS compression on ProRes/DNxHR proxies saves 20-30% with zero CPU overhead at zstd-1.

Backup Target

Receive ZFS send streams from 100+ machines into one dRAID pool. Each source gets its own dataset with independent snapshot schedules. Compression with lz4 is free throughput. If a disk dies, the resilver completes before your next backup window even opens.

Surveillance NVR

24/7 writes from 50+ cameras at 10-30 Mbps each. dRAID handles the sustained write load. Oldest footage auto-pruned by ZFS snapshot expiry — no filesystem fragmentation, no manual cleanup. Resilver in minutes means no surveillance gap.

Scientific Data

Genomics sequencers produce terabytes per run. Climate models generate petabytes per simulation. Particle physics detectors write continuously. These datasets never shrink. dRAID3 with triple parity protects irreplaceable data while zstd-19 compression can cut storage costs by 40-60% on compressible formats.

CDN Origin

Serve static assets from a massive dRAID pool. Sequential reads dominate. Replicate to edge nodes with zfs send — only changed blocks transfer. A failed disk rebuilds in minutes, not the hours it would take on RAIDZ with multi-terabyte drives.

Limitations

dRAID is not a silver bullet. Be honest about what it costs:

Higher overhead for small random I/O — dRAID's fixed stripe width means small random writes touch more disks than RAIDZ. For database workloads, mirror vdevs remain the correct choice.
Cannot add individual disks — you must add entire vdevs. Plan your disk purchasing in groups that match your vdev width.
Distributed spares cannot be removed once allocated — the spare space is woven into the vdev layout. You choose the spare count at zpool create time and live with it.
Minimum 3 disks per vdev — but realistically, do not use dRAID below 8 disks. Below 12, the resilver time advantage over RAIDZ is marginal.
Fixed redundancy group size — once created, you cannot change the parity level or spare count. Choose wisely. dRAID2:2s is the conservative default for most large arrays.
Requires OpenZFS 2.1.0+ — kldload ships current OpenZFS, so this is not a constraint if you are using this project. On older systems, check your version with zfs --version.

Complete Build: 24-Disk dRAID2 Storage Server

# Install kldload
cat > /tmp/answers.env << 'EOF'
KLDLOAD_DISTRO=centos
KLDLOAD_DISK=/dev/nvme0n1
KLDLOAD_HOSTNAME=storage01
KLDLOAD_USERNAME=admin
KLDLOAD_PASSWORD=changeme
KLDLOAD_PROFILE=server
KLDLOAD_NET_METHOD=dhcp
EOF
kldload-install-target --config /tmp/answers.env

# After reboot, create the dRAID pool
# 24x 16TB SAS disks on an LSI 9400 HBA
# Double parity, 2 distributed spares
zpool create \
    -o ashift=12 \
    -o autotrim=on \
    -O compression=lz4 \
    -O atime=off \
    -O xattr=sa \
    -O dnodesize=auto \
    -O recordsize=1M \
    tank draid2:2s /dev/sd{a..x}

# Verify the topology
zpool status tank

# Add NVMe special vdev for metadata acceleration
zpool add tank special mirror /dev/nvme1n1p1 /dev/nvme2n1p1
zfs set special_small_blocks=64k tank

# Add mirrored SLOG for synchronous writes (NFS, iSCSI)
zpool add tank log mirror /dev/nvme1n1p2 /dev/nvme2n1p2

# Create the dataset hierarchy
zfs create -o mountpoint=/tank/archive -o compression=zstd tank/archive
zfs create tank/archive/video-raw
zfs create tank/archive/video-edit
zfs create tank/archive/video-final

zfs create -o mountpoint=/tank/backup-targets -o compression=lz4 tank/backup-targets
zfs create tank/backup-targets/site-a
zfs create tank/backup-targets/site-b

zfs create -o mountpoint=/tank/cold-storage -o compression=zstd-19 tank/cold-storage
zfs create -o mountpoint=/tank/scratch -o sync=disabled tank/scratch

# Snapshot schedule
cat > /etc/cron.d/draid-snapshots << 'CRON'
# Hourly snapshots on active datasets, keep 48
0 * * * * root zfs snapshot -r tank/archive@auto-$(date +\%Y\%m\%d-\%H\%M) 2>&1 | logger -t zfs-snap

# Daily snapshots on backup targets, keep 30
0 2 * * * root zfs snapshot -r tank/backup-targets@auto-$(date +\%Y\%m\%d) 2>&1 | logger -t zfs-snap

# Weekly snapshots on cold storage, keep 52
0 3 * * 0 root zfs snapshot -r tank/cold-storage@auto-$(date +\%Y\%m\%d) 2>&1 | logger -t zfs-snap
CRON

# Enable ZFS event monitoring
systemctl enable --now zfs-zed

echo "Pool ready. $(zpool list -H -o size,free tank | awk '{print "Size: "$1", Free: "$2}')"

From here, point your backup agents at /tank/backup-targets, mount /tank/archive via NFS or SMB for your edit bays, and let syncoid handle offsite replication. The dRAID pool handles the rest — fast resilvers, parallel I/O, and distributed spare activation without human intervention.

← Game Servers Ham Radio (IRLP) →