Distributed Storage with ZFS dRAID
This is for the large array crowd — 12+ disks, petabyte-scale, where resilver time on traditional RAIDZ would be unacceptable. dRAID distributes parity AND spare capacity across every disk in the vdev. When a disk fails, EVERY surviving disk participates in the rebuild. A 12-hour RAIDZ2 resilver becomes 30 minutes on dRAID.
What dRAID Is
Distributed Parity
Parity is spread across all disks in the vdev, not confined to fixed positions. Similar to Ceph's approach to redundancy, but built into ZFS at the vdev layer.
Distributed Hot Spares
Spare capacity is spread across every disk — not a dedicated idle disk sitting in a slot doing nothing. When a failure occurs, the spare space is already distributed and ready.
Massively Parallel Resilver
All surviving disks read AND write during rebuild. The bottleneck shifts from a single replacement disk to the aggregate bandwidth of the entire vdev.
Available Since OpenZFS 2.1.0
Production-ready since 2021. kldload ships current OpenZFS — dRAID works out of the box on every supported distro.
RAIDZ2 (8 disks):
[D1][D2][D3][D4][D5][D6][P1][P2]
Resilver: replacement disk is the bottleneck
All surviving disks read, but ONE disk writes
Time: hours to days
dRAID2 (8 disks, 1 distributed spare):
[D+P+S scattered across all 8 disks]
Resilver: all 7 surviving disks read AND write to distributed spare space
The rebuild is a shuffle across existing disks, not a sequential copy
Time: minutes
The key insight: in traditional RAIDZ, the replacement disk is a single sequential writer — it can only rebuild as fast as one disk can write. In dRAID, spare capacity is pre-distributed across every disk, so the rebuild is a parallel scatter/gather operation bounded by the aggregate bandwidth of all surviving disks.
When to Use dRAID
Use dRAID when:
- 12+ disks — below that, RAIDZ is simpler and the resilver time difference is negligible
- Resilver time is critical — production systems that cannot afford extended degraded state
- Large sequential workloads — video editing, backup targets, media archives, surveillance
- Disk failure probability compounds — with 48 spinning disks, a second failure during a 12-hour resilver is not theoretical
Do NOT use dRAID for:
- Small random I/O — databases belong on mirror vdevs, period
- Fewer than 8 disks — RAIDZ is simpler, the overhead is not worth it
- Pools where you need to add individual disks — dRAID requires adding whole vdevs
dRAID Topologies
dRAID1 — Single Parity (Distributed)
Like RAIDZ1 but with distributed parity and parallel resilver. Use with 12+ disks where single-parity risk is acceptable (non-critical data, or replicated elsewhere).
# 12 disks, single parity, 1 distributed spare
zpool create tank draid1:1s /dev/sd{a..l}
# Usable capacity: ~9.2 disks worth (12 - 1 parity - 1 spare, minus overhead)
# Resilver: all 11 surviving disks participate
dRAID2 — Double Parity (Distributed) — Recommended
The sweet spot. Survives two simultaneous disk failures with resilver times measured in minutes instead of hours. Use with 12–48 disks.
# 12 disks, double parity, 1 distributed spare
zpool create tank draid2:1s /dev/sd{a..l}
# 24 disks, double parity, 2 distributed spares
zpool create tank draid2:2s /dev/sd{a..x}
# 36 disks, double parity, 2 distributed spares, 9 data disks per group
zpool create tank draid2:2s:9d /dev/sd{a..aj}
dRAID3 — Triple Parity
Maximum protection for massive arrays. Survives three simultaneous failures. Use with 24+ disks where data loss is unacceptable and resilver overlaps are likely.
# 48 disks, triple parity, 3 distributed spares
zpool create tank draid3:3s /dev/sd{a..av}
# Usable capacity: ~42 disks worth (48 - 3 parity - 3 spare)
# Can lose 3 disks simultaneously and still serve data
# Resilver after first failure: minutes, not days
Anatomy of the Syntax
zpool create tank draid<parity>:<spares>s[:<data>d][:<children>c] devices...
parity = 1, 2, or 3 (number of parity disks per redundancy group)
spares = number of distributed spares (e.g., 1s, 2s, 3s)
data = (optional) data disks per group — controls stripe width
children = (optional) expected number of children — for replacement mapping
Resilver Speed Comparison
Real-world resilver times on spinning rust (7200 RPM SATA). These are representative — your mileage varies with disk speed, fragmentation, and pool utilization. The point is the order of magnitude difference.
| Config | Disks | Disk Size | Pool Used | Resilver Time |
|---|---|---|---|---|
| RAIDZ2 | 8x 16TB | 16TB replacement | ~60% | ~12 hours |
| dRAID2:1s | 8x 16TB | distributed | ~60% | ~45 minutes |
| RAIDZ2 | 24x 16TB | 16TB replacement | ~60% | ~18 hours |
| dRAID2:2s | 24x 16TB | distributed | ~60% | ~20 minutes |
| RAIDZ2 | 48x 16TB | 16TB replacement | ~60% | ~24 hours |
| dRAID3:3s | 48x 16TB | distributed | ~60% | ~15 minutes |
The math is straightforward: in RAIDZ2 with 24 disks, the replacement disk writes at ~200 MB/s sustained. That is 16TB / 200 MB/s = ~22 hours. In dRAID2 with 24 disks, 22 surviving disks each write their portion in parallel: 16TB / (22 * 200 MB/s) = ~20 minutes. The bottleneck shifts from one disk to all disks.
Dataset Layout for Large Storage
tank/
├── archive/ recordsize=1M, compression=zstd
│ ├── video-raw/ # Production footage, 4K/8K masters
│ ├── video-edit/ # Active editing projects
│ └── video-final/# Delivered, graded finals
├── backup-targets/ recordsize=1M, compression=lz4
│ ├── site-a/ # ZFS recv from remote site
│ └── site-b/ # ZFS recv from second site
├── cold-storage/ recordsize=1M, compression=zstd-19
└── scratch/ recordsize=1M, sync=disabled
# Create the hierarchy
zfs create -o mountpoint=/tank/archive -o recordsize=1M \
-o compression=zstd tank/archive
zfs create tank/archive/video-raw
zfs create tank/archive/video-edit
zfs create tank/archive/video-final
zfs create -o mountpoint=/tank/backup-targets -o recordsize=1M \
-o compression=lz4 tank/backup-targets
zfs create tank/backup-targets/site-a
zfs create tank/backup-targets/site-b
# Cold storage: maximum compression, data is write-once-read-rarely
zfs create -o mountpoint=/tank/cold-storage -o recordsize=1M \
-o compression=zstd-19 tank/cold-storage
# Scratch: fast writes, no sync overhead, data is expendable
zfs create -o mountpoint=/tank/scratch -o recordsize=1M \
-o sync=disabled tank/scratch
Why recordsize=1M everywhere: Large sequential workloads (video, backups, archives) benefit from 1MB records. Fewer metadata operations, better compression ratios, and higher throughput. Do not use 1M recordsize for databases — those belong on mirror vdevs with 8K–16K recordsize, not on dRAID.
lz4 for active data (near-zero CPU, modest compression). zstd for archive data (better ratio, more CPU, but the data is rarely read). zstd-19 for cold storage (maximum compression, significant CPU, but the data is write-once-read-never). off for already-compressed data (JPEG, H.264/H.265, compressed archives). The scratch dataset has sync=disabled — writes aren't guaranteed to survive a crash, but throughput is 2-3x higher. Only use this for data you can regenerate (render proxies, temp files, cache).Monitoring dRAID
Pool and Spare Status
# Show pool status including distributed spare state
zpool status tank
# Example output during resilver:
# NAME STATE READ WRITE CKSUM
# tank DEGRADED 0 0 0
# draid2:1s:0:8-0 DEGRADED 0 0 0
# sda ONLINE 0 0 0
# sdb FAULTED 0 0 0 # failed disk
# sdc ONLINE 0 0 0
# ...
# spare-0 ONLINE 0 0 0 # distributed spare active
# scan: resilver in progress, 45.2% done, 00:12:34 to go
I/O During Resilver
# Watch per-disk I/O during resilver — every disk should be busy
zpool iostat -v tank 5
# All surviving disks should show read AND write activity
# This is the key difference from RAIDZ: in RAIDZ only the replacement
# disk shows heavy writes. In dRAID, every disk writes.
eBPF Latency Monitoring During Rebuild
# Track I/O latency per disk during resilver
# Ensures no single disk is a bottleneck
bpftrace -e '
tracepoint:block:block_rq_complete {
@lat[args->dev] = hist(args->nr_sector);
}
interval:s:10 { print(@lat); clear(@lat); }
'
# Or use kldload built-in monitoring
k-zfs-monitor --resilver
Alerting on Resilver Events
# ZFS Event Daemon (zed) handles this automatically on kldload
# Configure email alerts in /etc/zfs/zed.d/zed.rc:
ZED_EMAIL_ADDR="admin@example.com"
ZED_NOTIFY_VERBOSE=1
# Key events to monitor:
# resilver_start — a distributed spare has activated
# resilver_finish — rebuild complete
# vdev_fault — a disk has failed
# io_delay — slow I/O detected (possible failing disk)
# Track spare capacity utilization
zpool list -v tank | grep spare
Expanding dRAID Pools
You cannot add individual disks to a dRAID vdev. Growth means adding entire vdevs.
# Start with one 12-disk dRAID2 vdev
zpool create tank draid2:1s /dev/sd{a..l}
# Later: add a second 12-disk dRAID2 vdev — pool doubles in size
zpool add tank draid2:1s /dev/sd{m..x}
# The pool now stripes across two dRAID2 vdevs
# Both vdevs resilver independently and in parallel
Planning for growth: Decide your vdev width at build time and stick with it. If you start with 12-disk vdevs, buy disks in groups of 12. Mixing vdev widths works but complicates capacity planning and makes the pool harder to reason about.
Rebalancing After Expansion
# ZFS does not automatically rebalance data across new vdevs.
# New writes go to the vdev with the most free space (roughly).
# To rebalance existing data, copy it through ZFS:
# Option 1: zfs send/recv within the same pool
zfs snapshot tank/archive@rebalance
zfs send tank/archive@rebalance | zfs recv tank/archive-new
zfs rename tank/archive tank/archive-old
zfs rename tank/archive-new tank/archive
zfs destroy -r tank/archive-old
# Option 2: for small datasets, cp and delete
# Option 3: accept uneven distribution — ZFS handles it fine,
# new writes will naturally fill the new vdev
dRAID + Replication
zfs send means even petabyte-scale replication is practical — only changed blocks transfer.dRAID protects against disk failure. Replication protects against site failure. You need both.
# syncoid for offsite backup of dRAID pools
# Incremental send/recv over WireGuard
# Initial full send (will take a while for petabyte-scale)
syncoid -r tank/archive remote-host:backup/archive
# Subsequent runs send only changed blocks
# Run every hour via cron
cat > /etc/cron.d/draid-replicate << 'EOF'
0 * * * * root syncoid -r --no-sync-snap tank/archive remote-host:backup/archive 2>&1 | logger -t zfs-replicate
0 * * * * root syncoid -r --no-sync-snap tank/backup-targets remote-host:backup/targets 2>&1 | logger -t zfs-replicate
EOF
WireGuard Tunnel for Replication
# Dedicated WireGuard tunnel for replication traffic
# See the Firewall & Gateway recipe for full WireGuard setup
cat > /etc/wireguard/wg-repl.conf << EOF
[Interface]
Address = 10.99.0.1/30
PrivateKey = $(cat /etc/wireguard/repl-private.key)
ListenPort = 51821
[Peer]
PublicKey =
AllowedIPs = 10.99.0.2/32
Endpoint = remote-host.example.com:51821
PersistentKeepalive = 25
EOF
systemctl enable --now wg-quick@wg-repl
The 3-2-1 Rule at Petabyte Scale
Three copies of data, two different media types, one offsite. At petabyte scale:
- Copy 1: dRAID pool (production, fast access)
- Copy 2: ZFS replication to a second dRAID pool at a remote site
- Copy 3: Cold storage — tape (LTO-9 at 18TB/tape), or a third site with slower disks and zstd-19 compression
At 1 PB, a full tape backup is ~56 LTO-9 tapes. Incremental ZFS send/recv keeps the remote replicas current without retransmitting the entire dataset.
Hardware for Large Arrays
HBA Cards
Use HBA (Host Bus Adapter) cards in IT mode — never hardware RAID controllers. ZFS needs direct access to disks. Hardware RAID hides disk errors from ZFS and defeats the entire purpose of end-to-end checksumming.
Recommended HBAs:
Broadcom (LSI) SAS 9300/9400 series — 12 Gbps SAS, IT mode
Broadcom SAS 9500 series — 24 Gbps SAS, NVMe support
Dell HBA330/HBA355i — rebranded LSI, same firmware
Flash to IT mode firmware if the card ships in IR (RAID) mode.
kldload includes mpt3sas and mpi3mr kernel modules for these controllers.
SAS Expanders and Disk Shelves
For 24+ disks:
SAS expander backplane (HP D3600/D3700, Dell MD1200/MD1400, Supermicro JBOD)
One SAS HBA connects to one or more JBODs via SFF-8644 cables
Each JBOD holds 12-60 disks depending on form factor
Topology:
[Server] --SFF-8644--> [SAS Expander JBOD 1: 24 disks]
--SFF-8644--> [SAS Expander JBOD 2: 24 disks]
Total: 48 disks on one HBA, two cables
dRAID2:2s across all 48 disks = one massive vdev
ECC RAM Sizing
The rule of thumb: 1 GB of RAM per TB of storage. This is for the ARC (Adaptive Replacement Cache). You can run with less, but the ARC is what makes ZFS fast.
12x 16TB disks = 192TB raw, ~150TB usable → 150 GB RAM (minimum 64 GB)
24x 16TB disks = 384TB raw, ~300TB usable → 300 GB RAM (minimum 128 GB)
48x 16TB disks = 768TB raw, ~600TB usable → 512-768 GB RAM
Always ECC. Non-ECC RAM can silently corrupt data in the ARC and ZFS
will happily write the corrupted data back to disk. ECC is non-negotiable
for production storage.
NVMe for SLOG and Special Vdev
# SLOG (Separate Log) — accelerates synchronous writes
# Only helps if your workload does sync writes (NFS, databases, iSCSI)
zpool add tank log mirror /dev/nvme0n1p1 /dev/nvme1n1p1
# Special vdev — stores metadata and small blocks on fast NVMe
# Dramatically improves ls, find, stat operations on large pools
zpool add tank special mirror /dev/nvme0n1p2 /dev/nvme1n1p2
# Set small_blk threshold — blocks smaller than this go to special vdev
zfs set special_small_blocks=64k tank
# ALWAYS mirror your special vdev. If you lose it, you lose the pool.
Use Cases
Video Production
4K/8K editing requires sustained sequential reads at 800+ MB/s. dRAID2 across 24 disks delivers aggregate throughput that saturates 10GbE. Snapshots before every render — roll back a bad color grade in seconds. ZFS compression on ProRes/DNxHR proxies saves 20-30% with zero CPU overhead at zstd-1.
Backup Target
Receive ZFS send streams from 100+ machines into one dRAID pool. Each source gets its own dataset with independent snapshot schedules. Compression with lz4 is free throughput. If a disk dies, the resilver completes before your next backup window even opens.
Surveillance NVR
24/7 writes from 50+ cameras at 10-30 Mbps each. dRAID handles the sustained write load. Oldest footage auto-pruned by ZFS snapshot expiry — no filesystem fragmentation, no manual cleanup. Resilver in minutes means no surveillance gap.
Scientific Data
Genomics sequencers produce terabytes per run. Climate models generate petabytes per simulation. Particle physics detectors write continuously. These datasets never shrink. dRAID3 with triple parity protects irreplaceable data while zstd-19 compression can cut storage costs by 40-60% on compressible formats.
CDN Origin
Serve static assets from a massive dRAID pool. Sequential reads dominate.
Replicate to edge nodes with zfs send — only changed blocks transfer.
A failed disk rebuilds in minutes, not the hours it would take on RAIDZ
with multi-terabyte drives.
Limitations
dRAID is not a silver bullet. Be honest about what it costs:
- Higher overhead for small random I/O — dRAID's fixed stripe width means small random writes touch more disks than RAIDZ. For database workloads, mirror vdevs remain the correct choice.
- Cannot add individual disks — you must add entire vdevs. Plan your disk purchasing in groups that match your vdev width.
- Distributed spares cannot be removed once allocated — the spare space is
woven into the vdev layout. You choose the spare count at
zpool createtime and live with it. - Minimum 3 disks per vdev — but realistically, do not use dRAID below 8 disks. Below 12, the resilver time advantage over RAIDZ is marginal.
- Fixed redundancy group size — once created, you cannot change the parity level or spare count. Choose wisely. dRAID2:2s is the conservative default for most large arrays.
- Requires OpenZFS 2.1.0+ — kldload ships current OpenZFS, so this is not
a constraint if you are using this project. On older systems, check your version
with
zfs --version.
Complete Build: 24-Disk dRAID2 Storage Server
# Install kldload
cat > /tmp/answers.env << 'EOF'
KLDLOAD_DISTRO=centos
KLDLOAD_DISK=/dev/nvme0n1
KLDLOAD_HOSTNAME=storage01
KLDLOAD_USERNAME=admin
KLDLOAD_PASSWORD=changeme
KLDLOAD_PROFILE=server
KLDLOAD_NET_METHOD=dhcp
EOF
kldload-install-target --config /tmp/answers.env
# After reboot, create the dRAID pool
# 24x 16TB SAS disks on an LSI 9400 HBA
# Double parity, 2 distributed spares
zpool create \
-o ashift=12 \
-o autotrim=on \
-O compression=lz4 \
-O atime=off \
-O xattr=sa \
-O dnodesize=auto \
-O recordsize=1M \
tank draid2:2s /dev/sd{a..x}
# Verify the topology
zpool status tank
# Add NVMe special vdev for metadata acceleration
zpool add tank special mirror /dev/nvme1n1p1 /dev/nvme2n1p1
zfs set special_small_blocks=64k tank
# Add mirrored SLOG for synchronous writes (NFS, iSCSI)
zpool add tank log mirror /dev/nvme1n1p2 /dev/nvme2n1p2
# Create the dataset hierarchy
zfs create -o mountpoint=/tank/archive -o compression=zstd tank/archive
zfs create tank/archive/video-raw
zfs create tank/archive/video-edit
zfs create tank/archive/video-final
zfs create -o mountpoint=/tank/backup-targets -o compression=lz4 tank/backup-targets
zfs create tank/backup-targets/site-a
zfs create tank/backup-targets/site-b
zfs create -o mountpoint=/tank/cold-storage -o compression=zstd-19 tank/cold-storage
zfs create -o mountpoint=/tank/scratch -o sync=disabled tank/scratch
# Snapshot schedule
cat > /etc/cron.d/draid-snapshots << 'CRON'
# Hourly snapshots on active datasets, keep 48
0 * * * * root zfs snapshot -r tank/archive@auto-$(date +\%Y\%m\%d-\%H\%M) 2>&1 | logger -t zfs-snap
# Daily snapshots on backup targets, keep 30
0 2 * * * root zfs snapshot -r tank/backup-targets@auto-$(date +\%Y\%m\%d) 2>&1 | logger -t zfs-snap
# Weekly snapshots on cold storage, keep 52
0 3 * * 0 root zfs snapshot -r tank/cold-storage@auto-$(date +\%Y\%m\%d) 2>&1 | logger -t zfs-snap
CRON
# Enable ZFS event monitoring
systemctl enable --now zfs-zed
echo "Pool ready. $(zpool list -H -o size,free tank | awk '{print "Size: "$1", Free: "$2}')"
From here, point your backup agents at /tank/backup-targets, mount
/tank/archive via NFS or SMB for your edit bays, and let
syncoid handle offsite replication. The dRAID pool handles the rest —
fast resilvers, parallel I/O, and distributed spare activation without
human intervention.