Hardware Selection

ZFS Wiki

Hardware Selection — ZFS is only as good as the metal it runs on.

ZFS is highly dependent on disk latency and throughput. The wrong hardware doesn't just slow things down — it can cause data loss. Consumer SSDs, hardware RAID controllers, and SMR drives are the three most common hardware mistakes. This page covers every hardware decision you need to make, from RAM to drives to power protection, with concrete part numbers and real-world trade-offs.

Hardware selection is where most ZFS deployments succeed or fail before they even boot. I've seen more pools lost to bad controllers and SMR drives than to any software bug. The OpenZFS docs cover the software side well but leave hardware guidance vague. This page fills that gap with specific recommendations from years of building ZFS systems in production.

HBA vs RAID controller — the golden rule

NEVER use ZFS on top of hardware RAID. Use IT-mode HBAs only.

ZFS requires direct access to raw disks. It manages its own redundancy, checksumming, and caching. RAID controllers hide disk errors, alter write ordering, inject caching layers, and present virtual volumes that strip ZFS of its ability to detect and correct corruption. ZFS on hardware RAID is not just suboptimal — it is actively dangerous.

A Host Bus Adapter (HBA) in IT (Initiator Target) mode passes raw disks straight to the OS. No RAID logic, no write-back cache, no interference. ZFS sees every disk individually, reads S.M.A.R.T. data directly, detects every bad sector, and handles all redundancy itself. This is the only correct way to attach disks to a ZFS system.

Why hardware RAID is harmful to ZFS

Hidden errors

RAID controllers silently correct errors or return stale data from cache. ZFS never sees the corruption and can't checksum-verify what it never receives.

Write reordering

RAID controllers reorder writes for performance. ZFS's copy-on-write transaction model depends on strict write ordering. Reordering can corrupt the pool on power loss.

S.M.A.R.T. blocked

Most RAID controllers don't pass through S.M.A.R.T. data. You can't monitor drive health, predict failures, or proactively replace dying drives.

Cache interference

RAID write-back cache sits between ZFS and the disks. ZFS thinks a write is on disk when it's in volatile RAID cache. Power loss = data loss, even with a UPS on the server.

Wasted redundancy

Running ZFS mirrors on top of hardware RAID-1 gives you double redundancy that wastes half your disks. Running ZFS on RAID-0 gives you zero protection from the controller's perspective and full overhead from both layers.

Recommended HBA models

Model	Interface	Ports	Speed	Notes
Broadcom 9500-16i	SAS/SATA/NVMe	16 internal	12 Gb/s SAS, PCIe 4.0	Current generation. Tri-mode: SAS, SATA, and NVMe on the same card. The gold standard for new builds.
Broadcom 9400-16i	SAS/SATA/NVMe	16 internal	12 Gb/s SAS, PCIe 3.1	Tri-mode. Widely available, excellent Linux support. Great for mixed SAS+NVMe pools.
Broadcom 9300-8i	SAS/SATA	8 internal	12 Gb/s SAS, PCIe 3.0	Workhorse. Native IT mode, no crossflashing needed. Abundant and cheap on the used market.
LSI 9211-8i	SAS/SATA	8 internal	6 Gb/s SAS, PCIe 2.0	The classic ZFS HBA. Must be crossflashed to IT mode (ships in IR/RAID mode). Cheap, proven, ubiquitous. PCIe 2.0 limits throughput on all-SSD pools.
LSI 9207-8i	SAS/SATA	8 internal	6 Gb/s SAS, PCIe 3.0	Same as 9211-8i but native IT mode — no crossflashing. Drop-in replacement.
Dell H310 (crossflashed)	SAS/SATA	8 internal	6 Gb/s SAS	Rebranded LSI 2008. Crossflash to IT mode for ZFS use. Extremely cheap from decommissioned Dell servers.

If you're buying new, the Broadcom 9400-16i or 9500-16i is the right answer. If you're building a homelab on a budget, a used LSI 9207-8i or crossflashed Dell H310 costs $20–40 and works perfectly. The 9211-8i requires crossflashing from IR to IT firmware — it's well-documented but annoying. The 9207-8i ships in IT mode already. Either way, never use the RAID firmware.

Crossflashing a RAID controller to IT mode

Many servers ship with LSI-based RAID controllers (Dell PERC H310/H710, IBM M1015, etc.) that are actually LSI 2008 or 2108 chips with RAID firmware. You can replace the firmware with the IT-mode (HBA) firmware to expose raw disks. The process varies by card but follows this pattern:

# General crossflash process (varies by card — research your specific model):
# 1. Boot into EFI shell or DOS
# 2. Erase existing firmware:
sas2flsh -o -e 6
# 3. Flash IT-mode firmware + BIOS:
sas2flsh -o -f 2118it.bin -b mptsas2.rom
# 4. Set SAS address (required — use the address from the card's sticker):
sas2flsh -o -sasadd 500605bxxxxxxxxx

# After reboot, verify IT mode:
lspci -v | grep -i "LSI\|Broadcom"
# Should show: "Serial Attached SCSI controller" (not "RAID bus controller")

Crossflashing is a one-time 20-minute process that transforms a $15 Dell PERC into a proper ZFS HBA. There are excellent guides on the ServeTheHome forums for every card variant. Just make sure you record the SAS address from the sticker before you erase the firmware. If you lose it, you'll need to set a new one manually.

ECC RAM — the real story

ECC RAM is strongly recommended but not required. This is the most debated topic in the ZFS community, and the nuance matters. ZFS checksums every block on disk and verifies those checksums on every read. If corruption is detected, ZFS corrects it from a redundant copy. This entire chain assumes the data in RAM is correct — because that's where the checksum is computed.

Without ECC, a single bit flip in RAM can corrupt data silently. ZFS computes a checksum over the corrupted data, writes the corrupted block and its now-valid-looking checksum to disk, and the corruption becomes permanent. This is not theoretical — Google published a study showing DRAM error rates of 25,000–70,000 FIT per Mbit (roughly one correctable error per GB per year). In a server with 64GB of RAM running 24/7, that's dozens of bit flips per year.

The counterpoint: every filesystem has this problem, not just ZFS. ext4 and XFS suffer the same RAM corruption risk — they just can't detect it at all. ZFS without ECC is still safer than any other filesystem without ECC, because at least ZFS can detect most corruption (the kind that happens on disk, during transfer, or in firmware). ZFS without ECC is vulnerable only to corruption that happens to be in RAM at the moment of checksum computation.

The ZFS creator (Matt Ahrens) has said repeatedly that ZFS works fine without ECC. The internet's "you MUST use ECC or ZFS will eat your data" narrative is overblown. That said — if you're building a server that stores data you care about, use ECC. It's cheap ($10–20 more per DIMM), every server board supports it, and it eliminates an entire class of failure. For a desktop or homelab test box? Non-ECC is fine. For production storage? ECC, always.

When ECC matters most

Scrubbing

During a scrub, ZFS reads every block and verifies its checksum. The entire pool passes through RAM. If RAM has a stuck bit, scrub can silently "fix" good data into bad data. ECC prevents this.

Deduplication

Dedup compares block hashes in RAM. A bit flip in the dedup table can cause ZFS to merge non-identical blocks. The data loss is permanent and undetectable. Dedup without ECC is reckless.

Send/receive

Replication streams pass through RAM. A bit flip during send corrupts the stream and can corrupt the receiving pool. ECC protects both sides.

Large ARC

Systems with 128GB+ RAM have a larger surface area for bit flips. The more RAM you have, the more ECC matters. If you're running 256GB of ARC, ECC is mandatory.

SAS vs SATA vs NVMe for ZFS

Three interfaces, three different trade-offs. ZFS works equally well with all three — the choice is about performance requirements, budget, and physical infrastructure.

Feature	SATA	SAS	NVMe
Max bandwidth	6 Gb/s (600 MB/s)	12 Gb/s (1.2 GB/s)	PCIe 4.0 x4: 32 Gb/s (7 GB/s)
Queue depth	32 commands	254 commands	65,535 commands (64K queues x 64K depth)
Dual-port	No	Yes — multipath for HA	Some enterprise models
Hot-swap	Yes (with backplane)	Yes (native)	Yes (U.2/U.3/EDSFF)
Cable length	1m max	10m (SAS cables)	Short (direct PCIe) or via U.2/U.3
Cost per TB (HDD)	$15–25	$18–30	N/A (no NVMe HDDs)
Cost per TB (SSD)	$50–80	$60–100	$50–90
Best for	Budget NAS, homelabs	Enterprise storage, disk shelves	All-flash pools, SLOG, special vdev

SATA is the budget choice. Every HBA supports it, every case has bays for it, and the drives are cheap. The 6 Gb/s bandwidth is sufficient for HDDs (which max out at ~250 MB/s) but becomes a bottleneck for SSDs. Fine for spinning-rust pools, limiting for all-flash.

SAS is the enterprise choice. Dual-port for multipath failover, longer cable runs for external disk shelves (JBODs), higher queue depth for workloads with many concurrent I/O operations. SAS HBAs accept both SAS and SATA drives, so SAS backplanes give you maximum flexibility. SAS HDDs are 10K or 15K RPM — faster seeks but lower capacity and higher cost than SATA equivalents.

NVMe is the performance choice. Direct PCIe connection eliminates the SAS/SATA protocol overhead entirely. Queue depth is orders of magnitude higher. For all-flash ZFS pools, SLOG devices, and special vdevs, NVMe is the right answer. The downside: each NVMe drive typically consumes a PCIe slot or M.2 connector, limiting drive count without an NVMe JBOD or a tri-mode HBA like the Broadcom 9400/9500 series.

For most ZFS builds, the answer is SATA HDDs for bulk storage and NVMe for special vdev/SLOG/L2ARC. SAS only matters if you need external disk shelves (JBODs), multipath, or you're buying used enterprise gear where SAS drives are cheaper than SATA. The tri-mode HBAs (9400/9500) make the SAS-vs-NVMe question moot — they support everything on one card.

Drive selection — enterprise vs consumer, CMR vs SMR

CMR vs SMR — the non-negotiable rule

NEVER use SMR drives with ZFS. CMR only.

SMR (Shingled Magnetic Recording) overlaps write tracks like shingles on a roof to increase density. Random writes require read-modify-write of entire bands (20–50MB), causing catastrophic performance collapse under sustained writes. Resilvers on SMR drives can take 3–5x longer than CMR equivalents. Scrubs crawl. ZFS write patterns — copy-on-write with scattered free-space allocation — are the worst possible workload for SMR.

CMR (Conventional Magnetic Recording) writes tracks independently. Every write is a direct operation with predictable latency. This is what ZFS expects.

How to identify SMR drives

Manufacturers have been caught shipping SMR drives under model names that previously used CMR. The WD Red (non-Plus) debacle in 2020 is the most famous example. Never trust marketing. Always verify.

# Check the drive's data sheet — look for recording technology
# Manufacturer spec sheets will list "CMR" or "PMR" (good) vs "SMR" (bad)

# Known SMR model families (as of 2025 — always verify current models):
# WD Red (WD40EFAX, WD60EFAX) — NOT the Red Plus or Red Pro
# Seagate Barracuda (many 2TB-8TB models, especially ST2000DM008)
# Seagate Archive (all models — designed for SMR)
# Toshiba P300 (some capacity points)

# Known CMR model families:
# WD Red Plus (WD40EFPX, WD80EFPX) — the "Plus" means CMR
# WD Red Pro (all models)
# WD Ultrastar / HGST (all models)
# Seagate Exos (all models)
# Seagate IronWolf (all models — NAS-rated, CMR)
# Toshiba N300 (NAS-rated, CMR)
# Toshiba MG series (enterprise, CMR)

Enterprise vs consumer drives

Feature	Consumer (WD Blue, Barracuda)	NAS (Red Plus, IronWolf)	Enterprise (Exos, Ultrastar)
Recording	Often SMR	CMR	CMR
Workload rating	~55 TB/year	180 TB/year	550 TB/year
MTBF	~300,000 hours	1,000,000 hours	2,500,000 hours
Vibration tolerance	Low	Medium (RV sensors)	High (RV sensors)
Warranty	2 years	3 years	5 years
Error recovery (TLER/ERC)	Uncontrolled (up to 60s)	7 seconds (configurable)	7 seconds (configurable)
ZFS suitability	Not recommended	Good for homelab/NAS	Best for everything

Error recovery time (TLER/ERC/CCTL) is the most overlooked spec. When a consumer drive encounters a read error, it retries aggressively for up to 60 seconds trying to recover the sector. During this time, the drive is unresponsive. ZFS (or the HBA) may declare the drive dead and kick it from the pool. Enterprise and NAS drives limit this to 7 seconds, then return an error and let ZFS handle it. This is exactly what you want — ZFS has a redundant copy and can fix the error instantly, but only if the drive reports the failure promptly.

The sweet spot for most homelab builders is NAS-rated drives: WD Red Plus, Seagate IronWolf, or Toshiba N300. They're CMR, have proper error recovery timing, and cost 10–20% more than consumer drives. For large pools (12+ drives) or production, Seagate Exos or WD Ultrastar. Avoid consumer Barracuda/Blue drives entirely — even if they happen to be CMR, the error recovery behavior will cause problems in multi-disk arrays.

Used enterprise drives — the economics

Decommissioned datacenter drives (WD Ultrastar/HGST, Seagate Exos) are often available at 30–50% of new retail price. A used 14TB Ultrastar HC530 might cost $80–120 versus $250+ new. These drives were designed for 24/7 operation and have years of useful life remaining. The key is to verify their health before trusting them with data.

# Check S.M.A.R.T. data on used drives before building a pool
smartctl -a /dev/sdX

# Key fields to check:
# Reallocated_Sector_Ct — should be 0 or very low
# Current_Pending_Sector — should be 0
# Offline_Uncorrectable — should be 0
# Power_On_Hours — tells you age (8760 hours = 1 year)
# Spin_Retry_Count — should be 0

# Run a full surface scan (takes hours on large drives)
badblocks -b 4096 -wsv /dev/sdX
# WARNING: -w is destructive — writes test patterns. Only on empty drives.

Used enterprise drives are the best value in storage. A datacenter typically decommissions drives at 3–5 years with zero reallocated sectors — they're replaced on schedule, not because they failed. ZFS's checksumming means you'll catch any degradation early. Buy from reputable sellers (serverpartdeals.com, eBay sellers with return policies), check S.M.A.R.T. on arrival, and you'll build a pool at half the cost of new NAS drives.

HDD form factors and capacity planning

3.5-inch drives dominate ZFS pools. They offer the highest capacity (up to 30TB+ per drive as of 2026), the best cost per terabyte, and are designed for continuous operation in multi-bay enclosures. All NAS, nearline, and enterprise drives are 3.5-inch.

2.5-inch HDDs are largely irrelevant for ZFS. They max out at 2–5TB, are often SMR, and are being replaced by SSDs in every use case. Avoid them.

Capacity planning tip: Buy the largest drives you can afford. Fewer, larger drives mean fewer points of failure, less power consumption, fewer HBA ports, and simpler cabling. A mirror of 2x 18TB drives gives you 18TB usable with only 2 failure domains. Six 6TB drives in RAIDZ2 give you 24TB usable but with 6 failure domains and 6x the probability of a drive failure.

SSD endurance — DWPD, TBW, and flash types

SSDs wear out as NAND cells degrade from repeated write/erase cycles. Endurance is rated in DWPD (Drive Writes Per Day over warranty period) or TBW (Total Bytes Written over lifetime). ZFS's copy-on-write design means every write is a new write — there are no in-place updates. This increases write amplification compared to filesystems that update blocks in place. Compression and recordsize tuning mitigate this, but SSD endurance matters more on ZFS than on ext4.

Flash type	Endurance	Cost	ZFS use case
SLC (Single-Level Cell)	100,000 P/E cycles	Very high (rare/obsolete)	SLOG (if you can find it). The gold standard for endurance.
Intel Optane (3D XPoint)	Effectively unlimited for ZFS workloads	High (discontinued, prices rising)	SLOG, special vdev. No NAND at all — byte-addressable, near-zero latency. The perfect SLOG device.
MLC (Multi-Level Cell)	10,000 P/E cycles	High (mostly enterprise)	Enterprise data SSDs, SLOG, special vdev.
TLC (Triple-Level Cell)	1,000–3,000 P/E cycles	Moderate	Data SSDs with DRAM cache. Fine for general-purpose all-flash pools. Acceptable for L2ARC.
QLC (Quad-Level Cell)	100–1,000 P/E cycles	Low	Avoid for ZFS. Terrible write endurance, high write amplification, slow sustained writes. Barely acceptable as read-only L2ARC.

DWPD vs TBW — what the numbers mean

DWPD tells you how many times you can overwrite the entire drive per day for the warranty period (typically 5 years). A 1.6TB drive rated at 3 DWPD can sustain 1.6TB x 3 = 4.8TB of writes per day for 5 years. That's 8,760 TBW total.

TBW is the simpler metric: total terabytes written over the drive's lifetime. A consumer 1TB NVMe rated at 600 TBW can sustain 328 GB/day for 5 years. Sounds like a lot, but a busy ZFS pool with sync writes, scrubs, and resilvers can burn through that faster than you'd expect.

# Check current SSD wear level
smartctl -a /dev/nvme0n1 | grep -i "percentage used\|data units written"

# Percentage Used: 3%     — drive is at 3% of rated lifetime
# Data Units Written: 12,345,678  — multiply by 512KB for total bytes written

# Convert Data Units Written to TB:
# (data_units * 512000) / 1000000000000
# Example: 12345678 * 512000 / 1e12 = 6.32 TB written

SLOG devices — the write intent log accelerator

The ZFS Intent Log (ZIL) records synchronous write transactions so they survive power loss. By default it lives on the data disks. A SLOG (Separate LOG device) moves the ZIL to a dedicated, fast, power-safe device. A SLOG only benefits workloads that issue synchronous writes: databases calling fsync(), NFS with sync=always, iSCSI targets, and VM storage over NFS/iSCSI.

A SLOG is not a write cache. Asynchronous writes (the default) bypass the ZIL entirely and go straight to the transaction group. Adding a SLOG to an async workload does nothing.

SLOG sizing

A SLOG only needs to hold 5–10 seconds of synchronous writes. That's the interval between ZFS transaction group (TXG) commits (default: 5 seconds, tunable via zfs_txg_timeout). After each TXG commit, the SLOG data is no longer needed. For most workloads, a 16–32GB SLOG is plenty. Even under heavy NFS sync load, you'd need sustained 1+ GB/s of sync writes to need more than 16GB. A 400GB Optane P4801X as SLOG is massive overkill in capacity — but the latency characteristics make it ideal.

Why consumer NVMe fails as SLOG

The entire purpose of the ZIL is to survive power loss. Consumer NVMe drives have volatile write-back caches — the drive reports a write as complete when it hits the DRAM cache, before it reaches NAND. On power loss, that cached data is gone. ZFS thinks the sync write was safely logged on the SLOG. It wasn't. The application (database, NFS client) thinks the write was committed. It wasn't. Data loss follows.

Power Loss Protection (PLP) means the drive has capacitors that provide enough energy to flush the volatile write cache to NAND during power loss. Enterprise NVMe and Intel Optane have this. Consumer NVMe does not.

SLOG device comparison

Device	PLP	Latency	Endurance	Cost	Verdict
Intel Optane P4801X (100/200/375GB)	Yes	~10 us	60+ DWPD	$100–300 used	Best SLOG ever made. 3D XPoint, near-zero latency, effectively infinite endurance for ZIL workloads.
Intel Optane 900P/905P (280/480/960GB)	Yes	~10 us	10+ DWPD	$80–200 used	Excellent. Originally a consumer product but has PLP and Optane endurance. Overprovisions wonderfully as SLOG.
Samsung PM9A3 (960GB–7.68TB)	Yes	~70 us	1–3 DWPD	$100–400	Very good. Enterprise NVMe with PLP. Higher latency than Optane but widely available and current-gen.
Micron 7450 PRO (480GB–3.84TB)	Yes	~70 us	1 DWPD	$80–300	Good. Enterprise NVMe with PLP. Solid choice if Optane is unavailable.
Intel DC P4510/P5510	Yes	~80 us	1–3 DWPD	$60–200 used	Good. Older enterprise NVMe. Cheap on the used market. Perfectly adequate for SLOG.
Samsung 990 Pro (consumer)	No	~80 us	600 TBW (1TB)	$80–120	Not recommended for SLOG. No PLP. Acceptable only with a UPS and understanding of the risk.

If you can find Intel Optane P4801X or 900P/905P on eBay, buy them. Optane is discontinued and prices are climbing. Nothing else matches its latency for ZIL workloads. A $100 used Optane 900P 280GB as a mirrored SLOG pair transforms NFS/iSCSI performance. If Optane is gone from the market by the time you read this, Samsung PM9A3 or Micron 7450 are the next best. Consumer NVMe + UPS is a valid budget choice — just understand that your SLOG protection depends entirely on that UPS staying healthy.

L2ARC devices — extending read cache to SSD

L2ARC is a read cache that extends the in-memory ARC onto a fast SSD. It helps when your working set exceeds RAM and the workload is read-heavy: file servers, media libraries, build caches. Any decent SSD works for L2ARC — it doesn't need PLP (it's a cache; loss is harmless) and write endurance requirements are moderate (L2ARC writes are sequential and compressible).

L2ARC sizing guidelines

Header overhead

Each cached block consumes ~70 bytes of ARC (RAM) for its L2ARC header. A 1TB L2ARC filled with 4K blocks = 250 million headers = ~17GB of RAM consumed by L2ARC indexing alone. Budget accordingly.

Minimum RAM

Don't add L2ARC if you have less than 64GB RAM. The header overhead steals from ARC, and ARC is faster than L2ARC. Invest in more RAM first.

Persistent L2ARC

OpenZFS 2.0+ supports persistent L2ARC — the cache survives reboots. Previously, every reboot meant a cold cache that took hours to warm. Enable with l2arc_rebuild_enabled=1 (on by default in 2.0+).

Size ratio

A good rule of thumb: L2ARC should be 5–10x your ARC size. With 64GB ARC, a 500GB L2ARC is reasonable. Going beyond 10x rarely helps — the working set usually isn't that large.

Good L2ARC devices: Any TLC NVMe SSD with a DRAM cache. Samsung 870 EVO (SATA), Samsung 980 Pro (NVMe), WD Black SN770, Crucial P3 Plus. Avoid QLC — sustained random read performance drops sharply as the drive fills. Enterprise SSDs are nice but overkill for L2ARC.

Special vdev devices — metadata on fast storage

The special vdev stores pool metadata (directory entries, block pointers, file attributes) and optionally small file blocks below the special_small_blocks threshold. On a spinning-rust pool, this means ls, find, du, container layer lookups, and database index reads hit the SSD instead of waiting for platter seeks. The improvement is 10–50x for metadata-heavy operations.

The special vdev MUST be mirrored.

If an unmirrored special vdev fails, the pool's metadata is gone. The pool is unrecoverable. ZFS will warn you, but it will let you create an unmirrored special vdev. Never do this.

Ideal devices: Intel Optane (best — lowest latency for random reads), enterprise NVMe with PLP (Samsung PM9A3, Micron 7450), or high-endurance consumer NVMe. The special vdev handles both reads and writes, so PLP is more important here than for L2ARC (though ZFS can rebuild the special vdev from data vdevs in theory, you don't want to test this).

# Add a mirrored special vdev — metadata + files under 64K go to SSDs
zpool add tank special mirror /dev/nvme0n1 /dev/nvme1n1
zfs set special_small_blocks=65536 tank

# For databases with 8K pages:
zfs set special_small_blocks=16384 tank/postgres

# Verify special vdev allocation:
zpool list -v tank
# Look for the "special" vdev and its allocated/free space

Boot devices — ESP and boot pool

A ZFS-on-root system needs two things outside the main pool: an EFI System Partition (ESP) for the bootloader, and optionally a small boot pool if your bootloader can't read ZFS directly. kldload uses a direct ZFS boot chain (kernel + initramfs on the root pool), so you only need the ESP.

Mirror your boot devices

If your single boot SSD fails, the system won't boot even though the ZFS pool is healthy on other disks. Mirror the ESP across two devices. For a dedicated boot setup, two small SSDs (120–256GB) in a ZFS mirror work perfectly. For an integrated setup, partition the first two data disks with a small ESP partition each and use efibootmgr to register both.

# kldload automatically mirrors the ESP across available disks
# To verify ESP mirrors:
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT | grep -i efi

# Manual ESP mirror (if needed):
# 1. Create matching ESP partitions on both disks
# 2. Format both as FAT32
# 3. Copy bootloader files to both
# 4. Register both with efibootmgr:
efibootmgr -c -d /dev/sda -p 1 -L "Linux Boot 1" -l '\EFI\BOOT\BOOTX64.EFI'
efibootmgr -c -d /dev/sdb -p 1 -L "Linux Boot 2" -l '\EFI\BOOT\BOOTX64.EFI'

USB drives as boot devices are tempting but unreliable. USB flash drives have terrible write endurance and fail silently. If you must boot from USB, use a quality USB 3.0 drive and mirror it. Better yet, use two small SATA SSDs. A 120GB Kingston A400 costs $15 and will outlast any USB stick by years. kldload handles ESP mirroring automatically during install.

CPU requirements

ZFS is not CPU-intensive for basic operations, but specific features have real CPU requirements. Any modern x86_64 processor handles ZFS fine. The features that demand CPU attention are compression, encryption, and deduplication.

Feature	CPU requirement	Notes
Compression (LZ4)	Minimal — any CPU	LZ4 compresses at 4+ GB/s per core. Free performance. Always enable it.
Compression (ZSTD)	Moderate — 1–2 cores	ZSTD is 2–5x slower than LZ4 but achieves 30–50% better ratios. Use for cold storage, backups.
Encryption (AES-256-GCM)	AES-NI required	Without AES-NI, encryption is 10–50x slower (software fallback). Every x86 CPU since ~2011 has AES-NI. Verify with: `grep aes /proc/cpuinfo`
Deduplication	SHA256 + many cores	Dedup hashes every block. SHA extensions (SHA-NI, present on Zen 1+ and Ice Lake+) accelerate this 3–5x. Dedup is RAM-hungry more than CPU-hungry, but fast hashing helps.
Checksumming (fletcher4)	Minimal	Default checksum. Hardware-accelerated on all modern CPUs. Negligible overhead.
Checksumming (SHA-256)	SHA-NI helps	Required for dedup. With SHA-NI: ~2 GB/s per core. Without: ~500 MB/s per core.
Scrub/resilver	1–2 cores	Scrubbing reads + checksums every block. CPU-bound on fast all-flash pools. On HDD pools, the disks are the bottleneck, not the CPU.

# Verify CPU feature support
grep -o 'aes\|sha_ni\|avx2\|sse4_2' /proc/cpuinfo | sort -u
# aes      — required for native ZFS encryption at full speed
# sha_ni   — accelerates SHA-256 checksumming and dedup
# avx2     — accelerates various ZFS operations
# sse4_2   — CRC32 acceleration for checksumming

CPU is almost never the bottleneck for ZFS. A 4-core Xeon from 2015 handles a 24-disk HDD pool with encryption and LZ4 compression without breaking a sweat. The only time CPU matters is: (1) all-flash pools with ZSTD compression at high levels, (2) dedup on large pools, or (3) encryption without AES-NI. For home NAS builds, literally any modern CPU works. For servers, anything with AES-NI and 4+ cores is plenty.

Power protection — UPS is mandatory

Every ZFS system needs a UPS. No exceptions.

ZFS's copy-on-write design means a power loss during a write will never corrupt existing data — the old blocks are still intact until the new transaction group commits atomically. However, the ZFS Intent Log (ZIL) assumes the write medium is power-safe. If your ZIL is on data disks with volatile write caches, a power loss can lose acknowledged sync writes. If your SLOG lacks PLP, the same applies. A UPS eliminates this entire class of failure.

Beyond ZIL safety, a UPS protects against:

Drive cache flush

ZFS issues cache flush commands to drives after writing. If power dies before the flush completes, data in the drive's volatile write cache is lost. A UPS gives the flush time to complete.

TXG commit

Transaction group commits write the uberblock pointer — the atomic root of the pool's state. Interrupting this is extremely unlikely to corrupt the pool (ZFS has multiple uberblock copies), but a UPS makes it impossible.

Drive head parking

HDDs park their heads on power loss using rotational inertia. Repeated hard power cuts accelerate mechanical wear. A UPS eliminates hard shutdowns.

Graceful shutdown

The UPS signals the OS via USB/serial. apcupsd or nut triggers a clean shutdown: export pools, sync caches, power off. The pool comes back clean every time.

# Install and configure UPS monitoring (APC example)
dnf install apcupsd     # RHEL/CentOS/Fedora/Rocky
apt install apcupsd     # Debian/Ubuntu

# /etc/apcupsd/apcupsd.conf — key settings:
# UPSTYPE usb
# UPSCABLE usb
# TIMEOUT 60          — shutdown after 60 seconds on battery
# BATTERYLEVEL 15     — shutdown when battery reaches 15%
# MINUTES 5           — shutdown when 5 minutes of runtime remain

# For Network UPS Tools (NUT):
dnf install nut         # supports more UPS brands than apcupsd
# Configure /etc/ups/ups.conf, /etc/ups/upsd.conf, /etc/ups/upsmon.conf

A $60 APC Back-UPS 600VA is enough for a home NAS. A $200 CyberPower 1500VA handles a full tower server. The cost of a UPS is trivial compared to the cost of losing a pool. Configure automatic shutdown. A UPS without shutdown scripting is just a very expensive power strip — it delays the inevitable but doesn't prevent it. Connect via USB, install apcupsd or NUT, and set it to shut down cleanly with 5 minutes of battery remaining.

Backplanes, expanders, and disk shelves

When you need more drives than your HBA's port count, you have three options: SAS expanders, multiple HBAs, or external disk shelves (JBODs).

Direct backplane

Each drive bay connects directly to an HBA port. One cable per 4 drives (SFF-8643 to 4x SATA/SAS). Best performance, lowest latency, limited by HBA port count (8–16 per card).

SAS expander

Multiplexes a single SAS connection to 24+ drives. Adds ~1–5 us latency. Ubiquitous in enterprise servers (Supermicro, Dell, HP chassis). Shared bandwidth — fine for HDDs, limiting for SSDs.

External JBOD

A separate disk shelf connected via external SAS (SFF-8644). Supports 12–90 drives per shelf. Use for large ZFS pools that outgrow the server chassis. Common models: Supermicro 847, Dell MD1420, NetApp DS4246.

NVMe JBOD

PCIe fabric shelves for NVMe drives. Expensive but eliminates SAS bandwidth bottleneck. For all-NVMe pools with 12+ drives. Emerging standard (NVMe-oF fabrics).

For homelab builders, a used Supermicro 846 or 847 chassis gives you 24 or 36 hot-swap bays with an integrated SAS2 expander for $200–400 on eBay. They're loud, heavy, and power-hungry, but nothing beats the bay count for the price. For quieter builds, the Fractal Design Define 7 XL fits 18 HDDs with direct backplane connections. For enterprise, Dell MD1420 JBODs with SAS3 (12 Gb/s) give you 24 NVMe-ready U.2 bays per shelf.

Drive count vs redundancy

More drives means more failure probability. A pool with 24 drives is far more likely to experience a drive failure in any given year than a pool with 4 drives. The Annualized Failure Rate (AFR) compounds: if each drive has a 1% AFR, a 24-drive pool has a ~21% chance of at least one failure per year. This is why redundancy level must scale with drive count.

Drive count	Recommended minimum	Reasoning
1–2	Mirror	Mirror is the only option with 2 disks. Single disk = no redundancy, test only.
3–4	Mirror pairs	2 mirror pairs (4 disks) give excellent IOPS and simple expansion.
5–8	RAIDZ2 or mirror pairs	RAIDZ2 for capacity; mirror pairs for IOPS. Never RAIDZ1 with drives over 2TB.
9–12	RAIDZ2 or dRAID2	Resilver times start getting long. dRAID2 shines at 12 disks.
13–24	dRAID2 with 1–2 spares	Traditional RAIDZ resilvers take too long. dRAID's parallel rebuild is essential.
25+	dRAID2 or dRAID3 with 2+ spares	At this scale, triple parity may be justified. The probability of coincident failures is non-negligible.

Recommended builds by use case

Home NAS — 4–8 bay, budget-conscious

The $500–1000 NAS

CPU: Any modern 4-core (Intel i3/i5, AMD Ryzen 3/5, or used Xeon E3). AES-NI required if using encryption.
RAM: 16–32GB ECC DDR4 (used server RAM is cheap). 1GB per TB of storage is a reasonable starting point for ARC.
HBA: LSI 9207-8i or Broadcom 9300-8i ($20–60 used).
Drives: 4x WD Red Plus 8TB or 4x Seagate IronWolf 8TB in 2 mirror pairs (16TB usable). Or 4x used Ultrastar 14TB ($100 each) for 28TB usable.
SLOG: Not needed unless serving NFS with sync=always.
Boot: 2x 120GB SATA SSD (mirrored ESP).
UPS: APC Back-UPS 600VA ($60). Configure apcupsd for auto shutdown.
Case: Fractal Design Define 7 (supports 14 HDDs) or Node 804 (compact, 8 bays).

Total: ~$600–900 with used enterprise drives. 16–28TB usable, mirrored, checksummed, compressed.

Production file server — 12–24 bay, enterprise-grade

The $3,000–8,000 server

CPU: Xeon Silver/Gold or EPYC 7003/9004 series. 8+ cores. AES-NI + SHA-NI.
RAM: 128–256GB ECC DDR4/DDR5. Registered (RDIMM) for higher density.
HBA: Broadcom 9400-16i or 9500-16i. Tri-mode for future NVMe expansion.
Drives: 12x Seagate Exos X18 18TB in dRAID2:1s (10x 18TB usable = ~180TB) or 6 mirror pairs (108TB usable, max IOPS).
Special vdev: 2x Intel Optane P4801X 375GB (mirrored). Metadata + small blocks on Optane.
SLOG: 2x Intel Optane P4801X 100GB (mirrored) if serving NFS/iSCSI. Otherwise not needed.
Boot: 2x 240GB enterprise SATA SSD (mirrored ESP).
UPS: CyberPower 2200VA or APC Smart-UPS 1500. Network-managed, NUT monitoring.
Chassis: Supermicro 846 (24 bay) or Dell R740xd (24 bay + 2 NVMe).

Total: ~$4,000–7,000 with new enterprise drives. 100–180TB usable with enterprise reliability.

All-flash pool — maximum IOPS, minimum latency

The all-NVMe build

CPU: EPYC 9004 or Xeon Sapphire Rapids. PCIe 5.0 for maximum NVMe bandwidth. 16+ cores for ZSTD compression at scale.
RAM: 256–512GB ECC DDR5. All-flash pools benefit enormously from large ARC — SSD reads are fast but RAM reads are 100x faster.
HBA: Direct PCIe NVMe (no HBA needed) or Broadcom 9500-16i for U.2/U.3 drives.
Drives: 8x Samsung PM9A3 3.84TB in 4 mirror pairs (15.36TB usable, ~2M random IOPS). Or 8x Micron 7450 PRO for budget-conscious builds.
Special vdev: Not needed — metadata is already on NVMe. All reads are fast.
SLOG: 2x Intel Optane P4801X (if sync writes are significant). On an all-NVMe pool, the data drives are fast enough that a SLOG provides marginal benefit for most workloads.
UPS: Enterprise rack UPS. Double-conversion for clean power to sensitive NVMe electronics.

Total: ~$8,000–20,000+. For databases, VMs, and latency-sensitive workloads where nothing else will do.

The home NAS build is where 90% of people should start. Four drives in two mirror pairs, 16–32GB ECC RAM, a used HBA, and a UPS. It's simple, expandable (add more mirror pairs later), and rock-solid. Resist the temptation to over-engineer. You can always grow the pool by adding vdevs. You can't easily shrink it or change the topology. Start conservative, expand as needed.

Verify your hardware

# Verify HBA is in IT mode (not RAID mode)
lspci -v | grep -i "LSI\|Broadcom\|SAS"
# Should show "Serial Attached SCSI controller" — NOT "RAID bus controller"

# Verify disks are directly visible (not behind RAID)
ls -l /dev/disk/by-id/
# Should show scsi-* or ata-* entries for each physical disk

# Verify ECC RAM is active
edac-util -s        # shows ECC error counts (install edac-utils)
dmidecode -t 17 | grep -i "error correction"
# Should show "Multi-bit ECC" or "Single-bit ECC"

# Verify AES-NI and SHA-NI support
grep -o 'aes\|sha_ni' /proc/cpuinfo | sort -u

# Check for SMR drives
smartctl -a /dev/sdX | grep -i "rotation rate\|form factor"
# Also check manufacturer's spec sheet — smartctl doesn't directly report SMR

# Verify drive health
smartctl -H /dev/sdX
# PASSED = healthy, FAILED = replace immediately

# Check SSD wear
smartctl -a /dev/nvme0n1 | grep -i "percentage used"
# Percentage Used: 3% — drive is at 3% of rated endurance

# Verify UPS is communicating
apcaccess status     # for APC UPS with apcupsd
upsc myups@localhost # for NUT

Hardware checklist

HBA in IT mode

Never hardware RAID. Broadcom 9300/9400/9500 or crossflashed LSI 9211/Dell H310.

ECC RAM

Strongly recommended. Mandatory for dedup. Budget 1GB per TB of storage as a starting point for ARC.

CMR drives only

Never SMR. Verify recording technology on manufacturer's spec sheet. WD Red Plus / IronWolf / Exos / Ultrastar.

SLOG: PLP required

Intel Optane (best), enterprise NVMe with PLP, or accept the risk of consumer NVMe + UPS.

Special vdev: mirrored

Unmirrored special vdev = unrecoverable pool on SSD failure. Always mirror.

Boot: mirrored

Two small SSDs for ESP. A single boot device is a single point of failure.

AES-NI

Required for ZFS encryption at full speed. Present on all x86 CPUs since ~2011.

UPS

Mandatory. Configure auto-shutdown via apcupsd or NUT. A UPS without shutdown scripting is a delay, not a solution.

Error recovery (TLER)

Use NAS or enterprise drives with 7-second error recovery timeout. Consumer drives with 60-second timeouts cause pool problems.

If you take away one thing from this page: ZFS does not need exotic hardware. It needs correct hardware. An IT-mode HBA, ECC RAM, CMR drives, and a UPS. That's it. Everything else — Optane SLOGs, special vdevs, all-NVMe pools — is optimization on top of a solid foundation. Get the foundation right first.

← Encryption — per-dataset, native, and replication-aware. Memory & ARC — the engine that makes ZFS fast. →