Hardware Selection — ZFS is only as good as the metal it runs on.
ZFS is highly dependent on disk latency and throughput. The wrong hardware doesn't just slow things down — it can cause data loss. Consumer SSDs, hardware RAID controllers, and SMR drives are the three most common hardware mistakes. This page covers every hardware decision you need to make, from RAM to drives to power protection, with concrete part numbers and real-world trade-offs.
HBA vs RAID controller — the golden rule
NEVER use ZFS on top of hardware RAID. Use IT-mode HBAs only.
ZFS requires direct access to raw disks. It manages its own redundancy, checksumming, and caching. RAID controllers hide disk errors, alter write ordering, inject caching layers, and present virtual volumes that strip ZFS of its ability to detect and correct corruption. ZFS on hardware RAID is not just suboptimal — it is actively dangerous.
A Host Bus Adapter (HBA) in IT (Initiator Target) mode passes raw disks straight to the OS. No RAID logic, no write-back cache, no interference. ZFS sees every disk individually, reads S.M.A.R.T. data directly, detects every bad sector, and handles all redundancy itself. This is the only correct way to attach disks to a ZFS system.
Why hardware RAID is harmful to ZFS
Recommended HBA models
| Model | Interface | Ports | Speed | Notes |
|---|---|---|---|---|
| Broadcom 9500-16i | SAS/SATA/NVMe | 16 internal | 12 Gb/s SAS, PCIe 4.0 | Current generation. Tri-mode: SAS, SATA, and NVMe on the same card. The gold standard for new builds. |
| Broadcom 9400-16i | SAS/SATA/NVMe | 16 internal | 12 Gb/s SAS, PCIe 3.1 | Tri-mode. Widely available, excellent Linux support. Great for mixed SAS+NVMe pools. |
| Broadcom 9300-8i | SAS/SATA | 8 internal | 12 Gb/s SAS, PCIe 3.0 | Workhorse. Native IT mode, no crossflashing needed. Abundant and cheap on the used market. |
| LSI 9211-8i | SAS/SATA | 8 internal | 6 Gb/s SAS, PCIe 2.0 | The classic ZFS HBA. Must be crossflashed to IT mode (ships in IR/RAID mode). Cheap, proven, ubiquitous. PCIe 2.0 limits throughput on all-SSD pools. |
| LSI 9207-8i | SAS/SATA | 8 internal | 6 Gb/s SAS, PCIe 3.0 | Same as 9211-8i but native IT mode — no crossflashing. Drop-in replacement. |
| Dell H310 (crossflashed) | SAS/SATA | 8 internal | 6 Gb/s SAS | Rebranded LSI 2008. Crossflash to IT mode for ZFS use. Extremely cheap from decommissioned Dell servers. |
Crossflashing a RAID controller to IT mode
Many servers ship with LSI-based RAID controllers (Dell PERC H310/H710, IBM M1015, etc.) that are actually LSI 2008 or 2108 chips with RAID firmware. You can replace the firmware with the IT-mode (HBA) firmware to expose raw disks. The process varies by card but follows this pattern:
# General crossflash process (varies by card — research your specific model):
# 1. Boot into EFI shell or DOS
# 2. Erase existing firmware:
sas2flsh -o -e 6
# 3. Flash IT-mode firmware + BIOS:
sas2flsh -o -f 2118it.bin -b mptsas2.rom
# 4. Set SAS address (required — use the address from the card's sticker):
sas2flsh -o -sasadd 500605bxxxxxxxxx
# After reboot, verify IT mode:
lspci -v | grep -i "LSI\|Broadcom"
# Should show: "Serial Attached SCSI controller" (not "RAID bus controller")
ECC RAM — the real story
ECC RAM is strongly recommended but not required. This is the most debated topic in the ZFS community, and the nuance matters. ZFS checksums every block on disk and verifies those checksums on every read. If corruption is detected, ZFS corrects it from a redundant copy. This entire chain assumes the data in RAM is correct — because that's where the checksum is computed.
Without ECC, a single bit flip in RAM can corrupt data silently. ZFS computes a checksum over the corrupted data, writes the corrupted block and its now-valid-looking checksum to disk, and the corruption becomes permanent. This is not theoretical — Google published a study showing DRAM error rates of 25,000–70,000 FIT per Mbit (roughly one correctable error per GB per year). In a server with 64GB of RAM running 24/7, that's dozens of bit flips per year.
The counterpoint: every filesystem has this problem, not just ZFS. ext4 and XFS suffer the same RAM corruption risk — they just can't detect it at all. ZFS without ECC is still safer than any other filesystem without ECC, because at least ZFS can detect most corruption (the kind that happens on disk, during transfer, or in firmware). ZFS without ECC is vulnerable only to corruption that happens to be in RAM at the moment of checksum computation.
When ECC matters most
SAS vs SATA vs NVMe for ZFS
Three interfaces, three different trade-offs. ZFS works equally well with all three — the choice is about performance requirements, budget, and physical infrastructure.
| Feature | SATA | SAS | NVMe |
|---|---|---|---|
| Max bandwidth | 6 Gb/s (600 MB/s) | 12 Gb/s (1.2 GB/s) | PCIe 4.0 x4: 32 Gb/s (7 GB/s) |
| Queue depth | 32 commands | 254 commands | 65,535 commands (64K queues x 64K depth) |
| Dual-port | No | Yes — multipath for HA | Some enterprise models |
| Hot-swap | Yes (with backplane) | Yes (native) | Yes (U.2/U.3/EDSFF) |
| Cable length | 1m max | 10m (SAS cables) | Short (direct PCIe) or via U.2/U.3 |
| Cost per TB (HDD) | $15–25 | $18–30 | N/A (no NVMe HDDs) |
| Cost per TB (SSD) | $50–80 | $60–100 | $50–90 |
| Best for | Budget NAS, homelabs | Enterprise storage, disk shelves | All-flash pools, SLOG, special vdev |
SATA is the budget choice. Every HBA supports it, every case has bays for it, and the drives are cheap. The 6 Gb/s bandwidth is sufficient for HDDs (which max out at ~250 MB/s) but becomes a bottleneck for SSDs. Fine for spinning-rust pools, limiting for all-flash.
SAS is the enterprise choice. Dual-port for multipath failover, longer cable runs for external disk shelves (JBODs), higher queue depth for workloads with many concurrent I/O operations. SAS HBAs accept both SAS and SATA drives, so SAS backplanes give you maximum flexibility. SAS HDDs are 10K or 15K RPM — faster seeks but lower capacity and higher cost than SATA equivalents.
NVMe is the performance choice. Direct PCIe connection eliminates the SAS/SATA protocol overhead entirely. Queue depth is orders of magnitude higher. For all-flash ZFS pools, SLOG devices, and special vdevs, NVMe is the right answer. The downside: each NVMe drive typically consumes a PCIe slot or M.2 connector, limiting drive count without an NVMe JBOD or a tri-mode HBA like the Broadcom 9400/9500 series.
Drive selection — enterprise vs consumer, CMR vs SMR
CMR vs SMR — the non-negotiable rule
NEVER use SMR drives with ZFS. CMR only.
SMR (Shingled Magnetic Recording) overlaps write tracks like shingles on a roof to increase density. Random writes require read-modify-write of entire bands (20–50MB), causing catastrophic performance collapse under sustained writes. Resilvers on SMR drives can take 3–5x longer than CMR equivalents. Scrubs crawl. ZFS write patterns — copy-on-write with scattered free-space allocation — are the worst possible workload for SMR.
CMR (Conventional Magnetic Recording) writes tracks independently. Every write is a direct operation with predictable latency. This is what ZFS expects.
How to identify SMR drives
Manufacturers have been caught shipping SMR drives under model names that previously used CMR. The WD Red (non-Plus) debacle in 2020 is the most famous example. Never trust marketing. Always verify.
# Check the drive's data sheet — look for recording technology
# Manufacturer spec sheets will list "CMR" or "PMR" (good) vs "SMR" (bad)
# Known SMR model families (as of 2025 — always verify current models):
# WD Red (WD40EFAX, WD60EFAX) — NOT the Red Plus or Red Pro
# Seagate Barracuda (many 2TB-8TB models, especially ST2000DM008)
# Seagate Archive (all models — designed for SMR)
# Toshiba P300 (some capacity points)
# Known CMR model families:
# WD Red Plus (WD40EFPX, WD80EFPX) — the "Plus" means CMR
# WD Red Pro (all models)
# WD Ultrastar / HGST (all models)
# Seagate Exos (all models)
# Seagate IronWolf (all models — NAS-rated, CMR)
# Toshiba N300 (NAS-rated, CMR)
# Toshiba MG series (enterprise, CMR)
Enterprise vs consumer drives
| Feature | Consumer (WD Blue, Barracuda) | NAS (Red Plus, IronWolf) | Enterprise (Exos, Ultrastar) |
|---|---|---|---|
| Recording | Often SMR | CMR | CMR |
| Workload rating | ~55 TB/year | 180 TB/year | 550 TB/year |
| MTBF | ~300,000 hours | 1,000,000 hours | 2,500,000 hours |
| Vibration tolerance | Low | Medium (RV sensors) | High (RV sensors) |
| Warranty | 2 years | 3 years | 5 years |
| Error recovery (TLER/ERC) | Uncontrolled (up to 60s) | 7 seconds (configurable) | 7 seconds (configurable) |
| ZFS suitability | Not recommended | Good for homelab/NAS | Best for everything |
Error recovery time (TLER/ERC/CCTL) is the most overlooked spec. When a consumer drive encounters a read error, it retries aggressively for up to 60 seconds trying to recover the sector. During this time, the drive is unresponsive. ZFS (or the HBA) may declare the drive dead and kick it from the pool. Enterprise and NAS drives limit this to 7 seconds, then return an error and let ZFS handle it. This is exactly what you want — ZFS has a redundant copy and can fix the error instantly, but only if the drive reports the failure promptly.
Used enterprise drives — the economics
Decommissioned datacenter drives (WD Ultrastar/HGST, Seagate Exos) are often available at 30–50% of new retail price. A used 14TB Ultrastar HC530 might cost $80–120 versus $250+ new. These drives were designed for 24/7 operation and have years of useful life remaining. The key is to verify their health before trusting them with data.
# Check S.M.A.R.T. data on used drives before building a pool
smartctl -a /dev/sdX
# Key fields to check:
# Reallocated_Sector_Ct — should be 0 or very low
# Current_Pending_Sector — should be 0
# Offline_Uncorrectable — should be 0
# Power_On_Hours — tells you age (8760 hours = 1 year)
# Spin_Retry_Count — should be 0
# Run a full surface scan (takes hours on large drives)
badblocks -b 4096 -wsv /dev/sdX
# WARNING: -w is destructive — writes test patterns. Only on empty drives.
HDD form factors and capacity planning
3.5-inch drives dominate ZFS pools. They offer the highest capacity (up to 30TB+ per drive as of 2026), the best cost per terabyte, and are designed for continuous operation in multi-bay enclosures. All NAS, nearline, and enterprise drives are 3.5-inch.
2.5-inch HDDs are largely irrelevant for ZFS. They max out at 2–5TB, are often SMR, and are being replaced by SSDs in every use case. Avoid them.
Capacity planning tip: Buy the largest drives you can afford. Fewer, larger drives mean fewer points of failure, less power consumption, fewer HBA ports, and simpler cabling. A mirror of 2x 18TB drives gives you 18TB usable with only 2 failure domains. Six 6TB drives in RAIDZ2 give you 24TB usable but with 6 failure domains and 6x the probability of a drive failure.
SSD endurance — DWPD, TBW, and flash types
SSDs wear out as NAND cells degrade from repeated write/erase cycles. Endurance is rated in DWPD (Drive Writes Per Day over warranty period) or TBW (Total Bytes Written over lifetime). ZFS's copy-on-write design means every write is a new write — there are no in-place updates. This increases write amplification compared to filesystems that update blocks in place. Compression and recordsize tuning mitigate this, but SSD endurance matters more on ZFS than on ext4.
| Flash type | Endurance | Cost | ZFS use case |
|---|---|---|---|
| SLC (Single-Level Cell) | 100,000 P/E cycles | Very high (rare/obsolete) | SLOG (if you can find it). The gold standard for endurance. |
| Intel Optane (3D XPoint) | Effectively unlimited for ZFS workloads | High (discontinued, prices rising) | SLOG, special vdev. No NAND at all — byte-addressable, near-zero latency. The perfect SLOG device. |
| MLC (Multi-Level Cell) | 10,000 P/E cycles | High (mostly enterprise) | Enterprise data SSDs, SLOG, special vdev. |
| TLC (Triple-Level Cell) | 1,000–3,000 P/E cycles | Moderate | Data SSDs with DRAM cache. Fine for general-purpose all-flash pools. Acceptable for L2ARC. |
| QLC (Quad-Level Cell) | 100–1,000 P/E cycles | Low | Avoid for ZFS. Terrible write endurance, high write amplification, slow sustained writes. Barely acceptable as read-only L2ARC. |
DWPD vs TBW — what the numbers mean
DWPD tells you how many times you can overwrite the entire drive per day for the warranty period (typically 5 years). A 1.6TB drive rated at 3 DWPD can sustain 1.6TB x 3 = 4.8TB of writes per day for 5 years. That's 8,760 TBW total.
TBW is the simpler metric: total terabytes written over the drive's lifetime. A consumer 1TB NVMe rated at 600 TBW can sustain 328 GB/day for 5 years. Sounds like a lot, but a busy ZFS pool with sync writes, scrubs, and resilvers can burn through that faster than you'd expect.
# Check current SSD wear level
smartctl -a /dev/nvme0n1 | grep -i "percentage used\|data units written"
# Percentage Used: 3% — drive is at 3% of rated lifetime
# Data Units Written: 12,345,678 — multiply by 512KB for total bytes written
# Convert Data Units Written to TB:
# (data_units * 512000) / 1000000000000
# Example: 12345678 * 512000 / 1e12 = 6.32 TB written
SLOG devices — the write intent log accelerator
The ZFS Intent Log (ZIL) records synchronous write transactions so they survive power loss.
By default it lives on the data disks. A SLOG (Separate LOG device) moves
the ZIL to a dedicated, fast, power-safe device. A SLOG only benefits workloads that issue
synchronous writes: databases calling fsync(), NFS with
sync=always, iSCSI targets, and VM storage over NFS/iSCSI.
A SLOG is not a write cache. Asynchronous writes (the default) bypass the ZIL entirely and go straight to the transaction group. Adding a SLOG to an async workload does nothing.
SLOG sizing
A SLOG only needs to hold 5–10 seconds of synchronous writes. That's the
interval between ZFS transaction group (TXG) commits (default: 5 seconds, tunable via
zfs_txg_timeout). After each TXG commit, the SLOG data is no longer needed.
For most workloads, a 16–32GB SLOG is plenty. Even under heavy NFS sync
load, you'd need sustained 1+ GB/s of sync writes to need more than 16GB. A 400GB Optane P4801X
as SLOG is massive overkill in capacity — but the latency characteristics make it ideal.
Why consumer NVMe fails as SLOG
The entire purpose of the ZIL is to survive power loss. Consumer NVMe drives have volatile write-back caches — the drive reports a write as complete when it hits the DRAM cache, before it reaches NAND. On power loss, that cached data is gone. ZFS thinks the sync write was safely logged on the SLOG. It wasn't. The application (database, NFS client) thinks the write was committed. It wasn't. Data loss follows.
Power Loss Protection (PLP) means the drive has capacitors that provide enough energy to flush the volatile write cache to NAND during power loss. Enterprise NVMe and Intel Optane have this. Consumer NVMe does not.
SLOG device comparison
| Device | PLP | Latency | Endurance | Cost | Verdict |
|---|---|---|---|---|---|
| Intel Optane P4801X (100/200/375GB) | Yes | ~10 us | 60+ DWPD | $100–300 used | Best SLOG ever made. 3D XPoint, near-zero latency, effectively infinite endurance for ZIL workloads. |
| Intel Optane 900P/905P (280/480/960GB) | Yes | ~10 us | 10+ DWPD | $80–200 used | Excellent. Originally a consumer product but has PLP and Optane endurance. Overprovisions wonderfully as SLOG. |
| Samsung PM9A3 (960GB–7.68TB) | Yes | ~70 us | 1–3 DWPD | $100–400 | Very good. Enterprise NVMe with PLP. Higher latency than Optane but widely available and current-gen. |
| Micron 7450 PRO (480GB–3.84TB) | Yes | ~70 us | 1 DWPD | $80–300 | Good. Enterprise NVMe with PLP. Solid choice if Optane is unavailable. |
| Intel DC P4510/P5510 | Yes | ~80 us | 1–3 DWPD | $60–200 used | Good. Older enterprise NVMe. Cheap on the used market. Perfectly adequate for SLOG. |
| Samsung 990 Pro (consumer) | No | ~80 us | 600 TBW (1TB) | $80–120 | Not recommended for SLOG. No PLP. Acceptable only with a UPS and understanding of the risk. |
L2ARC devices — extending read cache to SSD
L2ARC is a read cache that extends the in-memory ARC onto a fast SSD. It helps when your working set exceeds RAM and the workload is read-heavy: file servers, media libraries, build caches. Any decent SSD works for L2ARC — it doesn't need PLP (it's a cache; loss is harmless) and write endurance requirements are moderate (L2ARC writes are sequential and compressible).
L2ARC sizing guidelines
l2arc_rebuild_enabled=1 (on by default in 2.0+).Good L2ARC devices: Any TLC NVMe SSD with a DRAM cache. Samsung 870 EVO (SATA), Samsung 980 Pro (NVMe), WD Black SN770, Crucial P3 Plus. Avoid QLC — sustained random read performance drops sharply as the drive fills. Enterprise SSDs are nice but overkill for L2ARC.
Special vdev devices — metadata on fast storage
The special vdev stores pool metadata (directory entries, block pointers, file
attributes) and optionally small file blocks below the special_small_blocks
threshold. On a spinning-rust pool, this means ls, find, du,
container layer lookups, and database index reads hit the SSD instead of waiting for platter seeks.
The improvement is 10–50x for metadata-heavy operations.
The special vdev MUST be mirrored.
If an unmirrored special vdev fails, the pool's metadata is gone. The pool is unrecoverable. ZFS will warn you, but it will let you create an unmirrored special vdev. Never do this.
Ideal devices: Intel Optane (best — lowest latency for random reads), enterprise NVMe with PLP (Samsung PM9A3, Micron 7450), or high-endurance consumer NVMe. The special vdev handles both reads and writes, so PLP is more important here than for L2ARC (though ZFS can rebuild the special vdev from data vdevs in theory, you don't want to test this).
# Add a mirrored special vdev — metadata + files under 64K go to SSDs
zpool add tank special mirror /dev/nvme0n1 /dev/nvme1n1
zfs set special_small_blocks=65536 tank
# For databases with 8K pages:
zfs set special_small_blocks=16384 tank/postgres
# Verify special vdev allocation:
zpool list -v tank
# Look for the "special" vdev and its allocated/free space
Boot devices — ESP and boot pool
A ZFS-on-root system needs two things outside the main pool: an EFI System Partition (ESP) for the bootloader, and optionally a small boot pool if your bootloader can't read ZFS directly. kldload uses a direct ZFS boot chain (kernel + initramfs on the root pool), so you only need the ESP.
Mirror your boot devices
If your single boot SSD fails, the system won't boot even though the ZFS pool is healthy on other disks.
Mirror the ESP across two devices. For a dedicated boot setup, two small SSDs (120–256GB) in a
ZFS mirror work perfectly. For an integrated setup, partition the first two data disks with a small
ESP partition each and use efibootmgr to register both.
# kldload automatically mirrors the ESP across available disks
# To verify ESP mirrors:
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT | grep -i efi
# Manual ESP mirror (if needed):
# 1. Create matching ESP partitions on both disks
# 2. Format both as FAT32
# 3. Copy bootloader files to both
# 4. Register both with efibootmgr:
efibootmgr -c -d /dev/sda -p 1 -L "Linux Boot 1" -l '\EFI\BOOT\BOOTX64.EFI'
efibootmgr -c -d /dev/sdb -p 1 -L "Linux Boot 2" -l '\EFI\BOOT\BOOTX64.EFI'
CPU requirements
ZFS is not CPU-intensive for basic operations, but specific features have real CPU requirements. Any modern x86_64 processor handles ZFS fine. The features that demand CPU attention are compression, encryption, and deduplication.
| Feature | CPU requirement | Notes |
|---|---|---|
| Compression (LZ4) | Minimal — any CPU | LZ4 compresses at 4+ GB/s per core. Free performance. Always enable it. |
| Compression (ZSTD) | Moderate — 1–2 cores | ZSTD is 2–5x slower than LZ4 but achieves 30–50% better ratios. Use for cold storage, backups. |
| Encryption (AES-256-GCM) | AES-NI required | Without AES-NI, encryption is 10–50x slower (software fallback). Every x86 CPU since ~2011 has AES-NI. Verify with: grep aes /proc/cpuinfo |
| Deduplication | SHA256 + many cores | Dedup hashes every block. SHA extensions (SHA-NI, present on Zen 1+ and Ice Lake+) accelerate this 3–5x. Dedup is RAM-hungry more than CPU-hungry, but fast hashing helps. |
| Checksumming (fletcher4) | Minimal | Default checksum. Hardware-accelerated on all modern CPUs. Negligible overhead. |
| Checksumming (SHA-256) | SHA-NI helps | Required for dedup. With SHA-NI: ~2 GB/s per core. Without: ~500 MB/s per core. |
| Scrub/resilver | 1–2 cores | Scrubbing reads + checksums every block. CPU-bound on fast all-flash pools. On HDD pools, the disks are the bottleneck, not the CPU. |
# Verify CPU feature support
grep -o 'aes\|sha_ni\|avx2\|sse4_2' /proc/cpuinfo | sort -u
# aes — required for native ZFS encryption at full speed
# sha_ni — accelerates SHA-256 checksumming and dedup
# avx2 — accelerates various ZFS operations
# sse4_2 — CRC32 acceleration for checksumming
Power protection — UPS is mandatory
Every ZFS system needs a UPS. No exceptions.
ZFS's copy-on-write design means a power loss during a write will never corrupt existing data — the old blocks are still intact until the new transaction group commits atomically. However, the ZFS Intent Log (ZIL) assumes the write medium is power-safe. If your ZIL is on data disks with volatile write caches, a power loss can lose acknowledged sync writes. If your SLOG lacks PLP, the same applies. A UPS eliminates this entire class of failure.
Beyond ZIL safety, a UPS protects against:
apcupsd or nut triggers a clean shutdown: export pools, sync caches, power off. The pool comes back clean every time.# Install and configure UPS monitoring (APC example)
dnf install apcupsd # RHEL/CentOS/Fedora/Rocky
apt install apcupsd # Debian/Ubuntu
# /etc/apcupsd/apcupsd.conf — key settings:
# UPSTYPE usb
# UPSCABLE usb
# TIMEOUT 60 — shutdown after 60 seconds on battery
# BATTERYLEVEL 15 — shutdown when battery reaches 15%
# MINUTES 5 — shutdown when 5 minutes of runtime remain
# For Network UPS Tools (NUT):
dnf install nut # supports more UPS brands than apcupsd
# Configure /etc/ups/ups.conf, /etc/ups/upsd.conf, /etc/ups/upsmon.conf
Backplanes, expanders, and disk shelves
When you need more drives than your HBA's port count, you have three options: SAS expanders, multiple HBAs, or external disk shelves (JBODs).
Drive count vs redundancy
More drives means more failure probability. A pool with 24 drives is far more likely to experience a drive failure in any given year than a pool with 4 drives. The Annualized Failure Rate (AFR) compounds: if each drive has a 1% AFR, a 24-drive pool has a ~21% chance of at least one failure per year. This is why redundancy level must scale with drive count.
| Drive count | Recommended minimum | Reasoning |
|---|---|---|
| 1–2 | Mirror | Mirror is the only option with 2 disks. Single disk = no redundancy, test only. |
| 3–4 | Mirror pairs | 2 mirror pairs (4 disks) give excellent IOPS and simple expansion. |
| 5–8 | RAIDZ2 or mirror pairs | RAIDZ2 for capacity; mirror pairs for IOPS. Never RAIDZ1 with drives over 2TB. |
| 9–12 | RAIDZ2 or dRAID2 | Resilver times start getting long. dRAID2 shines at 12 disks. |
| 13–24 | dRAID2 with 1–2 spares | Traditional RAIDZ resilvers take too long. dRAID's parallel rebuild is essential. |
| 25+ | dRAID2 or dRAID3 with 2+ spares | At this scale, triple parity may be justified. The probability of coincident failures is non-negligible. |
Recommended builds by use case
Home NAS — 4–8 bay, budget-conscious
The $500–1000 NAS
CPU: Any modern 4-core (Intel i3/i5, AMD Ryzen 3/5, or used Xeon E3). AES-NI required if using encryption.
RAM: 16–32GB ECC DDR4 (used server RAM is cheap). 1GB per TB of storage is a reasonable starting point for ARC.
HBA: LSI 9207-8i or Broadcom 9300-8i ($20–60 used).
Drives: 4x WD Red Plus 8TB or 4x Seagate IronWolf 8TB in 2 mirror pairs (16TB usable). Or 4x used Ultrastar 14TB ($100 each) for 28TB usable.
SLOG: Not needed unless serving NFS with sync=always.
Boot: 2x 120GB SATA SSD (mirrored ESP).
UPS: APC Back-UPS 600VA ($60). Configure apcupsd for auto shutdown.
Case: Fractal Design Define 7 (supports 14 HDDs) or Node 804 (compact, 8 bays).
Production file server — 12–24 bay, enterprise-grade
The $3,000–8,000 server
CPU: Xeon Silver/Gold or EPYC 7003/9004 series. 8+ cores. AES-NI + SHA-NI.
RAM: 128–256GB ECC DDR4/DDR5. Registered (RDIMM) for higher density.
HBA: Broadcom 9400-16i or 9500-16i. Tri-mode for future NVMe expansion.
Drives: 12x Seagate Exos X18 18TB in dRAID2:1s (10x 18TB usable = ~180TB) or 6 mirror pairs (108TB usable, max IOPS).
Special vdev: 2x Intel Optane P4801X 375GB (mirrored). Metadata + small blocks on Optane.
SLOG: 2x Intel Optane P4801X 100GB (mirrored) if serving NFS/iSCSI. Otherwise not needed.
Boot: 2x 240GB enterprise SATA SSD (mirrored ESP).
UPS: CyberPower 2200VA or APC Smart-UPS 1500. Network-managed, NUT monitoring.
Chassis: Supermicro 846 (24 bay) or Dell R740xd (24 bay + 2 NVMe).
All-flash pool — maximum IOPS, minimum latency
The all-NVMe build
CPU: EPYC 9004 or Xeon Sapphire Rapids. PCIe 5.0 for maximum NVMe bandwidth. 16+ cores for ZSTD compression at scale.
RAM: 256–512GB ECC DDR5. All-flash pools benefit enormously from large ARC — SSD reads are fast but RAM reads are 100x faster.
HBA: Direct PCIe NVMe (no HBA needed) or Broadcom 9500-16i for U.2/U.3 drives.
Drives: 8x Samsung PM9A3 3.84TB in 4 mirror pairs (15.36TB usable, ~2M random IOPS). Or 8x Micron 7450 PRO for budget-conscious builds.
Special vdev: Not needed — metadata is already on NVMe. All reads are fast.
SLOG: 2x Intel Optane P4801X (if sync writes are significant). On an all-NVMe pool, the data drives are fast enough that a SLOG provides marginal benefit for most workloads.
UPS: Enterprise rack UPS. Double-conversion for clean power to sensitive NVMe electronics.
Verify your hardware
# Verify HBA is in IT mode (not RAID mode)
lspci -v | grep -i "LSI\|Broadcom\|SAS"
# Should show "Serial Attached SCSI controller" — NOT "RAID bus controller"
# Verify disks are directly visible (not behind RAID)
ls -l /dev/disk/by-id/
# Should show scsi-* or ata-* entries for each physical disk
# Verify ECC RAM is active
edac-util -s # shows ECC error counts (install edac-utils)
dmidecode -t 17 | grep -i "error correction"
# Should show "Multi-bit ECC" or "Single-bit ECC"
# Verify AES-NI and SHA-NI support
grep -o 'aes\|sha_ni' /proc/cpuinfo | sort -u
# Check for SMR drives
smartctl -a /dev/sdX | grep -i "rotation rate\|form factor"
# Also check manufacturer's spec sheet — smartctl doesn't directly report SMR
# Verify drive health
smartctl -H /dev/sdX
# PASSED = healthy, FAILED = replace immediately
# Check SSD wear
smartctl -a /dev/nvme0n1 | grep -i "percentage used"
# Percentage Used: 3% — drive is at 3% of rated endurance
# Verify UPS is communicating
apcaccess status # for APC UPS with apcupsd
upsc myups@localhost # for NUT