ZFS vs Everything Else — the middleware graveyard.
Traditional Linux storage is a layer cake: partitions, volume managers, RAID arrays, filesystems, encryption wrappers, snapshot tools, caching layers — each with its own config syntax, failure modes, and on-call runbooks. ZFS replaces all of them with a single, integrated storage platform. This page is a comprehensive, honest comparison of ZFS against every legacy filesystem, volume manager, and RAID system you might be running today.
The philosophical difference
ZFS is not "just a filesystem" the way ext4 is a filesystem. ZFS is a storage platform. It merges the volume manager, RAID controller, filesystem, snapshot engine, replication system, caching layer, compression engine, and encryption subsystem into one coherent, transactional whole. Every component shares the same on-disk format, the same transaction model, and the same checksum tree.
Legacy Linux storage treats each layer as independent: mdadm doesn't know about
ext4. LVM doesn't know about LUKS. ext4
doesn't know about mdadm. When something breaks, you're debugging three or four
tools that have no awareness of each other. When you want a snapshot, you need LVM thin provisioning
or a separate tool like snapper. When you want replication, you need rsync
or borgbackup, which operate at the file level and scale poorly.
ZFS's integrated design means every operation — writes, checksums, compression, encryption,
snapshots, replication — happens in one atomic transaction group (TXG). There is no window
where the filesystem is inconsistent. There is no fsck. There is no "I hope the
RAID rebuild finishes before another disk dies."
The layer cake problem — ext4 + LVM + mdadm
The traditional Linux "enterprise" storage stack looks like this: mdadm assembles
physical disks into a RAID array. LVM carves that array into logical volumes.
LUKS encrypts each volume. ext4 or XFS sits on top.
That's four independent layers, each with its own tools, its own failure modes,
and zero awareness of the layers above or below it.
What goes wrong with the layer cake
Silent corruption propagates. ext4 has no checksums. If a disk returns corrupt data, ext4 stores it faithfully. mdadm has no way to know which copy is correct during a RAID1 rebuild — it picks one arbitrarily. LVM doesn't checksum anything. LUKS encrypts whatever it receives, corrupt or not. The corruption is now encrypted, replicated, and backed up. You discover it six months later when you try to open a file.
Snapshots are painful. LVM snapshots exist, but they're copy-on-write at the block level with severe performance degradation. LVM thin snapshots are better but add another layer of complexity and have their own failure modes. Neither integrates with replication.
Expansion is error-prone. Growing the stack means: growing the mdadm array, then
pvresize, then lvextend, then resize2fs or xfs_growfs.
Miss a step and you've got mismatched sizes. Shrinking is worse — do the steps in reverse order
or lose data.
Here's the same operation — creating redundant, encrypted storage — in both stacks:
# Legacy: mdadm + LUKS + LVM + ext4 (15+ commands)
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
mdadm --detail --scan >> /etc/mdadm.conf
cryptsetup luksFormat /dev/md0
cryptsetup luksOpen /dev/md0 crypt0
pvcreate /dev/mapper/crypt0
vgcreate vg0 /dev/mapper/crypt0
lvcreate -L 100G -n data vg0
mkfs.ext4 /dev/vg0/data
mkdir -p /data
mount /dev/vg0/data /data
# ...plus fstab, crypttab, mdadm.conf, dracut/initramfs updates
# ZFS: one command
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O encryption=aes-256-gcm \
-O keyformat=passphrase -O keylocation=prompt \
tank mirror /dev/sda /dev/sdb
What ZFS replaces — the complete list
/dev/mapper.zfs mount -a handles the rest. Only EFI needs fstab.zfs send/recv does block-level incremental replication. Only changed blocks. Delta-aware. Encrypted.quota=50G on a dataset. One command. Tracks used, referenced, snapshot consumption. Done.zfs set recordsize=1M tank/media. No remount. No reboot.ZFS vs ext4 — the default vs the future
ext4 is the default filesystem on most Linux distributions. It's stable, fast, well-understood, and has been in production since 2008. It is also a filesystem and nothing else — no volume management, no RAID, no checksums, no snapshots, no replication.
| Feature | ext4 | ZFS |
|---|---|---|
| Architecture | Journaling filesystem only | Integrated volume manager + filesystem + RAID |
| Data checksums | Metadata journal only — no data checksums | SHA-256/fletcher4 on every block — data and metadata |
| Self-healing | No — corrupt data stays corrupt | Yes — reads from good copy on checksum mismatch (mirrors/RAIDZ) |
| Snapshots | No (requires LVM thin) | Instant, zero-cost, unlimited, atomic |
| Replication | rsync (file-level, slow) | zfs send/recv (block-level, incremental, encrypted) |
| RAID | Requires mdadm or hardware RAID | Built-in mirrors, RAIDZ1/2/3, dRAID |
| Encryption | Requires LUKS wrapper | Native per-dataset AES-256-GCM |
| Compression | No | LZ4, ZSTD, gzip — transparent, per-dataset |
| Max filesystem size | 1 EiB (theoretical) | 256 ZiB (theoretical) |
| Max file size | 16 TiB | 16 EiB |
| Online shrink | Yes (with care) | No — pools cannot be shrunk |
| fsck required | Yes — after unclean shutdown, can take hours on large volumes | No — always consistent due to copy-on-write + TXG |
| RAM requirements | Minimal | 1 GB per TB of storage (rule of thumb); more = better ARC |
| Kernel inclusion | In-tree since 2.6.28 | Out-of-tree DKMS module (license incompatibility) |
| Distro support | Universal | Ubuntu native; others via DKMS or kldload |
Where ext4 wins: simplicity, minimal resource usage, universal kernel inclusion, and the ability to shrink filesystems. For a 512 MB embedded device or a throwaway cloud instance that stores nothing important, ext4 is perfectly fine. It boots, it works, it's boring.
Where ext4 loses: everything else. No checksums means silent data corruption goes undetected. No snapshots means no quick rollback. No built-in RAID means you need mdadm. No built-in encryption means you need LUKS. No replication means you need rsync. Each addition is another layer with its own failure modes.
ZFS vs XFS — the metadata champion
XFS is the default filesystem on RHEL, CentOS, Rocky, and Fedora. It was designed by SGI for high-performance, large-scale storage. XFS excels at metadata performance — it handles millions of files in a single directory better than any other legacy filesystem. It's the only legacy filesystem that gives ZFS honest competition in some workloads.
| Feature | XFS | ZFS |
|---|---|---|
| Metadata performance | Excellent — B+ tree allocation groups, delayed allocation | Good — improved dramatically with special vdevs on SSD |
| Data checksums | No (metadata CRCs in v5 format since 2013, but no data checksums) | Yes — every block checksummed |
| Self-healing | No | Yes (with redundancy) |
| Snapshots | No | Yes — instant, unlimited |
| Reflinks / CoW copies | Yes (since kernel 4.9) — instant file copies | Yes — clones and snapshots are CoW |
| RAID | Requires mdadm or hardware RAID | Built-in |
| Compression | No | LZ4, ZSTD, gzip |
| Online grow | Yes (xfs_growfs) | Yes (zpool add or zpool attach) |
| Online shrink | No | No |
| Max filesystem size | 8 EiB | 256 ZiB |
| Parallel I/O | Excellent — allocation groups enable independent parallel writes | Good — multiple vdevs enable parallel I/O |
| Repair tool | xfs_repair — fast and reliable | No fsck needed — zpool scrub for proactive verification |
| Production history | Since 1994 (IRIX), Linux since 2001 | Since 2005 (Solaris), Linux since 2010 (ZoL) |
Where XFS wins: raw metadata throughput on workloads with millions of small files (mail servers, build caches, package repositories). XFS allocation groups allow truly parallel metadata operations across different regions of the disk. XFS is also battle-hardened in enterprise Linux — Red Hat has invested decades into xfs_repair and xfsprogs.
Where XFS loses: no checksums on data (only metadata CRCs), no snapshots, no built-in RAID, no compression, no encryption, no replication. XFS is an excellent filesystem. But it's only a filesystem. You still need the full layer cake around it.
ZFS vs Btrfs — the closest competitor
Btrfs is the only Linux filesystem that honestly competes with ZFS on features. It has copy-on-write, snapshots, checksums, built-in RAID, compression, and subvolumes. It's in-tree in the Linux kernel. On paper, it's everything ZFS is, but GPL-licensed and natively integrated. In practice, the story is more complicated.
| Feature | Btrfs | ZFS |
|---|---|---|
| License | GPL — in-tree kernel module | CDDL — out-of-tree DKMS |
| Copy-on-write | Yes | Yes |
| Checksums | CRC32C (default), SHA-256, BLAKE2b | fletcher4 (default), SHA-256, SHA-512, Skein, Edon-R, BLAKE3 |
| Self-healing | Yes (with redundancy) | Yes (with redundancy) |
| Snapshots | Yes — subvolume snapshots, writable | Yes — dataset snapshots (read-only) + clones (writable) |
| Compression | LZO, ZLIB, ZSTD | LZ4, GZIP, ZSTD, LZjb, ZLE |
| Encryption | No native (fscrypt proposed but unmerged) | Native AES-256-GCM per-dataset |
| RAID 0/1/10 | Stable | Stable (striped vdevs, mirrors) |
| RAID 5/6 (parity) | BROKEN — write hole, data loss risk | RAIDZ1/2/3 — stable since 2005, no write hole |
| Send/receive | Yes — subvolume-based incremental send | Yes — dataset-based incremental send |
| Deduplication | Out-of-band (offline) since 6.13 via btrfs-dedup | Inline (real-time) but RAM-hungry; block cloning since 2.2 |
| Quotas | qgroups (complex, historically buggy) | Simple per-dataset quota/refquota/reservation |
| Max filesystem size | 16 EiB | 256 ZiB |
| Online shrink | Yes | No |
| Device removal | Yes (btrfs device remove) | Limited (mirror vdevs only, via zpool remove) |
| RAM requirements | Lower than ZFS | Higher — ARC wants RAM |
| Maturity | Declared stable in 2013; RAID5/6 still not production-ready in 2026 | Production since 2005 (Solaris); OpenZFS on Linux since 2013 |
Where Btrfs wins: kernel inclusion (no DKMS headaches), online shrink, device removal, lower RAM requirements, and writable snapshots by default. Btrfs subvolumes are also more flexible than ZFS datasets for certain container and flatpak workflows. SUSE has run Btrfs as the default root filesystem since 2014 — for RAID1 and single-disk configurations, it's genuinely production-ready.
Where Btrfs loses: the RAID5/6 write hole is the elephant in the room. Btrfs parity RAID has a known bug where a crash during a partial stripe write can produce inconsistent parity. This has been documented since 2013 and remains unfixed in 2026. If you need parity RAID, Btrfs is not an option. ZFS RAIDZ has never had this problem — its full-stripe writes and copy-on-write design make a write hole impossible.
Btrfs also lacks native encryption (you need LUKS underneath, defeating the integrated design). Btrfs qgroups are notoriously complex and have had performance regressions. And Btrfs has a history of data loss bugs in edge cases that has eroded trust, even as the codebase has matured significantly since 2020.
ZFS vs mdadm — software RAID
mdadm is the Linux software RAID implementation. It operates at the block layer,
below the filesystem. It knows nothing about the data it stores — just blocks.
This is both its strength (simplicity, flexibility) and its fatal flaw (no data integrity).
| Feature | mdadm | ZFS |
|---|---|---|
| RAID levels | 0, 1, 4, 5, 6, 10 | Stripe, mirror, RAIDZ1/2/3, dRAID |
| Data checksums | No — relies on disk firmware | Yes — every block |
| Write hole (RAID5/6) | Yes — requires battery-backed write journal or write-intent bitmap | No — copy-on-write eliminates write hole by design |
| Rebuild intelligence | Rebuilds entire disk, even empty space | Only rebuilds allocated blocks |
| Hot spare activation | Manual or mdadm.conf-based | Automatic (hot spares or dRAID distributed spares) |
| Scrub | echo check > /sys/block/md0/md/sync_action | zpool scrub tank — verifies checksums, repairs from good copies |
| Monitoring | mdmonitor daemon + email | zpool status, zed daemon, JSON events |
| Filesystem awareness | None — just blocks | Fully integrated — RAID and filesystem are one |
The write hole is mdadm's most dangerous problem. In RAID5/6, a power failure during a write can leave parity inconsistent with data. On next boot, mdadm has no way to know which blocks are correct. The write-intent bitmap mitigates this but doesn't eliminate it. ZFS's copy-on-write design makes a write hole physically impossible — new data is always written to new locations, and the uberblock pointer is updated atomically.
# mdadm: create RAID1 + filesystem (multiple tools, multiple steps)
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
mkfs.ext4 /dev/md0
mount /dev/md0 /data
# ZFS: one command, same result, plus checksums + snapshots + compression
zpool create -o ashift=12 -O compression=lz4 tank mirror /dev/sda /dev/sdb
ZFS vs hardware RAID — the vendor trap
Hardware RAID controllers (Dell PERC, HP Smart Array, LSI MegaRAID, Broadcom) move RAID computation to a dedicated chip. For decades, this was considered the "enterprise" approach. In 2026, hardware RAID is a liability for ZFS — and increasingly for everything else too.
| Feature | Hardware RAID | ZFS |
|---|---|---|
| Data checksums | No — RAID controller doesn't checksum data | Yes — end-to-end |
| Write hole | Mitigated by BBU/supercap (when battery is healthy) | Impossible by design |
| BBU dependency | Yes — dead battery = write hole returns | No battery needed |
| Controller failure | Need identical replacement controller or data is lost | Import pool on any machine with ZFS |
| Vendor lock-in | Proprietary on-disk format — locked to controller vendor | Open format — portable across any OpenZFS platform |
| Visibility | OS sees one virtual disk — no per-disk SMART, no per-disk errors | Full visibility into every disk — SMART, error counters, I/O stats |
| Snapshots | No | Yes |
| Cost | $200–$2000 per controller + battery | Free (use HBA in IT/JBOD mode) |
| Firmware bugs | Opaque — firmware bugs have caused silent data corruption | Open source — bugs are visible, reported, and fixed publicly |
The controller failure scenario is the killer. If your Dell PERC H740 dies,
you need another H740 (or compatible) to read the array. If that model is discontinued, you're
buying used cards on eBay and praying. With ZFS, you pull the disks, put them in any machine
running OpenZFS, and zpool import tank. Done.
The BBU dependency is the second killer. Hardware RAID controllers rely on a battery backup unit to protect the write cache during power failure. Batteries degrade. When the BBU reports degraded, the controller disables write-back caching and performance falls off a cliff. Or worse: the battery is dead but the controller doesn't report it, and you have an unprotected write cache. ZFS doesn't need a battery because copy-on-write never overwrites live data.
ZFS vs LVM — the volume manager
LVM2 is the standard Linux volume manager. It provides logical volumes, thin provisioning, snapshots (of a sort), and the ability to span or stripe across multiple disks. ZFS replaces LVM entirely — datasets are the equivalent of logical volumes, but with far more capabilities.
| Feature | LVM2 | ZFS |
|---|---|---|
| Thin provisioning | Yes (LVM thin) | Yes — datasets are thin by default |
| Snapshots | CoW snapshots (thick LVs: severe performance penalty; thin: better but complex) | Instant, zero overhead, unlimited |
| Snapshot performance | Classic LVM snapshots degrade performance 30–80% | Zero performance impact |
| Quotas | LV size is the quota | Per-dataset quota, refquota, reservation — granular control |
| Checksums | No | Yes |
| Compression | No | Yes |
| Shrink | Yes (lvreduce + resize2fs) | No |
| Complexity | PV → VG → LV → filesystem (four concepts) | Pool → dataset (two concepts) |
| Replication | No built-in replication | zfs send/recv |
Where LVM wins: online shrink (ZFS pools cannot shrink), deep integration with every Linux distro's installer, and simpler mental model for admins who only need basic volumes. LVM also integrates with LUKS and mdadm in well-documented ways.
Where LVM loses: LVM classic snapshots are notoriously slow — every write to the origin volume triggers a copy-on-write to the snapshot exception store, degrading performance by 30–80%. LVM thin snapshots are better but add significant complexity (thin pools, metadata volumes, autoextend thresholds). ZFS snapshots are free — zero performance impact, zero configuration.
ZFS vs Ceph — local vs distributed
Ceph is a distributed storage system that provides block (RBD), object (RADOS), and file (CephFS) storage across a cluster of machines. Comparing ZFS to Ceph is comparing a local storage platform to a distributed one — they solve different problems, but the comparison comes up constantly because both are used for "serious" storage.
| Feature | Ceph | ZFS |
|---|---|---|
| Scope | Distributed across multiple nodes | Local to one machine (or replicated via send/recv) |
| Minimum nodes | 3 (for quorum) | 1 |
| Operational complexity | High — MON, OSD, MDS, MGR daemons; CRUSH maps; PG placement | Low — zpool and zfs commands |
| Self-healing | Yes — re-replicates on node failure | Yes — resilvers on disk failure |
| Snapshots | RBD snapshots, CephFS snapshots | Dataset snapshots |
| Scale | Petabytes across hundreds of nodes | Petabytes on a single node (practical limit ~2 PB) |
| Network dependency | Requires dedicated storage network (10GbE minimum, 25GbE recommended) | None — local I/O |
| Latency | Network-bound (100µs–1ms typical) | Disk-bound (10–100µs NVMe, 1–5ms HDD) |
| Use case | Multi-tenant cloud, OpenStack/Kubernetes PVs, geographically distributed data | Single-node servers, NAS, VM hosts, databases, workstations |
Ceph wins when you need data accessible across multiple machines simultaneously, when you need to survive entire node failures without service interruption, or when you're building a cloud platform that serves block storage to hundreds of VMs.
ZFS wins when you need local storage performance, operational simplicity, or you're running on a single machine. Fun fact: many Ceph clusters use ZFS as the OSD backing store (via BlueStore on raw ZFS zvols) to get checksumming and compression underneath Ceph's distributed layer.
zfs send/recv to a remote backup gives you 90% of the
resilience at 10% of the operational cost. Don't deploy Ceph unless you genuinely need
distributed storage.
ZFS vs DRBD — synchronous replication
DRBD (Distributed Replicated Block Device) provides synchronous block-level replication between two nodes. It's often used for database HA: primary writes to local disk and DRBD simultaneously replicates every write to the secondary. If the primary dies, the secondary has an identical copy.
| Feature | DRBD | ZFS send/recv |
|---|---|---|
| Replication mode | Synchronous (Protocol C) or async | Asynchronous (snapshot-based incremental) |
| RPO | Zero (sync mode — no data loss on failover) | Last snapshot interval (typically 1–15 minutes) |
| Write latency impact | Every write waits for remote acknowledge (adds network RTT) | None — replication is decoupled from writes |
| Bandwidth | Continuous — mirrors every write in real time | Batched — only transfers changed blocks per snapshot |
| Complexity | Moderate — DRBD resource config, Pacemaker/Corosync for failover | Low — cron job or sanoid/syncoid |
| Multi-target | Yes (DRBD 9 supports 2+ secondaries, but complex) | Yes — send to multiple targets trivially |
| Checksums | Network CRC only — no on-disk data checksums | Full on-disk checksums on both sides |
DRBD wins when you absolutely need zero RPO — database HA clusters where losing even one transaction is unacceptable. Synchronous replication guarantees the secondary has every committed write.
ZFS wins when you can tolerate a few minutes of potential data loss (which is
most workloads). zfs send/recv is dramatically simpler to operate, doesn't impact
write latency, and includes checksums on both sides. For most server replication,
syncoid --no-sync-snap tank tank/backup@remote in a cron job is all you need.
The master comparison table
Every major feature across every storage technology in one table. This is the reference.
| Feature | ext4 | XFS | Btrfs | ZFS |
|---|---|---|---|---|
| Data checksums | No | No | Yes | Yes |
| Metadata checksums | Journal | CRC32C | Yes | Yes |
| Self-healing | No | No | Yes* | Yes |
| Copy-on-write | No | Reflink only | Yes | Yes |
| Snapshots | No | No | Yes | Yes |
| Compression | No | No | Yes | Yes |
| Encryption | No | No | No | Yes |
| Built-in RAID | No | No | Partial** | Yes |
| Volume management | No | No | Yes | Yes |
| Send/receive | No | No | Yes | Yes |
| Online shrink | Yes | No | Yes | No |
| Kernel in-tree | Yes | Yes | Yes | No |
| Parity RAID stable | N/A | N/A | No | Yes |
| RAM hungry | No | No | Moderate | Yes |
| Boot support | Universal | Universal | GRUB only | GRUB or ZFSBootMenu |
* Btrfs self-healing requires RAID1/10 profiles. RAID5/6 is not reliable.
** Btrfs RAID 0/1/10 is stable. RAID 5/6 has a known write hole and is not production-safe.
RAID implementation comparison
| Feature | mdadm | Hardware RAID | Btrfs RAID | ZFS RAID |
|---|---|---|---|---|
| Write hole | Yes (bitmap mitigates) | BBU mitigates | Yes (RAID5/6) | No (CoW) |
| Checksums | No | No | Yes | Yes |
| Scrub intelligence | Block-level check only | Controller-dependent | Checksum-verified | Checksum-verified + auto-repair |
| Rebuild speed (12 disks) | Hours | Hours | Hours | Minutes (dRAID) or hours (RAIDZ) |
| Portability | Any Linux | Same controller model only | Any Linux with Btrfs | Any OS with OpenZFS |
| Per-disk visibility | Yes | No (controller abstracts) | Yes | Yes |
| Mixed disk sizes | Uses smallest | Uses smallest | Flexible | Uses smallest per vdev |
| Hot spare | Configured in mdadm.conf | Controller config | Manual | Pool property + dRAID distributed spares |
Migration paths to ZFS
You can't convert an existing filesystem to ZFS in-place. Migration always involves creating a new ZFS pool and copying data. Here are the practical paths for each legacy system.
From ext4 / XFS (single disk or LVM)
# 1. Create ZFS pool on new disk(s)
zpool create -o ashift=12 -O compression=lz4 -O atime=off tank mirror /dev/sdc /dev/sdd
# 2. Copy data preserving permissions, xattrs, ACLs
rsync -avxHAX --progress /old-mount/ /tank/data/
# 3. Verify
diff -r /old-mount/ /tank/data/
# 4. Update fstab/mountpoints, reboot, decommission old disks
From mdadm RAID
# 1. Back up mdadm config
mdadm --detail --scan > /root/mdadm-backup.conf
# 2. Create ZFS pool on separate disks
zpool create -o ashift=12 -O compression=lz4 tank mirror /dev/sde /dev/sdf
# 3. Copy data
rsync -avxHAX /old-raid-mount/ /tank/data/
# 4. Stop mdadm array after verification
mdadm --stop /dev/md0
mdadm --zero-superblock /dev/sda1 /dev/sdb1
# 5. Optionally add old disks to the ZFS pool
zpool add tank mirror /dev/sda /dev/sdb
From hardware RAID
# 1. Copy data to external storage or new ZFS pool
rsync -avxHAX /old-mount/ /tank/data/
# 2. Flash RAID controller to IT/HBA mode (or replace with HBA)
# Dell PERC: use Dell firmware utility
# LSI: use sas2flash or sas3flash
# This exposes raw disks to the OS
# 3. Create ZFS pool on the now-exposed raw disks
zpool create -o ashift=12 -O compression=lz4 \
tank mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd
# 4. Restore data
rsync -avxHAX /backup/ /tank/data/
From Btrfs
# 1. Create ZFS pool on new disks
zpool create -o ashift=12 -O compression=zstd tank mirror /dev/sdc /dev/sdd
# 2. Use btrfs send to extract snapshots, pipe through tar
# (btrfs send/recv is Btrfs-only — can't receive into ZFS)
btrfs subvolume snapshot -r /btrfs-mount /btrfs-mount/migration-snap
rsync -avxHAX /btrfs-mount/migration-snap/ /tank/data/
# 3. Recreate subvolume structure as ZFS datasets
zfs create tank/data/home
zfs create tank/data/var
rsync -avxHAX /btrfs-mount/home/ /tank/data/home/
rsync -avxHAX /btrfs-mount/var/ /tank/data/var/
When NOT to use ZFS
ZFS is the right answer for most storage needs. But not all. Here are the cases where legacy tools are simpler or more appropriate. Being honest about this makes the rest of the page more credible.
Embedded systems / tiny VMs
ZFS wants RAM. The ARC alone consumes 1–4 GB on a typical system. If you're running a 256 MB container or a resource-constrained IoT device, ext4 is the right choice. Don't force ZFS into environments where RAM is precious and data integrity isn't critical.
Ephemeral cloud instances
A spot instance that lives for 20 minutes to run a batch job doesn't need ZFS. The instance will be destroyed before checksums or snapshots provide any value. Use whatever the AMI ships with (usually ext4 or XFS).
Windows-only environments
OpenZFS on Windows exists but is experimental. If your entire infrastructure is Windows Server and you need a filesystem, NTFS or ReFS is the practical choice. Don't shoehorn ZFS into a Windows shop.
/boot and EFI system partition
UEFI firmware reads the ESP as FAT32. GRUB reads /boot as ext4 (or Btrfs or ZFS, but with caveats). For maximum compatibility, keep /boot on ext4 and ESP on FAT32. These are small, static partitions where ZFS provides no benefit.
Environments where kernel DKMS is forbidden
Some security-hardened or compliance-bound environments prohibit out-of-tree kernel modules. ZFS on Linux is a DKMS module (or kABI-tracking RPM). If your security policy bans DKMS, you can't run ZFS. Btrfs or ext4 are your options.
When you need online shrink
ZFS pools cannot be shrunk. If your workflow requires regularly reclaiming pool space by removing disks, Btrfs or LVM handles this. ZFS pools only grow — plan accordingly.
Operational complexity comparison
One of ZFS's biggest advantages isn't a feature — it's the operational simplicity of having one tool instead of five. Here's what common storage tasks look like in each stack.
| Task | Legacy stack | ZFS |
|---|---|---|
| Create redundant storage | mdadm --create + pvcreate + vgcreate + lvcreate + mkfs (5 commands) |
zpool create tank mirror sda sdb (1 command) |
| Take a snapshot | lvcreate --snapshot (LVM) or install snapper/timeshift |
zfs snapshot tank/data@now |
| Replicate to remote | rsync -avz /data/ remote:/backup/ (file-level, slow) |
zfs send -i @prev tank/data@now | ssh remote zfs recv backup/data |
| Check data integrity | No tool (ext4/XFS have no data checksums) | zpool scrub tank |
| Replace failed disk | mdadm --manage /dev/md0 --remove + --add + wait for rebuild |
zpool replace tank /dev/sda /dev/sde |
| Enable compression | Not possible (ext4/XFS don't support it) | zfs set compression=zstd tank/data |
| Set quota | edquota (per-user) or LV size limit |
zfs set quota=100G tank/data |
| Expand storage | mdadm --grow + pvresize + lvextend + resize2fs (4 commands, order matters) |
zpool add tank mirror sdc sdd (1 command, instant) |
| Rollback after bad update | Restore from backup (minutes to hours) | zfs rollback tank/root@before-update (seconds) |
Scalability limits
| Limit | ext4 | XFS | Btrfs | ZFS |
|---|---|---|---|---|
| Max filesystem size | 1 EiB | 8 EiB | 16 EiB | 256 ZiB |
| Max file size | 16 TiB | 8 EiB | 16 EiB | 16 EiB |
| Max files | 4 billion (fixed at mkfs) | 264 | 264 | 248 |
| Max filename length | 255 bytes | 255 bytes | 255 bytes | 255 bytes |
| Max snapshots | N/A | N/A | Unlimited (subvolume-based) | Unlimited (264 theoretical) |
| Max disks per pool | N/A | N/A | N/A | Hundreds (practical), limited by memory |
In practice, you'll hit hardware limits (RAM, CPU, disk count) long before you hit ZFS's theoretical limits. The practical ceiling for a single ZFS pool is around 2 PB on current hardware — beyond that, you're looking at distributed solutions like Ceph or Lustre.
Performance characteristics
Raw sequential throughput is not where ZFS shines compared to ext4 or XFS. ZFS's performance advantages come from compression (less data written to disk), ARC caching (hot data served from RAM), and the special vdev (metadata on SSD). Here's an honest assessment.
Sequential writes
ext4 and XFS are 5–15% faster for raw sequential writes on the same hardware. ZFS's copy-on-write overhead and checksum computation add latency. However, with compression enabled (LZ4 compresses faster than disk I/O), ZFS often writes less data to disk, making it faster in practice for compressible workloads.
Sequential reads
Roughly equivalent across all filesystems. The bottleneck is disk speed, not filesystem overhead. ZFS's ARC gives it an advantage for repeated reads of the same data.
Random IOPS (mirrors)
ZFS mirrors perform comparably to mdadm RAID1 for random I/O. The ARC gives ZFS an edge for read-heavy workloads with a hot working set. For pure random writes, ext4 on mdadm has a slight edge due to less copy-on-write overhead.
Random IOPS (RAIDZ)
RAIDZ has a significant random write penalty due to the read-modify-write cycle. This is not a ZFS bug — it's inherent to parity RAID with copy-on-write. For random I/O workloads, use mirrors. See Pool Design for details.
Metadata operations
XFS wins for raw metadata throughput (millions of creates/deletes in a single directory). ZFS is competitive with a special vdev on SSD. Without a special vdev on spinning disks, ZFS metadata operations can be 2–5x slower than XFS.
Memory tradeoff
ZFS uses RAM aggressively for ARC. This is a feature, not a bug — unused RAM is wasted RAM.
ARC is adaptive and releases memory under pressure. But on systems with 4–8 GB RAM,
the ARC competes with application memory. Set zfs_arc_max to cap it.
The verdict
ZFS is not just a filesystem — it's a storage platform. It replaces the entire legacy stack: partitions, volume managers, RAID arrays, encryption wrappers, snapshot tools, caching layers, integrity checkers, and replication systems. One tool. One command syntax. One failure domain.
The legacy tools aren't bad. ext4 is rock-solid. XFS is fast. mdadm works. LVM is flexible. But they're independent layers that don't talk to each other, don't checksum data, and require you to be the integration layer. You are the glue. You are the one who has to remember the right order of operations for expanding storage, the right incantation for LVM snapshots, the right flags for mdadm --grow.
ZFS eliminates the glue. It gives you an integrated system where every component — RAID, volume management, filesystem, checksums, compression, encryption, snapshots, replication — is designed to work together from the ground up. Once you've operated a ZFS system, the legacy layer cake feels like what it is: the past.