Compression & Deduplication — one is your friend, the other is a trap.
ZFS compression is the single best free lunch in storage. It saves disk space, reduces I/O,
and in most cases makes your system faster because the CPU can compress and
decompress data faster than the disk can read and write uncompressed blocks. There is almost
no reason to leave it off. kldload enables compression=lz4 on every pool by default
because not doing so would be negligent.
Deduplication is the opposite. It sounds brilliant — "never store the same block twice!" — and it will eat your RAM alive. This page covers both in depth: how they work, when to use them, and the specific algorithms that matter.
How ZFS compression works
ZFS compression operates per-block, inline, and transparently. Understanding these three properties is the key to understanding why it's safe to leave on everywhere.
Per-block
ZFS compresses each block independently — not files, not extents, not the entire dataset.
A 128K record (the default recordsize) is compressed as a single unit. If it compresses
well, ZFS stores the smaller result. If it doesn't, ZFS stores the original uncompressed block.
There is no cross-block dictionary and no decompression dependency chain.
Inline (write-path)
Compression happens in the write pipeline before the block hits disk. The CPU compresses the data in memory, and only the compressed result is written to the vdev. On reads, ZFS decompresses in memory after reading the (smaller) block from disk. The on-disk format is always compressed; you never need to "enable" or "run" compression — it just happens.
Transparent
Applications see uncompressed data. ls -l shows the logical (uncompressed) file size.
du shows the physical (compressed) space used on disk. No application changes are needed.
No special mount flags. No userspace library. ZFS handles everything in the kernel.
The early abort optimization is what makes LZ4 essentially free. When ZFS compresses a block, it checks whether the compressed output is actually smaller. If the first portion of the block doesn't compress well (the algorithm detects incompressible data early), ZFS aborts the compression attempt and writes the block uncompressed. This means pre-compressed data (JPEG, MP4, encrypted files) incurs only a tiny CPU cost for the failed attempt — not the full compression pass. LZ4's early abort is particularly aggressive: it bails out within microseconds on incompressible data.
compression=off and compression=lz4
on incompressible data is within noise. Meanwhile, the compressible portions of your data get 2x
space savings for free.
Every compression algorithm in OpenZFS
OpenZFS supports seven algorithm families. In practice, you will use LZ4 or zstd. The others exist for historical or edge-case reasons.
zstd (without a level) uses level 3. Decompression is always fast regardless of compression level. Available since OpenZFS 2.0 (2020).zstd-fast-1 through zstd-fast-1000. Higher numbers = faster compression, lower ratio. zstd-fast-1 is roughly LZ4 speed with slightly better ratio. zstd-fast-500 and above are faster than LZ4 but with minimal compression. Useful for high-throughput streaming workloads where you want marginal compression at near-zero cost.compression=lzjb, change it to lz4.off unnecessary. The only real use case is regulatory environments where on-disk data must be byte-identical to the application's output.gzip and lzjb
are dead algorithms walking. If you see them on an existing pool, the correct action is
zfs set compression=lz4 poolname (or zstd) and move on. They exist
in the codebase for backward compatibility, not because anyone should choose them in 2025+. zstd
at level 3 beats gzip-6 in both ratio and speed. At level 9, it beats gzip-9 in ratio while being
5–10x faster. There is no contest.
Algorithm comparison — speed, ratio, CPU cost
These are representative numbers from benchmarking on a 2-socket Xeon server (Sapphire Rapids, 64 cores) with mixed data (source code, binaries, logs, databases). Real-world results vary by workload. The relative ordering is consistent across hardware.
| Algorithm | Compress speed | Decompress speed | Typical ratio | CPU cost | Best for |
|---|---|---|---|---|---|
| lz4 | 4,500+ MB/s | 5,500+ MB/s | 2.0–2.5x | Negligible | Everything (default) |
| zstd-1 | 1,800 MB/s | 4,000 MB/s | 2.5–3.0x | Low | General purpose, better ratio |
| zstd-3 (default zstd) | 900 MB/s | 3,800 MB/s | 2.8–3.3x | Low–moderate | NAS, file servers, backups |
| zstd-7 | 350 MB/s | 3,500 MB/s | 3.0–3.8x | Moderate | Archival, cold storage |
| zstd-15 | 30 MB/s | 3,400 MB/s | 3.2–4.2x | High | Deep archival, write-once |
| zstd-19 | 8 MB/s | 3,300 MB/s | 3.3–4.5x | Very high | Maximum compression, rarely written |
| zstd-fast-1 | 3,500 MB/s | 4,200 MB/s | 2.2–2.6x | Negligible | Streaming, high-throughput ingest |
| zstd-fast-100 | 5,000+ MB/s | 5,000+ MB/s | 1.5–1.8x | Negligible | Marginal compression at max speed |
| gzip-1 | 180 MB/s | 700 MB/s | 2.5–3.0x | High | Legacy pools only |
| gzip-9 | 25 MB/s | 700 MB/s | 2.8–3.5x | Very high | Legacy pools only |
| lzjb | 1,200 MB/s | 1,800 MB/s | 1.5–2.0x | Low | Never (legacy only) |
| zle | 6,000+ MB/s | 6,000+ MB/s | 1.0–1.2x | Negligible | Zero-filled sparse data only |
Notice the asymmetry in zstd: decompression is always fast regardless of
the compression level used to write the data. This is a critical property. You can write
archival data with zstd-19 (very slow writes) and read it back at 3,300+ MB/s.
The compression level only penalizes writes. Reads are always cheap.
The "always use LZ4" rule — and why it's correct
Rule: set compression=lz4 on every pool, every dataset, every time.
The only exceptions are datasets where you have measured and confirmed that a different algorithm provides meaningful benefit for your specific workload. "Meaningful" means measurable space savings or performance improvement, not theoretical. LZ4 is the correct default until proven otherwise.
Here's why. LZ4 has three properties that make it uniquely suitable as a default:
compression=lz4 is faster than compression=off because the I/O reduction outweighs the (negligible) CPU cost. You are literally getting more performance and more space for free.# Enable LZ4 on a pool (inherits to all child datasets)
zfs set compression=lz4 tank
# Enable LZ4 at pool creation (the kldload way)
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
tank mirror /dev/sda /dev/sdb
# Verify compression is set
zfs get compression tank
# NAME PROPERTY VALUE SOURCE
# tank compression lz4 local
When to use zstd instead
zstd earns its place on specific workloads where the better compression ratio justifies the additional (but still modest) CPU cost. The key insight is that zstd's decompression is always fast — so the write penalty is the only cost, and for write-once/read-many data, that's a one-time expense.
zstd-7 or zstd-9 gives 30–50% better compression than LZ4 with decompression speed that's still faster than most storage. Perfect for zfs send receive targets.zstd-3 (the default zstd level) compresses logs at 3–5x vs. LZ4's 2–3x. On a system generating 10GB/day of logs, that's 20GB+ saved per week.zstd-15 or zstd-19 squeezes maximum savings. Write speed doesn't matter — you're writing once and storing for years.zstd-3 is an excellent choice for general file servers. The network (1GbE = 125 MB/s, 10GbE = 1,250 MB/s) is always slower than zstd-3's 900 MB/s compression speed. The network is the bottleneck, not the CPU.# Per-dataset compression policies
zfs set compression=lz4 tank # pool default
zfs set compression=zstd-3 tank/shares # NAS shares
zfs set compression=zstd-7 tank/backups # backup targets
zfs set compression=zstd-19 tank/archive # deep archive
zfs set compression=lz4 tank/vms # VMs need speed
zfs set compression=lz4 tank/databases # databases need speed
When to turn compression off
Almost never. But there are a few legitimate cases:
Regulatory/forensic requirements
Some compliance frameworks require that on-disk data be byte-identical to the application's output.
ZFS compression changes the on-disk representation. If your auditors require bit-for-bit on-disk
fidelity (rare, but it happens in forensics and some financial contexts), you need compression=off.
All-encrypted-payload datasets
A dataset that stores only AES-256 encrypted blobs (like encrypted backup chunks from Borg
or Restic with client-side encryption) won't compress at all. LZ4's early abort handles this fine, but if
you want to avoid even the ~5 microsecond early-abort overhead per block on a massive ingest pipeline,
compression=off is defensible. In practice, the difference is immeasurable.
zstd-fast levels explained
The zstd-fast family inverts the normal zstd level numbering. Higher numbers mean
less compression, more speed. These levels use zstd's "negative level" mode internally,
trading compression ratio for throughput.
In practice, zstd-fast is a niche tool. If you need speed, LZ4 is simpler and well-tested.
If you need ratio, zstd-1 through zstd-7 is the sweet spot. zstd-fast
occupies the narrow gap between "LZ4 speed" and "faster than LZ4 but worse ratio" — a gap that
rarely matters in real deployments.
Monitoring compression effectiveness
ZFS exposes compression statistics at every level: pool, dataset, and snapshot. Understanding these properties is essential for capacity planning and for deciding whether to change algorithms.
Key properties
compressratio of 2.00x means you're storing 2GB of data in 1GB of disk space. This is the headline number.used < logicalused when compression is working.referenced / logicalreferenced gives you the true compression ratio for this specific dataset.# Check compression ratio for the entire pool
zfs get compressratio tank
# NAME PROPERTY VALUE SOURCE
# tank compressratio 2.31x -
# Detailed compression stats for all datasets
zfs get compressratio,logicalused,used,compression -r tank
# NAME PROPERTY VALUE SOURCE
# tank compressratio 2.31x -
# tank logicalused 1.82T -
# tank used 812G -
# tank compression lz4 local
# tank/databases compressratio 1.89x -
# tank/databases logicalused 420G -
# tank/databases used 222G -
# tank/databases compression lz4 inherited from tank
# tank/logs compressratio 4.72x -
# tank/logs logicalused 89G -
# tank/logs used 18.8G -
# tank/logs compression zstd-3 local
# Pool-level compression stats
zpool get all tank | grep -i compress
# tank feature@lz4_compress active local
# tank feature@zstd_compress active local
# Quick space savings summary
zfs list -o name,logicalused,used,compressratio -r tank
The compressratio property is cumulative over the life of the dataset. It reflects
all data currently stored, not just recent writes. If you change the compression algorithm,
the ratio will gradually shift as old data is overwritten with new (differently-compressed) data.
Inheritance & changing compression mid-stream
Compression is a dataset property that follows ZFS's standard inheritance model. Setting it on a parent dataset propagates to all children (unless they have a local override). But there's a critical nuance: changing the compression algorithm only affects new writes.
Existing data is not recompressed. When you change from lz4 to
zstd, blocks already on disk stay compressed with LZ4. Only newly written blocks use zstd.
This is by design — recompressing terabytes of data in place would be enormously expensive and
dangerous (power failure mid-recompression could corrupt data).
If you need to force recompression, you must rewrite the data: zfs send | zfs receive
to a new dataset, or copy files manually. Both approaches write new blocks with the new algorithm.
# Change compression on a dataset (only affects new writes)
zfs set compression=zstd-3 tank/shares
# Verify the change and check inheritance
zfs get compression -r tank
# NAME PROPERTY VALUE SOURCE
# tank compression lz4 local
# tank/databases compression lz4 inherited from tank
# tank/shares compression zstd-3 local <-- overridden
# tank/shares/hr compression zstd-3 inherited from tank/shares
# tank/vms compression lz4 inherited from tank
# Force recompression via send/receive
zfs snapshot tank/shares@recompress
zfs send tank/shares@recompress | zfs receive tank/shares-new
# Then swap: rename tank/shares to tank/shares-old, tank/shares-new to tank/shares
# Verify, then destroy tank/shares-old
ZFS handles mixed-algorithm datasets transparently. A single dataset can contain blocks compressed with LZ4, zstd, gzip, and even uncompressed blocks — all in the same file. Each block's compression metadata is stored in the block pointer, so ZFS always knows how to decompress it. This is why algorithm changes are safe to make at any time.
zfs send | zfs receive. It's explicit, safe, and you can verify the result before
destroying the original.
Real-world compression ratios by workload
These are ratios observed in production across dozens of kldload deployments. Your results will vary based on data content, but these give realistic expectations.
| Workload | LZ4 ratio | zstd-3 ratio | zstd-9 ratio | Notes |
|---|---|---|---|---|
| System logs (syslog, journald) | 3.5–5.0x | 5.0–8.0x | 6.0–10.0x | Highly repetitive text. Compression champion. |
| Application logs (JSON/structured) | 3.0–4.5x | 4.0–6.0x | 4.5–7.0x | JSON structure is very compressible. |
| Source code repositories | 2.5–3.5x | 3.0–4.5x | 3.5–5.0x | Text files compress well. Binary artifacts less so. |
| PostgreSQL / MySQL data | 1.8–3.0x | 2.2–3.5x | 2.5–4.0x | Depends on data types. Text-heavy schemas compress very well. |
| VM disk images (qcow2/raw) | 1.5–2.5x | 1.8–3.0x | 2.0–3.2x | Mixed content. Free space in VMs compresses well. |
| Docker layers | 2.0–3.0x | 2.5–4.0x | 3.0–4.5x | Lots of repeated OS files across layers. |
| Email (Maildir/mbox) | 2.0–3.5x | 2.5–4.0x | 3.0–4.5x | Text-heavy mail compresses well. Attachments less so. |
| Photos (JPEG/HEIF) | 1.00–1.05x | 1.00–1.05x | 1.00–1.05x | Already compressed. LZ4 early abort makes this free. |
| Video (H.264/H.265) | 1.00–1.02x | 1.00–1.02x | 1.00–1.02x | Already compressed. Zero benefit, near-zero cost. |
| Music (FLAC) | 1.00–1.03x | 1.00–1.03x | 1.00–1.03x | Already compressed. MP3/AAC even less compressible. |
| Encrypted volumes (LUKS, VeraCrypt) | 1.00x | 1.00x | 1.00x | Encrypted data is indistinguishable from random. Cannot compress. |
| Mixed NAS (home directories) | 1.5–2.5x | 1.8–3.0x | 2.0–3.2x | Blend of documents, media, and configs. |
Per-dataset compression policies
The correct strategy is: set LZ4 on the pool root, then override specific datasets where zstd provides meaningful benefit. This gives you sane defaults with targeted optimization.
# Recommended dataset layout with compression policies
# Pool root: lz4 (inherits to everything unless overridden)
zfs set compression=lz4 tank
# Hot data: lz4 (maximum speed, good compression)
zfs create -o compression=lz4 tank/vms
zfs create -o compression=lz4 tank/databases
zfs create -o compression=lz4 tank/containers
# Warm data: zstd-3 (better ratio, still fast)
zfs create -o compression=zstd-3 tank/shares
zfs create -o compression=zstd-3 tank/home
zfs create -o compression=zstd-3 tank/mail
# Cold data: zstd-7 or higher (maximum savings, write speed irrelevant)
zfs create -o compression=zstd-7 tank/backups
zfs create -o compression=zstd-7 tank/logs
zfs create -o compression=zstd-19 tank/archive
# Pre-compressed data: lz4 (early abort handles it)
zfs create -o compression=lz4 tank/media
zfs create -o compression=lz4 tank/iso-images
Note that tank/media still uses LZ4, not compression=off. Even on
a media dataset, there will be metadata files, NFO files, subtitle files, and cover art in
non-compressed formats. LZ4's early abort means the compressed media files cost nothing, while
the text files still get compressed. There's no reason to turn it off.
Special considerations
Databases (PostgreSQL, MySQL, MongoDB)
Database compression on ZFS is powerful but requires understanding the interaction between
ZFS's block-level compression and the database's page/record format. The key variable is
recordsize.
recordsize and database page alignment
PostgreSQL uses 8K pages. MySQL/InnoDB uses 16K pages. MongoDB's WiredTiger uses 32K or 64K blocks.
If ZFS's recordsize matches the database's page size, each ZFS block contains exactly
one database page — which compresses and decompresses independently. This avoids read-modify-write
amplification and gives the best compression ratio because the compressor sees a coherent data structure.
Recommendation: Set recordsize=8k for PostgreSQL,
recordsize=16k for MySQL/InnoDB, recordsize=64k for MongoDB. Always use
compression=lz4 for databases — the speed advantage over zstd matters for latency-sensitive
workloads.
# PostgreSQL dataset
zfs create -o compression=lz4 -o recordsize=8k \
-o primarycache=metadata -o atime=off \
-o logbias=throughput tank/postgres
# MySQL/InnoDB dataset
zfs create -o compression=lz4 -o recordsize=16k \
-o primarycache=metadata -o atime=off \
tank/mysql
# MongoDB (WiredTiger) dataset
zfs create -o compression=lz4 -o recordsize=64k \
-o atime=off tank/mongodb
Virtual machines
VM disk images are mixed-content: the guest OS has system files (compressible), databases (compressible), and potentially media (incompressible). ZFS compresses each block independently, so compressible blocks get compressed and incompressible blocks get the early abort. On a typical Linux guest, expect 1.5–2.5x compression ratio.
# VM storage dataset — zvol with 64K block size
zfs create -V 100G -o compression=lz4 -o volblocksize=64k tank/vms/web-01
# For Proxmox/libvirt with raw volumes
zfs create -o compression=lz4 -o recordsize=64k tank/vms
# Check actual VM compression savings
zfs get compressratio,logicalused,used tank/vms/web-01
# NAME PROPERTY VALUE SOURCE
# tank/vms/web-01 compressratio 2.14x -
# tank/vms/web-01 logicalused 52.3G -
# tank/vms/web-01 used 24.4G -
Media files (photos, video, music)
Pre-compressed media (JPEG, H.264, H.265, FLAC, MP3, AAC) won't compress further. LZ4's early abort detects this within microseconds and writes the block uncompressed. The CPU cost is immeasurable. Leave LZ4 on. The metadata, subtitles, playlists, and thumbnails in the same dataset will still benefit from compression.
Compression + encryption interaction
ZFS native encryption (added in OpenZFS 0.8) has a critical interaction with compression: ZFS compresses first, then encrypts. This is the correct order and it matters enormously.
Compress-then-encrypt: why order matters
If you encrypt first, the output is pseudorandom — completely incompressible. Compression after encryption would waste CPU and save zero space. ZFS does it the right way: compress the plaintext data (which has patterns and redundancy), then encrypt the compressed result. You get full compression savings AND full encryption protection.
This means compression=lz4 and encryption=aes-256-gcm
work perfectly together. Set both. The compression ratio is identical to what you'd get without encryption.
# Create an encrypted dataset with compression (compression applies to plaintext)
zfs create -o encryption=aes-256-gcm -o keyformat=passphrase \
-o compression=lz4 tank/encrypted-data
# Verify both are active
zfs get compression,encryption,compressratio tank/encrypted-data
# NAME PROPERTY VALUE SOURCE
# tank/encrypted-data compression lz4 local
# tank/encrypted-data encryption aes-256-gcm -
# tank/encrypted-data compressratio 2.18x -
Caveat: there is a theoretical information leakage concern with compress-then-encrypt. Because compressed block sizes vary based on content, an attacker who can observe the on-disk block sizes might infer something about the plaintext content's compressibility. This is the same class of attack as CRIME/BREACH in TLS. In practice, for at-rest storage encryption, this risk is negligible — the attacker would need to both access the raw disk and have a model of your data patterns. ZFS's variable block sizes already obscure individual file sizes. But if you're building a system for a three-letter agency, be aware of it.
Compression + dedup interaction
Compression and deduplication are independent operations in ZFS. When both are enabled, ZFS compresses first, then deduplicates. The dedup table stores checksums of the compressed blocks. This means two blocks with identical uncompressed content will also be identical after compression (for the same algorithm), and dedup will catch them.
Compression reduces the effectiveness of dedup slightly. Dedup operates on fixed-size blocks, and compression changes the on-disk block size. Two files that are mostly identical but differ in one byte will compress to different block sizes, and dedup will not match them. This is correct behavior — dedup is block-level, not file-level.
In practice, use compression instead of dedup whenever possible. Compression gives 1.5–4x savings with near-zero overhead. Dedup gives additional savings only for truly block-identical data (identical VM images, identical backup copies) and costs 1–2GB of RAM per TB. For most workloads, compression alone is sufficient.
ARC and compression (compressed ARC)
Starting with OpenZFS 2.2, the ARC (Adaptive Replacement Cache) can store compressed blocks. This is a major improvement: your RAM cache now holds more data per GB of RAM because the cached blocks are compressed.
How compressed ARC works
Before OpenZFS 2.2, the ARC stored uncompressed blocks. A 128K record compressed to 64K on disk would still occupy 128K in the ARC after decompression. With compressed ARC, ZFS keeps the 64K compressed version in cache and decompresses on demand when an application reads the data. This effectively doubles (or more) your ARC's effective capacity, because every cached block is smaller.
The tradeoff is a small CPU cost for decompression on every cache hit. With LZ4, this cost is negligible (~5,500 MB/s decompression). With gzip, it's noticeable. This is another reason to use LZ4 or zstd: their decompression speed is high enough that compressed ARC is effectively free.
# Check ARC statistics (compressed vs. uncompressed)
cat /proc/spl/kstat/zfs/arcstats | grep -E 'size|compressed|uncompressed'
# size 4 34359738368
# compressed_size 4 16800432128
# uncompressed_size 4 33621204992
# In this example:
# ARC is using 32GB of RAM (size)
# The compressed data in ARC is 15.6GB (compressed_size)
# That 15.6GB represents 31.3GB of uncompressed data (uncompressed_size)
# Effective ARC ratio: 31.3 / 32 = ~1.0 (minimal overhead)
# But it's caching 31.3GB of logical data in 32GB of RAM
# arc_summary gives a friendlier view
arc_summary | grep -A5 "ARC size"
Real-world scenarios
Scenario 1: NAS with 8TB of mixed content
Home NAS: 4x 4TB in mirror pairs, LZ4 default
A typical home NAS stores documents, photos, videos, music, and system backups. Without compression: 8TB raw, ~7.3TB usable. With LZ4: the compressible portion (documents, configs, system backups) compresses at 2–3x. Media files don't compress but LZ4 early abort makes the attempt free. Typical result: 7.3TB usable becomes effectively 10–12TB of logical capacity. That's 2.7–4.7TB of free space you didn't have to buy hardware for.
Scenario 2: PostgreSQL database server
Database server: mirrored NVMe, LZ4, recordsize=8k
A PostgreSQL server with 500GB of data (text-heavy schema: user records, messages, metadata).
With LZ4 and recordsize=8k, the compression ratio is typically 2.0–2.5x. 500GB of
logical data occupies 200–250GB on disk. The I/O reduction means: less data read from NVMe on
every query, more data fits in ARC (compressed ARC doubles effective cache), and write amplification
decreases. The database is both smaller and faster with compression enabled.
Scenario 3: VM host with 50 Linux guests
KVM/Proxmox host: RAIDZ2 on spinning rust, LZ4
50 Linux VMs, each with a 40GB disk. Without compression: 2TB of raw storage. With LZ4: the average Linux guest compresses at 1.8–2.2x (OS files, packages, configs compress well; application data varies). Typical result: 2TB logical = 900GB–1.1TB physical. You just saved 50% of your disk budget. For VMs built from the same base image, the compression ratio is even better because similar OS files produce similar compressed blocks (though not identical — that's dedup's territory).
Scenario 4: Log aggregation server
Log server: zstd-3 for maximum savings on highly compressible data
A central log aggregation server ingesting 50GB/day of syslog, JSON application logs, and audit trails.
With zstd-3, log data typically compresses at 5–8x. 50GB/day becomes 6–10GB/day on disk.
Over a 90-day retention period: 4.5TB of logical logs occupies 560–900GB on disk. The write throughput
of zstd-3 (900 MB/s) far exceeds the ingest rate (50GB/day = ~0.6 MB/s average). CPU impact: zero.
Use zstd-3 here instead of LZ4 because the 30–50% better ratio saves hundreds of gigabytes
and the CPU cost is irrelevant at these ingest rates.
CPU impact benchmarks
The most common concern about compression is CPU overhead. Here are real numbers from benchmarking on typical server hardware (Xeon Gold 6348, 28 cores, DDR4-3200).
| Algorithm | CPU usage (sequential write) | CPU usage (random 4K write) | Throughput impact |
|---|---|---|---|
| compression=off | Baseline | Baseline | Baseline |
| lz4 | +1–3% CPU | +0.5–1% CPU | +5–15% faster (less I/O) |
| zstd-1 | +3–8% CPU | +2–5% CPU | +0–10% faster (I/O reduction offsets CPU) |
| zstd-3 | +5–15% CPU | +3–8% CPU | –0–5% on NVMe, +5–15% on HDD |
| zstd-7 | +15–35% CPU | +8–20% CPU | –5–20% on NVMe, +0–10% on HDD |
| zstd-19 | +50–100% CPU | +30–60% CPU | –30–60% write throughput |
| gzip-9 | +80–150% CPU | +40–80% CPU | –40–70% write throughput |
The crucial insight: on spinning rust, LZ4 and zstd-3 actually improve throughput because the disk is the bottleneck, not the CPU. Compressing data means fewer bytes traverse the SATA/SAS bus, and the CPU finishes compression faster than the disk can write. On NVMe, the CPU becomes the bottleneck sooner, so higher compression levels (zstd-7+) can reduce write throughput. But even on NVMe, LZ4 is a net positive.
Deduplication: understand it before you enable it
Deduplication requires 1–2GB of RAM per TB of deduplicated data.
This is not a guideline. It's a hard requirement. The dedup table (DDT) stores a checksum for every unique block in the pool. For a pool with 128K recordsize, 1TB of unique data = ~8 million blocks = ~320MB of DDT entries (at ~40 bytes each). With smaller blocks (8K for databases), the DDT is 16x larger. With metadata overhead and ARC caching requirements, the practical number is 1–2GB of RAM per TB.
If the DDT doesn't fit in ARC (RAM), every write requires a random read from disk to check the DDT. Performance doesn't "degrade gradually" — it falls off a cliff. A pool that was doing 500 MB/s drops to 5 MB/s because every write blocks on a DDT lookup from spinning rust.
How dedup works internally
When dedup is enabled (dedup=on), ZFS checksums every block before writing it.
If the checksum already exists in the DDT, ZFS increments a reference count instead of writing
the block. When the last reference to a block is removed (file deleted, snapshot destroyed),
the reference count drops to zero and the block is freed.
The DDT is stored in the pool's metadata and must be accessible for every write operation. ZFS caches the DDT in ARC (RAM). If the DDT is larger than available ARC, portions are evicted, and subsequent writes must fetch DDT entries from disk. This is the death spiral.
When dedup is actually valid
zfs clone instead.Checking dedup viability before enabling
# BEFORE enabling dedup: simulate the dedup ratio without the RAM cost
# This scans the pool and reports what the dedup ratio *would* be
zdb -S tank
# Output looks like:
# Simulated DDT histogram:
#
# refcnt blocks LSIZE PSIZE DSIZE ...
# ------ ------ ----- ----- -----
# 1 8.32M 996G 498G 498G ...
# 2 1.21M 145G 72.5G 72.5G ...
# 4 32.4K 3.89G 1.94G 1.94G ...
# Total 9.56M 1.12T 572G 572G ...
#
# dedup = 1.96, ...
# The "dedup = 1.96" means dedup would save ~49% of space.
# Now calculate RAM needed:
# Total unique blocks: 9.56M
# DDT entry size: ~320 bytes (with overhead)
# RAM needed: 9.56M * 320 = ~3GB
# If you have 3GB+ of ARC to spare, dedup is viable for this pool.
# Check current DDT size on an existing dedup-enabled pool
zpool status -D tank
zdb -S before you even think about enabling dedup.
If the simulated ratio is below 2x, dedup isn't worth the RAM. At 2x, compression alone probably
gets you similar savings with zero overhead. Dedup only makes economic sense above 3–5x, and even
then, ask yourself: would ZFS clones or snapshots solve the same problem without the RAM tax?
The answer is almost always yes.
Disabling dedup
If you've enabled dedup and want to undo it: zfs set dedup=off pool/dataset stops
deduplicating new writes, but existing deduplicated blocks remain in the DDT until
overwritten or deleted. The DDT doesn't shrink until the referenced data is gone. To
fully remove dedup, you must rewrite all data: zfs send | zfs receive to a new pool
or dataset with dedup=off.
# Disable dedup on new writes (existing DDT remains)
zfs set dedup=off tank/vdi
# To fully remove the DDT, rewrite the data
zfs snapshot tank/vdi@migrate
zfs send tank/vdi@migrate | zfs receive -o dedup=off tank/vdi-new
# Verify data, then swap datasets
# Fast dedup (OpenZFS 2.2+) — uses a Bloom filter for DDT lookups
# Reduces RAM requirement but still needs significant memory
zfs set dedup=on,fast tank/vdi
Fast dedup (OpenZFS 2.2+)
OpenZFS 2.2 introduced fast dedup, which uses a Bloom filter as a front-end to the DDT. The Bloom filter is small (a few hundred MB instead of tens of GB) and can quickly determine that a block is not in the DDT — skipping the expensive DDT lookup for unique blocks. For blocks that might be duplicates, it falls back to the full DDT check.
Fast dedup reduces the RAM penalty significantly for workloads where most blocks are unique (low dedup ratio). If 90% of blocks are unique, fast dedup skips 90% of DDT lookups, cutting RAM requirements roughly in proportion. However, the DDT still exists and still needs RAM for the duplicate blocks. Fast dedup is an improvement, not a cure.
kldload defaults & why
kldload sets compression=lz4 on every pool at creation time, across all profiles
(desktop, server, core) and all target distros. dedup=off is the default and kldload
does not expose dedup as an option in the web UI.
# kldload pool creation (from kldload-install-target)
zpool create -o ashift=12 \
-O compression=lz4 \
-O atime=off \
-O xattr=sa \
-O dnodesize=auto \
-O relatime=on \
rpool mirror /dev/disk/by-id/... /dev/disk/by-id/...
# After install, customize per-dataset as needed
zfs set compression=zstd-3 rpool/home
zfs set compression=zstd-7 rpool/var/log