Compression & Dedup

ZFS Wiki

Compression & Deduplication — one is your friend, the other is a trap.

ZFS compression is the single best free lunch in storage. It saves disk space, reduces I/O, and in most cases makes your system faster because the CPU can compress and decompress data faster than the disk can read and write uncompressed blocks. There is almost no reason to leave it off. kldload enables compression=lz4 on every pool by default because not doing so would be negligent.

Deduplication is the opposite. It sounds brilliant — "never store the same block twice!" — and it will eat your RAM alive. This page covers both in depth: how they work, when to use them, and the specific algorithms that matter.

I have run ZFS in production since 2014 across hundreds of machines. I have never once regretted enabling compression. I have regretted enabling dedup exactly once, and it took a full pool rebuild to undo. Compression is the default for a reason. Dedup is a trap for anyone who hasn't done the RAM math first.

How ZFS compression works

ZFS compression operates per-block, inline, and transparently. Understanding these three properties is the key to understanding why it's safe to leave on everywhere.

Per-block

ZFS compresses each block independently — not files, not extents, not the entire dataset. A 128K record (the default recordsize) is compressed as a single unit. If it compresses well, ZFS stores the smaller result. If it doesn't, ZFS stores the original uncompressed block. There is no cross-block dictionary and no decompression dependency chain.

Each block stands alone. A corrupt block only loses that block, not the file.

Inline (write-path)

Compression happens in the write pipeline before the block hits disk. The CPU compresses the data in memory, and only the compressed result is written to the vdev. On reads, ZFS decompresses in memory after reading the (smaller) block from disk. The on-disk format is always compressed; you never need to "enable" or "run" compression — it just happens.

Write path: app -> ARC -> compress -> disk. Read path: disk -> decompress -> ARC -> app.

Transparent

Applications see uncompressed data. ls -l shows the logical (uncompressed) file size. du shows the physical (compressed) space used on disk. No application changes are needed. No special mount flags. No userspace library. ZFS handles everything in the kernel.

To the application, nothing changed. To the disk, everything is smaller.

The early abort optimization is what makes LZ4 essentially free. When ZFS compresses a block, it checks whether the compressed output is actually smaller. If the first portion of the block doesn't compress well (the algorithm detects incompressible data early), ZFS aborts the compression attempt and writes the block uncompressed. This means pre-compressed data (JPEG, MP4, encrypted files) incurs only a tiny CPU cost for the failed attempt — not the full compression pass. LZ4's early abort is particularly aggressive: it bails out within microseconds on incompressible data.

The early abort is why "just use lz4" works. People worry about CPU waste on already-compressed data. In practice, LZ4 detects incompressible blocks so fast that the overhead is literally unmeasurable on modern CPUs. I have benchmarked this with FIO on mixed workloads — the CPU delta between compression=off and compression=lz4 on incompressible data is within noise. Meanwhile, the compressible portions of your data get 2x space savings for free.

Every compression algorithm in OpenZFS

OpenZFS supports seven algorithm families. In practice, you will use LZ4 or zstd. The others exist for historical or edge-case reasons.

lz4

The default. Use this. Extremely fast compression and decompression. Compression ratio typically 1.5–2.5x on mixed data. Near-zero CPU overhead. Aggressive early abort on incompressible data. Available since OpenZFS 0.6.3 (2014). This is what kldload sets on every pool.

zstd

Best all-around alternative. Better compression ratio than LZ4, adjustable CPU cost via levels 1–19. Default zstd (without a level) uses level 3. Decompression is always fast regardless of compression level. Available since OpenZFS 2.0 (2020).

zstd-1 to zstd-19

Explicit zstd compression levels. Level 1 is fastest (near-LZ4 speed, better ratio). Level 19 is slowest (near-gzip-9 ratio, much less CPU than gzip). Each level increases compression ratio and CPU cost. Decompression speed is roughly constant across all levels.

zstd-fast

Ultra-fast zstd mode. zstd-fast-1 through zstd-fast-1000. Higher numbers = faster compression, lower ratio. zstd-fast-1 is roughly LZ4 speed with slightly better ratio. zstd-fast-500 and above are faster than LZ4 but with minimal compression. Useful for high-throughput streaming workloads where you want marginal compression at near-zero cost.

gzip-1 to gzip-9

DEFLATE compression. gzip-1 is fastest, gzip-9 is best ratio. Significantly slower than zstd at every comparable ratio. Legacy algorithm — kept for compatibility with pools created before zstd existed. There is no reason to choose gzip for new pools; zstd matches or beats gzip's ratio at a fraction of the CPU cost.

lzjb

The original ZFS compression algorithm from Solaris. Slower than LZ4 with a worse compression ratio. Never use this. It exists only because it was the default before LZ4 was added. If you inherit a pool with compression=lzjb, change it to lz4.

zle

Zero-Length Encoding. Only compresses runs of zero bytes. Useful on exactly one workload: sparse files or datasets with large zero-filled regions where you want compression with absolute minimum CPU. In practice, LZ4 handles this case just as well and compresses non-zero data too. Not recommended.

off

No compression. Only appropriate for datasets containing exclusively pre-compressed or encrypted data where zero bytes would ever compress (think: an entire dataset of encrypted AES-256 blocks). Even then, LZ4's early abort makes off unnecessary. The only real use case is regulatory environments where on-disk data must be byte-identical to the application's output.

I'm putting this bluntly: gzip and lzjb are dead algorithms walking. If you see them on an existing pool, the correct action is zfs set compression=lz4 poolname (or zstd) and move on. They exist in the codebase for backward compatibility, not because anyone should choose them in 2025+. zstd at level 3 beats gzip-6 in both ratio and speed. At level 9, it beats gzip-9 in ratio while being 5–10x faster. There is no contest.

Algorithm comparison — speed, ratio, CPU cost

These are representative numbers from benchmarking on a 2-socket Xeon server (Sapphire Rapids, 64 cores) with mixed data (source code, binaries, logs, databases). Real-world results vary by workload. The relative ordering is consistent across hardware.

Algorithm	Compress speed	Decompress speed	Typical ratio	CPU cost	Best for
lz4	4,500+ MB/s	5,500+ MB/s	2.0–2.5x	Negligible	Everything (default)
zstd-1	1,800 MB/s	4,000 MB/s	2.5–3.0x	Low	General purpose, better ratio
zstd-3 (default zstd)	900 MB/s	3,800 MB/s	2.8–3.3x	Low–moderate	NAS, file servers, backups
zstd-7	350 MB/s	3,500 MB/s	3.0–3.8x	Moderate	Archival, cold storage
zstd-15	30 MB/s	3,400 MB/s	3.2–4.2x	High	Deep archival, write-once
zstd-19	8 MB/s	3,300 MB/s	3.3–4.5x	Very high	Maximum compression, rarely written
zstd-fast-1	3,500 MB/s	4,200 MB/s	2.2–2.6x	Negligible	Streaming, high-throughput ingest
zstd-fast-100	5,000+ MB/s	5,000+ MB/s	1.5–1.8x	Negligible	Marginal compression at max speed
gzip-1	180 MB/s	700 MB/s	2.5–3.0x	High	Legacy pools only
gzip-9	25 MB/s	700 MB/s	2.8–3.5x	Very high	Legacy pools only
lzjb	1,200 MB/s	1,800 MB/s	1.5–2.0x	Low	Never (legacy only)
zle	6,000+ MB/s	6,000+ MB/s	1.0–1.2x	Negligible	Zero-filled sparse data only

Notice the asymmetry in zstd: decompression is always fast regardless of the compression level used to write the data. This is a critical property. You can write archival data with zstd-19 (very slow writes) and read it back at 3,300+ MB/s. The compression level only penalizes writes. Reads are always cheap.

zstd's asymmetric design is why it killed gzip in ZFS. gzip's decompression is also slow — 700 MB/s vs. zstd's 3,300+ MB/s. So gzip penalizes both reads and writes. zstd only penalizes writes (at high levels). For a read-heavy filesystem — which is most filesystems — this is a massive win.

The "always use LZ4" rule — and why it's correct

Rule: set compression=lz4 on every pool, every dataset, every time.

The only exceptions are datasets where you have measured and confirmed that a different algorithm provides meaningful benefit for your specific workload. "Meaningful" means measurable space savings or performance improvement, not theoretical. LZ4 is the correct default until proven otherwise.

Here's why. LZ4 has three properties that make it uniquely suitable as a default:

Speed > disk

LZ4 compresses at 4,500+ MB/s per core. A single NVMe drive tops out around 7,000 MB/s. A SATA SSD does 550 MB/s. Spinning rust does 150 MB/s. LZ4 is faster than every storage device except the fastest NVMe arrays. The CPU is never the bottleneck.

Early abort

LZ4's early abort detects incompressible data within the first few hundred bytes and bails out. The CPU cost of a failed compression attempt on a 128K block is ~5 microseconds. On a pool full of JPEG images, LZ4 wastes essentially zero CPU.

Net positive I/O

Because compressed blocks are smaller, ZFS reads and writes fewer bytes to disk. On compressible data, compression=lz4 is faster than compression=off because the I/O reduction outweighs the (negligible) CPU cost. You are literally getting more performance and more space for free.

# Enable LZ4 on a pool (inherits to all child datasets)
zfs set compression=lz4 tank

# Enable LZ4 at pool creation (the kldload way)
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  tank mirror /dev/sda /dev/sdb

# Verify compression is set
zfs get compression tank
# NAME  PROPERTY     VALUE     SOURCE
# tank  compression  lz4       local

When to use zstd instead

zstd earns its place on specific workloads where the better compression ratio justifies the additional (but still modest) CPU cost. The key insight is that zstd's decompression is always fast — so the write penalty is the only cost, and for write-once/read-many data, that's a one-time expense.

Backup datasets

Written once, read rarely. zstd-7 or zstd-9 gives 30–50% better compression than LZ4 with decompression speed that's still faster than most storage. Perfect for zfs send receive targets.

Log archives

Syslog, application logs, audit trails. Highly compressible text. zstd-3 (the default zstd level) compresses logs at 3–5x vs. LZ4's 2–3x. On a system generating 10GB/day of logs, that's 20GB+ saved per week.

Cold storage / archives

Data that's rarely accessed: old project files, compliance archives, media masters. zstd-15 or zstd-19 squeezes maximum savings. Write speed doesn't matter — you're writing once and storing for years.

NAS file shares

zstd-3 is an excellent choice for general file servers. The network (1GbE = 125 MB/s, 10GbE = 1,250 MB/s) is always slower than zstd-3's 900 MB/s compression speed. The network is the bottleneck, not the CPU.

# Per-dataset compression policies
zfs set compression=lz4 tank                     # pool default
zfs set compression=zstd-3 tank/shares           # NAS shares
zfs set compression=zstd-7 tank/backups          # backup targets
zfs set compression=zstd-19 tank/archive         # deep archive
zfs set compression=lz4 tank/vms                 # VMs need speed
zfs set compression=lz4 tank/databases           # databases need speed

When to turn compression off

Almost never. But there are a few legitimate cases:

Regulatory/forensic requirements

Some compliance frameworks require that on-disk data be byte-identical to the application's output. ZFS compression changes the on-disk representation. If your auditors require bit-for-bit on-disk fidelity (rare, but it happens in forensics and some financial contexts), you need compression=off.

All-encrypted-payload datasets

A dataset that stores only AES-256 encrypted blobs (like encrypted backup chunks from Borg or Restic with client-side encryption) won't compress at all. LZ4's early abort handles this fine, but if you want to avoid even the ~5 microsecond early-abort overhead per block on a massive ingest pipeline, compression=off is defensible. In practice, the difference is immeasurable.

I want to be clear: I have never turned compression off on a production system. The "encrypted payload" case is theoretical — LZ4 early abort is so fast that the overhead is not worth thinking about. The forensic case is real but exotic. For 99.9% of deployments, leave LZ4 on and forget about it.

zstd-fast levels explained

The zstd-fast family inverts the normal zstd level numbering. Higher numbers mean less compression, more speed. These levels use zstd's "negative level" mode internally, trading compression ratio for throughput.

zstd-fast-1

Roughly LZ4 speed with slightly better compression ratio. A reasonable alternative if you want the zstd decompressor path but LZ4-class speed. ~3,500 MB/s compress, 2.2–2.6x ratio.

zstd-fast-10

Faster than LZ4 compress, similar ratio. ~4,200 MB/s compress, 2.0–2.3x ratio. Useful if you're saturating NVMe and need maximum write throughput.

zstd-fast-100

Minimal compression, maximum speed. ~5,000+ MB/s compress, 1.5–1.8x ratio. You're barely compressing, but even 1.5x on 10TB saves 3.3TB.

zstd-fast-500

Effectively "compress only trivially compressible data." ~5,500+ MB/s, 1.1–1.3x ratio. At this point you're catching zero runs and little else.

zstd-fast-1000

Maximum speed, lowest meaningful compression. Barely worth enabling. LZ4 gives better ratios at comparable speed. This level exists for completeness, not practical use.

In practice, zstd-fast is a niche tool. If you need speed, LZ4 is simpler and well-tested. If you need ratio, zstd-1 through zstd-7 is the sweet spot. zstd-fast occupies the narrow gap between "LZ4 speed" and "faster than LZ4 but worse ratio" — a gap that rarely matters in real deployments.

Monitoring compression effectiveness

ZFS exposes compression statistics at every level: pool, dataset, and snapshot. Understanding these properties is essential for capacity planning and for deciding whether to change algorithms.

Key properties

compressratio

The ratio of logical (uncompressed) data to physical (on-disk) data. A compressratio of 2.00x means you're storing 2GB of data in 1GB of disk space. This is the headline number.

logicalused

The amount of data as the application sees it — before compression. This is what you'd use without ZFS compression.

used

The actual on-disk space consumed, including metadata overhead. This is what's subtracted from your pool's available capacity. used < logicalused when compression is working.

logicalreferenced

Logical data referenced by this dataset only (not children). Useful for calculating per-dataset compression ratios.

referenced

Physical on-disk space referenced by this dataset only. referenced / logicalreferenced gives you the true compression ratio for this specific dataset.

# Check compression ratio for the entire pool
zfs get compressratio tank
# NAME  PROPERTY       VALUE  SOURCE
# tank  compressratio  2.31x  -

# Detailed compression stats for all datasets
zfs get compressratio,logicalused,used,compression -r tank
# NAME            PROPERTY       VALUE      SOURCE
# tank            compressratio  2.31x      -
# tank            logicalused    1.82T      -
# tank            used           812G       -
# tank            compression    lz4        local
# tank/databases  compressratio  1.89x      -
# tank/databases  logicalused    420G       -
# tank/databases  used           222G       -
# tank/databases  compression    lz4        inherited from tank
# tank/logs       compressratio  4.72x      -
# tank/logs       logicalused    89G        -
# tank/logs       used           18.8G      -
# tank/logs       compression    zstd-3     local

# Pool-level compression stats
zpool get all tank | grep -i compress
# tank  feature@lz4_compress    active   local
# tank  feature@zstd_compress   active   local

# Quick space savings summary
zfs list -o name,logicalused,used,compressratio -r tank

The compressratio property is cumulative over the life of the dataset. It reflects all data currently stored, not just recent writes. If you change the compression algorithm, the ratio will gradually shift as old data is overwritten with new (differently-compressed) data.

Inheritance & changing compression mid-stream

Compression is a dataset property that follows ZFS's standard inheritance model. Setting it on a parent dataset propagates to all children (unless they have a local override). But there's a critical nuance: changing the compression algorithm only affects new writes.

Existing data is not recompressed. When you change from lz4 to zstd, blocks already on disk stay compressed with LZ4. Only newly written blocks use zstd. This is by design — recompressing terabytes of data in place would be enormously expensive and dangerous (power failure mid-recompression could corrupt data).

If you need to force recompression, you must rewrite the data: zfs send | zfs receive to a new dataset, or copy files manually. Both approaches write new blocks with the new algorithm.

# Change compression on a dataset (only affects new writes)
zfs set compression=zstd-3 tank/shares

# Verify the change and check inheritance
zfs get compression -r tank
# NAME            PROPERTY     VALUE   SOURCE
# tank            compression  lz4     local
# tank/databases  compression  lz4     inherited from tank
# tank/shares     compression  zstd-3  local           <-- overridden
# tank/shares/hr  compression  zstd-3  inherited from tank/shares
# tank/vms        compression  lz4     inherited from tank

# Force recompression via send/receive
zfs snapshot tank/shares@recompress
zfs send tank/shares@recompress | zfs receive tank/shares-new
# Then swap: rename tank/shares to tank/shares-old, tank/shares-new to tank/shares
# Verify, then destroy tank/shares-old

ZFS handles mixed-algorithm datasets transparently. A single dataset can contain blocks compressed with LZ4, zstd, gzip, and even uncompressed blocks — all in the same file. Each block's compression metadata is stored in the block pointer, so ZFS always knows how to decompress it. This is why algorithm changes are safe to make at any time.

The "only affects new writes" behavior surprises people, but it's the right call. Imagine ZFS trying to recompress 50TB in the background — it would destroy your I/O for days and risk data integrity if the process is interrupted. The current design lets you change algorithms instantly with zero risk. If you truly need the old data recompressed, use zfs send | zfs receive. It's explicit, safe, and you can verify the result before destroying the original.

Real-world compression ratios by workload

These are ratios observed in production across dozens of kldload deployments. Your results will vary based on data content, but these give realistic expectations.

Workload	LZ4 ratio	zstd-3 ratio	zstd-9 ratio	Notes
System logs (syslog, journald)	3.5–5.0x	5.0–8.0x	6.0–10.0x	Highly repetitive text. Compression champion.
Application logs (JSON/structured)	3.0–4.5x	4.0–6.0x	4.5–7.0x	JSON structure is very compressible.
Source code repositories	2.5–3.5x	3.0–4.5x	3.5–5.0x	Text files compress well. Binary artifacts less so.
PostgreSQL / MySQL data	1.8–3.0x	2.2–3.5x	2.5–4.0x	Depends on data types. Text-heavy schemas compress very well.
VM disk images (qcow2/raw)	1.5–2.5x	1.8–3.0x	2.0–3.2x	Mixed content. Free space in VMs compresses well.
Docker layers	2.0–3.0x	2.5–4.0x	3.0–4.5x	Lots of repeated OS files across layers.
Email (Maildir/mbox)	2.0–3.5x	2.5–4.0x	3.0–4.5x	Text-heavy mail compresses well. Attachments less so.
Photos (JPEG/HEIF)	1.00–1.05x	1.00–1.05x	1.00–1.05x	Already compressed. LZ4 early abort makes this free.
Video (H.264/H.265)	1.00–1.02x	1.00–1.02x	1.00–1.02x	Already compressed. Zero benefit, near-zero cost.
Music (FLAC)	1.00–1.03x	1.00–1.03x	1.00–1.03x	Already compressed. MP3/AAC even less compressible.
Encrypted volumes (LUKS, VeraCrypt)	1.00x	1.00x	1.00x	Encrypted data is indistinguishable from random. Cannot compress.
Mixed NAS (home directories)	1.5–2.5x	1.8–3.0x	2.0–3.2x	Blend of documents, media, and configs.

The ratios that surprise people most: VM disk images at 2x and Docker layers at 3x. VMs compress well because the guest OS has free space (zero-filled), swap partitions, and text configs. Docker compresses well because base image layers are nearly identical across containers. If you're running 20 containers from the same base image, ZFS compression saves a shocking amount of space even without dedup. And LZ4 gets you most of the savings with zero performance impact.

Per-dataset compression policies

The correct strategy is: set LZ4 on the pool root, then override specific datasets where zstd provides meaningful benefit. This gives you sane defaults with targeted optimization.

# Recommended dataset layout with compression policies
# Pool root: lz4 (inherits to everything unless overridden)
zfs set compression=lz4 tank

# Hot data: lz4 (maximum speed, good compression)
zfs create -o compression=lz4 tank/vms
zfs create -o compression=lz4 tank/databases
zfs create -o compression=lz4 tank/containers

# Warm data: zstd-3 (better ratio, still fast)
zfs create -o compression=zstd-3 tank/shares
zfs create -o compression=zstd-3 tank/home
zfs create -o compression=zstd-3 tank/mail

# Cold data: zstd-7 or higher (maximum savings, write speed irrelevant)
zfs create -o compression=zstd-7 tank/backups
zfs create -o compression=zstd-7 tank/logs
zfs create -o compression=zstd-19 tank/archive

# Pre-compressed data: lz4 (early abort handles it)
zfs create -o compression=lz4 tank/media
zfs create -o compression=lz4 tank/iso-images

Note that tank/media still uses LZ4, not compression=off. Even on a media dataset, there will be metadata files, NFO files, subtitle files, and cover art in non-compressed formats. LZ4's early abort means the compressed media files cost nothing, while the text files still get compressed. There's no reason to turn it off.

Special considerations

Databases (PostgreSQL, MySQL, MongoDB)

Database compression on ZFS is powerful but requires understanding the interaction between ZFS's block-level compression and the database's page/record format. The key variable is recordsize.

recordsize and database page alignment

PostgreSQL uses 8K pages. MySQL/InnoDB uses 16K pages. MongoDB's WiredTiger uses 32K or 64K blocks. If ZFS's recordsize matches the database's page size, each ZFS block contains exactly one database page — which compresses and decompresses independently. This avoids read-modify-write amplification and gives the best compression ratio because the compressor sees a coherent data structure.

Recommendation: Set recordsize=8k for PostgreSQL, recordsize=16k for MySQL/InnoDB, recordsize=64k for MongoDB. Always use compression=lz4 for databases — the speed advantage over zstd matters for latency-sensitive workloads.

Smaller recordsize = more random-IOPS friendly. Larger = better sequential throughput.

# PostgreSQL dataset
zfs create -o compression=lz4 -o recordsize=8k \
  -o primarycache=metadata -o atime=off \
  -o logbias=throughput tank/postgres

# MySQL/InnoDB dataset
zfs create -o compression=lz4 -o recordsize=16k \
  -o primarycache=metadata -o atime=off \
  tank/mysql

# MongoDB (WiredTiger) dataset
zfs create -o compression=lz4 -o recordsize=64k \
  -o atime=off tank/mongodb

A common question: "Should I disable ZFS compression and let the database handle compression internally?" No. ZFS compression is transparent and per-block, with no database-side CPU cost. Database internal compression (InnoDB page compression, Postgres TOAST) is additive — ZFS will still compress the database's compressed output if there's any slack. They don't conflict. Use both.

Virtual machines

VM disk images are mixed-content: the guest OS has system files (compressible), databases (compressible), and potentially media (incompressible). ZFS compresses each block independently, so compressible blocks get compressed and incompressible blocks get the early abort. On a typical Linux guest, expect 1.5–2.5x compression ratio.

# VM storage dataset — zvol with 64K block size
zfs create -V 100G -o compression=lz4 -o volblocksize=64k tank/vms/web-01

# For Proxmox/libvirt with raw volumes
zfs create -o compression=lz4 -o recordsize=64k tank/vms

# Check actual VM compression savings
zfs get compressratio,logicalused,used tank/vms/web-01
# NAME              PROPERTY       VALUE   SOURCE
# tank/vms/web-01   compressratio  2.14x   -
# tank/vms/web-01   logicalused    52.3G   -
# tank/vms/web-01   used           24.4G   -

Media files (photos, video, music)

Pre-compressed media (JPEG, H.264, H.265, FLAC, MP3, AAC) won't compress further. LZ4's early abort detects this within microseconds and writes the block uncompressed. The CPU cost is immeasurable. Leave LZ4 on. The metadata, subtitles, playlists, and thumbnails in the same dataset will still benefit from compression.

Compression + encryption interaction

ZFS native encryption (added in OpenZFS 0.8) has a critical interaction with compression: ZFS compresses first, then encrypts. This is the correct order and it matters enormously.

Compress-then-encrypt: why order matters

If you encrypt first, the output is pseudorandom — completely incompressible. Compression after encryption would waste CPU and save zero space. ZFS does it the right way: compress the plaintext data (which has patterns and redundancy), then encrypt the compressed result. You get full compression savings AND full encryption protection.

This means compression=lz4 and encryption=aes-256-gcm work perfectly together. Set both. The compression ratio is identical to what you'd get without encryption.

Compress then encrypt = savings + security. Encrypt then compress = wasted CPU, no savings.

# Create an encrypted dataset with compression (compression applies to plaintext)
zfs create -o encryption=aes-256-gcm -o keyformat=passphrase \
  -o compression=lz4 tank/encrypted-data

# Verify both are active
zfs get compression,encryption,compressratio tank/encrypted-data
# NAME                  PROPERTY       VALUE           SOURCE
# tank/encrypted-data   compression    lz4             local
# tank/encrypted-data   encryption     aes-256-gcm     -
# tank/encrypted-data   compressratio  2.18x           -

Caveat: there is a theoretical information leakage concern with compress-then-encrypt. Because compressed block sizes vary based on content, an attacker who can observe the on-disk block sizes might infer something about the plaintext content's compressibility. This is the same class of attack as CRIME/BREACH in TLS. In practice, for at-rest storage encryption, this risk is negligible — the attacker would need to both access the raw disk and have a model of your data patterns. ZFS's variable block sizes already obscure individual file sizes. But if you're building a system for a three-letter agency, be aware of it.

Compression + dedup interaction

Compression and deduplication are independent operations in ZFS. When both are enabled, ZFS compresses first, then deduplicates. The dedup table stores checksums of the compressed blocks. This means two blocks with identical uncompressed content will also be identical after compression (for the same algorithm), and dedup will catch them.

Compression reduces the effectiveness of dedup slightly. Dedup operates on fixed-size blocks, and compression changes the on-disk block size. Two files that are mostly identical but differ in one byte will compress to different block sizes, and dedup will not match them. This is correct behavior — dedup is block-level, not file-level.

In practice, use compression instead of dedup whenever possible. Compression gives 1.5–4x savings with near-zero overhead. Dedup gives additional savings only for truly block-identical data (identical VM images, identical backup copies) and costs 1–2GB of RAM per TB. For most workloads, compression alone is sufficient.

ARC and compression (compressed ARC)

Starting with OpenZFS 2.2, the ARC (Adaptive Replacement Cache) can store compressed blocks. This is a major improvement: your RAM cache now holds more data per GB of RAM because the cached blocks are compressed.

How compressed ARC works

Before OpenZFS 2.2, the ARC stored uncompressed blocks. A 128K record compressed to 64K on disk would still occupy 128K in the ARC after decompression. With compressed ARC, ZFS keeps the 64K compressed version in cache and decompresses on demand when an application reads the data. This effectively doubles (or more) your ARC's effective capacity, because every cached block is smaller.

The tradeoff is a small CPU cost for decompression on every cache hit. With LZ4, this cost is negligible (~5,500 MB/s decompression). With gzip, it's noticeable. This is another reason to use LZ4 or zstd: their decompression speed is high enough that compressed ARC is effectively free.

Compressed ARC: your 64GB of RAM works like 128GB+ of cache. The CPU cost is ~0 with LZ4.

# Check ARC statistics (compressed vs. uncompressed)
cat /proc/spl/kstat/zfs/arcstats | grep -E 'size|compressed|uncompressed'
# size                           4    34359738368
# compressed_size                4    16800432128
# uncompressed_size              4    33621204992

# In this example:
# ARC is using 32GB of RAM (size)
# The compressed data in ARC is 15.6GB (compressed_size)
# That 15.6GB represents 31.3GB of uncompressed data (uncompressed_size)
# Effective ARC ratio: 31.3 / 32 = ~1.0 (minimal overhead)
# But it's caching 31.3GB of logical data in 32GB of RAM

# arc_summary gives a friendlier view
arc_summary | grep -A5 "ARC size"

Real-world scenarios

Scenario 1: NAS with 8TB of mixed content

Home NAS: 4x 4TB in mirror pairs, LZ4 default

A typical home NAS stores documents, photos, videos, music, and system backups. Without compression: 8TB raw, ~7.3TB usable. With LZ4: the compressible portion (documents, configs, system backups) compresses at 2–3x. Media files don't compress but LZ4 early abort makes the attempt free. Typical result: 7.3TB usable becomes effectively 10–12TB of logical capacity. That's 2.7–4.7TB of free space you didn't have to buy hardware for.

Cost of LZ4 compression: $0. Space saved: ~$80–150 worth of drives.

Scenario 2: PostgreSQL database server

Database server: mirrored NVMe, LZ4, recordsize=8k

A PostgreSQL server with 500GB of data (text-heavy schema: user records, messages, metadata). With LZ4 and recordsize=8k, the compression ratio is typically 2.0–2.5x. 500GB of logical data occupies 200–250GB on disk. The I/O reduction means: less data read from NVMe on every query, more data fits in ARC (compressed ARC doubles effective cache), and write amplification decreases. The database is both smaller and faster with compression enabled.

Compression makes databases faster, not slower. Counter-intuitive but consistently true.

Scenario 3: VM host with 50 Linux guests

KVM/Proxmox host: RAIDZ2 on spinning rust, LZ4

50 Linux VMs, each with a 40GB disk. Without compression: 2TB of raw storage. With LZ4: the average Linux guest compresses at 1.8–2.2x (OS files, packages, configs compress well; application data varies). Typical result: 2TB logical = 900GB–1.1TB physical. You just saved 50% of your disk budget. For VMs built from the same base image, the compression ratio is even better because similar OS files produce similar compressed blocks (though not identical — that's dedup's territory).

50 VMs at 2x compression = 25 VMs worth of disk space. No RAM overhead.

Scenario 4: Log aggregation server

Log server: zstd-3 for maximum savings on highly compressible data

A central log aggregation server ingesting 50GB/day of syslog, JSON application logs, and audit trails. With zstd-3, log data typically compresses at 5–8x. 50GB/day becomes 6–10GB/day on disk. Over a 90-day retention period: 4.5TB of logical logs occupies 560–900GB on disk. The write throughput of zstd-3 (900 MB/s) far exceeds the ingest rate (50GB/day = ~0.6 MB/s average). CPU impact: zero. Use zstd-3 here instead of LZ4 because the 30–50% better ratio saves hundreds of gigabytes and the CPU cost is irrelevant at these ingest rates.

Logs are compression gold. zstd-3 turns 4.5TB of retention into 800GB.

CPU impact benchmarks

The most common concern about compression is CPU overhead. Here are real numbers from benchmarking on typical server hardware (Xeon Gold 6348, 28 cores, DDR4-3200).

Algorithm	CPU usage (sequential write)	CPU usage (random 4K write)	Throughput impact
compression=off	Baseline	Baseline	Baseline
lz4	+1–3% CPU	+0.5–1% CPU	+5–15% faster (less I/O)
zstd-1	+3–8% CPU	+2–5% CPU	+0–10% faster (I/O reduction offsets CPU)
zstd-3	+5–15% CPU	+3–8% CPU	–0–5% on NVMe, +5–15% on HDD
zstd-7	+15–35% CPU	+8–20% CPU	–5–20% on NVMe, +0–10% on HDD
zstd-19	+50–100% CPU	+30–60% CPU	–30–60% write throughput
gzip-9	+80–150% CPU	+40–80% CPU	–40–70% write throughput

The crucial insight: on spinning rust, LZ4 and zstd-3 actually improve throughput because the disk is the bottleneck, not the CPU. Compressing data means fewer bytes traverse the SATA/SAS bus, and the CPU finishes compression faster than the disk can write. On NVMe, the CPU becomes the bottleneck sooner, so higher compression levels (zstd-7+) can reduce write throughput. But even on NVMe, LZ4 is a net positive.

If someone tells you "compression hurts performance," they're either using gzip-9 on NVMe (don't) or they're repeating cargo-cult advice from 2005 when CPUs were 10x slower. On any hardware from the last decade, LZ4 compression is faster than not compressing. This is measurable, reproducible, and the reason every ZFS guide written by someone who has actually benchmarked it says "always enable LZ4."

Deduplication: understand it before you enable it

Deduplication requires 1–2GB of RAM per TB of deduplicated data.

This is not a guideline. It's a hard requirement. The dedup table (DDT) stores a checksum for every unique block in the pool. For a pool with 128K recordsize, 1TB of unique data = ~8 million blocks = ~320MB of DDT entries (at ~40 bytes each). With smaller blocks (8K for databases), the DDT is 16x larger. With metadata overhead and ARC caching requirements, the practical number is 1–2GB of RAM per TB.

If the DDT doesn't fit in ARC (RAM), every write requires a random read from disk to check the DDT. Performance doesn't "degrade gradually" — it falls off a cliff. A pool that was doing 500 MB/s drops to 5 MB/s because every write blocks on a DDT lookup from spinning rust.

How dedup works internally

When dedup is enabled (dedup=on), ZFS checksums every block before writing it. If the checksum already exists in the DDT, ZFS increments a reference count instead of writing the block. When the last reference to a block is removed (file deleted, snapshot destroyed), the reference count drops to zero and the block is freed.

The DDT is stored in the pool's metadata and must be accessible for every write operation. ZFS caches the DDT in ARC (RAM). If the DDT is larger than available ARC, portions are evicted, and subsequent writes must fetch DDT entries from disk. This is the death spiral.

When dedup is actually valid

Identical VM images

100 VMs cloned from the same base image, where the OS partition is genuinely block-identical. Dedup ratios of 10–50x are possible. But ZFS clones are usually better — they give the same savings without the RAM cost. Use zfs clone instead.

Backup dedup

Multiple full backups of the same machine where blocks are unchanged between backups. But again, ZFS snapshots and incremental send/receive are almost always better than dedup for this use case.

VDI (virtual desktops)

Hundreds of identical Windows desktops. This is the canonical dedup use case. It works if you have the RAM: 100TB of VDI with 20:1 dedup requires 100–200GB of RAM for the DDT. If you have 256GB+ RAM, it's viable.

Checking dedup viability before enabling

# BEFORE enabling dedup: simulate the dedup ratio without the RAM cost
# This scans the pool and reports what the dedup ratio *would* be
zdb -S tank

# Output looks like:
# Simulated DDT histogram:
#
# refcnt   blocks   LSIZE   PSIZE   DSIZE   ...
# ------   ------   -----   -----   -----
#      1    8.32M    996G    498G    498G    ...
#      2    1.21M    145G    72.5G   72.5G   ...
#      4    32.4K   3.89G   1.94G   1.94G   ...
#  Total    9.56M   1.12T    572G    572G    ...
#
# dedup = 1.96, ...

# The "dedup = 1.96" means dedup would save ~49% of space.
# Now calculate RAM needed:
# Total unique blocks: 9.56M
# DDT entry size: ~320 bytes (with overhead)
# RAM needed: 9.56M * 320 = ~3GB
# If you have 3GB+ of ARC to spare, dedup is viable for this pool.

# Check current DDT size on an existing dedup-enabled pool
zpool status -D tank

Run zdb -S before you even think about enabling dedup. If the simulated ratio is below 2x, dedup isn't worth the RAM. At 2x, compression alone probably gets you similar savings with zero overhead. Dedup only makes economic sense above 3–5x, and even then, ask yourself: would ZFS clones or snapshots solve the same problem without the RAM tax? The answer is almost always yes.

Disabling dedup

If you've enabled dedup and want to undo it: zfs set dedup=off pool/dataset stops deduplicating new writes, but existing deduplicated blocks remain in the DDT until overwritten or deleted. The DDT doesn't shrink until the referenced data is gone. To fully remove dedup, you must rewrite all data: zfs send | zfs receive to a new pool or dataset with dedup=off.

# Disable dedup on new writes (existing DDT remains)
zfs set dedup=off tank/vdi

# To fully remove the DDT, rewrite the data
zfs snapshot tank/vdi@migrate
zfs send tank/vdi@migrate | zfs receive -o dedup=off tank/vdi-new
# Verify data, then swap datasets

# Fast dedup (OpenZFS 2.2+) — uses a Bloom filter for DDT lookups
# Reduces RAM requirement but still needs significant memory
zfs set dedup=on,fast tank/vdi

Fast dedup (OpenZFS 2.2+)

OpenZFS 2.2 introduced fast dedup, which uses a Bloom filter as a front-end to the DDT. The Bloom filter is small (a few hundred MB instead of tens of GB) and can quickly determine that a block is not in the DDT — skipping the expensive DDT lookup for unique blocks. For blocks that might be duplicates, it falls back to the full DDT check.

Fast dedup reduces the RAM penalty significantly for workloads where most blocks are unique (low dedup ratio). If 90% of blocks are unique, fast dedup skips 90% of DDT lookups, cutting RAM requirements roughly in proportion. However, the DDT still exists and still needs RAM for the duplicate blocks. Fast dedup is an improvement, not a cure.

Fast dedup makes dedup less awful, but the fundamental economics haven't changed. Compression is still free. Dedup still costs RAM. If you're considering dedup, fast dedup is the way to do it on 2.2+, but make sure you've exhausted every alternative first: ZFS clones for identical VMs, incremental send/receive for backups, compression for general savings. Dedup should be the last tool you reach for, not the first.

kldload defaults & why

kldload sets compression=lz4 on every pool at creation time, across all profiles (desktop, server, core) and all target distros. dedup=off is the default and kldload does not expose dedup as an option in the web UI.

compression=lz4

Set on the pool root at creation. Inherited by all child datasets. Provides 1.5–2.5x space savings on typical workloads with zero measurable performance impact. Cannot cause harm. Always beneficial.

dedup=off

Not exposed in the UI because the RAM requirement makes it dangerous for unprepared users. Enabling dedup on a 4TB pool with 8GB of RAM would destroy performance. We'd rather users not have the footgun.

Why not zstd?

LZ4 is safer as a universal default. zstd-3 is better for NAS/archival workloads, but on latency-sensitive workloads (databases, VMs), LZ4's near-zero CPU overhead is preferred. Users who need zstd can set it per-dataset after install.

# kldload pool creation (from kldload-install-target)
zpool create -o ashift=12 \
  -O compression=lz4 \
  -O atime=off \
  -O xattr=sa \
  -O dnodesize=auto \
  -O relatime=on \
  rpool mirror /dev/disk/by-id/... /dev/disk/by-id/...

# After install, customize per-dataset as needed
zfs set compression=zstd-3 rpool/home
zfs set compression=zstd-7 rpool/var/log

The single most impactful thing kldload does for users who don't know ZFS is enabling LZ4 at pool creation. Most people don't think about compression until they run out of space. By then, changing compression only affects new writes. Enabling it from the start means every byte that ever touches the pool is compressed. It's the kind of default you should never have to think about, and it saves real money on real hardware.

Quick reference

zfs set compression=lz4 tank

Enable LZ4 on a dataset (inherits to children).

zfs get compressratio tank

Check compression ratio for a dataset.

zfs get compressratio,logicalused,used -r tank

Space savings for all datasets in a pool.

zfs get compression -r tank

Check compression algorithm for all datasets (shows inheritance).

zdb -S tank

Simulate dedup ratio without enabling dedup (read-only, safe).

zpool status -D tank

Show DDT statistics on a dedup-enabled pool.

arc_summary

Show ARC statistics including compressed ARC size.

zfs list -o name,compress,compressratio -r tank

Compact view of algorithm and ratio per dataset.

← illumos / FreeBSD / Linux / Proxmox — pick the right one. Encryption — per-dataset, native, and replication-aware. →