Memory & ARC — kldload

ZFS Wiki

Memory & ARC — the engine that makes ZFS fast.

ARC (Adaptive Replacement Cache) is ZFS's in-memory read cache. It's one of ZFS's biggest strengths — and one of the most common sources of confusion when people see ZFS "eating all their RAM." This page explains how ZFS memory actually works, how to monitor it, how to tune it, and when to panic (almost never).

ZFS using all your RAM is not a bug. It's the entire point. ARC is reclaimable cache — it gives memory back when applications need it. The problem is that free -h and htop show it as "used," which sends people into a panic. Every "ZFS is eating my RAM" forum post is someone misreading free output. Read this page before you start tweaking.

ARC — Adaptive Replacement Cache

ARC is a patent-free adaptation of IBM's ARC algorithm. It maintains four lists that self-tune based on your actual access patterns. This is what makes it dramatically smarter than a simple LRU cache — it learns whether your workload favors frequency or recency and adapts automatically.

MRU — Most Recently Used

Data that was accessed once recently. Sequential scans, newly opened files, first-time reads. This is ARC's "short-term memory." If the data is accessed again, it graduates to MFU.

Think: browser tabs you just opened. May or may not be accessed again.

MFU — Most Frequently Used

Data accessed at least twice. Database indexes, config files, hot application data. This is ARC's "long-term memory." MFU entries survive longer under eviction pressure because ZFS knows you'll probably need them again.

Think: browser tabs you keep going back to. ZFS keeps these warm.

MRU Ghost List

Metadata-only entries for data recently evicted from MRU. ZFS doesn't store the data, just the block pointer. If a ghost entry is accessed, ZFS knows it should have kept that data — and grows the MRU target size to cache more recent data in the future.

Think: "I remember throwing this away. I should have kept it."

MFU Ghost List

Metadata-only entries for data recently evicted from MFU. Same principle: ghost hits tell ARC to grow the MFU target size. The two ghost lists create a feedback loop that continuously optimizes the MRU/MFU split ratio for your actual workload.

Think: ARC uses its own eviction mistakes to get smarter over time.

The genius of ARC is the ghost lists. A pure LRU cache has no memory of what it evicted. ARC's ghosts let it learn from mistakes: "I evicted block X from MFU and something asked for it 1 second later — I need a bigger MFU." This self-tuning happens continuously, thousands of times per second, with no operator intervention. The MRU/MFU balance constantly shifts to match your workload's access pattern.

The ghost lists are the reason ARC beats every simple LRU or LFU cache. They're also why "just use a bigger page cache" from Linux people misses the point entirely. The Linux page cache is a dumb LRU. ARC is an adaptive algorithm that learns your workload. When ZFS people say "ARC is smarter than the page cache," the ghost lists are why.

Data vs metadata in ARC

ARC caches two fundamentally different things, and understanding the distinction is critical for tuning:

Data

Actual file contents. Database rows, video files, application binaries. This is the bulk of ARC in most workloads. Evicting data means the next read hits disk — slower, but the system keeps working.

Metadata

Directory entries, file attributes, block pointers, dnode structures. Metadata is what ZFS needs to find and navigate to your data. Evicting metadata means even ls has to read from disk. On a spinning-rust pool with millions of files, metadata eviction can make the system feel completely unresponsive.

The zfs_arc_meta_limit parameter (default: 75% of ARC max) controls how much of ARC can be consumed by metadata. If metadata exceeds this limit, ZFS begins evicting metadata entries. This matters because some workloads (container hosts, build servers, source trees with millions of small files) are metadata-heavy — they can fill ARC entirely with metadata, leaving no room for actual data caching.

# Check metadata vs data usage in ARC
arc_summary | grep -A 5 "ARC size"

# From /proc/spl/kstat/zfs/arcstats:
# data_size    = bytes used for data
# metadata_size = bytes used for metadata
# arc_meta_limit = maximum metadata allowed in ARC
cat /proc/spl/kstat/zfs/arcstats | grep -E "data_size|metadata_size|arc_meta_limit"

Compressed ARC (OpenZFS 2.2+)

Starting with OpenZFS 2.2, ARC can store data in compressed form. Instead of decompressing blocks before caching them, ZFS keeps the compressed version in ARC and decompresses on read. This effectively multiplies your ARC capacity by your compression ratio.

With LZ4 compression achieving typical ratios of 1.5–2.5x on general data (and 3–10x on text-heavy data like logs and source code), compressed ARC means 32GB of RAM can cache 48–80GB of logical data. This is free performance — the decompression overhead is negligible on modern CPUs.

# Verify compressed ARC is active (default on in 2.2+)
cat /sys/module/zfs/parameters/zfs_compressed_arc_enabled
# 1 = enabled (default)

# Check effective compression in ARC
arc_summary | grep -i compress

Compressed ARC is one of the best ZFS improvements in years. Before 2.2, ZFS would decompress every block before caching it. That meant your 2x compressed dataset used 2x the ARC you'd expect. Now ARC stores compressed blocks and decompresses on demand. If you're running pre-2.2, upgrade. This alone is worth it.

ARC size tuning

ARC has three key size parameters. On Linux, you almost always need to set at least zfs_arc_max explicitly, because the kernel's memory pressure heuristics fight with ARC's desire to cache everything.

zfs_arc_max

Maximum ARC size in bytes. Default: 50% of RAM (but the kernel can shrink it below this). Set this to the amount of RAM you want ZFS to use for caching. Must leave enough for the OS, applications, and other caches.

zfs_arc_min

Minimum ARC size in bytes. Default: 1/32 of RAM or 64MB (whichever is larger). ARC won't shrink below this even under heavy memory pressure. Set this to prevent ARC from being completely evicted during memory spikes.

zfs_arc_meta_limit

Maximum metadata in ARC. Default: 75% of zfs_arc_max. Prevents metadata from consuming all of ARC. Increase on metadata-heavy workloads (containers, millions of small files). Decrease if you need more room for data caching.

zfs_arc_meta_min

Minimum metadata in ARC. Default: 0. Prevents metadata from being completely evicted. Set to 25–50% of arc_max on workloads where metadata eviction causes severe stalls (e.g., NAS with millions of files).

Setting ARC limits — runtime and persistent

# Runtime — takes effect immediately, lost on reboot
echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max    # 8 GB
echo 2147483648 > /sys/module/zfs/parameters/zfs_arc_min    # 2 GB
echo 6442450944 > /sys/module/zfs/parameters/zfs_arc_meta_limit  # 6 GB

# Persistent — survives reboot
cat > /etc/modprobe.d/zfs.conf <<'EOF'
options zfs zfs_arc_max=8589934592
options zfs zfs_arc_min=2147483648
options zfs zfs_arc_meta_limit=6442450944
EOF

# Rebuild initramfs so early-boot uses these values
# CentOS/RHEL/Rocky/Fedora:
dracut --force
# Debian/Ubuntu:
update-initramfs -u

Byte values are annoying. Here's a cheat sheet: 1 GB = 1073741824, 2 GB = 2147483648, 4 GB = 4294967296, 8 GB = 8589934592, 16 GB = 17179869184, 32 GB = 34359738368, 64 GB = 68719476736. Or just use Python: python3 -c "print(8 * 1024**3)" for 8 GB.

The "1 GB per TB" myth — debunked

You'll see this "rule" repeated everywhere: "ZFS needs 1 GB of RAM per TB of storage." This is wrong. Or rather, it's so oversimplified that following it will lead you to both over-provisioning and under-provisioning depending on the workload.

Where the myth came from: early ZFS on Solaris, with deduplication enabled, the dedup table (DDT) consumed roughly 1 GB of RAM per TB of deduplicated data. Someone generalized this to all ZFS deployments, and it stuck. Without dedup, the storage capacity has almost nothing to do with RAM requirements.

What actually determines ZFS RAM needs:

Working set size

How much data is actively being read. A 100 TB archive where you read 50 GB/day needs far less ARC than a 2 TB database with 200 GB of hot rows.

File count

Each file's metadata (dnode) consumes ARC. A pool with 50 million small files needs significantly more metadata ARC than a pool with 1,000 large files — even at the same total capacity.

Deduplication

The DDT must fit in memory for acceptable performance. ~320 bytes per block. At 128K recordsize, that's ~2.5 GB per TB. At 4K recordsize (databases), it's ~80 GB per TB. This is where the "1 GB per TB" came from — and only applies with dedup on.

Access pattern

Sequential reads (backups, media streaming) barely benefit from ARC. Random reads (databases, VMs, containers) benefit enormously. Same storage, same capacity, wildly different RAM needs.

I've run 20 TB pools on 8 GB of RAM (backup server, sequential writes, almost no reads) and seen 2 TB pools choke on 32 GB (PostgreSQL with dedup enabled — don't ask). The "1 GB per TB" rule is cargo cult. Size your RAM for your workload, not your capacity.

ARC on Linux — the memory pressure problem

On FreeBSD and Solaris/illumos, ARC and the kernel memory allocator cooperate natively. ARC registers itself as reclaimable memory, and the kernel politely asks ARC to shrink when applications need RAM. ARC complies gracefully, evicting cold entries while keeping hot data warm.

On Linux, it's a fight. The kernel's memory reclaim subsystem treats ARC as a second-class citizen. Under memory pressure, Linux can aggressively evict ARC entries — including hot MFU data that took hours to warm up. The result: unpredictable performance cliffs. ZFS has been caching your database's working set for 6 hours, the kernel panics about memory, throws it all away, and your next query takes 10x longer because it's reading from disk.

Why ZFS shows high memory usage (it's supposed to)

When you run free -h and see 28 GB "used" on a 32 GB system, that's ARC doing its job. ARC is reclaimable — it gives memory back instantly when applications request it. But free reports it as "used," not "buff/cache," because ARC allocates via the kernel's SLAB allocator rather than the page cache. This confuses monitoring tools and humans alike.

The correct way to check actual memory availability:

# This includes ARC as available (correct)
grep -E "MemAvailable|AnonPages|Slab" /proc/meminfo

# This shows ARC size directly
cat /proc/spl/kstat/zfs/arcstats | grep "^size"
# size  4  17179869184
# (that's 16 GB of ARC)

ARC memory is like a library that instantly returns books when you need the shelf space. "Used" is technically true. "Unavailable" is not.

Fixing ARC eviction pressure on Linux

# 1. Set explicit ARC max — don't let Linux decide
echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max  # 16 GB

# 2. Set ARC min to prevent complete eviction during memory spikes
echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_min   # 4 GB

# 3. Make persistent
cat > /etc/modprobe.d/zfs.conf <<'EOF'
options zfs zfs_arc_max=17179869184
options zfs zfs_arc_min=4294967296
EOF

# 4. Rebuild initramfs
dracut --force  # RHEL/CentOS/Rocky/Fedora
# update-initramfs -u  # Debian/Ubuntu

Rule of thumb for Linux: set zfs_arc_max to leave at least 4 GB for the OS and applications (more if running VMs, databases, or Java). Set zfs_arc_min to 25–50% of zfs_arc_max to prevent complete ARC eviction under load.

On FreeBSD, ARC manages itself. On Linux, you manage ARC, or the kernel manages it badly. Every Linux ZFS deployment should have explicit zfs_arc_max in /etc/modprobe.d/zfs.conf. No exceptions.

Monitoring ARC — arcstat and arc_summary

ZFS exposes detailed ARC statistics through /proc/spl/kstat/zfs/arcstats. Two tools make this data human-readable: arc_summary (point-in-time snapshot) and arcstat (live streaming).

arc_summary — the full picture

# Full ARC summary (run as root)
arc_summary

# Key sections in the output:
# ARC Summary
#   ARC size (current):              16.0 GiB
#   Target size (adaptive):          16.0 GiB
#   Min size (hard limit):            4.0 GiB
#   Max size (high water):           16.0 GiB
#
# ARC Efficiency
#   Cache hit ratio:                 94.32%   <-- this is your primary metric
#   Cache miss ratio:                 5.68%
#   Actual hit ratio:                96.71%   <-- includes prefetch hits
#
# Cache Hits by Type
#   Demand data hits:            1,247,831    <-- real application reads from cache
#   Demand metadata hits:          892,441    <-- metadata lookups from cache
#   Prefetch data hits:            124,982    <-- speculative prefetch hits
#   Prefetch metadata hits:         41,220    <-- speculative metadata prefetch hits
#
# Cache Misses by Type
#   Demand data misses:             62,391    <-- had to go to disk
#   Demand metadata misses:         18,442    <-- metadata not in cache

The cache hit ratio is the single most important ARC metric. Above 90% is good. Above 95% is excellent. Below 80% means your working set is larger than ARC — either add RAM, add L2ARC, or reduce the working set. Below 50% means ARC is essentially useless and something is wrong (ARC too small, sequential scan thrashing, or the workload simply doesn't benefit from caching).

arcstat — live monitoring

# Stream ARC stats every 5 seconds
arcstat 5

# Output columns explained:
#   time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size  c
#
#   read   = total ARC reads/sec
#   miss   = total ARC misses/sec
#   miss%  = miss rate (lower is better)
#   dmis   = demand misses (real application reads that missed)
#   dm%    = demand miss rate (the metric you care about most)
#   pmis   = prefetch misses (less critical — prefetch is speculative)
#   pm%    = prefetch miss rate
#   mmis   = metadata misses
#   mm%    = metadata miss rate
#   size   = current ARC size
#   c      = ARC target size (what ARC wants to be)

# Watch for:
# - dm% consistently above 10%: ARC is too small for your workload
# - mm% above 5%: metadata not fitting in ARC, consider special vdev
# - size << c: Linux is evicting ARC below its target (memory pressure)

/proc/spl/kstat/zfs/arcstats — raw data

# Key fields in arcstats (values in bytes unless noted)
cat /proc/spl/kstat/zfs/arcstats

# size          — current ARC size
# c             — target ARC size (what ARC wants)
# c_min         — minimum ARC size (zfs_arc_min)
# c_max         — maximum ARC size (zfs_arc_max)
# hits          — total cache hits (counter)
# misses        — total cache misses (counter)
# demand_data_hits    — application data reads served from cache
# demand_data_misses  — application data reads that went to disk
# demand_metadata_hits   — metadata lookups from cache
# demand_metadata_misses — metadata lookups from disk
# prefetch_data_hits     — speculative prefetch served from cache
# prefetch_data_misses   — speculative prefetch from disk
# mru_hits      — hits from MRU list
# mfu_hits      — hits from MFU list
# mru_ghost_hits  — hits on MRU ghost list (ARC auto-tuning signal)
# mfu_ghost_hits  — hits on MFU ghost list (ARC auto-tuning signal)
# evict_l2_cached — bytes evicted from ARC that were also in L2ARC
# data_size     — bytes of data in ARC
# metadata_size — bytes of metadata in ARC
# arc_meta_limit — max metadata allowed in ARC

If you're only going to look at one metric, look at the demand data hit ratio: demand_data_hits / (demand_data_hits + demand_data_misses). This tells you how often real application reads are served from memory vs disk. Everything else is details. Prefetch misses are expected and fine — prefetch is speculative by nature.

ARC eviction — how data leaves the cache

ARC doesn't evict randomly. The eviction algorithm considers the ghost list feedback:

Under pressure

When ARC hits c_max (or the kernel demands memory on Linux), ARC evicts from the tail of MRU or MFU. Which list loses entries depends on the current ghost list balance — if MRU ghost hits are high, ARC protects MRU and evicts from MFU, and vice versa.

MRU eviction

Data accessed only once is evicted first. This protects the working set (MFU) from sequential scans. A find / or tar won't flush your database cache.

MFU eviction

Only happens when MFU is larger than its target size (driven by ghost list balance). Frequently-accessed data is the last to go.

Metadata eviction

When metadata exceeds arc_meta_limit, ZFS evicts metadata entries regardless of MRU/MFU status. This prevents metadata from starving data caching.

The critical takeaway: ARC is scan-resistant. Unlike a simple LRU cache where a single find / -type f would flush every warm entry, ARC puts scan data into MRU (one-time access). Your database's hot rows stay in MFU. This is why ARC outperforms the Linux page cache for workloads that mix random and sequential access.

L2ARC — the victim cache on SSD

L2ARC is a victim cache: it captures data that ARC is evicting from memory and writes it to an SSD. When a future read misses ARC but hits L2ARC, ZFS reads from the SSD instead of the (much slower) data disks. L2ARC is read-only from the perspective of applications — it does nothing for write performance.

How L2ARC populates

L2ARC does not copy data from ARC. It captures data during eviction. When ARC evicts a block, it checks: is this block eligible for L2ARC? If yes, the block is written to the L2ARC device before being freed from memory. This means L2ARC is always "behind" ARC — it contains data ARC decided it couldn't keep. L2ARC fills gradually as ARC churns; it doesn't warm up instantly on boot.

Persistent L2ARC (OpenZFS 2.0+)

Before OpenZFS 2.0, L2ARC was volatile. Every reboot started with a cold, empty L2ARC that had to be repopulated from ARC evictions. This could take hours or days for large L2ARC devices. Persistent L2ARC writes an index to the L2ARC device itself, so the cache survives reboots. On a cold start, ZFS reads the L2ARC index and the cache is immediately warm.

# Check if persistent L2ARC is enabled (default: on in 2.0+)
cat /sys/module/zfs/parameters/l2arc_rebuild_enabled
# 1 = persistent L2ARC active

# Add L2ARC device (no mirror needed — it's a cache)
zpool add tank cache /dev/nvme4n1

# Check L2ARC status
zpool iostat -v tank
arc_summary | grep -A 10 "L2ARC"

The hidden cost: L2ARC headers consume ARC

L2ARC's RAM tax

Every block cached in L2ARC needs a header in ARC (RAM) to index it. Each header is approximately 70–180 bytes depending on the block's metadata. This seems small until you do the math:

# L2ARC with 4K blocks (worst case — small recordsize)
# 1 TB L2ARC / 4096 bytes per block = 262 million blocks
# 262 million * ~100 bytes per header = ~26 GB of ARC consumed by L2ARC headers
#
# L2ARC with 128K blocks (default recordsize)
# 1 TB L2ARC / 131072 bytes per block = 8 million blocks
# 8 million * ~100 bytes per header = ~800 MB of ARC consumed

# Check L2ARC header overhead:
arc_summary | grep "Header size"
cat /proc/spl/kstat/zfs/arcstats | grep "l2_hdr_size"

On systems with less than 64 GB RAM, a large L2ARC with small blocks can consume more ARC for its headers than the L2ARC is worth. You're stealing from ARC (nanosecond latency) to index L2ARC (microsecond latency). The net effect is negative.

L2ARC is like building an index for a warehouse. If the index itself fills your desk, you've traded fast desk access for a warehouse lookup. Make sure the warehouse is worth it.

When L2ARC helps vs when it's wasted

L2ARC helps

Read-heavy workloads where the working set exceeds RAM but fits on SSD. File servers, media libraries, build caches, read-replica databases, large source trees. The data pool is on spinning disks. ARC can't hold the full working set. L2ARC on NVMe gives you SSD-speed reads for the overflow.

L2ARC is wasted

Write-heavy workloads (L2ARC does nothing for writes). All-SSD pools (L2ARC on SSD is the same speed as the data disks — no benefit). Small working sets that fit in ARC. Low-RAM systems where L2ARC header overhead exceeds the benefit. Sequential workloads (backups, media streaming) where prefetch is more effective than caching.

L2ARC sizing guidelines

Size

L2ARC should be 5–10x your ARC size. With 16 GB ARC, a 100–200 GB L2ARC is a good starting point. Going larger requires checking that L2ARC header overhead doesn't exceed 10–15% of ARC.

Device

Any SSD works. L2ARC is a cache — losing it means a temporary performance drop, not data loss. No need for enterprise NVMe or power loss protection. Consumer NVMe is fine. Don't waste money here.

Redundancy

Never mirror L2ARC. It's a disposable cache. If the SSD dies, ZFS rebuilds L2ARC from ARC evictions. Mirroring L2ARC wastes an entire SSD.

L2ARC is the most over-recommended ZFS feature. Everyone thinks "add an SSD cache, things go faster." But if your data pool is already on SSDs, L2ARC adds nothing. And on low-RAM systems, the header overhead makes things slower. Check l2_hdr_size in arcstats. If it's more than 10% of your ARC, your L2ARC is hurting you.

ZIL — the ZFS Intent Log (not a write cache)

The ZIL is the most misunderstood component of ZFS. It is not a write cache. It is a write-ahead log that records synchronous write transactions so they survive a crash. Understanding the ZIL requires understanding the difference between sync and async writes.

Sync vs async writes

Async writes

Default for most Linux applications. The application calls write(), the data goes into a transaction group in memory, and the application continues immediately. The data is flushed to disk when the transaction group commits (every 5–30 seconds by default). The ZIL is not involved at all. SLOG does nothing for async writes.

Sync writes

The application calls write() + fsync() (or opens the file with O_SYNC). The application blocks until ZFS guarantees the data is on stable storage. Without ZIL, this means waiting for the full transaction group commit. With ZIL, ZFS writes the data to the ZIL device and returns immediately — the data is safe on the ZIL, and the transaction group can commit at its leisure.

The ZIL exists on the data pool by default. Every pool has a ZIL. A SLOG (Separate LOG device) moves the ZIL to a dedicated fast device — typically NVMe with power loss protection. The SLOG doesn't change what the ZIL does; it changes where the ZIL lives, making sync writes faster by avoiding seeks on the data disks.

How the ZIL actually works

# Application calls: write() + fsync()

# 1. ZFS writes the data to the ZIL (or SLOG if present)
#    This is a sequential append — very fast on any device
# 2. ZFS returns success to the application
#    The app can continue — the data is on stable storage (ZIL)
# 3. The transaction group commits (5-30 seconds later)
#    Data is written to its final location on the data pool
# 4. The ZIL entries are freed
#    The ZIL only holds data between fsync() and TXG commit

# The ZIL is ONLY read during crash recovery:
# After an unclean shutdown, ZFS replays the ZIL to recover
# any sync writes that hadn't been committed to a TXG yet.
# During normal operation, the ZIL is write-only.

ZIL sizing

The ZIL only holds data between fsync() and the next transaction group commit (default: 5 seconds). Even under heavy sync write load, the ZIL rarely exceeds a few hundred MB. A SLOG device of 16–32 GB is more than sufficient for almost any workload. Larger SLOG devices don't help — the data is freed every TXG commit regardless.

When you need a SLOG

# Check if your workload generates sync writes
zpool iostat -q tank 5

# Look at the "syncq" columns:
#              syncq_read    syncq_write
# tank              0             47     <-- 47 sync writes queued: SLOG will help
# tank              0              0     <-- no sync writes: SLOG is useless

# Workloads that generate sync writes:
# - Databases (PostgreSQL, MySQL with innodb_flush_log_at_trx_commit=1)
# - NFS with sync=always (default for NFSv3)
# - iSCSI targets
# - ESXi/Proxmox VM storage over NFS/iSCSI
# - Any application using O_SYNC or fsync()

# Workloads that DON'T generate sync writes:
# - Most local file operations (cp, mv, rsync without --sync)
# - Web servers serving static files
# - Container registries
# - Media streaming

Do NOT disable the ZIL

zil_disable — don't

You'll find old forum posts recommending zfs set sync=disabled or the kernel parameter zil_disable=1 to improve write performance. This discards the write-ahead log entirely. If the system crashes, any sync writes since the last TXG commit are silently lost. Your database thinks the data was committed. It wasn't.

For databases, this means corrupted transactions. For NFS, this means clients believe writes are safe when they're not. The performance gain is real but the data loss risk is catastrophic.

Disabling the ZIL is like removing your car's seatbelts to save weight. It works fine until it doesn't.

Every few months someone posts "I disabled sync=always on my NFS export and writes are 10x faster!" Yes. Because you told ZFS to lie to NFS clients about data durability. If the power goes out, every NFS client that got a "write succeeded" response in the last 5 seconds may have lost data. If you need fast sync writes, buy a SLOG. Don't disable the safety net.

ABD — Adaptive Buffer Data

ABD (formerly known as "scatter/gather ABD") is the memory allocation layer underneath ARC. Before ABD (OpenZFS pre-0.7), ARC allocated large contiguous memory buffers via the SLAB allocator. On long-running systems, memory fragmentation made it impossible to allocate large blocks even when total free memory was sufficient — leading to mysterious allocation failures and performance degradation.

ABD solves this by supporting scattered allocations: instead of requiring a contiguous 128K buffer, ABD can assemble a 128K block from multiple smaller, non-contiguous pages. This dramatically reduces fragmentation pressure and is one of the reasons modern OpenZFS on Linux is far more stable than early versions.

You generally don't need to tune ABD. It's a low-level allocator improvement that works transparently. The main practical impact: modern ZFS on Linux no longer suffers from the "kmem:0" allocation failures that plagued older versions under memory pressure.

Memory pressure and OOM behavior

The Linux OOM (Out of Memory) killer can and does kill ZFS-related processes. Understanding the interaction between ZFS memory usage and the OOM killer is critical for production systems.

ARC is reclaimable

ARC memory is reclaimable under pressure. The kernel can force ARC to shrink. But the reclaim path has latency — if an application allocates a sudden burst of memory, the kernel may trigger OOM before ARC has time to shrink.

ARC shrink latency

ARC cannot release memory instantly. It has to walk its internal data structures and free entries. Under extreme pressure, this takes milliseconds — during which the kernel may decide there's no memory and invoke OOM.

The fix

Always set zfs_arc_max to leave headroom. On a 32 GB system running VMs, set ARC max to 16–20 GB, leaving 12–16 GB for VMs and the OS. On a dedicated NAS with no other workloads, you can give ARC 80–90% of RAM.

Diagnosing "ZFS is eating all my RAM"

Step-by-step procedure when someone reports high memory usage on a ZFS system:

# 1. Check actual ARC size
arc_summary | head -20
# Or: cat /proc/spl/kstat/zfs/arcstats | grep "^size"

# 2. Check if ARC is within its configured limits
cat /sys/module/zfs/parameters/zfs_arc_max
cat /sys/module/zfs/parameters/zfs_arc_min

# 3. Check available memory (includes ARC as available)
grep MemAvailable /proc/meminfo

# 4. Check what's actually consuming non-reclaimable memory
# If MemAvailable is low AND ARC is within limits, something
# else is eating memory (VMs, Java heap, leaked processes).
ps aux --sort=-%mem | head -20

# 5. Check ARC hit ratio — is ARC actually helping?
arc_summary | grep "hit ratio"
# If hit ratio < 80%, ARC is too small or the workload is uncacheable.
# If hit ratio > 95%, ARC is working perfectly. Leave it alone.

90% of "ZFS is eating my RAM" reports are misreading free(1) output. 9% are missing zfs_arc_max. 1% are actual problems.

Swap on ZFS — the recursive death spiral

Do not put swap on a ZFS filesystem. Or if you do, understand exactly what can go wrong.

The problem is a deadlock cycle: when the system is under memory pressure, it wants to swap pages to disk. If swap lives on ZFS, ZFS needs to allocate memory to perform the write. But the system is already out of memory — that's why it's swapping. ZFS asks the kernel for memory, the kernel says "I'm swapping to free memory," ZFS says "I need memory to write the swap," and the system deadlocks or the OOM killer starts shooting processes.

zvol swap

A ZFS zvol used as swap (mkswap /dev/zvol/tank/swap). This is the dangerous one. ZFS needs memory to service I/O to the zvol. Under memory pressure, this creates the deadlock cycle. OpenZFS has mitigations, but they're not bulletproof.

Swap file on ZFS

A swap file on a ZFS dataset. Same deadlock risk as zvol swap, plus additional complexity from the POSIX file layer.

Swap on a separate partition

The safe option. Put swap on a non-ZFS partition (ext4 or raw partition). ZFS is not involved in swap I/O. No deadlock possible. This is what kldload does by default.

# Safe: swap on a dedicated partition (kldload default)
mkswap /dev/sda2
swapon /dev/sda2

# Risky: swap on a zvol (mitigations exist but not guaranteed)
zfs create -V 8G -b 4096 -o compression=zle \
  -o sync=always -o primarycache=metadata \
  -o secondarycache=none tank/swap
mkswap /dev/zvol/tank/swap
swapon /dev/zvol/tank/swap

# If you MUST use zvol swap, these properties reduce deadlock risk:
# compression=zle  — minimal CPU during memory pressure
# sync=always      — prevents data loss on crash
# primarycache=metadata — don't cache swap data in ARC
# secondarycache=none   — don't cache swap in L2ARC

The ZFS-on-Linux team has put significant work into making zvol swap safe (reserved memory pools, I/O throttling during reclaim). On modern OpenZFS (2.1+) with proper tuning, it usually works. But "usually works" isn't "always works," and the failure mode is a hard deadlock requiring a power cycle. Use a separate partition for swap. It's not worth the risk.

Deduplication memory requirements

ZFS deduplication maintains a Dedup Table (DDT) that maps every block's checksum to its physical location. The DDT must be accessible for every write (to check if the block already exists) and every free (to track reference counts). If the DDT doesn't fit in memory, every write requires a random read from disk to check the DDT — and performance collapses.

DDT memory math

# Each DDT entry: ~320 bytes (checksum + physical address + refcount + padding)

# With 128K recordsize (default):
# 1 TB of data = ~8 million blocks
# 8 million * 320 bytes = ~2.5 GB of DDT
# 10 TB = ~25 GB of DDT

# With 4K recordsize (databases, zvols):
# 1 TB of data = ~262 million blocks
# 262 million * 320 bytes = ~84 GB of DDT
# THIS IS WHY DEDUP ON DATABASE ZVOLS IS INSANE

# Check your DDT size:
zpool status -D tank
# Shows DDT entries, size, and on-disk vs in-core statistics

zdb -DDD tank
# Detailed DDT analysis

Special vdev for DDT (OpenZFS 2.2+)

OpenZFS 2.2 introduced the ability to store the DDT on a special vdev. This moves the DDT from main pool spindles to fast SSDs, and — critically — means the DDT no longer needs to fit entirely in ARC. ZFS can read DDT entries from the SSD special vdev at microsecond latency instead of millisecond latency from spinning disks.

This makes dedup feasible for the first time on large pools. It's still expensive (the special vdev must be sized for the DDT), but it removes the "all DDT in RAM" requirement that made dedup impractical for most deployments.

# Add mirrored special vdev for DDT + metadata
zpool add tank special mirror /dev/nvme0n1 /dev/nvme1n1
zfs set special_small_blocks=65536 tank

# Enable dedup (now DDT lives on the special vdev)
zfs set dedup=on tank/datasets-with-duplicates

# Monitor DDT on special vdev
zpool iostat -v tank

Even with a special vdev, dedup is rarely worth it. LZ4 compression gives 1.5–2.5x space savings with zero memory overhead and zero performance cost. Dedup only wins when you have truly identical blocks (VM templates, backup repositories with many identical files). For everything else, compression is strictly better.

Memory recommendations by use case

Use Case	Minimum RAM	Recommended RAM	ARC Max Setting	Notes
Desktop / workstation	8 GB	16 GB	50% of RAM	Leave room for applications, browsers, IDEs
NAS / file server	8 GB	32 GB	80–90% of RAM	Dedicated storage appliance; give ARC almost everything
Database server	16 GB	64 GB	25–40% of RAM	Database has its own buffer pool; don't double-cache. `primarycache=metadata` on DB datasets
VM host (KVM/Proxmox)	32 GB	128 GB	15–25% of RAM	VMs need direct RAM allocation. ARC helps VM image reads but can't compete with VM memory
Container host	16 GB	32 GB	40–60% of RAM	Containers share the host kernel; ARC caches shared layers efficiently
Backup server	8 GB	16 GB	4–8 GB fixed	Sequential writes; ARC barely helps. Spend money on disk throughput, not RAM
Build server / CI	16 GB	32 GB	50% of RAM	Build tools need RAM; ARC helps with repeated source file reads
Dedup enabled (128K)	16 GB + 2.5 GB/TB	64 GB + 2.5 GB/TB	As much as possible	DDT must fit in ARC. 10 TB = 25 GB DDT alone. Consider special vdev (2.2+)
Dedup enabled (4K)	Don't	Don't	Don't	84 GB DDT per TB. Unless you have a special vdev and accept the cost, just use compression.

The database row deserves emphasis. If you're running PostgreSQL on ZFS, set primarycache=metadata on the database dataset. PostgreSQL has its own shared_buffers cache. If ARC also caches the data blocks, you're double-caching: wasting RAM on two copies of the same data. Let PostgreSQL manage its own cache and let ARC handle metadata (block pointers, directory entries). Same applies to MySQL's InnoDB buffer pool.

Real-world: tuning ARC for database servers

Databases are the hardest ZFS memory tuning case because the database and ZFS are both trying to cache the same data. The solution is to divide responsibility:

# PostgreSQL on ZFS — 64 GB RAM system

# 1. ZFS: cache metadata only for the database dataset
zfs set primarycache=metadata tank/pgdata
zfs set recordsize=16K tank/pgdata      # match PostgreSQL page size
zfs set logbias=throughput tank/pgdata   # avoid ZIL for large writes

# 2. ZFS: set ARC to 25% of RAM (16 GB)
echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max

# 3. PostgreSQL: give it the majority of remaining RAM
# postgresql.conf:
# shared_buffers = 32GB    # 50% of RAM
# effective_cache_size = 48GB  # tell planner about total cache
# work_mem = 256MB
# maintenance_work_mem = 2GB

# 4. Verify no double-caching
# ARC should show mostly metadata, very little data:
arc_summary | grep -E "data_size|metadata_size"

# MySQL/InnoDB equivalent:
zfs set primarycache=metadata tank/mysql
zfs set recordsize=16K tank/mysql
# my.cnf:
# innodb_buffer_pool_size = 32G

Monitoring cache hit rates in production

# Quick health check script for ZFS memory
#!/bin/bash
echo "=== ARC Status ==="
arc_summary 2>/dev/null | grep -E "ARC size|Target size|hit ratio|Meta"

echo ""
echo "=== ARC vs System Memory ==="
ARC_SIZE=$(awk '/^size/ {print $3}' /proc/spl/kstat/zfs/arcstats)
TOTAL_MEM=$(awk '/MemTotal/ {print $2 * 1024}' /proc/meminfo)
AVAIL_MEM=$(awk '/MemAvailable/ {print $2 * 1024}' /proc/meminfo)
echo "ARC:       $(( ARC_SIZE / 1073741824 )) GB"
echo "Total RAM: $(( TOTAL_MEM / 1073741824 )) GB"
echo "Available: $(( AVAIL_MEM / 1073741824 )) GB"
echo "ARC %:     $(( ARC_SIZE * 100 / TOTAL_MEM ))%"

echo ""
echo "=== L2ARC Status ==="
arc_summary 2>/dev/null | grep -A 5 "L2ARC" || echo "No L2ARC configured"

echo ""
echo "=== Pool I/O ==="
zpool iostat -v 1 1

Summary — the rules

Rule 1

Always set zfs_arc_max on Linux. The kernel will fight ARC for memory. Tell ARC exactly how much it can use.

Rule 2

ZFS using all your RAM is normal. ARC is reclaimable cache. Don't panic when free shows high usage. Check MemAvailable instead.

Rule 3

Monitor hit ratio, not ARC size. A 4 GB ARC with 98% hit ratio is better than a 32 GB ARC with 60% hit ratio. The hit ratio tells you if ARC is actually helping.

Rule 4

Don't double-cache. If the application has its own cache (PostgreSQL, MySQL, MongoDB), set primarycache=metadata on its datasets.

Rule 5

L2ARC needs RAM too. Check l2_hdr_size before adding large L2ARC devices. On low-RAM systems, L2ARC can make things worse.

Rule 6

SLOG is not a write cache. It only helps synchronous writes. Check zpool iostat -q for sync write pressure before buying hardware.

Rule 7

Never disable the ZIL. sync=disabled trades data integrity for performance. Buy a SLOG instead.

Rule 8

Swap on a separate partition. Swap on ZFS can deadlock under memory pressure. Use ext4 or raw partition for swap.

Rule 9

Dedup requires special attention. The DDT must fit in ARC or on a special vdev. Without either, dedup destroys performance. Use compression instead.

If you read this entire page and came away with one thing, make it this: ARC is ZFS's greatest feature and greatest source of confusion. It makes ZFS faster than any other filesystem for read-heavy workloads. But on Linux, you have to tell it how much RAM to use, or the kernel will make that decision for you — badly. Set zfs_arc_max, monitor your hit ratio, and stop worrying about free -h.

← Hardware Selection — ZFS is only as good as the metal it runs on. Tuning for Workloads — defaults are for nobody. →