Memory & ARC — the engine that makes ZFS fast.
ARC (Adaptive Replacement Cache) is ZFS's in-memory read cache. It's one of ZFS's biggest strengths — and one of the most common sources of confusion when people see ZFS "eating all their RAM." This page explains how ZFS memory actually works, how to monitor it, how to tune it, and when to panic (almost never).
free -h and htop show it as "used," which sends people into a panic.
Every "ZFS is eating my RAM" forum post is someone misreading free output.
Read this page before you start tweaking.
ARC — Adaptive Replacement Cache
ARC is a patent-free adaptation of IBM's ARC algorithm. It maintains four lists that self-tune based on your actual access patterns. This is what makes it dramatically smarter than a simple LRU cache — it learns whether your workload favors frequency or recency and adapts automatically.
MRU — Most Recently Used
Data that was accessed once recently. Sequential scans, newly opened files, first-time reads. This is ARC's "short-term memory." If the data is accessed again, it graduates to MFU.
MFU — Most Frequently Used
Data accessed at least twice. Database indexes, config files, hot application data. This is ARC's "long-term memory." MFU entries survive longer under eviction pressure because ZFS knows you'll probably need them again.
MRU Ghost List
Metadata-only entries for data recently evicted from MRU. ZFS doesn't store the data, just the block pointer. If a ghost entry is accessed, ZFS knows it should have kept that data — and grows the MRU target size to cache more recent data in the future.
MFU Ghost List
Metadata-only entries for data recently evicted from MFU. Same principle: ghost hits tell ARC to grow the MFU target size. The two ghost lists create a feedback loop that continuously optimizes the MRU/MFU split ratio for your actual workload.
The genius of ARC is the ghost lists. A pure LRU cache has no memory of what it evicted. ARC's ghosts let it learn from mistakes: "I evicted block X from MFU and something asked for it 1 second later — I need a bigger MFU." This self-tuning happens continuously, thousands of times per second, with no operator intervention. The MRU/MFU balance constantly shifts to match your workload's access pattern.
Data vs metadata in ARC
ARC caches two fundamentally different things, and understanding the distinction is critical for tuning:
ls has to read from disk. On a spinning-rust pool with millions of files, metadata eviction can make the system feel completely unresponsive.
The zfs_arc_meta_limit parameter (default: 75% of ARC max) controls how much of ARC can be
consumed by metadata. If metadata exceeds this limit, ZFS begins evicting metadata entries. This matters
because some workloads (container hosts, build servers, source trees with millions of small files) are
metadata-heavy — they can fill ARC entirely with metadata, leaving no room for actual data caching.
# Check metadata vs data usage in ARC
arc_summary | grep -A 5 "ARC size"
# From /proc/spl/kstat/zfs/arcstats:
# data_size = bytes used for data
# metadata_size = bytes used for metadata
# arc_meta_limit = maximum metadata allowed in ARC
cat /proc/spl/kstat/zfs/arcstats | grep -E "data_size|metadata_size|arc_meta_limit"
Compressed ARC (OpenZFS 2.2+)
Starting with OpenZFS 2.2, ARC can store data in compressed form. Instead of decompressing blocks before caching them, ZFS keeps the compressed version in ARC and decompresses on read. This effectively multiplies your ARC capacity by your compression ratio.
With LZ4 compression achieving typical ratios of 1.5–2.5x on general data (and 3–10x on text-heavy data like logs and source code), compressed ARC means 32GB of RAM can cache 48–80GB of logical data. This is free performance — the decompression overhead is negligible on modern CPUs.
# Verify compressed ARC is active (default on in 2.2+)
cat /sys/module/zfs/parameters/zfs_compressed_arc_enabled
# 1 = enabled (default)
# Check effective compression in ARC
arc_summary | grep -i compress
ARC size tuning
ARC has three key size parameters. On Linux, you almost always need to set at least zfs_arc_max
explicitly, because the kernel's memory pressure heuristics fight with ARC's desire to cache everything.
Setting ARC limits — runtime and persistent
# Runtime — takes effect immediately, lost on reboot
echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max # 8 GB
echo 2147483648 > /sys/module/zfs/parameters/zfs_arc_min # 2 GB
echo 6442450944 > /sys/module/zfs/parameters/zfs_arc_meta_limit # 6 GB
# Persistent — survives reboot
cat > /etc/modprobe.d/zfs.conf <<'EOF'
options zfs zfs_arc_max=8589934592
options zfs zfs_arc_min=2147483648
options zfs zfs_arc_meta_limit=6442450944
EOF
# Rebuild initramfs so early-boot uses these values
# CentOS/RHEL/Rocky/Fedora:
dracut --force
# Debian/Ubuntu:
update-initramfs -u
python3 -c "print(8 * 1024**3)" for 8 GB.
The "1 GB per TB" myth — debunked
You'll see this "rule" repeated everywhere: "ZFS needs 1 GB of RAM per TB of storage." This is wrong. Or rather, it's so oversimplified that following it will lead you to both over-provisioning and under-provisioning depending on the workload.
Where the myth came from: early ZFS on Solaris, with deduplication enabled, the dedup table (DDT) consumed roughly 1 GB of RAM per TB of deduplicated data. Someone generalized this to all ZFS deployments, and it stuck. Without dedup, the storage capacity has almost nothing to do with RAM requirements.
What actually determines ZFS RAM needs:
ARC on Linux — the memory pressure problem
On FreeBSD and Solaris/illumos, ARC and the kernel memory allocator cooperate natively. ARC registers itself as reclaimable memory, and the kernel politely asks ARC to shrink when applications need RAM. ARC complies gracefully, evicting cold entries while keeping hot data warm.
On Linux, it's a fight. The kernel's memory reclaim subsystem treats ARC as a second-class citizen. Under memory pressure, Linux can aggressively evict ARC entries — including hot MFU data that took hours to warm up. The result: unpredictable performance cliffs. ZFS has been caching your database's working set for 6 hours, the kernel panics about memory, throws it all away, and your next query takes 10x longer because it's reading from disk.
Why ZFS shows high memory usage (it's supposed to)
When you run free -h and see 28 GB "used" on a 32 GB system, that's ARC doing its job.
ARC is reclaimable — it gives memory back instantly when applications request it. But
free reports it as "used," not "buff/cache," because ARC allocates via the kernel's SLAB
allocator rather than the page cache. This confuses monitoring tools and humans alike.
The correct way to check actual memory availability:
# This includes ARC as available (correct)
grep -E "MemAvailable|AnonPages|Slab" /proc/meminfo
# This shows ARC size directly
cat /proc/spl/kstat/zfs/arcstats | grep "^size"
# size 4 17179869184
# (that's 16 GB of ARC)
Fixing ARC eviction pressure on Linux
# 1. Set explicit ARC max — don't let Linux decide
echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max # 16 GB
# 2. Set ARC min to prevent complete eviction during memory spikes
echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_min # 4 GB
# 3. Make persistent
cat > /etc/modprobe.d/zfs.conf <<'EOF'
options zfs zfs_arc_max=17179869184
options zfs zfs_arc_min=4294967296
EOF
# 4. Rebuild initramfs
dracut --force # RHEL/CentOS/Rocky/Fedora
# update-initramfs -u # Debian/Ubuntu
Rule of thumb for Linux: set zfs_arc_max to leave at least 4 GB for the OS
and applications (more if running VMs, databases, or Java). Set zfs_arc_min to 25–50%
of zfs_arc_max to prevent complete ARC eviction under load.
zfs_arc_max in
/etc/modprobe.d/zfs.conf. No exceptions.
Monitoring ARC — arcstat and arc_summary
ZFS exposes detailed ARC statistics through /proc/spl/kstat/zfs/arcstats. Two tools
make this data human-readable: arc_summary (point-in-time snapshot) and
arcstat (live streaming).
arc_summary — the full picture
# Full ARC summary (run as root)
arc_summary
# Key sections in the output:
# ARC Summary
# ARC size (current): 16.0 GiB
# Target size (adaptive): 16.0 GiB
# Min size (hard limit): 4.0 GiB
# Max size (high water): 16.0 GiB
#
# ARC Efficiency
# Cache hit ratio: 94.32% <-- this is your primary metric
# Cache miss ratio: 5.68%
# Actual hit ratio: 96.71% <-- includes prefetch hits
#
# Cache Hits by Type
# Demand data hits: 1,247,831 <-- real application reads from cache
# Demand metadata hits: 892,441 <-- metadata lookups from cache
# Prefetch data hits: 124,982 <-- speculative prefetch hits
# Prefetch metadata hits: 41,220 <-- speculative metadata prefetch hits
#
# Cache Misses by Type
# Demand data misses: 62,391 <-- had to go to disk
# Demand metadata misses: 18,442 <-- metadata not in cache
The cache hit ratio is the single most important ARC metric. Above 90% is good. Above 95% is excellent. Below 80% means your working set is larger than ARC — either add RAM, add L2ARC, or reduce the working set. Below 50% means ARC is essentially useless and something is wrong (ARC too small, sequential scan thrashing, or the workload simply doesn't benefit from caching).
arcstat — live monitoring
# Stream ARC stats every 5 seconds
arcstat 5
# Output columns explained:
# time read miss miss% dmis dm% pmis pm% mmis mm% size c
#
# read = total ARC reads/sec
# miss = total ARC misses/sec
# miss% = miss rate (lower is better)
# dmis = demand misses (real application reads that missed)
# dm% = demand miss rate (the metric you care about most)
# pmis = prefetch misses (less critical — prefetch is speculative)
# pm% = prefetch miss rate
# mmis = metadata misses
# mm% = metadata miss rate
# size = current ARC size
# c = ARC target size (what ARC wants to be)
# Watch for:
# - dm% consistently above 10%: ARC is too small for your workload
# - mm% above 5%: metadata not fitting in ARC, consider special vdev
# - size << c: Linux is evicting ARC below its target (memory pressure)
/proc/spl/kstat/zfs/arcstats — raw data
# Key fields in arcstats (values in bytes unless noted)
cat /proc/spl/kstat/zfs/arcstats
# size — current ARC size
# c — target ARC size (what ARC wants)
# c_min — minimum ARC size (zfs_arc_min)
# c_max — maximum ARC size (zfs_arc_max)
# hits — total cache hits (counter)
# misses — total cache misses (counter)
# demand_data_hits — application data reads served from cache
# demand_data_misses — application data reads that went to disk
# demand_metadata_hits — metadata lookups from cache
# demand_metadata_misses — metadata lookups from disk
# prefetch_data_hits — speculative prefetch served from cache
# prefetch_data_misses — speculative prefetch from disk
# mru_hits — hits from MRU list
# mfu_hits — hits from MFU list
# mru_ghost_hits — hits on MRU ghost list (ARC auto-tuning signal)
# mfu_ghost_hits — hits on MFU ghost list (ARC auto-tuning signal)
# evict_l2_cached — bytes evicted from ARC that were also in L2ARC
# data_size — bytes of data in ARC
# metadata_size — bytes of metadata in ARC
# arc_meta_limit — max metadata allowed in ARC
demand_data_hits / (demand_data_hits + demand_data_misses). This tells you
how often real application reads are served from memory vs disk. Everything else is details.
Prefetch misses are expected and fine — prefetch is speculative by nature.
ARC eviction — how data leaves the cache
ARC doesn't evict randomly. The eviction algorithm considers the ghost list feedback:
c_max (or the kernel demands memory on Linux), ARC evicts from the tail of MRU or MFU. Which list loses entries depends on the current ghost list balance — if MRU ghost hits are high, ARC protects MRU and evicts from MFU, and vice versa.find / or tar won't flush your database cache.arc_meta_limit, ZFS evicts metadata entries regardless of MRU/MFU status. This prevents metadata from starving data caching.
The critical takeaway: ARC is scan-resistant. Unlike a simple LRU cache where a single
find / -type f would flush every warm entry, ARC puts scan data into MRU (one-time access).
Your database's hot rows stay in MFU. This is why ARC outperforms the Linux page cache for workloads
that mix random and sequential access.
L2ARC — the victim cache on SSD
L2ARC is a victim cache: it captures data that ARC is evicting from memory and writes it to an SSD. When a future read misses ARC but hits L2ARC, ZFS reads from the SSD instead of the (much slower) data disks. L2ARC is read-only from the perspective of applications — it does nothing for write performance.
How L2ARC populates
L2ARC does not copy data from ARC. It captures data during eviction. When ARC evicts a block, it checks: is this block eligible for L2ARC? If yes, the block is written to the L2ARC device before being freed from memory. This means L2ARC is always "behind" ARC — it contains data ARC decided it couldn't keep. L2ARC fills gradually as ARC churns; it doesn't warm up instantly on boot.
Persistent L2ARC (OpenZFS 2.0+)
Before OpenZFS 2.0, L2ARC was volatile. Every reboot started with a cold, empty L2ARC that had to be repopulated from ARC evictions. This could take hours or days for large L2ARC devices. Persistent L2ARC writes an index to the L2ARC device itself, so the cache survives reboots. On a cold start, ZFS reads the L2ARC index and the cache is immediately warm.
# Check if persistent L2ARC is enabled (default: on in 2.0+)
cat /sys/module/zfs/parameters/l2arc_rebuild_enabled
# 1 = persistent L2ARC active
# Add L2ARC device (no mirror needed — it's a cache)
zpool add tank cache /dev/nvme4n1
# Check L2ARC status
zpool iostat -v tank
arc_summary | grep -A 10 "L2ARC"
The hidden cost: L2ARC headers consume ARC
L2ARC's RAM tax
Every block cached in L2ARC needs a header in ARC (RAM) to index it. Each header is approximately 70–180 bytes depending on the block's metadata. This seems small until you do the math:
# L2ARC with 4K blocks (worst case — small recordsize)
# 1 TB L2ARC / 4096 bytes per block = 262 million blocks
# 262 million * ~100 bytes per header = ~26 GB of ARC consumed by L2ARC headers
#
# L2ARC with 128K blocks (default recordsize)
# 1 TB L2ARC / 131072 bytes per block = 8 million blocks
# 8 million * ~100 bytes per header = ~800 MB of ARC consumed
# Check L2ARC header overhead:
arc_summary | grep "Header size"
cat /proc/spl/kstat/zfs/arcstats | grep "l2_hdr_size"
On systems with less than 64 GB RAM, a large L2ARC with small blocks can consume more ARC for its headers than the L2ARC is worth. You're stealing from ARC (nanosecond latency) to index L2ARC (microsecond latency). The net effect is negative.
When L2ARC helps vs when it's wasted
L2ARC helps
Read-heavy workloads where the working set exceeds RAM but fits on SSD. File servers, media libraries, build caches, read-replica databases, large source trees. The data pool is on spinning disks. ARC can't hold the full working set. L2ARC on NVMe gives you SSD-speed reads for the overflow.
L2ARC is wasted
Write-heavy workloads (L2ARC does nothing for writes). All-SSD pools (L2ARC on SSD is the same speed as the data disks — no benefit). Small working sets that fit in ARC. Low-RAM systems where L2ARC header overhead exceeds the benefit. Sequential workloads (backups, media streaming) where prefetch is more effective than caching.
L2ARC sizing guidelines
l2_hdr_size in arcstats.
If it's more than 10% of your ARC, your L2ARC is hurting you.
ZIL — the ZFS Intent Log (not a write cache)
The ZIL is the most misunderstood component of ZFS. It is not a write cache. It is a write-ahead log that records synchronous write transactions so they survive a crash. Understanding the ZIL requires understanding the difference between sync and async writes.
Sync vs async writes
write(), the data goes into a transaction group in memory, and the application continues immediately. The data is flushed to disk when the transaction group commits (every 5–30 seconds by default). The ZIL is not involved at all. SLOG does nothing for async writes.write() + fsync() (or opens the file with O_SYNC). The application blocks until ZFS guarantees the data is on stable storage. Without ZIL, this means waiting for the full transaction group commit. With ZIL, ZFS writes the data to the ZIL device and returns immediately — the data is safe on the ZIL, and the transaction group can commit at its leisure.The ZIL exists on the data pool by default. Every pool has a ZIL. A SLOG (Separate LOG device) moves the ZIL to a dedicated fast device — typically NVMe with power loss protection. The SLOG doesn't change what the ZIL does; it changes where the ZIL lives, making sync writes faster by avoiding seeks on the data disks.
How the ZIL actually works
# Application calls: write() + fsync()
# 1. ZFS writes the data to the ZIL (or SLOG if present)
# This is a sequential append — very fast on any device
# 2. ZFS returns success to the application
# The app can continue — the data is on stable storage (ZIL)
# 3. The transaction group commits (5-30 seconds later)
# Data is written to its final location on the data pool
# 4. The ZIL entries are freed
# The ZIL only holds data between fsync() and TXG commit
# The ZIL is ONLY read during crash recovery:
# After an unclean shutdown, ZFS replays the ZIL to recover
# any sync writes that hadn't been committed to a TXG yet.
# During normal operation, the ZIL is write-only.
ZIL sizing
The ZIL only holds data between fsync() and the next transaction group commit (default: 5 seconds).
Even under heavy sync write load, the ZIL rarely exceeds a few hundred MB. A SLOG device of
16–32 GB is more than sufficient for almost any workload. Larger SLOG devices
don't help — the data is freed every TXG commit regardless.
When you need a SLOG
# Check if your workload generates sync writes
zpool iostat -q tank 5
# Look at the "syncq" columns:
# syncq_read syncq_write
# tank 0 47 <-- 47 sync writes queued: SLOG will help
# tank 0 0 <-- no sync writes: SLOG is useless
# Workloads that generate sync writes:
# - Databases (PostgreSQL, MySQL with innodb_flush_log_at_trx_commit=1)
# - NFS with sync=always (default for NFSv3)
# - iSCSI targets
# - ESXi/Proxmox VM storage over NFS/iSCSI
# - Any application using O_SYNC or fsync()
# Workloads that DON'T generate sync writes:
# - Most local file operations (cp, mv, rsync without --sync)
# - Web servers serving static files
# - Container registries
# - Media streaming
Do NOT disable the ZIL
zil_disable — don't
You'll find old forum posts recommending zfs set sync=disabled or the kernel
parameter zil_disable=1 to improve write performance. This discards the
write-ahead log entirely. If the system crashes, any sync writes since the last TXG
commit are silently lost. Your database thinks the data was committed. It wasn't.
For databases, this means corrupted transactions. For NFS, this means clients believe writes are safe when they're not. The performance gain is real but the data loss risk is catastrophic.
ABD — Adaptive Buffer Data
ABD (formerly known as "scatter/gather ABD") is the memory allocation layer underneath ARC. Before ABD (OpenZFS pre-0.7), ARC allocated large contiguous memory buffers via the SLAB allocator. On long-running systems, memory fragmentation made it impossible to allocate large blocks even when total free memory was sufficient — leading to mysterious allocation failures and performance degradation.
ABD solves this by supporting scattered allocations: instead of requiring a contiguous 128K buffer, ABD can assemble a 128K block from multiple smaller, non-contiguous pages. This dramatically reduces fragmentation pressure and is one of the reasons modern OpenZFS on Linux is far more stable than early versions.
You generally don't need to tune ABD. It's a low-level allocator improvement that works transparently. The main practical impact: modern ZFS on Linux no longer suffers from the "kmem:0" allocation failures that plagued older versions under memory pressure.
Memory pressure and OOM behavior
The Linux OOM (Out of Memory) killer can and does kill ZFS-related processes. Understanding the interaction between ZFS memory usage and the OOM killer is critical for production systems.
zfs_arc_max to leave headroom. On a 32 GB system running VMs, set ARC max to 16–20 GB, leaving 12–16 GB for VMs and the OS. On a dedicated NAS with no other workloads, you can give ARC 80–90% of RAM.Diagnosing "ZFS is eating all my RAM"
Step-by-step procedure when someone reports high memory usage on a ZFS system:
# 1. Check actual ARC size
arc_summary | head -20
# Or: cat /proc/spl/kstat/zfs/arcstats | grep "^size"
# 2. Check if ARC is within its configured limits
cat /sys/module/zfs/parameters/zfs_arc_max
cat /sys/module/zfs/parameters/zfs_arc_min
# 3. Check available memory (includes ARC as available)
grep MemAvailable /proc/meminfo
# 4. Check what's actually consuming non-reclaimable memory
# If MemAvailable is low AND ARC is within limits, something
# else is eating memory (VMs, Java heap, leaked processes).
ps aux --sort=-%mem | head -20
# 5. Check ARC hit ratio — is ARC actually helping?
arc_summary | grep "hit ratio"
# If hit ratio < 80%, ARC is too small or the workload is uncacheable.
# If hit ratio > 95%, ARC is working perfectly. Leave it alone.
Swap on ZFS — the recursive death spiral
Do not put swap on a ZFS filesystem. Or if you do, understand exactly what can go wrong.
The problem is a deadlock cycle: when the system is under memory pressure, it wants to swap pages to disk. If swap lives on ZFS, ZFS needs to allocate memory to perform the write. But the system is already out of memory — that's why it's swapping. ZFS asks the kernel for memory, the kernel says "I'm swapping to free memory," ZFS says "I need memory to write the swap," and the system deadlocks or the OOM killer starts shooting processes.
mkswap /dev/zvol/tank/swap). This is the dangerous one. ZFS needs memory to service I/O to the zvol. Under memory pressure, this creates the deadlock cycle. OpenZFS has mitigations, but they're not bulletproof.# Safe: swap on a dedicated partition (kldload default)
mkswap /dev/sda2
swapon /dev/sda2
# Risky: swap on a zvol (mitigations exist but not guaranteed)
zfs create -V 8G -b 4096 -o compression=zle \
-o sync=always -o primarycache=metadata \
-o secondarycache=none tank/swap
mkswap /dev/zvol/tank/swap
swapon /dev/zvol/tank/swap
# If you MUST use zvol swap, these properties reduce deadlock risk:
# compression=zle — minimal CPU during memory pressure
# sync=always — prevents data loss on crash
# primarycache=metadata — don't cache swap data in ARC
# secondarycache=none — don't cache swap in L2ARC
Deduplication memory requirements
ZFS deduplication maintains a Dedup Table (DDT) that maps every block's checksum to its physical location. The DDT must be accessible for every write (to check if the block already exists) and every free (to track reference counts). If the DDT doesn't fit in memory, every write requires a random read from disk to check the DDT — and performance collapses.
DDT memory math
# Each DDT entry: ~320 bytes (checksum + physical address + refcount + padding)
# With 128K recordsize (default):
# 1 TB of data = ~8 million blocks
# 8 million * 320 bytes = ~2.5 GB of DDT
# 10 TB = ~25 GB of DDT
# With 4K recordsize (databases, zvols):
# 1 TB of data = ~262 million blocks
# 262 million * 320 bytes = ~84 GB of DDT
# THIS IS WHY DEDUP ON DATABASE ZVOLS IS INSANE
# Check your DDT size:
zpool status -D tank
# Shows DDT entries, size, and on-disk vs in-core statistics
zdb -DDD tank
# Detailed DDT analysis
Special vdev for DDT (OpenZFS 2.2+)
OpenZFS 2.2 introduced the ability to store the DDT on a special vdev. This moves the DDT from main pool spindles to fast SSDs, and — critically — means the DDT no longer needs to fit entirely in ARC. ZFS can read DDT entries from the SSD special vdev at microsecond latency instead of millisecond latency from spinning disks.
This makes dedup feasible for the first time on large pools. It's still expensive (the special vdev must be sized for the DDT), but it removes the "all DDT in RAM" requirement that made dedup impractical for most deployments.
# Add mirrored special vdev for DDT + metadata
zpool add tank special mirror /dev/nvme0n1 /dev/nvme1n1
zfs set special_small_blocks=65536 tank
# Enable dedup (now DDT lives on the special vdev)
zfs set dedup=on tank/datasets-with-duplicates
# Monitor DDT on special vdev
zpool iostat -v tank
Memory recommendations by use case
| Use Case | Minimum RAM | Recommended RAM | ARC Max Setting | Notes |
|---|---|---|---|---|
| Desktop / workstation | 8 GB | 16 GB | 50% of RAM | Leave room for applications, browsers, IDEs |
| NAS / file server | 8 GB | 32 GB | 80–90% of RAM | Dedicated storage appliance; give ARC almost everything |
| Database server | 16 GB | 64 GB | 25–40% of RAM | Database has its own buffer pool; don't double-cache. primarycache=metadata on DB datasets |
| VM host (KVM/Proxmox) | 32 GB | 128 GB | 15–25% of RAM | VMs need direct RAM allocation. ARC helps VM image reads but can't compete with VM memory |
| Container host | 16 GB | 32 GB | 40–60% of RAM | Containers share the host kernel; ARC caches shared layers efficiently |
| Backup server | 8 GB | 16 GB | 4–8 GB fixed | Sequential writes; ARC barely helps. Spend money on disk throughput, not RAM |
| Build server / CI | 16 GB | 32 GB | 50% of RAM | Build tools need RAM; ARC helps with repeated source file reads |
| Dedup enabled (128K) | 16 GB + 2.5 GB/TB | 64 GB + 2.5 GB/TB | As much as possible | DDT must fit in ARC. 10 TB = 25 GB DDT alone. Consider special vdev (2.2+) |
| Dedup enabled (4K) | Don't | Don't | Don't | 84 GB DDT per TB. Unless you have a special vdev and accept the cost, just use compression. |
primarycache=metadata on the database dataset. PostgreSQL has its own
shared_buffers cache. If ARC also caches the data blocks, you're double-caching:
wasting RAM on two copies of the same data. Let PostgreSQL manage its own cache and let ARC handle
metadata (block pointers, directory entries). Same applies to MySQL's InnoDB buffer pool.
Real-world: tuning ARC for database servers
Databases are the hardest ZFS memory tuning case because the database and ZFS are both trying to cache the same data. The solution is to divide responsibility:
# PostgreSQL on ZFS — 64 GB RAM system
# 1. ZFS: cache metadata only for the database dataset
zfs set primarycache=metadata tank/pgdata
zfs set recordsize=16K tank/pgdata # match PostgreSQL page size
zfs set logbias=throughput tank/pgdata # avoid ZIL for large writes
# 2. ZFS: set ARC to 25% of RAM (16 GB)
echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max
# 3. PostgreSQL: give it the majority of remaining RAM
# postgresql.conf:
# shared_buffers = 32GB # 50% of RAM
# effective_cache_size = 48GB # tell planner about total cache
# work_mem = 256MB
# maintenance_work_mem = 2GB
# 4. Verify no double-caching
# ARC should show mostly metadata, very little data:
arc_summary | grep -E "data_size|metadata_size"
# MySQL/InnoDB equivalent:
zfs set primarycache=metadata tank/mysql
zfs set recordsize=16K tank/mysql
# my.cnf:
# innodb_buffer_pool_size = 32G
Monitoring cache hit rates in production
# Quick health check script for ZFS memory
#!/bin/bash
echo "=== ARC Status ==="
arc_summary 2>/dev/null | grep -E "ARC size|Target size|hit ratio|Meta"
echo ""
echo "=== ARC vs System Memory ==="
ARC_SIZE=$(awk '/^size/ {print $3}' /proc/spl/kstat/zfs/arcstats)
TOTAL_MEM=$(awk '/MemTotal/ {print $2 * 1024}' /proc/meminfo)
AVAIL_MEM=$(awk '/MemAvailable/ {print $2 * 1024}' /proc/meminfo)
echo "ARC: $(( ARC_SIZE / 1073741824 )) GB"
echo "Total RAM: $(( TOTAL_MEM / 1073741824 )) GB"
echo "Available: $(( AVAIL_MEM / 1073741824 )) GB"
echo "ARC %: $(( ARC_SIZE * 100 / TOTAL_MEM ))%"
echo ""
echo "=== L2ARC Status ==="
arc_summary 2>/dev/null | grep -A 5 "L2ARC" || echo "No L2ARC configured"
echo ""
echo "=== Pool I/O ==="
zpool iostat -v 1 1
Summary — the rules
zfs_arc_max on Linux. The kernel will fight ARC for memory. Tell ARC exactly how much it can use.free shows high usage. Check MemAvailable instead.primarycache=metadata on its datasets.l2_hdr_size before adding large L2ARC devices. On low-RAM systems, L2ARC can make things worse.zpool iostat -q for sync write pressure before buying hardware.sync=disabled trades data integrity for performance. Buy a SLOG instead.zfs_arc_max, monitor your hit
ratio, and stop worrying about free -h.