Proxmox ZFS Tuning

ZFS Wiki

Proxmox ZFS Tuning — stop blaming ZFS, start tuning it.

Proxmox VE ships with first-class ZFS support. You can install the hypervisor on ZFS, store VM disks on ZFS, replicate between cluster nodes with ZFS send/receive, and snapshot from the GUI. But the defaults are not optimized for virtualization workloads. People install Proxmox, create VMs on ZFS, performance is terrible, and they blame ZFS. The problem is not ZFS. The problem is that nobody tuned it. This page is the complete guide to making Proxmox and ZFS work together at full speed.

Proxmox is one of the best ways to run ZFS in a hypervisor because it exposes ZFS natively instead of hiding it behind an abstraction layer. But the convenience of the GUI means most admins never look at the underlying ZFS properties. This page teaches you what Proxmox does not. If you are running bare-metal Linux with ZFS (via kldload, for example), you get the same ZFS superpowers without the Proxmox overhead — and you can tune from day one because you own the pool creation command.

Proxmox VE + ZFS architecture

Proxmox VE is a Debian-based hypervisor that manages KVM virtual machines and LXC containers. When you choose ZFS as the storage backend, PVE uses two distinct storage modes:

VM disks (KVM)

Stored as ZFS zvols — block devices carved from the pool. The zvol appears as /dev/zvol/rpool/data/vm-100-disk-0 and is passed directly to QEMU as a raw block device. This is the fast path.

CT rootfs (LXC)

Stored as ZFS datasets (subvolumes). The dataset is mounted at /rpool/data/subvol-101-disk-0 and bind-mounted into the container. Datasets use recordsize (default 128K), which is fine for filesystem workloads.

ISO images

Stored in a regular dataset, typically rpool/data or a dedicated rpool/iso dataset. Large sequential reads — 128K recordsize is ideal here.

Backups (vzdump)

Stored in a dataset (usually /var/lib/vz/dump/ or a dedicated backup dataset). Sequential writes — 1M recordsize is optimal for backup files.

The critical distinction: VM disks are zvols (block devices). Container rootfs are datasets (filesystems). They have completely different tuning requirements. Most Proxmox ZFS problems come from not understanding this split.

Creating ZFS pools in PVE — GUI vs CLI

Proxmox offers a GUI for creating ZFS pools under Datacenter → Storage → ZFS (or per-node under Disks → ZFS). The GUI creates pools with sane defaults, but it does not expose all options. For production pools, always use the CLI.

GUI pool creation (what you get)

The GUI lets you pick disks, choose RAID level (mirror, raidz1/2/3, single), set compression, and pick ashift. That is all. It does not let you set dnodesize, xattr, create special vdevs, add SLOG, or control volblocksize for future zvols. The pool name defaults to rpool for root or your chosen name for data pools.

CLI pool creation (what you should do)

# Create a production VM storage pool — 4 disks, 2 mirror pairs
zpool create -o ashift=12 \
  -O compression=lz4 \
  -O atime=off \
  -O xattr=sa \
  -O dnodesize=auto \
  vmpool \
  mirror /dev/sda /dev/sdb \
  mirror /dev/sdc /dev/sdd

# Create the data dataset that PVE expects
zfs create vmpool/data

# Set default volblocksize for new zvols created under this dataset
zfs set volblocksize=16K vmpool

The Proxmox GUI pool creation is fine for a homelab where you are experimenting. For production, always SSH in and create the pool by hand. The GUI cannot add special vdevs, cannot set volblocksize defaults, and cannot create complex topologies (like mirror+SLOG+special in one command). With kldload, you get ZFS on root from the installer with all these properties set correctly from the start — no second-guessing what the GUI decided for you.

Storage configuration — /etc/pve/storage.cfg

After creating a ZFS pool, you must register it with Proxmox so the GUI and API know about it. This is done in /etc/pve/storage.cfg, which is automatically synced across cluster nodes via pmxcfs (the Proxmox cluster filesystem).

# /etc/pve/storage.cfg — ZFS pool for VM disks
zfspool: vmpool
    pool vmpool/data
    content images,rootdir
    sparse 1
    blocksize 16k

# Explanation of each line:
# zfspool: vmpool        — storage ID (what PVE calls it)
# pool vmpool/data       — the ZFS dataset path
# content images,rootdir — allowed content types (VM disks + CT rootfs)
# sparse 1               — create thin-provisioned zvols (CRITICAL — see below)
# blocksize 16k          — volblocksize for new zvols (PVE 7.x+ only)

The content field controls what PVE allows on this storage:

images

VM disk images (zvols). Required for KVM virtual machines.

rootdir

Container rootfs (datasets). Required for LXC containers.

iso

ISO images for VM installation. Stored in a template/iso/ directory on the dataset.

vztmpl

Container templates (.tar.gz). Stored in template/cache/.

backup

Vzdump backup files. Stored in dump/.

snippets

Custom config snippets (hookscripts, cloud-init fragments).

Zvol-based VM disks vs dataset-based CT storage

This is the most important concept for ZFS on Proxmox. Understanding the difference between zvols and datasets determines whether your VMs fly or crawl.

Zvols — block devices for VMs

A zvol is a raw block device managed by ZFS. Proxmox creates one zvol per VM disk. QEMU/KVM reads and writes directly to the block device — no filesystem layer in between. The zvol's volblocksize determines the minimum I/O unit. This is the property that makes or breaks VM performance.

Proxmox naming convention: rpool/data/vm-{VMID}-disk-{N}

Datasets — filesystems for containers

A dataset is a ZFS filesystem with its own mount point. Proxmox creates one dataset per container rootfs. The container sees a normal Linux filesystem. The dataset's recordsize (default 128K) controls the block size. 128K is fine for general filesystem workloads inside containers.

Proxmox naming convention: rpool/data/subvol-{CTID}-disk-{N}

Property	Zvol (VM disk)	Dataset (CT rootfs)
Block size property	volblocksize	recordsize
Default block size	16K (PVE 8) / 8K (older)	128K
Can change after creation	No (immutable)	Yes (new writes only)
I/O path	QEMU → /dev/zvol → ZFS	Bind mount → ZFS POSIX
Compression effective	Yes, but smaller blocks = less ratio	Yes, excellent at 128K
Snapshot granularity	Entire zvol	Entire dataset

VM disk formats on ZFS — raw zvols only, never qcow2

Never use qcow2 on ZFS

When PVE creates a VM disk on ZFS storage, it creates a raw zvol. This is correct. But some admins manually create qcow2 files on a ZFS dataset and point QEMU at them. This is catastrophically slow.

qcow2 on ZFS means: QEMU's qcow2 layer does its own copy-on-write inside a file, while ZFS does copy-on-write on the blocks underneath. Double CoW. Every write is amplified twice. Snapshots conflict — qcow2 has its own snapshot mechanism that fights with ZFS snapshots. Thin provisioning conflicts — qcow2 has its own sparse allocation that fights with ZFS's.

qcow2 on ZFS is like running two competing traffic controllers on the same road. They both think they are in charge. Nobody moves.

The correct answer is always raw zvols. ZFS already provides everything qcow2 was designed to add: copy-on-write, snapshots, thin provisioning, compression. Using qcow2 on ZFS duplicates all of that functionality at a massive performance cost.

Thin provisioning with zvols

By default, Proxmox can create zvols in two modes: thick (pre-allocated) or thin/sparse (allocate on write). Thin provisioning is controlled by the sparse flag in storage.cfg and the -s flag on zfs create.

# Thick zvol — 40GB allocated immediately, even if the VM uses 2GB
zfs create -V 40G rpool/data/vm-100-disk-0

# Thin/sparse zvol — 0 bytes allocated, grows as VM writes data
zfs create -V 40G -s rpool/data/vm-100-disk-0

# Check actual vs referenced space
zfs list -o name,volsize,used,refer rpool/data/vm-100-disk-0
# NAME                          VOLSIZE  USED   REFER
# rpool/data/vm-100-disk-0         40G   2.1G   2.0G

Always use thin provisioning (sparse 1 in storage.cfg). There is no performance penalty — ZFS allocates blocks on write either way. Thick provisioning just wastes pool space by reserving it upfront. The only edge case where thick makes sense is when you need guaranteed space reservation and cannot overcommit the pool.

The volmode property controls how the zvol appears to the system. Proxmox uses volmode=dev (the default), which exposes /dev/zvol/pool/name. You can also set volmode=full to expose partition tables, or volmode=none to hide the device entirely (useful for zvols managed only via iSCSI).

The write amplification problem

Why Proxmox VMs feel slow on default ZFS

Older Proxmox versions defaulted to 8K volblocksize for zvols, which actually aligned well with VM I/O. But if you created a dataset-backed VM disk (or used the wrong storage type), you got 128K recordsize. VM disk I/O operates in 4K-8K blocks. When a VM writes 8K to a 128K record, ZFS has to:

Read the full 128K record that contains the 8K block
Decompress it (if compression is on)
Modify the 8K portion
Recompress the full 128K
Write the new 128K record to a new location (CoW)

That is 16x write amplification for every VM I/O operation. Your VMs are not slow because ZFS is slow. They are slow because ZFS is reading and writing 128K to change 8K.

Imagine rewriting an entire chapter of a book to fix one typo. That is what 128K recordsize does to 8K VM I/O.

The fix: correct volblocksize

# For general VM workloads — 16K is the sweet spot
zfs create -V 40G -s \
    -o volblocksize=16K \
    -o compression=lz4 \
    rpool/data/vm-100-disk-0

# For database VMs (PostgreSQL, MySQL) — match the DB page size
zfs create -V 40G -s \
    -o volblocksize=8K \
    rpool/data/vm-100-disk-0

# Set the default for all future zvols created by PVE
# In /etc/pve/storage.cfg, add: blocksize 16k
# Or set on the parent dataset:
zfs set volblocksize=16K rpool/data

16K volblocksize = 2x amplification instead of 16x. That is an 8x improvement from changing one number.

PVE 8.x improved the defaults significantly (16K volblocksize out of the box), but if you upgraded from PVE 7 or earlier, your existing zvols still have whatever volblocksize they were created with. volblocksize is immutable — you cannot change it after creation. The only fix for existing VMs is to create a new zvol with the correct blocksize and dd the old one over, or use qemu-img convert. With kldload, you control volblocksize from the installer — it is set correctly before the first VM is ever created.

PVE snapshot integration

When you click "Snapshot" in the Proxmox GUI for a VM on ZFS storage, PVE does two things:

Creates a ZFS snapshot of each zvol attached to the VM: zfs snapshot rpool/data/vm-100-disk-0@snap_name
Saves the VM configuration (CPU, RAM, network) and optionally the RAM state to the snapshot metadata

ZFS snapshots are instantaneous and free until data diverges. A PVE snapshot of a 500GB VM takes less than a second and consumes zero additional space at creation time. As the VM writes new data, the snapshot holds references to the old blocks — only the delta consumes space.

# PVE creates snapshots like this internally:
zfs snapshot rpool/data/vm-100-disk-0@__replicate_100-0_1712345678
zfs snapshot rpool/data/vm-100-disk-1@__replicate_100-0_1712345678

# List all snapshots for a VM's disk
zfs list -t snapshot -r rpool/data/vm-100-disk-0

# Check snapshot space consumption
zfs list -t snapshot -o name,used,refer -r rpool/data/vm-100-disk-0

Key limitation: PVE snapshots include RAM state only if you check "Include RAM" in the GUI. Without RAM state, rolling back stops the VM first. With RAM state, the snapshot file is large (equal to VM RAM) and stored on the same ZFS dataset. For production, skip RAM snapshots and use ZFS-only snapshots for speed.

Snapshot cleanup matters. Every snapshot holds references to old blocks, preventing ZFS from freeing space. A VM with 100 old snapshots can consume 10x its actual data size. Set up automated snapshot retention with sanoid/syncoid or PVE's built-in retention policies. Delete old snapshots regularly.

PVE replication — built-in ZFS send/receive

Proxmox has built-in replication that uses zfs send and zfs receive to copy VM disks between cluster nodes. This is one of PVE's best features for ZFS users. It gives you near-instant failover without shared storage.

How it works

PVE takes a ZFS snapshot on the source node, sends an incremental stream to the target node, and receives it. Only changed blocks are transferred. A 500GB VM that changed 2GB since the last replication sends only 2GB.

Scheduling

Configure under Datacenter → Replication. Minimum interval is 1 minute. Typical production: every 15 minutes. The replication runs in the background — no VM downtime.

Failover

If the source node dies, migrate the VM to the target node. PVE uses the most recent replicated snapshot. Data loss = changes since last replication (RPO = replication interval).

Requirements

Both nodes must have ZFS pools with the same name. Both must be in the same PVE cluster. SSH keys are exchanged automatically by the cluster.

# Create a replication job via CLI (replicate VM 100 to node pve2, every 15 min)
pvesr create-local-job 100-0 pve2 --schedule '*/15'

# List replication jobs
pvesr list

# Check replication status
pvesr status

# Manually trigger replication
pvesr run 100-0

# Under the hood, PVE runs something like:
# zfs send -i rpool/data/vm-100-disk-0@prev rpool/data/vm-100-disk-0@new | \
#   ssh pve2 zfs receive rpool/data/vm-100-disk-0

PVE replication is essentially syncoid with a GUI. It is simpler but less flexible. syncoid gives you custom snapshot naming, bandwidth throttling, recursive replication, and replication to non-PVE targets (any machine with ZFS). If you are using kldload on bare metal, you get syncoid out of the box with sanoid for automated snapshot management — no PVE cluster overhead required. The same zfs send/receive mechanism, but you control every parameter.

Backup interaction — vzdump + ZFS snapshots

PVE's backup tool (vzdump) has a special ZFS mode. When backing up a VM on ZFS storage, vzdump:

Creates a ZFS snapshot of each VM disk (instantaneous, no I/O freeze)
Reads from the snapshot to create the backup file (the VM keeps running normally)
Removes the temporary snapshot when the backup completes

This means ZFS backups have near-zero impact on VM performance. Unlike LVM-thin backups (which create an LVM snapshot that degrades write performance), the ZFS snapshot is copy-on-write and does not slow down the running VM. The backup reads from a frozen-in-time view while the VM continues writing to new blocks.

# Backup modes for ZFS VMs:
# snapshot — uses ZFS snapshot, no VM pause (RECOMMENDED)
# suspend  — pauses VM briefly, then snapshots
# stop     — stops VM, snapshots, restarts

# Configure backup job (Datacenter → Backup)
# Or via CLI:
vzdump 100 --mode snapshot --storage backup-pool --compress zstd

# The backup file lands as a .vma.zst file in the backup storage
# Restore:
qmrestore /path/to/vzdump-qemu-100-2026_04_04-12_00_00.vma.zst 100

Backup tip: Use --compress zstd for vzdump backups. zstd compresses faster and smaller than gzip/lzo. Since the source data is already lz4-compressed on ZFS, the vzdump compression catches what lz4 missed. Combined, you often see 3:1 total compression on typical VM disks.

ARC tuning for PVE — leave room for KVM/QEMU

The Adaptive Replacement Cache (ARC) is ZFS's in-memory read cache. By default, ZFS claims up to 50% of system RAM for ARC. On a dedicated file server, that is fine. On a Proxmox host where KVM/QEMU VMs also need RAM, ARC and VMs compete for the same memory.

The rule: Total VM RAM + ARC + OS overhead must not exceed physical RAM. If you overcommit, the kernel's OOM killer starts killing QEMU processes — your VMs crash randomly with no warning.

Total host RAM	Recommended zfs_arc_max	Available for VMs	Notes
32 GB	4 GB (4294967296)	~26 GB	Tight. 4-6 small VMs. ARC helps but cannot be large.
64 GB	8 GB (8589934592)	~53 GB	Good balance. 8-12 VMs with healthy ARC.
128 GB	16-32 GB	~92-108 GB	Sweet spot. Large ARC + many VMs. Most production setups.
256 GB	32-64 GB	~188-220 GB	Enterprise. Massive ARC caches entire working set.
512 GB	64-128 GB	~380-444 GB	Database-heavy. ARC can cache full DB indexes.

# Set ARC limits — example for 128GB host
# /etc/modprobe.d/zfs.conf (persists across reboots)
options zfs zfs_arc_max=17179869184
options zfs zfs_arc_min=4294967296

# Apply immediately without reboot
echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max
echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_min

# Verify current ARC usage
arc_summary | head -30

# Or the quick way:
cat /proc/spl/kstat/zfs/arcstats | grep -E '^size|^c_max'

The zfs_arc_min trap

Setting zfs_arc_min too high is dangerous on PVE. If VMs need more RAM than what remains after arc_min, the kernel cannot reclaim ARC memory below the minimum. VMs start swapping, then OOM-killing. Set zfs_arc_min to 25% of zfs_arc_max as a safe floor. On a 128GB host with arc_max=16GB, set arc_min=4GB.

Proxmox's default ARC behavior is actually reasonable for PVE — it lets the kernel reclaim ARC memory under pressure. The problem is that reclamation is slow and causes latency spikes. Setting an explicit arc_max prevents the thrashing. On a kldload bare-metal server where all RAM belongs to the host (no QEMU VMs competing), you can let ARC have 75% or more — it is just a read cache and shrinks gracefully. The PVE tax is that QEMU pins memory for each VM, so ARC must share.

Swap on ZFS — the memory pressure problem

Never put swap on a ZFS zvol or dataset

When the system runs low on memory, it tries to swap. If the swap file is on ZFS, ZFS needs memory to write the swap data (for its own transaction groups, ARC metadata, and CoW operations). But the system is swapping because it is out of memory. Deadlock. The system hangs completely.

This is not theoretical. It happens on real Proxmox hosts, especially when VMs consume all available RAM and the host tries to swap. The entire node locks up and requires a hard reboot.

Trying to swap on ZFS is like asking an out-of-gas car to drive itself to the gas station.

# The fix: put swap on a dedicated small partition, NOT on ZFS
# During Proxmox install, create a 4-8GB swap partition on the boot disk

# If your root is already on ZFS, create a swap file on tmpfs or
# a dedicated non-ZFS partition:
mkswap /dev/sda2
swapon /dev/sda2

# Add to /etc/fstab:
# /dev/sda2  none  swap  sw  0 0

# Or disable swap entirely and rely on proper memory sizing:
swapoff -a
sed -i '/swap/d' /etc/fstab

# On PVE with ZFS root, the installer creates a small swap partition
# automatically. Verify it is NOT on ZFS:
swapon --show
# NAME      TYPE      SIZE  USED  PRIO
# /dev/sda2 partition   8G    0B    -2     ← Good, this is a partition

The Proxmox installer actually handles this correctly — it creates a small swap partition on the boot disk when you install on ZFS. The problem occurs when admins add more swap later on a ZFS dataset, or when they install Debian manually and put everything including swap on ZFS. kldload handles this the same way: if you install with ZFS root, the installer creates a dedicated swap partition on the EFI system disk, outside the ZFS pool. The kernel never has to ask ZFS for memory just to free memory.

I/O tuning for VMs — the properties that matter

volblocksize — the single most important setting

# 16K — general purpose VMs (Windows, Linux desktop/server)
# Matches the common 4K guest block size with 4:1 amplification
# Good balance of performance and compression ratio
zfs create -V 100G -s -o volblocksize=16K rpool/data/vm-100-disk-0

# 8K — database VMs (PostgreSQL 8K pages, MySQL/InnoDB 16K pages)
# Minimal amplification for DB page-aligned I/O
zfs create -V 200G -s -o volblocksize=8K rpool/data/vm-200-disk-0

# 64K — bulk storage VMs (file servers, media servers)
# Better compression ratio, acceptable for sequential workloads
zfs create -V 500G -s -o volblocksize=64K rpool/data/vm-300-disk-0

sync — write ordering guarantees

# standard (default) — honor guest fsync() calls, safest option
zfs set sync=standard rpool/data/vm-100-disk-0

# disabled — ignore fsync(), all writes are async
# DANGEROUS: data loss on power failure. Only for throwaway test VMs.
zfs set sync=disabled rpool/data/vm-999-disk-0

# always — force every write to be synchronous
# Overkill for VMs. Only useful for NFS/iSCSI targets.

Never disable sync in production. If you need faster sync writes, add a SLOG device instead of disabling the safety net.

logbias — ZIL write optimization

# latency (default) — optimize ZIL for low-latency commits
# Best for general VM workloads, databases, anything that calls fsync()
zfs set logbias=latency rpool/data/vm-100-disk-0

# throughput — optimize ZIL for large sequential sync writes
# Use for VMs that do large sequential writes (e.g., video processing)
# Writes go directly to the pool, bypassing the ZIL/SLOG
zfs set logbias=throughput rpool/data/vm-300-disk-0

primarycache — what ARC caches

# all (default) — cache both data and metadata in ARC
zfs set primarycache=all rpool/data/vm-100-disk-0

# metadata — only cache metadata, not data blocks
# Use for VMs with huge working sets that thrash ARC
# (e.g., a 2TB database VM on a host with 16GB ARC)
zfs set primarycache=metadata rpool/data/vm-200-disk-0

SLOG for VM workloads

Virtual machines are one of the most SLOG-friendly workloads. Every guest OS issues fsync() calls for journaling, package installs, database commits, and log writes. Without a SLOG, these sync writes wait for the transaction group to flush to the data disks (5 seconds by default, or whenever the pool is busy). With a SLOG, sync writes commit to the fast log device in microseconds and the VM continues immediately.

# Add a SLOG — mirrored enterprise NVMe with power loss protection
zpool add rpool log mirror /dev/nvme2n1p1 /dev/nvme3n1p1

# Verify SLOG is active
zpool status rpool
#   ...
#   logs
#     mirror-1    ONLINE
#       nvme2n1p1  ONLINE
#       nvme3n1p1  ONLINE

# Check if your workload benefits from SLOG — look at sync write queue
zpool iostat -q rpool 5
# If the syncq_write column is consistently > 0, SLOG helps.

# SLOG sizing: 16-32GB is plenty. The ZIL only holds ~10 seconds
# of sync write data. Even at 1GB/s sync write rate, 16GB is overkill.

SLOG must have power loss protection (PLP). Consumer NVMe drives (Samsung 970 EVO, WD Black, etc.) have volatile write caches. On power failure, the SLOG loses uncommitted data — exactly the data it was supposed to protect. Enterprise NVMe (Intel DC P4510/P5800X, Samsung PM9A3, Micron 7450) or Intel Optane are the only correct choices. A consumer SSD as SLOG is worse than no SLOG at all — it gives you a false sense of safety.

Special vdev for VM metadata

When your data pool is on HDDs, every metadata operation (block pointer lookups, free space tracking, dedup tables) hits spinning rust. A mirrored SSD special vdev accelerates all metadata operations for the entire pool. For VM workloads, this means faster snapshot creation, faster zfs list, faster pool scrubs, and faster replication delta calculation.

# Add a mirrored special vdev (MUST be mirrored — losing it kills the pool)
zpool add rpool special mirror /dev/sda /dev/sdb

# Store small blocks (under 64K) on the special vdev too
# This catches VM config files, container metadata, small writes
zfs set special_small_blocks=65536 rpool

# The special vdev is most effective when:
# - Data pool is on HDDs (the speed gap between HDD metadata and SSD is huge)
# - You run many VMs/CTs (more metadata operations)
# - You use heavy snapshot/replication workflows (snapshot metadata is on special vdev)

On an all-NVMe pool, a special vdev provides minimal benefit — the data vdevs are already fast enough for metadata. Special vdevs shine when there is a large speed gap between the data disks and the special vdev disks.

Use mirrors, not RAIDZ, for VMs

Why mirrors win for VM workloads

This is the most common mistake on Proxmox. RAIDZ has terrible random write performance. VMs generate random I/O. Mirrors handle random I/O linearly — each mirror pair serves requests independently. With 4 mirror pairs, you get 4x the IOPS of a single mirror. RAIDZ stripes parity across all disks, meaning every write touches every disk in the vdev.

# BAD for VMs — RAIDZ2 across 6 disks
# Every write touches all 6 disks. IOPS = ~1 disk worth.
# zpool create rpool raidz2 /dev/sd{a,b,c,d,e,f}

# GOOD for VMs — 3 mirror pairs
# Each pair handles I/O independently. IOPS = ~3 disks worth.
zpool create -o ashift=12 rpool \
    mirror /dev/sda /dev/sdb \
    mirror /dev/sdc /dev/sdd \
    mirror /dev/sde /dev/sdf

Exception: if your Proxmox host primarily stores backups, ISOs, or media files (sequential, large-block workloads), RAIDZ2 is fine. Separate your VM storage (mirrors) from your bulk storage (RAIDZ) into different pools.

PVE cluster with ZFS — Ceph vs ZFS replication

Proxmox supports two approaches to multi-node storage: Ceph (distributed object storage) and ZFS replication (zfs send/receive between nodes). They solve different problems.

Feature	Ceph	ZFS Replication
Minimum nodes	3 (recommended)	2
Failover speed	Seconds (automatic)	Minutes (manual or HA-managed)
Network requirement	10GbE dedicated (25GbE recommended)	1GbE sufficient
Storage overhead	3x (3-replica) or 1.5x (erasure coding)	2x (mirror on each node)
Complexity	High (MON, OSD, MDS daemons)	Low (just zfs send/receive)
RPO (data loss window)	0 (synchronous replication)	Replication interval (1-15 min)
Write latency	Higher (network round-trip + 3 copies)	Local disk speed
Disk failure handling	Automatic rebalance across cluster	ZFS resilver on local node
Scale-out	Yes (add nodes/OSDs anytime)	Limited (each node is independent)
Best for	Large clusters (5+ nodes), live migration	Small clusters (2-3 nodes), homelabs

For 2-3 node clusters, ZFS replication wins hands down. Ceph requires 3 nodes minimum, a dedicated 10GbE network, and significant operational complexity (monitor quorum, OSD management, CRUSH maps). ZFS replication just works — zfs send over SSH. You lose the zero-RPO guarantee, but you gain simplicity and local-disk write performance. For homelabs and small businesses, this is the right tradeoff. kldload gives you the same ZFS replication capability on bare metal with syncoid — no PVE cluster required. Two kldload boxes with syncoid running on a cron job is a perfectly functional replication setup.

Monitoring ZFS in PVE

The Proxmox GUI shows pool status (online/degraded/faulted) and basic space usage. That is about 10% of what you need to monitor. The rest requires the CLI.

# Pool health — the first command to run when anything seems wrong
zpool status rpool

# Pool I/O statistics — 5-second interval
zpool iostat rpool 5

# Per-vdev I/O breakdown (which disks are slow?)
zpool iostat -v rpool 5

# I/O queue depth (are writes queuing?)
zpool iostat -q rpool 5

# ARC hit rate — should be > 80% for good performance
arc_summary | grep -A5 "ARC size"
# Or:
cat /proc/spl/kstat/zfs/arcstats | grep -E '^hits|^misses'

# ARC efficiency breakdown
arcstat 5
# Output columns: read  hits  miss  hit%  l2hits  l2miss

# Pool fragmentation — high fragmentation = degraded write performance
zpool list -o name,frag,cap rpool
# Keep capacity below 80%. Above 80%, fragmentation accelerates.

# Scrub status — scrubs should run weekly
zpool status rpool | grep scan
# If the last scrub found errors, investigate immediately.

# ZFS I/O latency histogram (where is time being spent?)
zpool iostat -w rpool 5

Set up automated monitoring. Add zpool status checks to your monitoring stack (Prometheus + node_exporter with the ZFS collector, or Zabbix with ZFS templates). Alert on: degraded vdevs, scrub errors, capacity > 80%, ARC hit rate < 70%. The PVE GUI will not alert you on most of these conditions.

Common mistakes

Pool on a single disk

No redundancy. One disk failure = total data loss. The Proxmox installer allows this. Never do it for production. Always use at least a mirror.

No ECC RAM

ZFS checksums every block to detect corruption. But if RAM itself is corrupt, ZFS writes bad data with a valid checksum. ECC RAM is not optional for ZFS. Every hardware guide says this. Most people ignore it until they get silent corruption.

zfs_arc_max too high

ARC competes with QEMU for RAM. If arc_max is 50% on a 64GB host and you allocate 40GB to VMs, the host has negative free memory. OOM killer fires. VMs crash. Set arc_max conservatively and monitor.

Not setting ashift=12

Some disks (especially 4Kn drives behind USB enclosures) report 512-byte sectors. ZFS auto-detects ashift=9 and writes 512-byte blocks to 4K hardware. Throughput halved. Cannot be fixed after creation. Always pass -o ashift=12 explicitly.

RAIDZ1 on large drives

A 16TB drive takes 12+ hours to resilver. During resilver, one more failure and the vdev is gone. RAIDZ2 minimum for drives over 2TB. See Pool Design.

Using qcow2 on ZFS

Double copy-on-write. Double snapshot tracking. Double thin provisioning overhead. Always use raw zvols. PVE does this by default — do not override it.

Filling the pool past 80%

ZFS performance degrades sharply above 80% capacity due to fragmentation and reduced CoW efficiency. At 90%+, writes can stall completely. Monitor capacity and expand or delete before reaching 80%.

Swap on ZFS

Memory pressure + swap-on-ZFS = deadlock. The system needs memory to write swap, but is swapping because it has no memory. Use a dedicated swap partition outside ZFS.

Migration between PVE nodes with ZFS

Proxmox supports live migration of VMs between cluster nodes. With ZFS, migration uses zfs send/receive to transfer the disk, which is efficient but has caveats.

Online migration

PVE streams the zvol to the target node via zfs send, then transfers dirty memory pages, then switches the VM. Requires the same pool name on both nodes. Downtime is typically 100-500ms.

Offline migration

VM is stopped. Full zvol sent to target. VM starts on new node. Simple and reliable. Use when downtime is acceptable.

With replication

If replication is configured, migration sends only the delta since the last replication. A 500GB VM that was replicated 5 minutes ago might only need to transfer 200MB. This is the fast path.

# Migrate VM 100 from current node to pve2 (online)
qm migrate 100 pve2 --online

# Migrate offline (stops VM, transfers, starts on target)
qm migrate 100 pve2

# If migration is slow, check:
# 1. Network bandwidth between nodes (iperf3 -s / iperf3 -c pve2)
# 2. Whether replication is configured (pre-synced data = fast migration)
# 3. zvol size — a 2TB zvol without replication takes a while

# Manual zvol transfer (useful for moving to non-PVE ZFS hosts)
zfs send rpool/data/vm-100-disk-0 | ssh target zfs receive tank/data/vm-100-disk-0
# Incremental (much faster after initial sync):
zfs send -i @snap1 rpool/data/vm-100-disk-0@snap2 | \
  ssh target zfs receive tank/data/vm-100-disk-0

PVE + ZFS vs PVE + Ceph vs PVE + LVM-thin

Feature	ZFS	Ceph	LVM-thin
Snapshots	Instant, zero-cost	Instant (RBD snapshots)	Slow, degrades write perf
Compression	Built-in (lz4, zstd)	None (client-side only)	None
Checksums	Per-block SHA256/fletcher	Per-object CRC	None
Self-healing	Auto-repair on read	Yes (PG repair)	No
Replication	zfs send/receive	Built-in (synchronous)	None (use DRBD)
Thin provisioning	Native (sparse zvols)	Native	Native (the point of LVM-thin)
IOPS (mirrors)	Excellent	Good (network overhead)	Excellent (direct disk)
Write latency	Low (local disk)	Higher (network + 3 writes)	Low (local disk)
Capacity efficiency	50% (mirrors)	33% (3-replica) or 67% (EC)	100% (no redundancy) or 50% (DRBD)
Shared storage	No (per-node)	Yes (cluster-wide)	No (per-node)
Live migration	Yes (zfs send)	Yes (instant, shared storage)	Yes (block-level copy)
Complexity	Low	High	Very low
Minimum disks	2 (mirror)	3 nodes x 1+ OSD	1
Data integrity	Excellent (end-to-end)	Good	None

LVM-thin is what most Proxmox users start with because it is the default and it is simple. It works until you need snapshots (LVM snapshots are slow and degrade performance), replication (you need DRBD), or data integrity (LVM has no checksums). ZFS is the correct upgrade path for anyone who cares about their data. Ceph is the correct upgrade path for anyone who needs shared storage across many nodes. Most homelabs and small businesses should be on ZFS. kldload installs ZFS on root for any distro, so you can run the same ZFS-powered storage without Proxmox's overhead.

Converting existing PVE from LVM-thin to ZFS

There is no in-place conversion. LVM-thin and ZFS are fundamentally different storage architectures. You must migrate VMs disk-by-disk. Here is the procedure:

Add new disks and create a ZFS pool alongside the existing LVM-thin storage
Register the ZFS pool in /etc/pve/storage.cfg
For each VM: shut down, use qm move-disk to move each disk from LVM-thin to ZFS
Boot the VM on ZFS storage, verify everything works
Once all VMs are migrated, remove the LVM-thin storage

# Step 1: Create the ZFS pool (new disks)
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  vmpool mirror /dev/sdc /dev/sdd
zfs create vmpool/data

# Step 2: Register in storage.cfg
cat >> /etc/pve/storage.cfg << 'EOF'

zfspool: vmpool
    pool vmpool/data
    content images,rootdir
    sparse 1
    blocksize 16k
EOF

# Step 3: Move VM disks (VM must be stopped)
qm stop 100
qm move-disk 100 scsi0 vmpool --delete 1
# --delete 1 removes the source disk after successful move

# Step 4: Verify and start
qm start 100
qm agent 100 ping  # check guest agent responds

# Repeat for each VM. For many VMs, script it:
for vmid in 100 101 102 103 104; do
  qm stop $vmid
  qm move-disk $vmid scsi0 vmpool --delete 1
  qm start $vmid
  echo "VM $vmid migrated to ZFS"
done

Always back up before migrating. Run vzdump for every VM before moving disks. If the move fails or the ZFS pool has issues, you need a way back. Do not skip this.

PVE ZFS boot — root on ZFS

The Proxmox installer supports installing the OS directly on ZFS (root-on-ZFS). This gives you ZFS benefits for the entire system: the OS, logs, container rootfs, and VM disks all live on ZFS. Boot environments, rollback, and system-level snapshots work.

Boot pool

PVE creates rpool on the boot disk with the following layout: rpool/ROOT/pve-1 (the OS), rpool/data (VM/CT storage), and a small EFI partition for systemd-boot or GRUB.

Boot method

PVE 8.x uses systemd-boot with ZFS support via the zfs-initramfs package. Older versions use GRUB. Both work but systemd-boot is simpler and faster.

Upgrade safety

Snapshot rpool/ROOT/pve-1 before apt upgrade. If the upgrade breaks, rollback to the snapshot. This is the single best reason to run PVE on ZFS root.

# Snapshot before PVE upgrade
zfs snapshot rpool/ROOT/pve-1@before-upgrade-$(date +%Y%m%d)

# Upgrade PVE
apt update && apt dist-upgrade

# If something breaks, rollback:
zfs rollback rpool/ROOT/pve-1@before-upgrade-20260404

# Separate data pool for VMs (recommended — keep OS and data pools separate)
# During install, choose ZFS mirror for the boot disk
# After install, create a separate pool on dedicated disks:
zpool create -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
  vmpool mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf

Proxmox root-on-ZFS is nice, but it mixes the OS and VM storage on the same pool by default. This is fine for homelabs but bad for production — a runaway VM filling the pool can kill the OS. Always separate the boot pool (small mirror for the OS) from the data pool (large mirror set for VMs). kldload does this by default: the installer creates a dedicated boot pool and a separate data pool. Two pools, two failure domains, no shared fate.

Performance expectations

Real-world numbers from tuned Proxmox ZFS hosts. These assume correct volblocksize, mirrored vdevs, and properly sized ARC. Your results depend on disk hardware.

Configuration	Random 4K read IOPS	Random 4K write IOPS	Sequential read MB/s	Sequential write MB/s
2x SATA SSD mirror	40,000-80,000	20,000-40,000	500-550	400-500
2x NVMe mirror	200,000-500,000	100,000-300,000	3,000-6,000	2,000-4,000
4x NVMe (2 mirror pairs)	400,000-1,000,000	200,000-600,000	6,000-12,000	4,000-8,000
4x HDD mirror pairs	300-600	200-400	300-500	200-400
4x HDD mirrors + SLOG	300-600	200-400 (sync: 10,000+)	300-500	200-400
4x HDD mirrors + special	2,000-5,000 (metadata)	200-400	300-500	200-400
6x HDD RAIDZ2	80-150	40-80	500-800	300-500

Note the massive difference between mirrors and RAIDZ for random I/O. Four HDD mirror pairs deliver 300-600 random read IOPS. A 6-disk RAIDZ2 of the same drives delivers 80-150. For VMs, that difference is the difference between "responsive" and "why is everything so slow."

ARC dramatically improves reads. The IOPS numbers above are cold-cache (first read from disk). With warm ARC, frequently accessed blocks are served from RAM at millions of IOPS. A properly sized ARC turns HDD-backed ZFS into an SSD-like experience for the working set.

Quick reference: Proxmox ZFS tuning

Setting	Default	Recommended	Why
volblocksize	16K (PVE 8) / 8K (older)	16K (VMs) / 8K (DBs)	Match guest I/O pattern, reduce amplification
recordsize	128K	128K for CTs, 1M for backup datasets	Use zvols for VMs, not datasets
compression	on (lz4)	lz4 (or zstd for backup datasets)	Nearly free CPU cost, saves I/O bandwidth
atime	on	off	Eliminates useless access-time writes
xattr	sa	sa	Store extended attrs in the inode, faster
dnodesize	legacy	auto	Larger dnodes for metadata-heavy workloads
zfs_arc_max	50% RAM	See ARC table above	Leave room for QEMU VM memory
zfs_arc_min	adaptive	25% of arc_max	Prevent ARC starvation under pressure
sync	standard	standard + SLOG	Never disable sync — add SLOG instead
logbias	latency	latency (VMs) / throughput (bulk)	Match write pattern
primarycache	all	all (most VMs) / metadata (huge DBs)	Control what ARC caches
sparse (storage.cfg)	0	1	Thin provision zvols, no perf penalty
VDEV layout	varies	Mirrors for VMs	RAIDZ kills random I/O
ashift	auto-detect	12 (always specify)	Correct for all modern disks
special vdev	none	Mirrored SSDs (HDD pools)	Accelerates metadata for all VMs
SLOG	none	Enterprise NVMe with PLP	Accelerates sync writes from VMs

Proxmox is not bad. Untuned Proxmox is bad. The same ZFS that runs Netflix's CDN can run your Proxmox cluster — if you tune it for the workload. The defaults are conservative. Your workload is not conservative. Tune accordingly. And if you want ZFS done right from the start without a hypervisor in the way, check out kldload — same ZFS, same tuning, bare metal, any distro.

← Tuning for Workloads — defaults are for nobody. ZFS vs Everything Else — the middleware graveyard. →