Proxmox ZFS Tuning — stop blaming ZFS, start tuning it.
Proxmox VE ships with first-class ZFS support. You can install the hypervisor on ZFS, store VM disks on ZFS, replicate between cluster nodes with ZFS send/receive, and snapshot from the GUI. But the defaults are not optimized for virtualization workloads. People install Proxmox, create VMs on ZFS, performance is terrible, and they blame ZFS. The problem is not ZFS. The problem is that nobody tuned it. This page is the complete guide to making Proxmox and ZFS work together at full speed.
Proxmox VE + ZFS architecture
Proxmox VE is a Debian-based hypervisor that manages KVM virtual machines and LXC containers. When you choose ZFS as the storage backend, PVE uses two distinct storage modes:
/dev/zvol/rpool/data/vm-100-disk-0 and is passed directly to QEMU as a raw block device. This is the fast path./rpool/data/subvol-101-disk-0 and bind-mounted into the container. Datasets use recordsize (default 128K), which is fine for filesystem workloads.rpool/data or a dedicated rpool/iso dataset. Large sequential reads — 128K recordsize is ideal here./var/lib/vz/dump/ or a dedicated backup dataset). Sequential writes — 1M recordsize is optimal for backup files.The critical distinction: VM disks are zvols (block devices). Container rootfs are datasets (filesystems). They have completely different tuning requirements. Most Proxmox ZFS problems come from not understanding this split.
Creating ZFS pools in PVE — GUI vs CLI
Proxmox offers a GUI for creating ZFS pools under Datacenter → Storage → ZFS (or per-node under Disks → ZFS). The GUI creates pools with sane defaults, but it does not expose all options. For production pools, always use the CLI.
GUI pool creation (what you get)
The GUI lets you pick disks, choose RAID level (mirror, raidz1/2/3, single), set compression,
and pick ashift. That is all. It does not let you set dnodesize, xattr,
create special vdevs, add SLOG, or control volblocksize for future zvols. The pool name defaults
to rpool for root or your chosen name for data pools.
CLI pool creation (what you should do)
# Create a production VM storage pool — 4 disks, 2 mirror pairs
zpool create -o ashift=12 \
-O compression=lz4 \
-O atime=off \
-O xattr=sa \
-O dnodesize=auto \
vmpool \
mirror /dev/sda /dev/sdb \
mirror /dev/sdc /dev/sdd
# Create the data dataset that PVE expects
zfs create vmpool/data
# Set default volblocksize for new zvols created under this dataset
zfs set volblocksize=16K vmpool
Storage configuration — /etc/pve/storage.cfg
After creating a ZFS pool, you must register it with Proxmox so the GUI and API know about it.
This is done in /etc/pve/storage.cfg, which is automatically synced across cluster
nodes via pmxcfs (the Proxmox cluster filesystem).
# /etc/pve/storage.cfg — ZFS pool for VM disks
zfspool: vmpool
pool vmpool/data
content images,rootdir
sparse 1
blocksize 16k
# Explanation of each line:
# zfspool: vmpool — storage ID (what PVE calls it)
# pool vmpool/data — the ZFS dataset path
# content images,rootdir — allowed content types (VM disks + CT rootfs)
# sparse 1 — create thin-provisioned zvols (CRITICAL — see below)
# blocksize 16k — volblocksize for new zvols (PVE 7.x+ only)
The content field controls what PVE allows on this storage:
template/iso/ directory on the dataset.template/cache/.dump/.Zvol-based VM disks vs dataset-based CT storage
This is the most important concept for ZFS on Proxmox. Understanding the difference between zvols and datasets determines whether your VMs fly or crawl.
Zvols — block devices for VMs
A zvol is a raw block device managed by ZFS. Proxmox creates one zvol per VM disk. QEMU/KVM reads and writes directly to the block device — no filesystem layer in between. The zvol's volblocksize determines the minimum I/O unit. This is the property that makes or breaks VM performance.
Proxmox naming convention: rpool/data/vm-{VMID}-disk-{N}
Datasets — filesystems for containers
A dataset is a ZFS filesystem with its own mount point. Proxmox creates one dataset per container rootfs. The container sees a normal Linux filesystem. The dataset's recordsize (default 128K) controls the block size. 128K is fine for general filesystem workloads inside containers.
Proxmox naming convention: rpool/data/subvol-{CTID}-disk-{N}
| Property | Zvol (VM disk) | Dataset (CT rootfs) |
|---|---|---|
| Block size property | volblocksize | recordsize |
| Default block size | 16K (PVE 8) / 8K (older) | 128K |
| Can change after creation | No (immutable) | Yes (new writes only) |
| I/O path | QEMU → /dev/zvol → ZFS | Bind mount → ZFS POSIX |
| Compression effective | Yes, but smaller blocks = less ratio | Yes, excellent at 128K |
| Snapshot granularity | Entire zvol | Entire dataset |
VM disk formats on ZFS — raw zvols only, never qcow2
Never use qcow2 on ZFS
When PVE creates a VM disk on ZFS storage, it creates a raw zvol. This is correct. But some admins manually create qcow2 files on a ZFS dataset and point QEMU at them. This is catastrophically slow.
qcow2 on ZFS means: QEMU's qcow2 layer does its own copy-on-write inside a file, while ZFS does copy-on-write on the blocks underneath. Double CoW. Every write is amplified twice. Snapshots conflict — qcow2 has its own snapshot mechanism that fights with ZFS snapshots. Thin provisioning conflicts — qcow2 has its own sparse allocation that fights with ZFS's.
The correct answer is always raw zvols. ZFS already provides everything qcow2 was designed to add: copy-on-write, snapshots, thin provisioning, compression. Using qcow2 on ZFS duplicates all of that functionality at a massive performance cost.
Thin provisioning with zvols
By default, Proxmox can create zvols in two modes: thick (pre-allocated)
or thin/sparse (allocate on write). Thin provisioning is controlled by the
sparse flag in storage.cfg and the -s flag on
zfs create.
# Thick zvol — 40GB allocated immediately, even if the VM uses 2GB
zfs create -V 40G rpool/data/vm-100-disk-0
# Thin/sparse zvol — 0 bytes allocated, grows as VM writes data
zfs create -V 40G -s rpool/data/vm-100-disk-0
# Check actual vs referenced space
zfs list -o name,volsize,used,refer rpool/data/vm-100-disk-0
# NAME VOLSIZE USED REFER
# rpool/data/vm-100-disk-0 40G 2.1G 2.0G
Always use thin provisioning (sparse 1 in storage.cfg).
There is no performance penalty — ZFS allocates blocks on write either way.
Thick provisioning just wastes pool space by reserving it upfront. The only edge case
where thick makes sense is when you need guaranteed space reservation and cannot
overcommit the pool.
The volmode property controls how the zvol appears to the system.
Proxmox uses volmode=dev (the default), which exposes
/dev/zvol/pool/name. You can also set volmode=full
to expose partition tables, or volmode=none to hide the device entirely
(useful for zvols managed only via iSCSI).
The write amplification problem
Why Proxmox VMs feel slow on default ZFS
Older Proxmox versions defaulted to 8K volblocksize for zvols, which actually aligned well with VM I/O. But if you created a dataset-backed VM disk (or used the wrong storage type), you got 128K recordsize. VM disk I/O operates in 4K-8K blocks. When a VM writes 8K to a 128K record, ZFS has to:
- Read the full 128K record that contains the 8K block
- Decompress it (if compression is on)
- Modify the 8K portion
- Recompress the full 128K
- Write the new 128K record to a new location (CoW)
That is 16x write amplification for every VM I/O operation. Your VMs are not slow because ZFS is slow. They are slow because ZFS is reading and writing 128K to change 8K.
The fix: correct volblocksize
# For general VM workloads — 16K is the sweet spot
zfs create -V 40G -s \
-o volblocksize=16K \
-o compression=lz4 \
rpool/data/vm-100-disk-0
# For database VMs (PostgreSQL, MySQL) — match the DB page size
zfs create -V 40G -s \
-o volblocksize=8K \
rpool/data/vm-100-disk-0
# Set the default for all future zvols created by PVE
# In /etc/pve/storage.cfg, add: blocksize 16k
# Or set on the parent dataset:
zfs set volblocksize=16K rpool/data
16K volblocksize = 2x amplification instead of 16x. That is an 8x improvement from changing one number.
dd the old one over, or use qemu-img convert. With
kldload, you control volblocksize from the installer — it is set correctly before the first
VM is ever created.
PVE snapshot integration
When you click "Snapshot" in the Proxmox GUI for a VM on ZFS storage, PVE does two things:
- Creates a ZFS snapshot of each zvol attached to the VM:
zfs snapshot rpool/data/vm-100-disk-0@snap_name - Saves the VM configuration (CPU, RAM, network) and optionally the RAM state to the snapshot metadata
ZFS snapshots are instantaneous and free until data diverges. A PVE snapshot of a 500GB VM takes less than a second and consumes zero additional space at creation time. As the VM writes new data, the snapshot holds references to the old blocks — only the delta consumes space.
# PVE creates snapshots like this internally:
zfs snapshot rpool/data/vm-100-disk-0@__replicate_100-0_1712345678
zfs snapshot rpool/data/vm-100-disk-1@__replicate_100-0_1712345678
# List all snapshots for a VM's disk
zfs list -t snapshot -r rpool/data/vm-100-disk-0
# Check snapshot space consumption
zfs list -t snapshot -o name,used,refer -r rpool/data/vm-100-disk-0
Key limitation: PVE snapshots include RAM state only if you check "Include RAM" in the GUI. Without RAM state, rolling back stops the VM first. With RAM state, the snapshot file is large (equal to VM RAM) and stored on the same ZFS dataset. For production, skip RAM snapshots and use ZFS-only snapshots for speed.
PVE replication — built-in ZFS send/receive
Proxmox has built-in replication that uses zfs send and zfs receive
to copy VM disks between cluster nodes. This is one of PVE's best features for ZFS users.
It gives you near-instant failover without shared storage.
# Create a replication job via CLI (replicate VM 100 to node pve2, every 15 min)
pvesr create-local-job 100-0 pve2 --schedule '*/15'
# List replication jobs
pvesr list
# Check replication status
pvesr status
# Manually trigger replication
pvesr run 100-0
# Under the hood, PVE runs something like:
# zfs send -i rpool/data/vm-100-disk-0@prev rpool/data/vm-100-disk-0@new | \
# ssh pve2 zfs receive rpool/data/vm-100-disk-0
zfs send/receive mechanism, but you control every parameter.
Backup interaction — vzdump + ZFS snapshots
PVE's backup tool (vzdump) has a special ZFS mode. When backing up a VM on
ZFS storage, vzdump:
- Creates a ZFS snapshot of each VM disk (instantaneous, no I/O freeze)
- Reads from the snapshot to create the backup file (the VM keeps running normally)
- Removes the temporary snapshot when the backup completes
This means ZFS backups have near-zero impact on VM performance. Unlike LVM-thin backups (which create an LVM snapshot that degrades write performance), the ZFS snapshot is copy-on-write and does not slow down the running VM. The backup reads from a frozen-in-time view while the VM continues writing to new blocks.
# Backup modes for ZFS VMs:
# snapshot — uses ZFS snapshot, no VM pause (RECOMMENDED)
# suspend — pauses VM briefly, then snapshots
# stop — stops VM, snapshots, restarts
# Configure backup job (Datacenter → Backup)
# Or via CLI:
vzdump 100 --mode snapshot --storage backup-pool --compress zstd
# The backup file lands as a .vma.zst file in the backup storage
# Restore:
qmrestore /path/to/vzdump-qemu-100-2026_04_04-12_00_00.vma.zst 100
--compress zstd for vzdump backups.
zstd compresses faster and smaller than gzip/lzo. Since the source data is already
lz4-compressed on ZFS, the vzdump compression catches what lz4 missed. Combined,
you often see 3:1 total compression on typical VM disks.
ARC tuning for PVE — leave room for KVM/QEMU
The Adaptive Replacement Cache (ARC) is ZFS's in-memory read cache. By default, ZFS claims up to 50% of system RAM for ARC. On a dedicated file server, that is fine. On a Proxmox host where KVM/QEMU VMs also need RAM, ARC and VMs compete for the same memory.
The rule: Total VM RAM + ARC + OS overhead must not exceed physical RAM. If you overcommit, the kernel's OOM killer starts killing QEMU processes — your VMs crash randomly with no warning.
| Total host RAM | Recommended zfs_arc_max | Available for VMs | Notes |
|---|---|---|---|
| 32 GB | 4 GB (4294967296) | ~26 GB | Tight. 4-6 small VMs. ARC helps but cannot be large. |
| 64 GB | 8 GB (8589934592) | ~53 GB | Good balance. 8-12 VMs with healthy ARC. |
| 128 GB | 16-32 GB | ~92-108 GB | Sweet spot. Large ARC + many VMs. Most production setups. |
| 256 GB | 32-64 GB | ~188-220 GB | Enterprise. Massive ARC caches entire working set. |
| 512 GB | 64-128 GB | ~380-444 GB | Database-heavy. ARC can cache full DB indexes. |
# Set ARC limits — example for 128GB host
# /etc/modprobe.d/zfs.conf (persists across reboots)
options zfs zfs_arc_max=17179869184
options zfs zfs_arc_min=4294967296
# Apply immediately without reboot
echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max
echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_min
# Verify current ARC usage
arc_summary | head -30
# Or the quick way:
cat /proc/spl/kstat/zfs/arcstats | grep -E '^size|^c_max'
The zfs_arc_min trap
Setting zfs_arc_min too high is dangerous on PVE. If VMs need more RAM than
what remains after arc_min, the kernel cannot reclaim ARC memory below the minimum.
VMs start swapping, then OOM-killing. Set zfs_arc_min to 25% of zfs_arc_max
as a safe floor. On a 128GB host with arc_max=16GB, set arc_min=4GB.
Swap on ZFS — the memory pressure problem
Never put swap on a ZFS zvol or dataset
When the system runs low on memory, it tries to swap. If the swap file is on ZFS, ZFS needs memory to write the swap data (for its own transaction groups, ARC metadata, and CoW operations). But the system is swapping because it is out of memory. Deadlock. The system hangs completely.
This is not theoretical. It happens on real Proxmox hosts, especially when VMs consume all available RAM and the host tries to swap. The entire node locks up and requires a hard reboot.
# The fix: put swap on a dedicated small partition, NOT on ZFS
# During Proxmox install, create a 4-8GB swap partition on the boot disk
# If your root is already on ZFS, create a swap file on tmpfs or
# a dedicated non-ZFS partition:
mkswap /dev/sda2
swapon /dev/sda2
# Add to /etc/fstab:
# /dev/sda2 none swap sw 0 0
# Or disable swap entirely and rely on proper memory sizing:
swapoff -a
sed -i '/swap/d' /etc/fstab
# On PVE with ZFS root, the installer creates a small swap partition
# automatically. Verify it is NOT on ZFS:
swapon --show
# NAME TYPE SIZE USED PRIO
# /dev/sda2 partition 8G 0B -2 ← Good, this is a partition
I/O tuning for VMs — the properties that matter
volblocksize — the single most important setting
# 16K — general purpose VMs (Windows, Linux desktop/server)
# Matches the common 4K guest block size with 4:1 amplification
# Good balance of performance and compression ratio
zfs create -V 100G -s -o volblocksize=16K rpool/data/vm-100-disk-0
# 8K — database VMs (PostgreSQL 8K pages, MySQL/InnoDB 16K pages)
# Minimal amplification for DB page-aligned I/O
zfs create -V 200G -s -o volblocksize=8K rpool/data/vm-200-disk-0
# 64K — bulk storage VMs (file servers, media servers)
# Better compression ratio, acceptable for sequential workloads
zfs create -V 500G -s -o volblocksize=64K rpool/data/vm-300-disk-0
sync — write ordering guarantees
# standard (default) — honor guest fsync() calls, safest option
zfs set sync=standard rpool/data/vm-100-disk-0
# disabled — ignore fsync(), all writes are async
# DANGEROUS: data loss on power failure. Only for throwaway test VMs.
zfs set sync=disabled rpool/data/vm-999-disk-0
# always — force every write to be synchronous
# Overkill for VMs. Only useful for NFS/iSCSI targets.
Never disable sync in production. If you need faster sync writes, add a SLOG device instead of disabling the safety net.
logbias — ZIL write optimization
# latency (default) — optimize ZIL for low-latency commits
# Best for general VM workloads, databases, anything that calls fsync()
zfs set logbias=latency rpool/data/vm-100-disk-0
# throughput — optimize ZIL for large sequential sync writes
# Use for VMs that do large sequential writes (e.g., video processing)
# Writes go directly to the pool, bypassing the ZIL/SLOG
zfs set logbias=throughput rpool/data/vm-300-disk-0
primarycache — what ARC caches
# all (default) — cache both data and metadata in ARC
zfs set primarycache=all rpool/data/vm-100-disk-0
# metadata — only cache metadata, not data blocks
# Use for VMs with huge working sets that thrash ARC
# (e.g., a 2TB database VM on a host with 16GB ARC)
zfs set primarycache=metadata rpool/data/vm-200-disk-0
SLOG for VM workloads
Virtual machines are one of the most SLOG-friendly workloads. Every guest OS issues
fsync() calls for journaling, package installs, database commits, and
log writes. Without a SLOG, these sync writes wait for the transaction group to flush
to the data disks (5 seconds by default, or whenever the pool is busy). With a SLOG,
sync writes commit to the fast log device in microseconds and the VM continues immediately.
# Add a SLOG — mirrored enterprise NVMe with power loss protection
zpool add rpool log mirror /dev/nvme2n1p1 /dev/nvme3n1p1
# Verify SLOG is active
zpool status rpool
# ...
# logs
# mirror-1 ONLINE
# nvme2n1p1 ONLINE
# nvme3n1p1 ONLINE
# Check if your workload benefits from SLOG — look at sync write queue
zpool iostat -q rpool 5
# If the syncq_write column is consistently > 0, SLOG helps.
# SLOG sizing: 16-32GB is plenty. The ZIL only holds ~10 seconds
# of sync write data. Even at 1GB/s sync write rate, 16GB is overkill.
Special vdev for VM metadata
When your data pool is on HDDs, every metadata operation (block pointer lookups,
free space tracking, dedup tables) hits spinning rust. A mirrored SSD special vdev
accelerates all metadata operations for the entire pool. For VM workloads, this means
faster snapshot creation, faster zfs list, faster pool scrubs, and
faster replication delta calculation.
# Add a mirrored special vdev (MUST be mirrored — losing it kills the pool)
zpool add rpool special mirror /dev/sda /dev/sdb
# Store small blocks (under 64K) on the special vdev too
# This catches VM config files, container metadata, small writes
zfs set special_small_blocks=65536 rpool
# The special vdev is most effective when:
# - Data pool is on HDDs (the speed gap between HDD metadata and SSD is huge)
# - You run many VMs/CTs (more metadata operations)
# - You use heavy snapshot/replication workflows (snapshot metadata is on special vdev)
On an all-NVMe pool, a special vdev provides minimal benefit — the data vdevs are already fast enough for metadata. Special vdevs shine when there is a large speed gap between the data disks and the special vdev disks.
Use mirrors, not RAIDZ, for VMs
Why mirrors win for VM workloads
This is the most common mistake on Proxmox. RAIDZ has terrible random write performance. VMs generate random I/O. Mirrors handle random I/O linearly — each mirror pair serves requests independently. With 4 mirror pairs, you get 4x the IOPS of a single mirror. RAIDZ stripes parity across all disks, meaning every write touches every disk in the vdev.
# BAD for VMs — RAIDZ2 across 6 disks
# Every write touches all 6 disks. IOPS = ~1 disk worth.
# zpool create rpool raidz2 /dev/sd{a,b,c,d,e,f}
# GOOD for VMs — 3 mirror pairs
# Each pair handles I/O independently. IOPS = ~3 disks worth.
zpool create -o ashift=12 rpool \
mirror /dev/sda /dev/sdb \
mirror /dev/sdc /dev/sdd \
mirror /dev/sde /dev/sdf
Exception: if your Proxmox host primarily stores backups, ISOs, or media files (sequential, large-block workloads), RAIDZ2 is fine. Separate your VM storage (mirrors) from your bulk storage (RAIDZ) into different pools.
PVE cluster with ZFS — Ceph vs ZFS replication
Proxmox supports two approaches to multi-node storage: Ceph (distributed object storage) and ZFS replication (zfs send/receive between nodes). They solve different problems.
| Feature | Ceph | ZFS Replication |
|---|---|---|
| Minimum nodes | 3 (recommended) | 2 |
| Failover speed | Seconds (automatic) | Minutes (manual or HA-managed) |
| Network requirement | 10GbE dedicated (25GbE recommended) | 1GbE sufficient |
| Storage overhead | 3x (3-replica) or 1.5x (erasure coding) | 2x (mirror on each node) |
| Complexity | High (MON, OSD, MDS daemons) | Low (just zfs send/receive) |
| RPO (data loss window) | 0 (synchronous replication) | Replication interval (1-15 min) |
| Write latency | Higher (network round-trip + 3 copies) | Local disk speed |
| Disk failure handling | Automatic rebalance across cluster | ZFS resilver on local node |
| Scale-out | Yes (add nodes/OSDs anytime) | Limited (each node is independent) |
| Best for | Large clusters (5+ nodes), live migration | Small clusters (2-3 nodes), homelabs |
zfs send over SSH. You lose the zero-RPO guarantee, but you gain simplicity and
local-disk write performance. For homelabs and small businesses, this is the right tradeoff.
kldload gives you the same ZFS replication capability on bare metal with syncoid — no
PVE cluster required. Two kldload boxes with syncoid running on a cron job is a perfectly
functional replication setup.
Monitoring ZFS in PVE
The Proxmox GUI shows pool status (online/degraded/faulted) and basic space usage. That is about 10% of what you need to monitor. The rest requires the CLI.
# Pool health — the first command to run when anything seems wrong
zpool status rpool
# Pool I/O statistics — 5-second interval
zpool iostat rpool 5
# Per-vdev I/O breakdown (which disks are slow?)
zpool iostat -v rpool 5
# I/O queue depth (are writes queuing?)
zpool iostat -q rpool 5
# ARC hit rate — should be > 80% for good performance
arc_summary | grep -A5 "ARC size"
# Or:
cat /proc/spl/kstat/zfs/arcstats | grep -E '^hits|^misses'
# ARC efficiency breakdown
arcstat 5
# Output columns: read hits miss hit% l2hits l2miss
# Pool fragmentation — high fragmentation = degraded write performance
zpool list -o name,frag,cap rpool
# Keep capacity below 80%. Above 80%, fragmentation accelerates.
# Scrub status — scrubs should run weekly
zpool status rpool | grep scan
# If the last scrub found errors, investigate immediately.
# ZFS I/O latency histogram (where is time being spent?)
zpool iostat -w rpool 5
zpool status checks to your monitoring stack (Prometheus + node_exporter
with the ZFS collector, or Zabbix with ZFS templates). Alert on: degraded vdevs,
scrub errors, capacity > 80%, ARC hit rate < 70%. The PVE GUI will not alert you on
most of these conditions.
Common mistakes
Pool on a single disk
No redundancy. One disk failure = total data loss. The Proxmox installer allows this. Never do it for production. Always use at least a mirror.
No ECC RAM
ZFS checksums every block to detect corruption. But if RAM itself is corrupt, ZFS writes bad data with a valid checksum. ECC RAM is not optional for ZFS. Every hardware guide says this. Most people ignore it until they get silent corruption.
zfs_arc_max too high
ARC competes with QEMU for RAM. If arc_max is 50% on a 64GB host and you allocate 40GB to VMs, the host has negative free memory. OOM killer fires. VMs crash. Set arc_max conservatively and monitor.
Not setting ashift=12
Some disks (especially 4Kn drives behind USB enclosures) report 512-byte sectors.
ZFS auto-detects ashift=9 and writes 512-byte blocks to 4K hardware. Throughput halved.
Cannot be fixed after creation. Always pass -o ashift=12 explicitly.
RAIDZ1 on large drives
A 16TB drive takes 12+ hours to resilver. During resilver, one more failure and the vdev is gone. RAIDZ2 minimum for drives over 2TB. See Pool Design.
Using qcow2 on ZFS
Double copy-on-write. Double snapshot tracking. Double thin provisioning overhead. Always use raw zvols. PVE does this by default — do not override it.
Filling the pool past 80%
ZFS performance degrades sharply above 80% capacity due to fragmentation and reduced CoW efficiency. At 90%+, writes can stall completely. Monitor capacity and expand or delete before reaching 80%.
Swap on ZFS
Memory pressure + swap-on-ZFS = deadlock. The system needs memory to write swap, but is swapping because it has no memory. Use a dedicated swap partition outside ZFS.
Migration between PVE nodes with ZFS
Proxmox supports live migration of VMs between cluster nodes. With ZFS, migration uses
zfs send/receive to transfer the disk, which is efficient but has caveats.
zfs send, then transfers dirty memory pages, then switches the VM. Requires the same pool name on both nodes. Downtime is typically 100-500ms.# Migrate VM 100 from current node to pve2 (online)
qm migrate 100 pve2 --online
# Migrate offline (stops VM, transfers, starts on target)
qm migrate 100 pve2
# If migration is slow, check:
# 1. Network bandwidth between nodes (iperf3 -s / iperf3 -c pve2)
# 2. Whether replication is configured (pre-synced data = fast migration)
# 3. zvol size — a 2TB zvol without replication takes a while
# Manual zvol transfer (useful for moving to non-PVE ZFS hosts)
zfs send rpool/data/vm-100-disk-0 | ssh target zfs receive tank/data/vm-100-disk-0
# Incremental (much faster after initial sync):
zfs send -i @snap1 rpool/data/vm-100-disk-0@snap2 | \
ssh target zfs receive tank/data/vm-100-disk-0
PVE + ZFS vs PVE + Ceph vs PVE + LVM-thin
| Feature | ZFS | Ceph | LVM-thin |
|---|---|---|---|
| Snapshots | Instant, zero-cost | Instant (RBD snapshots) | Slow, degrades write perf |
| Compression | Built-in (lz4, zstd) | None (client-side only) | None |
| Checksums | Per-block SHA256/fletcher | Per-object CRC | None |
| Self-healing | Auto-repair on read | Yes (PG repair) | No |
| Replication | zfs send/receive | Built-in (synchronous) | None (use DRBD) |
| Thin provisioning | Native (sparse zvols) | Native | Native (the point of LVM-thin) |
| IOPS (mirrors) | Excellent | Good (network overhead) | Excellent (direct disk) |
| Write latency | Low (local disk) | Higher (network + 3 writes) | Low (local disk) |
| Capacity efficiency | 50% (mirrors) | 33% (3-replica) or 67% (EC) | 100% (no redundancy) or 50% (DRBD) |
| Shared storage | No (per-node) | Yes (cluster-wide) | No (per-node) |
| Live migration | Yes (zfs send) | Yes (instant, shared storage) | Yes (block-level copy) |
| Complexity | Low | High | Very low |
| Minimum disks | 2 (mirror) | 3 nodes x 1+ OSD | 1 |
| Data integrity | Excellent (end-to-end) | Good | None |
Converting existing PVE from LVM-thin to ZFS
There is no in-place conversion. LVM-thin and ZFS are fundamentally different storage architectures. You must migrate VMs disk-by-disk. Here is the procedure:
- Add new disks and create a ZFS pool alongside the existing LVM-thin storage
- Register the ZFS pool in
/etc/pve/storage.cfg - For each VM: shut down, use
qm move-diskto move each disk from LVM-thin to ZFS - Boot the VM on ZFS storage, verify everything works
- Once all VMs are migrated, remove the LVM-thin storage
# Step 1: Create the ZFS pool (new disks)
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
vmpool mirror /dev/sdc /dev/sdd
zfs create vmpool/data
# Step 2: Register in storage.cfg
cat >> /etc/pve/storage.cfg << 'EOF'
zfspool: vmpool
pool vmpool/data
content images,rootdir
sparse 1
blocksize 16k
EOF
# Step 3: Move VM disks (VM must be stopped)
qm stop 100
qm move-disk 100 scsi0 vmpool --delete 1
# --delete 1 removes the source disk after successful move
# Step 4: Verify and start
qm start 100
qm agent 100 ping # check guest agent responds
# Repeat for each VM. For many VMs, script it:
for vmid in 100 101 102 103 104; do
qm stop $vmid
qm move-disk $vmid scsi0 vmpool --delete 1
qm start $vmid
echo "VM $vmid migrated to ZFS"
done
vzdump for every VM before moving disks. If the move fails or the
ZFS pool has issues, you need a way back. Do not skip this.
PVE ZFS boot — root on ZFS
The Proxmox installer supports installing the OS directly on ZFS (root-on-ZFS). This gives you ZFS benefits for the entire system: the OS, logs, container rootfs, and VM disks all live on ZFS. Boot environments, rollback, and system-level snapshots work.
rpool on the boot disk with the following layout: rpool/ROOT/pve-1 (the OS), rpool/data (VM/CT storage), and a small EFI partition for systemd-boot or GRUB.zfs-initramfs package. Older versions use GRUB. Both work but systemd-boot is simpler and faster.rpool/ROOT/pve-1 before apt upgrade. If the upgrade breaks, rollback to the snapshot. This is the single best reason to run PVE on ZFS root.# Snapshot before PVE upgrade
zfs snapshot rpool/ROOT/pve-1@before-upgrade-$(date +%Y%m%d)
# Upgrade PVE
apt update && apt dist-upgrade
# If something breaks, rollback:
zfs rollback rpool/ROOT/pve-1@before-upgrade-20260404
# Separate data pool for VMs (recommended — keep OS and data pools separate)
# During install, choose ZFS mirror for the boot disk
# After install, create a separate pool on dedicated disks:
zpool create -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O dnodesize=auto \
vmpool mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf
Performance expectations
Real-world numbers from tuned Proxmox ZFS hosts. These assume correct volblocksize, mirrored vdevs, and properly sized ARC. Your results depend on disk hardware.
| Configuration | Random 4K read IOPS | Random 4K write IOPS | Sequential read MB/s | Sequential write MB/s |
|---|---|---|---|---|
| 2x SATA SSD mirror | 40,000-80,000 | 20,000-40,000 | 500-550 | 400-500 |
| 2x NVMe mirror | 200,000-500,000 | 100,000-300,000 | 3,000-6,000 | 2,000-4,000 |
| 4x NVMe (2 mirror pairs) | 400,000-1,000,000 | 200,000-600,000 | 6,000-12,000 | 4,000-8,000 |
| 4x HDD mirror pairs | 300-600 | 200-400 | 300-500 | 200-400 |
| 4x HDD mirrors + SLOG | 300-600 | 200-400 (sync: 10,000+) | 300-500 | 200-400 |
| 4x HDD mirrors + special | 2,000-5,000 (metadata) | 200-400 | 300-500 | 200-400 |
| 6x HDD RAIDZ2 | 80-150 | 40-80 | 500-800 | 300-500 |
Note the massive difference between mirrors and RAIDZ for random I/O. Four HDD mirror pairs deliver 300-600 random read IOPS. A 6-disk RAIDZ2 of the same drives delivers 80-150. For VMs, that difference is the difference between "responsive" and "why is everything so slow."
Quick reference: Proxmox ZFS tuning
| Setting | Default | Recommended | Why |
|---|---|---|---|
| volblocksize | 16K (PVE 8) / 8K (older) | 16K (VMs) / 8K (DBs) | Match guest I/O pattern, reduce amplification |
| recordsize | 128K | 128K for CTs, 1M for backup datasets | Use zvols for VMs, not datasets |
| compression | on (lz4) | lz4 (or zstd for backup datasets) | Nearly free CPU cost, saves I/O bandwidth |
| atime | on | off | Eliminates useless access-time writes |
| xattr | sa | sa | Store extended attrs in the inode, faster |
| dnodesize | legacy | auto | Larger dnodes for metadata-heavy workloads |
| zfs_arc_max | 50% RAM | See ARC table above | Leave room for QEMU VM memory |
| zfs_arc_min | adaptive | 25% of arc_max | Prevent ARC starvation under pressure |
| sync | standard | standard + SLOG | Never disable sync — add SLOG instead |
| logbias | latency | latency (VMs) / throughput (bulk) | Match write pattern |
| primarycache | all | all (most VMs) / metadata (huge DBs) | Control what ARC caches |
| sparse (storage.cfg) | 0 | 1 | Thin provision zvols, no perf penalty |
| VDEV layout | varies | Mirrors for VMs | RAIDZ kills random I/O |
| ashift | auto-detect | 12 (always specify) | Correct for all modern disks |
| special vdev | none | Mirrored SSDs (HDD pools) | Accelerates metadata for all VMs |
| SLOG | none | Enterprise NVMe with PLP | Accelerates sync writes from VMs |