| pick your distro, get ZFS on root
kldload — your platform, your way, free
Source
← Back to ZFS Overview

KVM + ZFS Hypervisor — the hypervisor you already have.

Every Linux kernel ships with KVM. Every Linux distribution ships with libvirt and virsh. When you combine KVM with ZFS zvols as VM storage, you get a hypervisor that matches or exceeds Proxmox, VMware, and Hyper-V in every measurable dimension — for free, with standard tools, and zero vendor lock-in. kldload's desktop and server profiles install this stack automatically: libvirt + QEMU/KVM + ZFS zvols + the kvm-* toolset. You get instant clones, atomic snapshots, checksummed storage, native compression, and replication — all at the storage layer, transparent to every VM.

The thesis: You do not need Proxmox. You do not need VMware. You do not need a proprietary management layer between you and your hypervisor. KVM is the Linux kernel's native hypervisor. ZFS is the best storage layer ever built. Together they give you everything — snapshots, clones, replication, compression, checksumming, encryption — and the only cost is learning the commands.

Proxmox is KVM + ZFS + a web UI + a subscription nag screen. VMware is a proprietary hypervisor + proprietary storage + proprietary licensing + Broadcom's lawyers. kldload is KVM + ZFS + tools that make you faster than either. The hypervisor you need is already in your kernel. Stop paying rent on it.

I ran VMware for 15 years. I ran Proxmox for 5. The day I switched to raw KVM + ZFS, I realized both products exist to sell you a GUI for things that are two-line shell commands. The entire value proposition of Proxmox is virsh with a web interface and zfs send on a cron job. Once you learn the 20 commands on this page, you will never go back.

Why KVM + ZFS

The philosophy is simple: use the kernel's hypervisor, use the best filesystem, and skip every abstraction layer in between. KVM is not a product — it is a kernel module. It has been in mainline Linux since 2007. Every cloud provider on Earth runs KVM: AWS (Nitro is KVM), Google Cloud (KVM), Oracle Cloud (KVM), DigitalOcean (KVM). The technology is not in question. The only question is what sits on top of it.

Traditional KVM setups use qcow2 files as VM disks. qcow2 is a file format with its own snapshot system, its own thin provisioning, its own compression. It works. But it is a userspace format managed by QEMU, and it duplicates everything ZFS already does better. When you put a qcow2 file on ZFS, you get double copy-on-write — ZFS COW under qcow2 COW — which destroys write performance and wastes space.

The solution is ZFS zvols. A zvol is a ZFS dataset that presents itself as a raw block device at /dev/zvol/poolname/volname. QEMU sees a block device, not a file. There is no qcow2 layer. No double COW. ZFS handles snapshots, clones, compression, checksumming, and replication at the storage layer. QEMU just reads and writes blocks. This is the architecture kldload uses, and it is the correct architecture.

Instant snapshots

ZFS snapshots are O(1) — they complete in milliseconds regardless of VM size. Snapshot a 500GB VM in under a second. No quiescing the storage, no pausing I/O. The snapshot is atomic and crash-consistent at the block level.

qcow2 snapshots: seconds to minutes. ZFS snapshots: milliseconds. Always.

Instant clones

Clone a 200GB VM in under one second. ZFS clones are copy-on-write — the clone shares all blocks with the parent until either one writes new data. This is not a linked clone that degrades over time. It is a first-class dataset that can be promoted to independent existence.

VMware linked clone: 30-120 seconds. ZFS clone: <1 second. Every time.

End-to-end checksumming

Every block written to a zvol is checksummed. Every read is verified. Bit rot is detected and corrected automatically on mirrored pools. No other hypervisor storage does this. qcow2 files can silently corrupt. VMDK files can silently corrupt. ZFS zvols cannot.

Silent data corruption is impossible on ZFS. That sentence is worth the entire page.

Native compression

LZ4 compression on zvols typically saves 30-60% on OS volumes at near-zero CPU cost. A 100GB Windows VM might use 45GB of actual disk. A Linux VM might use 30GB. Compression is transparent — the VM sees a 100GB disk, ZFS stores only what is needed.

Free disk space. Free bandwidth. Free IOPS (less data = less I/O). Always enable LZ4.

Replication built in

zfs send and zfs receive replicate VM disks between hosts with incremental, block-level efficiency. No proprietary replication protocol. No license. No cluster software. Just standard ZFS commands that work on any pool, anywhere.

Proxmox replication is zfs send with a GUI. You already have zfs send.

No vendor lock-in

KVM is the Linux kernel. libvirt is an open standard. ZFS is open source. Your VMs are standard QEMU disk images on standard ZFS datasets. You can move them to any Linux host with KVM and ZFS. No license keys. No subscription. No phone-home. No Broadcom.

Your infrastructure belongs to you. Not to a vendor. Not to a subscription.

Architecture — the full stack

Understanding the full stack is essential. Every layer is standard Linux. There is no proprietary component anywhere in the chain.

Physical disk
NVMe SSD, SATA SSD, or HDD. ZFS manages the raw device directly — no partitioning, no LVM, no mdraid. The disk is a member of a ZFS pool.
ZFS pool
zpool create rpool mirror /dev/nvme0n1 /dev/nvme1n1 — the pool provides redundancy (mirror, RAIDZ), checksumming, compression, and the dataset namespace.
ZFS zvol
zfs create -V 100G rpool/vms/webserver — a zvol is a block device dataset. ZFS allocates logical space and creates a device node. The zvol inherits pool properties: compression, checksumming, redundancy.
/dev/zvol/
The zvol appears as /dev/zvol/rpool/vms/webserver (symlink to /dev/zdN). This is a standard Linux block device. Any program can open it — dd, qemu, fdisk. No special API.
libvirt XML
The VM definition in /etc/libvirt/qemu/vmname.xml references the zvol as a block device: <source dev='/dev/zvol/rpool/vms/webserver'/>. libvirt passes this to QEMU at VM start.
QEMU/KVM
QEMU opens the block device and presents it to the guest as a virtio-blk or virtio-scsi disk. KVM (the kernel module) handles CPU and memory virtualization. QEMU handles device emulation and I/O.
Guest OS
The guest sees a standard disk device (/dev/vda with virtio). It has no idea it is running on ZFS. It formats the disk with ext4, XFS, NTFS, or even ZFS-in-ZFS for nested pools. Completely transparent.

This stack has zero proprietary layers. Compare to VMware: proprietary hypervisor (ESXi) + proprietary storage (VMFS/vSAN) + proprietary management (vCenter) + proprietary licensing (per-CPU). Compare to Proxmox: KVM + ZFS + proprietary management (Proxmox VE) + subscription nag. The kldload stack is: KVM + ZFS + shell commands. Same capabilities. No toll booth.

The entire Proxmox product is a Perl web UI that calls virsh, zfs, and qm under the hood. I have read the source. Every Proxmox "feature" maps to a standard Linux command. The clustering is corosync. The replication is zfs send. The firewall is nftables. They package it nicely and charge for support. That is fine. But do not confuse the packaging with the technology. The technology is free and always has been.

Pool design for KVM

How you organize your ZFS pool for VM storage matters. The wrong layout creates performance problems, management headaches, and snapshot chaos. The right layout gives you clean separation, easy replication, and per-VM accounting.

Dedicated pool vs shared pool

If you have enough disks, use a dedicated pool for VMs. The root pool (rpool) handles the OS, boot environments, and system snapshots. A separate vmpool handles VM storage. This gives you independent I/O paths, separate scrub schedules, and the ability to export/import the VM pool without touching the OS.

If you have a single disk or a single mirror pair, put VMs under rpool/vms. This is what kldload does by default on single-disk installs. It works fine — ZFS handles the mixed workload. You just cannot separate the I/O paths.

Recommended dataset hierarchy

# Dedicated VM pool (preferred)
vmpool/
  vms/                    # Parent dataset for all VM zvols
    webserver             # zvol: /dev/zvol/vmpool/vms/webserver
    database              # zvol: /dev/zvol/vmpool/vms/database
    dev-template          # zvol: golden image for cloning
  images/                 # Regular dataset (not zvol) for ISOs
    debian-13.iso
    rocky-9.iso
  backups/                # Received snapshots from other hosts

# Single-pool layout (rpool only)
rpool/
  vms/
    webserver
    database
  vms-images/             # ISOs and templates
  vms-backups/            # Received replicas

volblocksize — the critical tuning knob

volblocksize is the zvol equivalent of recordsize. It determines the minimum I/O unit for the zvol. This property is set at creation and cannot be changed. The default is 8K, which is wrong for almost every VM workload.

Use 64K for general-purpose VMs. This matches the default recordsize for ZFS datasets and gives good performance across mixed workloads (OS operations, application I/O, file serving). Most guest filesystems (ext4, XFS, NTFS) issue I/O in 4K-64K chunks, and a 64K volblocksize amortizes ZFS metadata overhead efficiently.

# Create a zvol with 64K block size (recommended for all VMs)
zfs create -V 100G -b 64K rpool/vms/webserver

# For database VMs that use 16K pages (PostgreSQL, MySQL/InnoDB):
zfs create -V 200G -b 16K rpool/vms/postgres

# For database VMs that use 8K pages (PostgreSQL default):
zfs create -V 200G -b 8K rpool/vms/pg-oltp

Why not 128K for VMs?

A 128K volblocksize means every small write (even a 4K guest I/O) touches a 128K block. ZFS must read the entire 128K block, modify the portion that changed, and write a new 128K block. This write amplification kills random I/O performance on VM workloads. 64K is the sweet spot: large enough to amortize metadata, small enough to limit amplification. For OLTP databases, go even smaller (8K or 16K) to match the database page size.

volblocksize is permanent. Test with 64K first. Only go smaller for known database workloads.

Compression on zvols

Always enable LZ4 compression on VM zvols. There is no reason not to. LZ4 compresses and decompresses faster than disk I/O — it literally makes your storage faster by reducing the amount of data written. OS volumes compress exceptionally well (40-70% savings on Linux, 30-50% on Windows). Even database volumes with mostly random data achieve 10-20% savings from metadata and logs.

# Set compression on the parent dataset (inherited by all zvols)
zfs set compression=lz4 rpool/vms

# Verify compression ratio on a running VM
zfs get compressratio rpool/vms/webserver
# NAME                    PROPERTY       VALUE  SOURCE
# rpool/vms/webserver     compressratio  2.14x  -

sync tuning for VM workloads

The sync property controls whether ZFS flushes writes to stable storage before acknowledging them. sync=standard (default) honors the guest's flush requests. sync=disabled acknowledges writes immediately without flushing, which is faster but risks data loss on power failure.

For production VMs: keep sync=standard. Add an SLOG (ZFS Intent Log device) if synchronous write latency is a problem. An enterprise NVMe with power-loss protection as SLOG drops sync write latency from 5-15ms (spinning rust) to 50-100us.

For development VMs, CI runners, and throwaway workloads: sync=disabled is safe because you do not care if the VM is destroyed by a power failure. The performance gain is dramatic — 5-10x for sync-heavy workloads like database imports and package installations.

# Production: keep defaults, add SLOG if needed
zfs set sync=standard rpool/vms/database

# Development: disable sync for speed
zfs set sync=disabled rpool/vms/dev-throwaway

# Add an SLOG to the pool (enterprise NVMe with PLP)
zpool add rpool log /dev/nvme2n1
I run sync=disabled on every dev VM and every CI runner. I have lost zero data from it, because those VMs are ephemeral — they are cloned from a golden image, used for a few hours, and destroyed. If the power goes out, I clone a new one. It takes one second. Production databases get sync=standard plus an SLOG. Match the durability guarantee to the data's actual value.

Creating VMs with zvol storage

The workflow is straightforward: create a zvol, then create a VM that uses it. No image files, no storage pools to configure in libvirt, no format conversions.

Step 1: Create the zvol

# General purpose VM - 100GB, 64K blocks, LZ4 compression
zfs create -V 100G -b 64K -o compression=lz4 rpool/vms/rocky9-web

# Thin provisioned (sparse) - only uses space as data is written
# By default, zvols reserve their full size. Remove the reservation:
zfs set refreservation=none rpool/vms/rocky9-web

# Verify
zfs list -o name,volsize,volblocksize,used,refer,compress rpool/vms/rocky9-web
# NAME                    VOLSIZE  VOLBLOCKSIZE   USED  REFER  COMPRESS
# rpool/vms/rocky9-web      100G          64K    56K    56K    lz4

Sparse vs thick provisioning: By default, ZFS reserves disk space equal to the zvol's logical size (refreservation). This guarantees the zvol can always write to its full size. Setting refreservation=none makes the zvol thin-provisioned — it only uses space as data is written. Use thin provisioning when you trust your capacity planning (or have monitoring). Use thick provisioning for critical VMs that must never hit ENOSPC.

Step 2: Create the VM with virt-install

# Full virt-install command with zvol storage
virt-install \
  --name rocky9-web \
  --ram 4096 \
  --vcpus 4 \
  --cpu host \
  --machine q35 \
  --os-variant rocky9 \
  --disk path=/dev/zvol/rpool/vms/rocky9-web,bus=virtio,cache=none \
  --cdrom /root/vms-images/Rocky-9-latest-x86_64-dvd.iso \
  --network bridge=br0,model=virtio \
  --graphics vnc,listen=0.0.0.0 \
  --serial pty \
  --console pty \
  --tpm backend.type=emulator,backend.version=2.0,model=tpm-crb \
  --boot uefi \
  --noautoconsole

Key parameters explained:

--cpu host
Pass the host CPU model to the guest. This enables all CPU features (AVX-512, AES-NI, etc.) and gives the best performance. Only use a generic CPU model if you need live migration between different CPU generations.
--machine q35
Use the Q35 chipset (PCIe-native) instead of the legacy i440fx. Required for PCIe passthrough, NVMe emulation, and modern features. There is no reason to use i440fx on new VMs.
cache=none
Critical for ZFS. Tells QEMU not to cache I/O in the host page cache. ZFS has its own cache (ARC). Double caching wastes RAM and hurts performance. Always use cache=none with ZFS-backed storage.
bus=virtio
Use the virtio disk driver for near-native I/O performance. IDE emulation is 10-50x slower. SCSI (virtio-scsi) is an alternative that supports TRIM and SCSI features, but virtio-blk is simpler and slightly faster for most workloads.
--boot uefi
Boot with UEFI firmware (OVMF). Required for modern OS installers, Secure Boot, and TPM 2.0. BIOS boot is legacy — do not use it for new VMs.
--tpm
Emulated TPM 2.0 via swtpm. Required for Windows 11, useful for measured boot and disk encryption in any guest. Zero performance cost.

Compare to qcow2 workflow

# qcow2 workflow (DON'T do this on ZFS):
qemu-img create -f qcow2 /var/lib/libvirt/images/vm.qcow2 100G
virt-install --disk path=/var/lib/libvirt/images/vm.qcow2,format=qcow2 ...
# Result: double COW (qcow2 COW + ZFS COW), double metadata, poor performance

# zvol workflow (DO this on ZFS):
zfs create -V 100G -b 64K rpool/vms/vm
virt-install --disk path=/dev/zvol/rpool/vms/vm,cache=none ...
# Result: single COW (ZFS only), native checksumming, instant snapshots

kldload's kvm-* tools

kldload ships five purpose-built commands that wrap virsh and ZFS into single operations. They enforce the correct zvol properties, handle snapshot naming, manage clone ancestry, and clean up orphan datasets. You can always use raw virsh and zfs commands — the kvm-* tools are convenience wrappers that do the right thing by default.

kvm-create
Creates the zvol and the VM in one command. Sets volblocksize=64K, compression=lz4, refreservation=none. Calls virt-install with the correct flags (q35, UEFI, virtio, cache=none, TPM 2.0, serial console).
kvm-clone
ZFS snapshot + ZFS clone + libvirt define in one command. Clones a VM in under 1 second regardless of disk size. Sets com.kldload:clone-origin property for tracking ancestry. Generates new MAC address and UUID.
kvm-snap
Snapshots a VM's zvol(s). Names snapshots with ISO 8601 timestamps: rpool/vms/web@2026-04-04T14:30:00. Optionally freezes the guest filesystem via QEMU guest agent for application-consistent snapshots.
kvm-delete
Destroys the VM, its zvol, and all orphan snapshots in one command. Refuses to delete if dependent clones exist (prompts you to promote or delete them first). Clean, safe removal.
kvm-list
Lists all VMs with their state (running/shut off), zvol path, disk usage (logical vs actual), compression ratio, and clone origin. One command to see everything.

Example: kvm-create

$ kvm-create rocky9-web --ram 4096 --vcpus 4 --disk 100G \
    --iso /root/vms-images/Rocky-9-latest-x86_64-dvd.iso \
    --bridge br0

Creating zvol rpool/vms/rocky9-web (100G, volblocksize=64K, compression=lz4)...
Creating VM rocky9-web (4096MB RAM, 4 vCPUs, UEFI, Q35, TPM 2.0)...
VM rocky9-web created. Connect: virt-viewer rocky9-web

Example: kvm-clone

$ kvm-clone rocky9-web rocky9-web-clone

Snapshotting rpool/vms/rocky9-web@clone-2026-04-04T14:32:00...
Cloning to rpool/vms/rocky9-web-clone...
Defining VM rocky9-web-clone (new MAC, new UUID)...
Done. Clone completed in 0.4 seconds.

$ kvm-list
NAME                STATE     ZVOL                          USED   REFER  RATIO  ORIGIN
rocky9-web          running   rpool/vms/rocky9-web          12.3G  12.3G  2.14x  -
rocky9-web-clone    shut off  rpool/vms/rocky9-web-clone    128K   12.3G  2.14x  rocky9-web@clone-2026-04-04T14:32:00

Notice the clone uses 128K of actual disk space. It references 12.3G of data from the parent. As the clone diverges (guest writes new data), its USED column grows. The parent's data is shared via COW. This is why you can clone 50 VMs from a golden image and they collectively use barely more space than one.

Example: kvm-snap

$ kvm-snap rocky9-web

Snapshotting rpool/vms/rocky9-web@2026-04-04T14:35:00...
Snapshot created in 0.002 seconds.

# With guest agent (application-consistent):
$ kvm-snap rocky9-web --freeze

Freezing guest filesystems via QEMU guest agent...
Snapshotting rpool/vms/rocky9-web@2026-04-04T14:36:00...
Thawing guest filesystems...
Application-consistent snapshot created in 0.8 seconds.

Example: kvm-delete

$ kvm-delete rocky9-web-clone

Shutting down rocky9-web-clone...
Destroying VM definition...
Destroying zvol rpool/vms/rocky9-web-clone (and 0 snapshots)...
Done. VM and all storage removed.

# If clones depend on it:
$ kvm-delete rocky9-web
ERROR: rpool/vms/rocky9-web has 2 dependent clones:
  - rpool/vms/rocky9-staging (from rocky9-web@clone-2026-04-04T14:32:00)
  - rpool/vms/rocky9-prod (from rocky9-web@clone-2026-04-04T15:00:00)
Delete or promote these clones first.
The kvm-* tools exist because I got tired of typing the same 6 commands every time I created a VM. They do not hide anything — they print every zfs and virsh command they run. If you want to understand what they do, read the source. They are short bash scripts. If you want to do it manually, every command on this page works without them.

Instant cloning — the killer feature

This is the single most important capability that ZFS gives you as a hypervisor operator. Cloning a VM is O(1). It takes the same amount of time to clone a 10GB VM as a 10TB VM: under one second. The clone shares all data blocks with the parent via copy-on-write. Only new writes allocate new space.

Think about what this means. You have a golden image — a fully configured, patched, hardened base VM. You snapshot it once. From that snapshot, you can create 50 clones in 50 seconds. Each clone is a full, independent VM with its own MAC address, UUID, and hostname. Each clone uses near-zero additional disk space until the guest starts writing. Your 50-VM test lab occupies the same disk space as one VM plus the deltas.

VMware charges for linked clones. Proxmox does them in 30-60 seconds. Hyper-V does not support them at all (you must copy the entire VHDX). qcow2 backing files achieve similar semantics but with worse performance and fragile dependency chains. ZFS clones are native, fast, and free.

The clone workflow

# 1. Build your golden image (install OS, configure, patch, harden)
kvm-create golden-rocky9 --ram 4096 --vcpus 2 --disk 50G --iso Rocky-9.iso
# ... install OS, configure everything ...
virsh shutdown golden-rocky9

# 2. Snapshot the golden image
zfs snapshot rpool/vms/golden-rocky9@v1

# 3. Clone as many VMs as you need (each takes <1 second)
for i in $(seq 1 50); do
  zfs clone rpool/vms/golden-rocky9@v1 rpool/vms/worker-${i}
  # ... define VM in libvirt with new MAC/UUID ...
done

# Or with kvm-clone (handles libvirt automatically):
for i in $(seq 1 50); do
  kvm-clone golden-rocky9 worker-${i}
done
# Total time for 50 clones: ~50 seconds (1 sec each, mostly libvirt XML generation)

# 4. Check space usage
zfs list -r -o name,used,refer rpool/vms | head -5
# NAME                        USED   REFER
# rpool/vms                   8.5G   24K
# rpool/vms/golden-rocky9     8.4G   8.4G
# rpool/vms/worker-1          128K   8.4G      <-- shares parent data
# rpool/vms/worker-2          128K   8.4G      <-- shares parent data

Clone speed comparison

Method 100GB VM clone time Disk space used Independent VM?
ZFS clone <1 second Near-zero (COW) Yes (promotable)
Proxmox linked clone 30-60 seconds Near-zero (COW) Depends on base
VMware linked clone 30-120 seconds Near-zero (COW) Depends on base
qcow2 backing file 2-5 seconds Near-zero (COW) Fragile chain
Full copy (any format) 5-30 minutes 100% (full copy) Yes
Hyper-V (no linked clone) 5-30 minutes 100% (full copy) Yes

COW semantics and clone promotion

A ZFS clone is a first-class dataset that shares blocks with its origin snapshot. As the clone writes new data, only the changed blocks are allocated new space. The origin snapshot cannot be deleted while clones depend on it.

Promotion reverses the parent-child relationship. After zfs promote, the clone becomes the independent dataset, and the original becomes the dependent. This lets you delete the original golden image while keeping the clone alive.

# Promote a clone to independence
zfs promote rpool/vms/worker-1
# Now worker-1 owns the shared blocks.
# golden-rocky9@v1 now depends on worker-1, not the other way around.

# Check origin tracking
zfs get com.kldload:clone-origin rpool/vms/worker-1
# NAME                       PROPERTY                    VALUE
# rpool/vms/worker-1         com.kldload:clone-origin    golden-rocky9@v1

Snapshots on running VMs

ZFS snapshots are crash-consistent by default. This means the snapshot captures the exact state of the block device at the instant the snapshot is taken — equivalent to pulling the power cord and rebooting. For most workloads, this is fine. The guest's filesystem journal (ext4 journal, NTFS log, XFS log) replays on boot and recovers to a consistent state.

For application-consistent snapshots (databases that need clean shutdown semantics, not just filesystem consistency), you need to freeze the guest filesystem before snapshotting. The QEMU guest agent provides this via fsfreeze.

Crash-consistent snapshot (fast, safe for most workloads)

# Snapshot while VM is running - instant, crash-consistent
zfs snapshot rpool/vms/webserver@before-upgrade

# This is safe for:
# - Web servers, application servers, file servers
# - Anything with a journaling filesystem
# - Test/dev VMs (always safe - who cares if it crashes?)
# - Pre-upgrade checkpoints (rollback if upgrade fails)

Application-consistent snapshot (for databases)

# Install QEMU guest agent in the VM
# (CentOS/Rocky/Fedora):
dnf install -y qemu-guest-agent
systemctl enable --now qemu-guest-agent

# Freeze, snapshot, thaw:
virsh domfsfreeze webserver
zfs snapshot rpool/vms/webserver@app-consistent-2026-04-04
virsh domfsthaw webserver
# Total freeze time: ~0.5 seconds

# Or use kvm-snap --freeze (does all three steps):
kvm-snap webserver --freeze

virsh snapshot-create-as vs zfs snapshot

libvirt has its own snapshot mechanism (virsh snapshot-create-as). Do not use it with ZFS zvols. It creates QEMU internal snapshots that conflict with ZFS semantics. Use zfs snapshot directly on the zvol. This is the correct approach for ZFS-backed VMs and what kvm-snap does.

Rollback workflow

# Upgrade went wrong? Roll back.
virsh shutdown webserver          # Must shut down first
zfs rollback rpool/vms/webserver@before-upgrade
virsh start webserver             # Boots into pre-upgrade state

# Rollback to a non-latest snapshot (destroys intermediate snapshots):
zfs rollback -r rpool/vms/webserver@last-known-good
I snapshot every VM before every upgrade, every kernel update, every configuration change. It costs nothing — a snapshot is a few hundred bytes of metadata. If the upgrade breaks the VM, I roll back in 2 seconds. This is the correct way to do change management on a hypervisor. Not ITSM tickets. Not change advisory boards. Snapshots. Rollbacks. Move fast, break nothing.

Replication — VM migration and disaster recovery

ZFS send and receive serialize a snapshot (or the delta between two snapshots) into a byte stream. Pipe that stream over SSH and you have replicated a VM to another host. This is the same mechanism Proxmox uses for its "replication" feature. You already have it. You do not need Proxmox.

One-shot migration

# Migrate a VM from host-a to host-b

# 1. Shut down the VM
virsh shutdown webserver

# 2. Take a final snapshot
zfs snapshot rpool/vms/webserver@migrate

# 3. Send the full snapshot to the remote host
zfs send rpool/vms/webserver@migrate | \
  ssh host-b zfs receive rpool/vms/webserver

# 4. Copy the libvirt XML
virsh dumpxml webserver | ssh host-b virsh define /dev/stdin

# 5. Start on the target
ssh host-b virsh start webserver

# 6. (Optional) Clean up the source
virsh undefine webserver
zfs destroy -r rpool/vms/webserver

Incremental replication for DR

# Initial full send (one-time)
zfs snapshot rpool/vms/webserver@repl-1
zfs send rpool/vms/webserver@repl-1 | \
  ssh dr-host zfs receive rpool/vms/webserver

# Incremental sends (only changed blocks since last snapshot)
zfs snapshot rpool/vms/webserver@repl-2
zfs send -i rpool/vms/webserver@repl-1 rpool/vms/webserver@repl-2 | \
  ssh dr-host zfs receive rpool/vms/webserver

# Clean up old snapshots
zfs destroy rpool/vms/webserver@repl-1
ssh dr-host zfs destroy rpool/vms/webserver@repl-1

Syncoid for automated replication

Syncoid (part of the Sanoid suite, included in kldload) automates incremental replication with proper snapshot management, resume support, and compression in transit. Set it up as a cron job and forget about it.

# Replicate a single VM zvol
syncoid rpool/vms/webserver dr-host:rpool/vms/webserver

# Replicate all VMs (recursive)
syncoid -r rpool/vms dr-host:rpool/vms

# Cron job: replicate every 15 minutes
*/15 * * * * /usr/sbin/syncoid -r --no-sync-snap rpool/vms dr-host:rpool/vms

This is Proxmox replication. Proxmox's replication feature is literally zfs send -i wrapped in a cron job with a GUI. You just built the same thing with one line in crontab. The VM on dr-host is ready to start at any time — define it in libvirt and boot it. RPO (Recovery Point Objective) equals your cron interval.

I replicate every production VM to a DR host every 15 minutes with syncoid. The total cost is: one cron line, one SSH key. No cluster software, no shared storage, no quorum devices, no fencing agents. When a host dies, I define the VMs on the DR host and start them. Total recovery time: under 5 minutes. Try getting that from VMware without vSphere Replication licenses.

Performance tuning

KVM + ZFS can match or exceed bare-metal performance when tuned correctly. Most "ZFS is slow" complaints come from wrong defaults, double caching, or I/O scheduler conflicts. Fix these and performance is excellent.

Storage tuning

volblocksize=64K
Default for general VMs. Match guest DB page size for database VMs (8K for PostgreSQL, 16K for MySQL/InnoDB). Set at zvol creation — cannot be changed later.
compression=lz4
Always. LZ4 compresses faster than disk I/O. It reduces bytes written, which reduces I/O latency and extends SSD lifespan. There is no scenario where lz4 hurts performance on modern CPUs.
primarycache=all
Default. ARC caches both data and metadata. For very large VMs where guest data is already cached by the guest OS, set primarycache=metadata to save ARC space for other uses.
SLOG
Enterprise NVMe with power-loss protection, used as ZFS Intent Log. Dramatically reduces synchronous write latency (from ms to us). Required for database VMs and NFS serving. Not needed for async workloads.
I/O scheduler
Set none (noop) for NVMe and SSD devices backing ZFS pools. ZFS has its own I/O scheduler; the Linux elevator just adds latency. echo none > /sys/block/nvme0n1/queue/scheduler

QEMU/KVM tuning

cache=none
Mandatory for ZFS. Disables host page cache for the disk. ZFS ARC is the cache. Double caching (page cache + ARC) wastes RAM and causes cache thrashing. Only use cache=writeback if you have a battery-backed SLOG.
io=native
Use Linux native AIO instead of QEMU's userspace I/O threads. Lower latency, fewer context switches. Set in the libvirt disk XML: <driver io='native'/>
CPU pinning
Pin vCPUs to physical cores for latency-sensitive workloads. Prevents the scheduler from migrating vCPUs across cores (and across NUMA nodes). Use virsh vcpupin.
Hugepages
2MB hugepages reduce TLB misses for memory-intensive VMs. Allocate at boot: hugepages=1024 in kernel cmdline (2GB). Assign in libvirt XML: <memoryBacking><hugepages/></memoryBacking>
NUMA awareness
On multi-socket systems, pin VM memory and vCPUs to the same NUMA node. Cross-node memory access adds 50-100ns latency per operation. Use virsh numatune and virsh vcpupin.
virtio-blk
Use bus=virtio for disk. This is a paravirtualized block device driver with near-native performance. IDE emulation is 10-50x slower and should never be used.

CPU pinning example

# Pin 4 vCPUs to physical cores 4-7 (same NUMA node)
virsh vcpupin database 0 4
virsh vcpupin database 1 5
virsh vcpupin database 2 6
virsh vcpupin database 3 7

# Pin memory to NUMA node 0
virsh numatune database --nodeset 0 --mode strict

# Allocate hugepages and assign to VM
echo 4096 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
virsh edit database
# Add inside :
#   
#     
#   

ARC memory management

ZFS ARC (Adaptive Replacement Cache) is an in-memory read cache. By default, it can consume up to 50% of system RAM. On a VM host, this competes with guest memory. You must limit ARC to leave enough RAM for VMs.

# Rule of thumb: ARC = total RAM - sum of all VM RAM - 4GB (for OS)
# Example: 128GB host, 96GB allocated to VMs, 4GB for OS = 28GB for ARC

# Set ARC max (persistent across reboots)
echo "options zfs zfs_arc_max=30064771072" > /etc/modprobe.d/zfs.conf
# 30064771072 = 28GB in bytes

# Set ARC max at runtime (takes effect immediately)
echo 30064771072 > /sys/module/zfs/parameters/zfs_arc_max

# Verify
arc_summary | grep "ARC size"

GPU passthrough with ZFS-backed VMs

VFIO-PCI passthrough gives a VM direct access to a physical GPU. The VM gets native GPU performance — no emulation, no overhead. Combined with ZFS zvols for storage, you get a VM that performs identically to bare metal for both compute and I/O.

Prerequisites

# 1. Enable IOMMU in kernel cmdline (GRUB or systemd-boot)
# Intel:
intel_iommu=on iommu=pt
# AMD:
amd_iommu=on iommu=pt

# 2. Verify IOMMU is active
dmesg | grep -i iommu

# 3. Identify the GPU's PCI address and IOMMU group
lspci -nn | grep -i nvidia
# 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation ... [10de:2684]
# 01:00.1 Audio device [0403]: NVIDIA Corporation ... [10de:22ba]

# 4. Check IOMMU group (all devices in the group must be passed through)
find /sys/kernel/iommu_groups/ -type l | sort -t/ -k5 -n | grep "01:00"

# 5. Bind GPU to vfio-pci driver
echo "options vfio-pci ids=10de:2684,10de:22ba" > /etc/modprobe.d/vfio.conf
echo "vfio-pci" > /etc/modules-load.d/vfio-pci.conf
# Rebuild initramfs (dracut on RHEL/Rocky/Fedora):
dracut -f

Attach GPU to VM

# Using virt-install (at creation):
virt-install \
  --name gpu-vm \
  --ram 32768 --vcpus 16 --cpu host \
  --machine q35 --boot uefi \
  --disk path=/dev/zvol/rpool/vms/gpu-vm,bus=virtio,cache=none \
  --host-device pci_0000_01_00_0 \
  --host-device pci_0000_01_00_1 \
  --features kvm_hidden=on \
  --network bridge=br0,model=virtio \
  --noautoconsole

# Using virsh (attach to existing VM, shut down first):
virsh nodedev-detach pci_0000_01_00_0
virsh nodedev-detach pci_0000_01_00_1
virsh attach-device gpu-vm gpu-device.xml --config

NVIDIA tip: Use --features kvm_hidden=on to hide the KVM hypervisor signature from the NVIDIA driver. Older NVIDIA drivers refuse to run inside a detected VM (Error 43). This flag tells the driver it is running on bare metal. Modern drivers (535+) no longer need this, but it does not hurt.

GPU passthrough on KVM is better than VMware's DirectPath I/O and miles ahead of Hyper-V's DDA. It works reliably with NVIDIA, AMD, and Intel GPUs. I pass through NVIDIA RTX cards for AI/ML workloads running in VMs with ZFS zvol storage. The VM gets native GPU speed and native storage speed. No compromises.

Networking for VMs

KVM VMs connect to the network through virtual interfaces. The most common and recommended approach is a Linux bridge. VMs attach TAP interfaces to the bridge and appear as first-class devices on your network — each with its own MAC and IP.

Linux bridge (recommended)

# Create a bridge using nmcli (NetworkManager)
nmcli con add type bridge con-name br0 ifname br0
nmcli con add type ethernet con-name br0-port1 ifname eno1 master br0
nmcli con modify br0 ipv4.method manual ipv4.addresses 10.0.0.1/24 ipv4.gateway 10.0.0.254
nmcli con up br0

# Or using systemd-networkd:
cat > /etc/systemd/network/br0.netdev << 'EOF'
[NetDev]
Name=br0
Kind=bridge
EOF

cat > /etc/systemd/network/br0.network << 'EOF'
[Match]
Name=br0
[Network]
Address=10.0.0.1/24
Gateway=10.0.0.254
DNS=10.0.0.254
EOF

cat > /etc/systemd/network/eno1.network << 'EOF'
[Match]
Name=eno1
[Network]
Bridge=br0
EOF

VLAN tagging on bridges

# Create VLAN interface, then bridge it
nmcli con add type vlan con-name vlan100 dev eno1 id 100
nmcli con add type bridge con-name br-vlan100 ifname br-vlan100
nmcli con add type ethernet con-name br-vlan100-port ifname eno1.100 master br-vlan100

# Attach VM to VLAN bridge
virt-install ... --network bridge=br-vlan100,model=virtio ...

Open vSwitch (advanced)

For software-defined networking, VXLAN overlays, and port mirroring, use Open vSwitch (OVS) instead of a Linux bridge. OVS supports OpenFlow, LACP, port mirroring, and VXLAN/GRE tunnels. kldload includes OVS in the server profile.

# Create OVS bridge
ovs-vsctl add-br ovsbr0
ovs-vsctl add-port ovsbr0 eno1

# Use in virt-install
virt-install ... --network network=ovs-net,model=virtio,virtualport_type=openvswitch ...

# VXLAN tunnel between hosts
ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 \
  type=vxlan options:remote_ip=10.0.0.2 options:key=100

WireGuard integration

kldload installs WireGuard by default. You can route VM traffic through a WireGuard tunnel for encrypted site-to-site connectivity. VMs on host A communicate with VMs on host B over an encrypted WireGuard mesh — transparent to the guests.

# Bridge VM traffic over WireGuard
# 1. WireGuard tunnel between hosts (already configured by kldload)
# wg0: 10.200.0.1/24 (host-a) <-> 10.200.0.2/24 (host-b)

# 2. Route VM subnet through WireGuard
ip route add 10.100.0.0/24 via 10.200.0.2 dev wg0

# 3. VMs on host-b use 10.100.0.0/24 on their bridge
# Traffic flows: VM -> br0 -> host routing -> wg0 -> host-b -> br0 -> VM

nftables firewall for VM traffic

# Basic nftables rules for VM host
nft add table inet vm-filter
nft add chain inet vm-filter forward '{ type filter hook forward priority 0; policy drop; }'

# Allow VM-to-internet
nft add rule inet vm-filter forward iifname "br0" oifname "eno1" accept
nft add rule inet vm-filter forward iifname "eno1" oifname "br0" ct state established,related accept

# Allow VM-to-VM on same bridge (default allowed, explicit for clarity)
nft add rule inet vm-filter forward iifname "br0" oifname "br0" accept

# Block inter-VLAN traffic (isolate VLANs)
nft add rule inet vm-filter forward iifname "br-vlan100" oifname "br-vlan200" drop

Comparison tables

KVM+ZFS vs Proxmox vs VMware vs Hyper-V

Feature KVM + ZFS (kldload) Proxmox VE VMware vSphere Hyper-V
Hypervisor KVM (kernel) KVM (kernel) ESXi (proprietary) Hyper-V (proprietary)
Cost Free, forever Free + nag, or $110/yr $8,600+/yr per CPU Windows Server license
Storage ZFS zvols ZFS, LVM, Ceph VMFS, vSAN NTFS, ReFS, CSV
Snapshots <1 sec (ZFS) <1 sec (ZFS) Seconds (VMFS) Seconds (VSS)
Cloning <1 sec (ZFS clone) 30-60 sec (linked) 30-120 sec (linked) Full copy only
Replication zfs send (free) zfs send (free) vSphere Replication ($$$) Hyper-V Replica (built-in)
Checksumming ZFS (every block) ZFS (every block) None None (ReFS partial)
Compression ZFS LZ4/ZSTD ZFS LZ4/ZSTD None native None native
GPU passthrough VFIO-PCI VFIO-PCI DirectPath I/O DDA (limited)
API / CLI virsh, libvirt API REST API, qm CLI vSphere API, PowerCLI PowerShell
Vendor lock-in None Minimal (Perl web UI) Total (Broadcom) Total (Microsoft)
Terraform libvirt provider Proxmox provider vSphere provider Limited

zvol vs qcow2 vs raw file

Property ZFS zvol qcow2 Raw file on ZFS
Type Block device File format Regular file
Snapshots ZFS snapshots (instant) QEMU internal (slow) ZFS snapshots (instant)
Cloning ZFS clone (<1 sec) Backing file (fragile) ZFS clone (<1 sec)
Performance Native block I/O QEMU format overhead File layer overhead
Compression ZFS (transparent) qcow2 (ZSTD/ZLIB) ZFS (transparent)
Checksumming ZFS (every block) None ZFS (every block)
COW overhead Single COW (ZFS) Double COW on ZFS Single COW (ZFS)
Portability ZFS only Any hypervisor ZFS only
Replication zfs send (incremental) Full file copy zfs send (incremental)

Verdict: Use zvols for all VM storage on ZFS. Use qcow2 only for exporting images to non-ZFS systems. Never put qcow2 files on ZFS — the double COW penalty is severe (30-50% performance loss on random write workloads).

Storage operations speed comparison

Operation ZFS zvol qcow2 file VMFS (VMware) VHDX (Hyper-V)
Snapshot (100GB) <0.01 sec 1-5 sec 1-10 sec 2-10 sec
Clone (100GB) <1 sec 2-5 sec 30-120 sec 5-30 min (full copy)
Delete (100GB) <1 sec 1-5 sec 5-30 sec 5-30 sec
Incremental replicate (1GB changed) ~10 sec Full file (minutes) vSR license required ~30 sec (Replica)
Rollback <1 sec 1-5 sec 10-60 sec 10-60 sec
These numbers are not theoretical. I have measured them on production hardware. ZFS snapshot and clone are metadata operations — they do not touch data blocks. That is why they are constant-time regardless of dataset size. Every other system copies blocks. If you have ever waited 30 minutes for a VMware clone, you understand why this matters.

Golden image workflow

A golden image is a fully configured, patched, hardened base VM that serves as the template for all production deployments. With ZFS clones, maintaining golden images is zero-cost. You snapshot the image, clone it for every deployment, and the clones diverge only where they need to (hostname, IP, application config). This is the kldload way.

Building the golden image

# 1. Create the template VM
kvm-create golden-rocky9 --ram 4096 --vcpus 2 --disk 50G \
  --iso /root/vms-images/Rocky-9.iso

# 2. Install OS, then inside the VM:
dnf update -y
dnf install -y qemu-guest-agent cloud-init vim-enhanced tmux
systemctl enable qemu-guest-agent cloud-init
# Configure cloud-init for multi-datasource (NoCloud, OpenStack, EC2)
# Install your standard packages, harden SSH, configure firewall

# 3. Seal the image for cloning
# (clear machine-specific state)
truncate -s 0 /etc/machine-id
rm -f /etc/ssh/ssh_host_*
rm -f /var/lib/dbus/machine-id
cloud-init clean
dnf clean all
shutdown -h now

# 4. Snapshot the golden image
zfs snapshot rpool/vms/golden-rocky9@v1

# 5. Tag it
zfs set com.kldload:golden=true rpool/vms/golden-rocky9
zfs set com.kldload:version=1 rpool/vms/golden-rocky9@v1

Deploying from the golden image

# Clone for deployment (instant)
kvm-clone golden-rocky9 production-web-01
kvm-clone golden-rocky9 production-web-02
kvm-clone golden-rocky9 production-api-01

# Each clone boots with cloud-init, which:
# - Generates new machine-id
# - Regenerates SSH host keys
# - Sets hostname from NoCloud metadata
# - Configures networking from metadata
# - Runs any user-data scripts

# Provide cloud-init metadata via NoCloud ISO:
cloud-localds /var/lib/libvirt/images/web01-cidata.iso \
  user-data.yaml meta-data.yaml

# Attach the cloud-init ISO to the clone
virsh attach-disk production-web-01 \
  /var/lib/libvirt/images/web01-cidata.iso hdc \
  --type cdrom --config

virsh start production-web-01
# VM boots, cloud-init runs, machine is unique and configured.

Exporting images

kldload's image export pipeline converts a zvol to portable formats for other hypervisors. The VM is installed normally on ZFS, sealed for cloning, and exported via qemu-img convert. Supported formats: qcow2, vmdk, vhd, ova, raw.

# Export a zvol as qcow2 (for non-ZFS KVM hosts)
qemu-img convert -f raw -O qcow2 -c \
  /dev/zvol/rpool/vms/golden-rocky9 golden-rocky9.qcow2

# Export as vmdk (for VMware)
qemu-img convert -f raw -O vmdk -o subformat=streamOptimized \
  /dev/zvol/rpool/vms/golden-rocky9 golden-rocky9.vmdk

# Export as vhd (for Hyper-V / Azure)
qemu-img convert -f raw -O vpc \
  /dev/zvol/rpool/vms/golden-rocky9 golden-rocky9.vhd

# Or use kldload's kexport tool:
kexport golden-rocky9 --format qcow2 --output /tmp/
kexport golden-rocky9 --format vmdk --scp user@remote:/images/

Automation

KVM + ZFS is fully scriptable. Every operation is a CLI command. There is no GUI dependency, no Java applet, no Flash plugin (yes, vSphere 6 still used Flash). This makes automation straightforward with any tool: bash, Python, Ansible, Terraform.

virsh + ZFS scripting patterns

#!/bin/bash
# deploy-fleet.sh — deploy N VMs from a golden image

GOLDEN="golden-rocky9"
SNAP="v1"
COUNT=${1:-5}
BRIDGE="br0"

for i in $(seq 1 $COUNT); do
  NAME="worker-$(printf '%03d' $i)"
  echo "Deploying $NAME..."

  # Clone zvol (instant)
  zfs clone rpool/vms/${GOLDEN}@${SNAP} rpool/vms/${NAME}

  # Generate cloud-init ISO
  cat > /tmp/meta-data.yaml << EOF
instance-id: ${NAME}
local-hostname: ${NAME}
EOF
  cloud-localds /tmp/${NAME}-cidata.iso /tmp/user-data.yaml /tmp/meta-data.yaml

  # Define and start VM
  virt-install \
    --name ${NAME} \
    --ram 2048 --vcpus 2 --cpu host \
    --machine q35 --boot uefi \
    --disk path=/dev/zvol/rpool/vms/${NAME},bus=virtio,cache=none \
    --disk path=/tmp/${NAME}-cidata.iso,device=cdrom \
    --network bridge=${BRIDGE},model=virtio \
    --os-variant rocky9 \
    --import --noautoconsole
done

echo "Deployed $COUNT VMs in $(($SECONDS)) seconds."

Terraform libvirt provider

# main.tf — Terraform with the dmacvicar/libvirt provider
terraform {
  required_providers {
    libvirt = {
      source = "dmacvicar/libvirt"
    }
  }
}

provider "libvirt" {
  uri = "qemu:///system"
}

# Reference an existing zvol (created outside Terraform)
resource "libvirt_domain" "web" {
  count  = 3
  name   = "web-${count.index + 1}"
  memory = 4096
  vcpu   = 4

  cpu {
    mode = "host-passthrough"
  }

  disk {
    # ZFS zvol created by a provisioner or pre-existing
    block_device = "/dev/zvol/rpool/vms/web-${count.index + 1}"
  }

  network_interface {
    bridge = "br0"
  }

  boot_device {
    dev = ["hd"]
  }

  # Cloud-init
  cloudinit = libvirt_cloudinit_disk.web[count.index].id
}

resource "libvirt_cloudinit_disk" "web" {
  count = 3
  name  = "web-${count.index + 1}-cloudinit.iso"
  user_data = templatefile("user-data.tpl", {
    hostname = "web-${count.index + 1}"
  })
}

libvirt Python bindings

#!/usr/bin/env python3
"""List all VMs with their ZFS zvol usage."""
import libvirt
import subprocess
import json

conn = libvirt.open("qemu:///system")
for dom in conn.listAllDomains():
    name = dom.name()
    state = "running" if dom.isActive() else "shut off"

    # Get disk path from XML
    import xml.etree.ElementTree as ET
    tree = ET.fromstring(dom.XMLDesc())
    disks = tree.findall(".//disk[@type='block']/source")

    for disk in disks:
        path = disk.get("dev", "")
        if "/zvol/" in path:
            # Extract ZFS dataset name from /dev/zvol/ path
            dataset = path.replace("/dev/zvol/", "")
            result = subprocess.run(
                ["zfs", "get", "-Hp", "-o", "value", "used,refer,compressratio", dataset],
                capture_output=True, text=True
            )
            print(f"{name:20s} {state:10s} {dataset:40s} {result.stdout.strip()}")

conn.close()
If you can script it, you can automate it. If you can automate it, you can scale it. Every VM I deploy is created by a script, not by clicking through a web UI. The kvm-* tools are my scripts. Terraform is for when you need state management. Ansible is for when you need configuration management. But the foundation is always: zfs clone + virt-install + cloud-init. Three commands, infinite scale.

Monitoring and maintenance

A KVM + ZFS hypervisor needs two monitoring planes: storage health (ZFS) and VM metrics (libvirt). Both are exposed via standard commands and can feed into Prometheus, Grafana, or any monitoring stack.

ZFS storage monitoring

# Pool health
zpool status rpool
# Look for: state (ONLINE/DEGRADED/FAULTED), errors, scrub status

# Real-time I/O statistics (per-second)
zpool iostat rpool 1
#               capacity     operations     bandwidth
# pool        alloc   free   read  write   read  write
# rpool       45.2G   186G    142    287  8.91M  17.3M

# Per-zvol space accounting
zfs list -r -o name,used,refer,compressratio,volsize rpool/vms
# NAME                        USED   REFER  RATIO  VOLSIZE
# rpool/vms                   45.1G   24K     -       -
# rpool/vms/webserver         12.3G  12.3G  2.14x   100G
# rpool/vms/database          28.4G  28.4G  1.32x   200G
# rpool/vms/dev-clone          4.4G  12.3G  2.14x   100G

# Scrub scheduling (weekly is recommended)
# kldload sets this up automatically via systemd timer
systemctl status zfs-scrub-weekly@rpool.timer

# Manual scrub
zpool scrub rpool
zpool status rpool | grep scan

VM metrics via virsh

# VM state overview
virsh list --all

# Detailed stats for a VM
virsh domstats webserver
# Domain: 'webserver'
#   state.state=1          (running)
#   cpu.time=18293847000   (nanoseconds)
#   balloon.current=4194304 (KB)
#   block.0.rd.reqs=142857
#   block.0.rd.bytes=8918573056
#   block.0.wr.reqs=287142
#   block.0.wr.bytes=17301438464
#   net.0.rx.bytes=4918573056
#   net.0.tx.bytes=2301438464

# CPU usage per VM
virsh cpu-stats webserver

# Memory usage
virsh dommemstat webserver

ARC monitoring

# ARC summary
arc_summary

# Key metrics to watch:
# - ARC hit ratio: should be >90% for cached workloads
# - ARC size: should stay below your configured max
# - Demand data hit ratio: directly impacts VM read performance

# Real-time ARC stats
arcstat 1
#    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size   c
#  14:30  1.2K    48   3.9%    35  3.2%    13  8.1%     0  0.0%  24G  28G

Common mistakes

Using qcow2 on ZFS

The most common and most damaging mistake. qcow2 has its own copy-on-write mechanism. ZFS has copy-on-write. Stacking them creates double COW — every guest write triggers two COW operations, doubling write amplification and halving throughput. Use zvols instead. If someone tells you to use qcow2 on ZFS, they are wrong.

qcow2 on ZFS = two taxi meters running simultaneously. You pay double for the same ride.

volblocksize mismatch

Using the default 8K volblocksize for general VMs creates excessive metadata overhead and poor sequential performance. Using 128K creates write amplification on random I/O. The right answer is 64K for general VMs and matching the database page size for database VMs. This is set at creation and cannot be changed.

volblocksize is permanent. Get it wrong and you rebuild the zvol from scratch.

ARC eating guest memory

ZFS ARC defaults to 50% of RAM. On a 128GB host with 96GB allocated to VMs, ARC will try to use 64GB — leaving only 32GB for VMs that need 96GB. The host swaps, performance collapses. Always set zfs_arc_max to leave enough RAM for all VM allocations plus OS overhead.

ARC + VM RAM + OS overhead must be < total RAM. Violate this and everything dies.

sync=disabled on database VMs

Disabling sync writes is safe for throwaway VMs but dangerous for databases. Without sync, a power failure can lose committed transactions. Databases rely on fsync() for durability guarantees. If the storage lies about flush completion, the database's crash recovery cannot work correctly.

sync=disabled on a production database = eventual data loss. Not if, when.

No SLOG for sync-heavy workloads

If your VMs run databases, NFS, or any sync-write-heavy application, and you are using spinning disks or consumer SSDs, synchronous write latency will be terrible (5-50ms per operation). An enterprise NVMe SLOG drops this to 50-100us. The SLOG must have power-loss protection (PLP) — a consumer NVMe without PLP as SLOG is worse than no SLOG.

SLOG with PLP = fast and safe. Consumer SSD as SLOG = fast until power failure, then corrupt.

Single-disk pool in production

A single disk with no redundancy means one drive failure destroys all VMs. ZFS checksumming will detect the corruption but cannot correct it without a mirror or RAIDZ. Production VM hosts need at least a mirror. The cost of a second disk is nothing compared to the cost of rebuilding everything.

Single disk = you are one firmware bug away from losing everything. Mirror or RAIDZ, always.

Not using virtio

IDE emulation is the default for some virt-install invocations and graphical VM managers. It is 10-50x slower than virtio. Always specify bus=virtio for disk and model=virtio for network. The only exception is old guests that lack virtio drivers (Windows XP, ancient Linux kernels). Modern Windows requires the virtio-win drivers.

IDE in a KVM VM is like racing a Ferrari in first gear. Shift to virtio.

Ignoring NUMA

On multi-socket servers, VMs that span NUMA nodes pay a 50-100ns penalty on every cross-node memory access. A latency-sensitive VM (database, real-time application) should be pinned to a single NUMA node with virsh numatune and virsh vcpupin. Most people never configure this and wonder why their 2-socket server is slower than expected.

NUMA is free performance. Pin your VMs and get 10-30% more throughput on multi-socket.

Using cache=writeback without SLOG

Setting cache=writeback in QEMU without a battery-backed SLOG means QEMU acknowledges writes before they reach stable storage. A power failure loses those writes. With ZFS-backed zvols, always use cache=none (let ZFS handle caching via ARC) or cache=writeback only if you have an SLOG with PLP.

cache=none is always safe on ZFS. cache=writeback is a bet that the power stays on.

Not monitoring pool capacity

ZFS performance degrades severely above 80% capacity due to fragmentation and the COW write pattern. With thin-provisioned zvols, a VM can write enough data to fill the pool without warning. Monitor zpool list and alert at 75%. Keep 20% free at all times. Use reservation on critical zvols to guarantee they always have space.

ZFS above 80% full = performance cliff. Monitor capacity or suffer.
I have made every one of these mistakes. The qcow2-on-ZFS mistake cost me a week of debugging "slow storage" on a Proxmox cluster before I realized the VMs were using qcow2 format on ZFS-backed storage. The ARC memory mistake caused an out-of-memory killer event that took down 12 VMs simultaneously. Learn from my mistakes. Read this list. Set up your system correctly the first time.

Migration from other hypervisors

Moving VMs to KVM + ZFS is straightforward because qemu-img can convert between every major disk format, and ZFS zvols accept raw block data via dd. The pattern is always the same: convert to raw, write to zvol, define in libvirt.

From Proxmox (qcow2 or raw on ZFS)

# If Proxmox already uses ZFS zvols (common):
# 1. Export the zvol via zfs send
ssh proxmox-host zfs send rpool/data/vm-100-disk-0@migrate | \
  zfs receive rpool/vms/migrated-vm

# If Proxmox uses qcow2:
# 1. Copy the qcow2 file from Proxmox
scp proxmox-host:/var/lib/vz/images/100/vm-100-disk-0.qcow2 /tmp/

# 2. Get the virtual size
qemu-img info /tmp/vm-100-disk-0.qcow2
# virtual size: 100 GiB

# 3. Create a zvol of that size
zfs create -V 100G -b 64K rpool/vms/migrated-vm

# 4. Convert and write directly to the zvol
qemu-img convert -f qcow2 -O raw /tmp/vm-100-disk-0.qcow2 \
  /dev/zvol/rpool/vms/migrated-vm

# 5. Define VM in libvirt (adjust the Proxmox XML or create new)
virt-install --import --name migrated-vm \
  --ram 4096 --vcpus 4 --cpu host --machine q35 --boot uefi \
  --disk path=/dev/zvol/rpool/vms/migrated-vm,bus=virtio,cache=none \
  --network bridge=br0,model=virtio --noautoconsole

From VMware (vmdk)

# 1. Export the VM from VMware (OVA or VMDK)
# Use vSphere client: Export -> OVF template
# Or use ovftool: ovftool vi://vcenter/... /tmp/vm.ova

# 2. Extract VMDK from OVA (if needed)
tar xf /tmp/vm.ova
# Results in: vm.ovf, vm-disk1.vmdk

# 3. Check virtual size
qemu-img info /tmp/vm-disk1.vmdk

# 4. Create zvol
zfs create -V 100G -b 64K rpool/vms/from-vmware

# 5. Convert VMDK to raw directly onto zvol
qemu-img convert -f vmdk -O raw /tmp/vm-disk1.vmdk \
  /dev/zvol/rpool/vms/from-vmware

# 6. IMPORTANT: Remove VMware Tools, install QEMU guest agent
# Boot the VM, then inside the guest:
# (RHEL/Rocky): dnf remove open-vm-tools && dnf install qemu-guest-agent
# (Debian/Ubuntu): apt remove open-vm-tools && apt install qemu-guest-agent
# (Windows): Uninstall VMware Tools, install virtio-win drivers + QEMU GA

# 7. Define in libvirt
virt-install --import --name from-vmware \
  --ram 8192 --vcpus 4 --cpu host --machine q35 --boot uefi \
  --disk path=/dev/zvol/rpool/vms/from-vmware,bus=virtio,cache=none \
  --network bridge=br0,model=virtio --noautoconsole

From Hyper-V (vhd/vhdx)

# 1. Export the VM from Hyper-V Manager (Export Virtual Machine)
# Copy the .vhdx file to the KVM host

# 2. Convert VHDX to raw on zvol
qemu-img info /tmp/vm.vhdx
zfs create -V 100G -b 64K rpool/vms/from-hyperv
qemu-img convert -f vhdx -O raw /tmp/vm.vhdx \
  /dev/zvol/rpool/vms/from-hyperv

# 3. Define and boot
virt-install --import --name from-hyperv \
  --ram 4096 --vcpus 4 --cpu host --machine q35 --boot uefi \
  --disk path=/dev/zvol/rpool/vms/from-hyperv,bus=virtio,cache=none \
  --network bridge=br0,model=virtio --noautoconsole

# Note: Windows VMs from Hyper-V need virtio drivers installed
# before migration. Install them in Hyper-V first, or boot with
# IDE emulation, install virtio-win, then switch to virtio.

From raw qcow2 KVM (non-ZFS)

# The simplest migration — you are already on KVM.
# Just convert the qcow2 to a zvol.

# 1. Get virtual size
qemu-img info /var/lib/libvirt/images/oldvm.qcow2

# 2. Create zvol
zfs create -V 80G -b 64K rpool/vms/oldvm

# 3. Convert
qemu-img convert -f qcow2 -O raw \
  /var/lib/libvirt/images/oldvm.qcow2 \
  /dev/zvol/rpool/vms/oldvm

# 4. Update the existing libvirt XML
virsh edit oldvm
# Change: 
# To:     
# Change: 
# To:     

# 5. Start
virsh start oldvm
# You now have ZFS snapshots, clones, replication, checksumming.
# Welcome to the future.
I have migrated hundreds of VMs from VMware and Proxmox to KVM + ZFS. The conversion step takes minutes (I/O bound by the disk copy). The hardest part is Windows VMs — install the virtio-win drivers before migration or you will be booting into a blue screen. Linux VMs migrate seamlessly because the kernel detects the new virtio hardware automatically.

Quick reference

The 25 commands you need to run a KVM + ZFS hypervisor. Print this page.

zfs create -V 100G -b 64K pool/vms/name
Create a 100GB zvol with 64K block size
zfs set compression=lz4 pool/vms
Enable LZ4 compression on all VMs (inherited)
zfs set refreservation=none pool/vms/name
Thin-provision a zvol (no space reservation)
zfs snapshot pool/vms/name@tag
Instant snapshot of a VM disk
zfs rollback pool/vms/name@tag
Rollback to a snapshot (VM must be shut down)
zfs clone pool/vms/name@tag pool/vms/clone
Instant clone from a snapshot
zfs promote pool/vms/clone
Make a clone independent (reverse parent-child)
zfs destroy pool/vms/name
Delete a zvol (fails if snapshots/clones depend on it)
zfs destroy -r pool/vms/name
Delete a zvol and all its snapshots
zfs send pool/vms/name@snap | ssh host zfs recv pool/vms/name
Full replication to another host
zfs send -i @snap1 pool/vms/name@snap2 | ssh host zfs recv pool/vms/name
Incremental replication (only changed blocks)
zfs list -r -o name,used,refer,ratio pool/vms
List all VM zvols with space usage and compression
syncoid -r pool/vms host:pool/vms
Automated incremental replication (all VMs)
virt-install --disk path=/dev/zvol/... --import
Create a VM from an existing zvol (no ISO)
virsh list --all
List all VMs and their state
virsh start / shutdown / destroy name
Start, graceful shutdown, or force-stop a VM
virsh domfsfreeze / domfsthaw name
Freeze/thaw guest FS for app-consistent snapshots
virsh dumpxml name
Export VM definition as XML
virsh define file.xml
Import a VM from XML definition
virsh vcpupin name vcpu cpu
Pin a vCPU to a physical core
virsh domstats name
Detailed VM statistics (CPU, memory, I/O)
zpool iostat pool 1
Real-time pool I/O statistics
zpool status pool
Pool health, errors, scrub status
kvm-create / kvm-clone / kvm-snap / kvm-delete / kvm-list
kldload wrappers: one-command VM lifecycle
qemu-img convert -f raw -O qcow2 /dev/zvol/... out.qcow2
Export a zvol to qcow2 (for shipping to non-ZFS)
25 commands. That is the entire hypervisor. Not a 200-page manual, not a certification course, not a week-long training. Twenty-five commands that give you everything VMware and Proxmox offer. Learn them. You will never need anything else.

The bottom line: KVM is the kernel's hypervisor. ZFS is the best storage layer. Together they give you instant snapshots, instant clones, checksummed storage, transparent compression, incremental replication, and GPU passthrough — all for free, all with standard Linux tools, all without a single proprietary component. kldload installs this stack automatically with every desktop and server profile. Boot the ISO, install, and you have a hypervisor that competes with anything on the market.

Stop paying rent on your hypervisor. The hypervisor you need is already in your kernel.