KVM + ZFS Hypervisor — the hypervisor you already have.
Every Linux kernel ships with KVM. Every Linux distribution ships with libvirt and virsh.
When you combine KVM with ZFS zvols as VM storage, you get a hypervisor that matches or
exceeds Proxmox, VMware, and Hyper-V in every measurable dimension — for free, with
standard tools, and zero vendor lock-in. kldload's desktop and server profiles install
this stack automatically: libvirt + QEMU/KVM + ZFS zvols + the kvm-* toolset.
You get instant clones, atomic snapshots, checksummed storage, native compression, and
replication — all at the storage layer, transparent to every VM.
The thesis: You do not need Proxmox. You do not need VMware.
You do not need a proprietary management layer between you and your hypervisor.
KVM is the Linux kernel's native hypervisor. ZFS is the best storage layer ever built.
Together they give you everything — snapshots, clones, replication, compression,
checksumming, encryption — and the only cost is learning the commands.
Proxmox is KVM + ZFS + a web UI + a subscription nag screen. VMware is a proprietary
hypervisor + proprietary storage + proprietary licensing + Broadcom's lawyers.
kldload is KVM + ZFS + tools that make you faster than either. The hypervisor you need
is already in your kernel. Stop paying rent on it.
I ran VMware for 15 years. I ran Proxmox for 5.
The day I switched to raw KVM + ZFS, I realized both products exist to sell you a GUI
for things that are two-line shell commands. The entire value proposition of Proxmox is
virsh with a web interface and zfs send on a cron job.
Once you learn the 20 commands on this page, you will never go back.
Why KVM + ZFS
The philosophy is simple: use the kernel's hypervisor, use the best filesystem,
and skip every abstraction layer in between. KVM is not a product — it is
a kernel module. It has been in mainline Linux since 2007. Every cloud provider on Earth
runs KVM: AWS (Nitro is KVM), Google Cloud (KVM), Oracle Cloud (KVM), DigitalOcean (KVM).
The technology is not in question. The only question is what sits on top of it.
Traditional KVM setups use qcow2 files as VM disks. qcow2 is a file format
with its own snapshot system, its own thin provisioning, its own compression. It works. But
it is a userspace format managed by QEMU, and it duplicates everything ZFS already does better.
When you put a qcow2 file on ZFS, you get double copy-on-write — ZFS COW under
qcow2 COW — which destroys write performance and wastes space.
The solution is ZFS zvols. A zvol is a ZFS dataset that presents itself as a
raw block device at /dev/zvol/poolname/volname. QEMU sees a block device, not a
file. There is no qcow2 layer. No double COW. ZFS handles snapshots, clones, compression,
checksumming, and replication at the storage layer. QEMU just reads and writes blocks. This
is the architecture kldload uses, and it is the correct architecture.
Instant snapshots
ZFS snapshots are O(1) — they complete in milliseconds regardless of VM size.
Snapshot a 500GB VM in under a second. No quiescing the storage, no pausing I/O.
The snapshot is atomic and crash-consistent at the block level.
qcow2 snapshots: seconds to minutes. ZFS snapshots: milliseconds. Always.
Instant clones
Clone a 200GB VM in under one second. ZFS clones are copy-on-write — the clone
shares all blocks with the parent until either one writes new data. This is not a linked
clone that degrades over time. It is a first-class dataset that can be promoted to
independent existence.
Every block written to a zvol is checksummed. Every read is verified. Bit rot is detected
and corrected automatically on mirrored pools. No other hypervisor storage does this.
qcow2 files can silently corrupt. VMDK files can silently corrupt. ZFS zvols cannot.
Silent data corruption is impossible on ZFS. That sentence is worth the entire page.
Native compression
LZ4 compression on zvols typically saves 30-60% on OS volumes at near-zero CPU cost.
A 100GB Windows VM might use 45GB of actual disk. A Linux VM might use 30GB. Compression
is transparent — the VM sees a 100GB disk, ZFS stores only what is needed.
Free disk space. Free bandwidth. Free IOPS (less data = less I/O). Always enable LZ4.
Replication built in
zfs send and zfs receive replicate VM disks between hosts with
incremental, block-level efficiency. No proprietary replication protocol. No license.
No cluster software. Just standard ZFS commands that work on any pool, anywhere.
Proxmox replication is zfs send with a GUI. You already have zfs send.
No vendor lock-in
KVM is the Linux kernel. libvirt is an open standard. ZFS is open source. Your VMs are
standard QEMU disk images on standard ZFS datasets. You can move them to any Linux host
with KVM and ZFS. No license keys. No subscription. No phone-home. No Broadcom.
Your infrastructure belongs to you. Not to a vendor. Not to a subscription.
Architecture — the full stack
Understanding the full stack is essential. Every layer is standard Linux. There is no
proprietary component anywhere in the chain.
Physical disk
NVMe SSD, SATA SSD, or HDD. ZFS manages the raw device directly — no partitioning, no LVM, no mdraid. The disk is a member of a ZFS pool.
ZFS pool
zpool create rpool mirror /dev/nvme0n1 /dev/nvme1n1 — the pool provides redundancy (mirror, RAIDZ), checksumming, compression, and the dataset namespace.
ZFS zvol
zfs create -V 100G rpool/vms/webserver — a zvol is a block device dataset. ZFS allocates logical space and creates a device node. The zvol inherits pool properties: compression, checksumming, redundancy.
/dev/zvol/
The zvol appears as /dev/zvol/rpool/vms/webserver (symlink to /dev/zdN). This is a standard Linux block device. Any program can open it — dd, qemu, fdisk. No special API.
libvirt XML
The VM definition in /etc/libvirt/qemu/vmname.xml references the zvol as a block device: <source dev='/dev/zvol/rpool/vms/webserver'/>. libvirt passes this to QEMU at VM start.
QEMU/KVM
QEMU opens the block device and presents it to the guest as a virtio-blk or virtio-scsi disk. KVM (the kernel module) handles CPU and memory virtualization. QEMU handles device emulation and I/O.
Guest OS
The guest sees a standard disk device (/dev/vda with virtio). It has no idea it is running on ZFS. It formats the disk with ext4, XFS, NTFS, or even ZFS-in-ZFS for nested pools. Completely transparent.
This stack has zero proprietary layers. Compare to VMware: proprietary hypervisor
(ESXi) + proprietary storage (VMFS/vSAN) + proprietary management (vCenter) + proprietary
licensing (per-CPU). Compare to Proxmox: KVM + ZFS + proprietary management (Proxmox VE) +
subscription nag. The kldload stack is: KVM + ZFS + shell commands. Same capabilities.
No toll booth.
The entire Proxmox product is a Perl web UI that calls
virsh, zfs, and qm under the hood. I have read
the source. Every Proxmox "feature" maps to a standard Linux command. The clustering is
corosync. The replication is zfs send. The firewall is nftables. They package
it nicely and charge for support. That is fine. But do not confuse the packaging with the technology.
The technology is free and always has been.
Pool design for KVM
How you organize your ZFS pool for VM storage matters. The wrong layout creates performance
problems, management headaches, and snapshot chaos. The right layout gives you clean
separation, easy replication, and per-VM accounting.
Dedicated pool vs shared pool
If you have enough disks, use a dedicated pool for VMs. The root pool
(rpool) handles the OS, boot environments, and system snapshots. A separate
vmpool handles VM storage. This gives you independent I/O paths, separate
scrub schedules, and the ability to export/import the VM pool without touching the OS.
If you have a single disk or a single mirror pair, put VMs under rpool/vms.
This is what kldload does by default on single-disk installs. It works fine — ZFS
handles the mixed workload. You just cannot separate the I/O paths.
Recommended dataset hierarchy
# Dedicated VM pool (preferred)
vmpool/
vms/ # Parent dataset for all VM zvols
webserver # zvol: /dev/zvol/vmpool/vms/webserver
database # zvol: /dev/zvol/vmpool/vms/database
dev-template # zvol: golden image for cloning
images/ # Regular dataset (not zvol) for ISOs
debian-13.iso
rocky-9.iso
backups/ # Received snapshots from other hosts
# Single-pool layout (rpool only)
rpool/
vms/
webserver
database
vms-images/ # ISOs and templates
vms-backups/ # Received replicas
volblocksize — the critical tuning knob
volblocksize is the zvol equivalent of recordsize. It determines
the minimum I/O unit for the zvol. This property is set at creation and cannot be
changed. The default is 8K, which is wrong for almost every VM workload.
Use 64K for general-purpose VMs. This matches the default recordsize for
ZFS datasets and gives good performance across mixed workloads (OS operations, application
I/O, file serving). Most guest filesystems (ext4, XFS, NTFS) issue I/O in 4K-64K chunks,
and a 64K volblocksize amortizes ZFS metadata overhead efficiently.
# Create a zvol with 64K block size (recommended for all VMs)
zfs create -V 100G -b 64K rpool/vms/webserver
# For database VMs that use 16K pages (PostgreSQL, MySQL/InnoDB):
zfs create -V 200G -b 16K rpool/vms/postgres
# For database VMs that use 8K pages (PostgreSQL default):
zfs create -V 200G -b 8K rpool/vms/pg-oltp
Why not 128K for VMs?
A 128K volblocksize means every small write (even a 4K guest I/O) touches a 128K block.
ZFS must read the entire 128K block, modify the portion that changed, and write a new 128K
block. This write amplification kills random I/O performance on VM workloads.
64K is the sweet spot: large enough to amortize metadata, small enough to limit amplification.
For OLTP databases, go even smaller (8K or 16K) to match the database page size.
volblocksize is permanent. Test with 64K first. Only go smaller for known database workloads.
Compression on zvols
Always enable LZ4 compression on VM zvols. There is no reason not to. LZ4
compresses and decompresses faster than disk I/O — it literally makes your storage
faster by reducing the amount of data written. OS volumes compress exceptionally well
(40-70% savings on Linux, 30-50% on Windows). Even database volumes with mostly random
data achieve 10-20% savings from metadata and logs.
# Set compression on the parent dataset (inherited by all zvols)
zfs set compression=lz4 rpool/vms
# Verify compression ratio on a running VM
zfs get compressratio rpool/vms/webserver
# NAME PROPERTY VALUE SOURCE
# rpool/vms/webserver compressratio 2.14x -
sync tuning for VM workloads
The sync property controls whether ZFS flushes writes to stable storage before
acknowledging them. sync=standard (default) honors the guest's flush requests.
sync=disabled acknowledges writes immediately without flushing, which is faster
but risks data loss on power failure.
For production VMs: keep sync=standard. Add an SLOG (ZFS Intent Log device)
if synchronous write latency is a problem. An enterprise NVMe with power-loss protection
as SLOG drops sync write latency from 5-15ms (spinning rust) to 50-100us.
For development VMs, CI runners, and throwaway workloads:sync=disabled is safe because you do not care if the VM is destroyed by a
power failure. The performance gain is dramatic — 5-10x for sync-heavy workloads
like database imports and package installations.
# Production: keep defaults, add SLOG if needed
zfs set sync=standard rpool/vms/database
# Development: disable sync for speed
zfs set sync=disabled rpool/vms/dev-throwaway
# Add an SLOG to the pool (enterprise NVMe with PLP)
zpool add rpool log /dev/nvme2n1
I run sync=disabled on every dev VM and every
CI runner. I have lost zero data from it, because those VMs are ephemeral — they
are cloned from a golden image, used for a few hours, and destroyed. If the power goes
out, I clone a new one. It takes one second. Production databases get sync=standard
plus an SLOG. Match the durability guarantee to the data's actual value.
Creating VMs with zvol storage
The workflow is straightforward: create a zvol, then create a VM that uses it. No image
files, no storage pools to configure in libvirt, no format conversions.
Step 1: Create the zvol
# General purpose VM - 100GB, 64K blocks, LZ4 compression
zfs create -V 100G -b 64K -o compression=lz4 rpool/vms/rocky9-web
# Thin provisioned (sparse) - only uses space as data is written
# By default, zvols reserve their full size. Remove the reservation:
zfs set refreservation=none rpool/vms/rocky9-web
# Verify
zfs list -o name,volsize,volblocksize,used,refer,compress rpool/vms/rocky9-web
# NAME VOLSIZE VOLBLOCKSIZE USED REFER COMPRESS
# rpool/vms/rocky9-web 100G 64K 56K 56K lz4
Sparse vs thick provisioning: By default, ZFS reserves disk space equal to
the zvol's logical size (refreservation). This guarantees the zvol can always
write to its full size. Setting refreservation=none makes the zvol thin-provisioned
— it only uses space as data is written. Use thin provisioning when you trust your
capacity planning (or have monitoring). Use thick provisioning for critical VMs that must
never hit ENOSPC.
Pass the host CPU model to the guest. This enables all CPU features (AVX-512, AES-NI, etc.) and gives the best performance. Only use a generic CPU model if you need live migration between different CPU generations.
--machine q35
Use the Q35 chipset (PCIe-native) instead of the legacy i440fx. Required for PCIe passthrough, NVMe emulation, and modern features. There is no reason to use i440fx on new VMs.
cache=none
Critical for ZFS. Tells QEMU not to cache I/O in the host page cache. ZFS has its own cache (ARC). Double caching wastes RAM and hurts performance. Always use cache=none with ZFS-backed storage.
bus=virtio
Use the virtio disk driver for near-native I/O performance. IDE emulation is 10-50x slower. SCSI (virtio-scsi) is an alternative that supports TRIM and SCSI features, but virtio-blk is simpler and slightly faster for most workloads.
--boot uefi
Boot with UEFI firmware (OVMF). Required for modern OS installers, Secure Boot, and TPM 2.0. BIOS boot is legacy — do not use it for new VMs.
--tpm
Emulated TPM 2.0 via swtpm. Required for Windows 11, useful for measured boot and disk encryption in any guest. Zero performance cost.
Compare to qcow2 workflow
# qcow2 workflow (DON'T do this on ZFS):
qemu-img create -f qcow2 /var/lib/libvirt/images/vm.qcow2 100G
virt-install --disk path=/var/lib/libvirt/images/vm.qcow2,format=qcow2 ...
# Result: double COW (qcow2 COW + ZFS COW), double metadata, poor performance
# zvol workflow (DO this on ZFS):
zfs create -V 100G -b 64K rpool/vms/vm
virt-install --disk path=/dev/zvol/rpool/vms/vm,cache=none ...
# Result: single COW (ZFS only), native checksumming, instant snapshots
kldload's kvm-* tools
kldload ships five purpose-built commands that wrap virsh and ZFS into single operations.
They enforce the correct zvol properties, handle snapshot naming, manage clone ancestry,
and clean up orphan datasets. You can always use raw virsh and zfs
commands — the kvm-* tools are convenience wrappers that do the right thing by default.
kvm-create
Creates the zvol and the VM in one command. Sets volblocksize=64K, compression=lz4, refreservation=none. Calls virt-install with the correct flags (q35, UEFI, virtio, cache=none, TPM 2.0, serial console).
kvm-clone
ZFS snapshot + ZFS clone + libvirt define in one command. Clones a VM in under 1 second regardless of disk size. Sets com.kldload:clone-origin property for tracking ancestry. Generates new MAC address and UUID.
kvm-snap
Snapshots a VM's zvol(s). Names snapshots with ISO 8601 timestamps: rpool/vms/web@2026-04-04T14:30:00. Optionally freezes the guest filesystem via QEMU guest agent for application-consistent snapshots.
kvm-delete
Destroys the VM, its zvol, and all orphan snapshots in one command. Refuses to delete if dependent clones exist (prompts you to promote or delete them first). Clean, safe removal.
kvm-list
Lists all VMs with their state (running/shut off), zvol path, disk usage (logical vs actual), compression ratio, and clone origin. One command to see everything.
$ kvm-clone rocky9-web rocky9-web-clone
Snapshotting rpool/vms/rocky9-web@clone-2026-04-04T14:32:00...
Cloning to rpool/vms/rocky9-web-clone...
Defining VM rocky9-web-clone (new MAC, new UUID)...
Done. Clone completed in 0.4 seconds.
$ kvm-list
NAME STATE ZVOL USED REFER RATIO ORIGIN
rocky9-web running rpool/vms/rocky9-web 12.3G 12.3G 2.14x -
rocky9-web-clone shut off rpool/vms/rocky9-web-clone 128K 12.3G 2.14x rocky9-web@clone-2026-04-04T14:32:00
Notice the clone uses 128K of actual disk space. It references 12.3G of data
from the parent. As the clone diverges (guest writes new data), its USED column
grows. The parent's data is shared via COW. This is why you can clone 50 VMs from a golden
image and they collectively use barely more space than one.
Example: kvm-snap
$ kvm-snap rocky9-web
Snapshotting rpool/vms/rocky9-web@2026-04-04T14:35:00...
Snapshot created in 0.002 seconds.
# With guest agent (application-consistent):
$ kvm-snap rocky9-web --freeze
Freezing guest filesystems via QEMU guest agent...
Snapshotting rpool/vms/rocky9-web@2026-04-04T14:36:00...
Thawing guest filesystems...
Application-consistent snapshot created in 0.8 seconds.
Example: kvm-delete
$ kvm-delete rocky9-web-clone
Shutting down rocky9-web-clone...
Destroying VM definition...
Destroying zvol rpool/vms/rocky9-web-clone (and 0 snapshots)...
Done. VM and all storage removed.
# If clones depend on it:
$ kvm-delete rocky9-web
ERROR: rpool/vms/rocky9-web has 2 dependent clones:
- rpool/vms/rocky9-staging (from rocky9-web@clone-2026-04-04T14:32:00)
- rpool/vms/rocky9-prod (from rocky9-web@clone-2026-04-04T15:00:00)
Delete or promote these clones first.
The kvm-* tools exist because I got tired of typing the same
6 commands every time I created a VM. They do not hide anything — they print every
zfs and virsh command they run. If you want to understand what
they do, read the source. They are short bash scripts. If you want to do it manually,
every command on this page works without them.
Instant cloning — the killer feature
This is the single most important capability that ZFS gives you as a hypervisor operator.
Cloning a VM is O(1). It takes the same amount of time to clone a 10GB VM
as a 10TB VM: under one second. The clone shares all data blocks with the parent via
copy-on-write. Only new writes allocate new space.
Think about what this means. You have a golden image — a fully
configured, patched, hardened base VM. You snapshot it once. From that snapshot, you
can create 50 clones in 50 seconds. Each clone is a full, independent VM with its own
MAC address, UUID, and hostname. Each clone uses near-zero additional disk space until
the guest starts writing. Your 50-VM test lab occupies the same disk space as one VM
plus the deltas.
VMware charges for linked clones. Proxmox does them in 30-60 seconds. Hyper-V does not
support them at all (you must copy the entire VHDX). qcow2 backing files achieve similar
semantics but with worse performance and fragile dependency chains. ZFS clones are native,
fast, and free.
The clone workflow
# 1. Build your golden image (install OS, configure, patch, harden)
kvm-create golden-rocky9 --ram 4096 --vcpus 2 --disk 50G --iso Rocky-9.iso
# ... install OS, configure everything ...
virsh shutdown golden-rocky9
# 2. Snapshot the golden image
zfs snapshot rpool/vms/golden-rocky9@v1
# 3. Clone as many VMs as you need (each takes <1 second)
for i in $(seq 1 50); do
zfs clone rpool/vms/golden-rocky9@v1 rpool/vms/worker-${i}
# ... define VM in libvirt with new MAC/UUID ...
done
# Or with kvm-clone (handles libvirt automatically):
for i in $(seq 1 50); do
kvm-clone golden-rocky9 worker-${i}
done
# Total time for 50 clones: ~50 seconds (1 sec each, mostly libvirt XML generation)
# 4. Check space usage
zfs list -r -o name,used,refer rpool/vms | head -5
# NAME USED REFER
# rpool/vms 8.5G 24K
# rpool/vms/golden-rocky9 8.4G 8.4G
# rpool/vms/worker-1 128K 8.4G <-- shares parent data
# rpool/vms/worker-2 128K 8.4G <-- shares parent data
Clone speed comparison
Method
100GB VM clone time
Disk space used
Independent VM?
ZFS clone
<1 second
Near-zero (COW)
Yes (promotable)
Proxmox linked clone
30-60 seconds
Near-zero (COW)
Depends on base
VMware linked clone
30-120 seconds
Near-zero (COW)
Depends on base
qcow2 backing file
2-5 seconds
Near-zero (COW)
Fragile chain
Full copy (any format)
5-30 minutes
100% (full copy)
Yes
Hyper-V (no linked clone)
5-30 minutes
100% (full copy)
Yes
COW semantics and clone promotion
A ZFS clone is a first-class dataset that shares blocks with its origin
snapshot. As the clone writes new data, only the changed blocks are allocated new space.
The origin snapshot cannot be deleted while clones depend on it.
Promotion reverses the parent-child relationship. After zfs promote,
the clone becomes the independent dataset, and the original becomes the dependent. This
lets you delete the original golden image while keeping the clone alive.
# Promote a clone to independence
zfs promote rpool/vms/worker-1
# Now worker-1 owns the shared blocks.
# golden-rocky9@v1 now depends on worker-1, not the other way around.
# Check origin tracking
zfs get com.kldload:clone-origin rpool/vms/worker-1
# NAME PROPERTY VALUE
# rpool/vms/worker-1 com.kldload:clone-origin golden-rocky9@v1
Snapshots on running VMs
ZFS snapshots are crash-consistent by default. This means the snapshot
captures the exact state of the block device at the instant the snapshot is taken —
equivalent to pulling the power cord and rebooting. For most workloads, this is fine.
The guest's filesystem journal (ext4 journal, NTFS log, XFS log) replays on boot and
recovers to a consistent state.
For application-consistent snapshots (databases that need clean shutdown
semantics, not just filesystem consistency), you need to freeze the guest filesystem
before snapshotting. The QEMU guest agent provides this via fsfreeze.
Crash-consistent snapshot (fast, safe for most workloads)
# Snapshot while VM is running - instant, crash-consistent
zfs snapshot rpool/vms/webserver@before-upgrade
# This is safe for:
# - Web servers, application servers, file servers
# - Anything with a journaling filesystem
# - Test/dev VMs (always safe - who cares if it crashes?)
# - Pre-upgrade checkpoints (rollback if upgrade fails)
Application-consistent snapshot (for databases)
# Install QEMU guest agent in the VM
# (CentOS/Rocky/Fedora):
dnf install -y qemu-guest-agent
systemctl enable --now qemu-guest-agent
# Freeze, snapshot, thaw:
virsh domfsfreeze webserver
zfs snapshot rpool/vms/webserver@app-consistent-2026-04-04
virsh domfsthaw webserver
# Total freeze time: ~0.5 seconds
# Or use kvm-snap --freeze (does all three steps):
kvm-snap webserver --freeze
virsh snapshot-create-as vs zfs snapshot
libvirt has its own snapshot mechanism (virsh snapshot-create-as). Do
not use it with ZFS zvols. It creates QEMU internal snapshots that conflict with
ZFS semantics. Use zfs snapshot directly on the zvol. This is the correct
approach for ZFS-backed VMs and what kvm-snap does.
Rollback workflow
# Upgrade went wrong? Roll back.
virsh shutdown webserver # Must shut down first
zfs rollback rpool/vms/webserver@before-upgrade
virsh start webserver # Boots into pre-upgrade state
# Rollback to a non-latest snapshot (destroys intermediate snapshots):
zfs rollback -r rpool/vms/webserver@last-known-good
I snapshot every VM before every upgrade, every kernel
update, every configuration change. It costs nothing — a snapshot is a few hundred
bytes of metadata. If the upgrade breaks the VM, I roll back in 2 seconds. This is the
correct way to do change management on a hypervisor. Not ITSM tickets. Not change advisory
boards. Snapshots. Rollbacks. Move fast, break nothing.
Replication — VM migration and disaster recovery
ZFS send and receive serialize a snapshot (or the delta between
two snapshots) into a byte stream. Pipe that stream over SSH and you have replicated a VM
to another host. This is the same mechanism Proxmox uses for its "replication" feature.
You already have it. You do not need Proxmox.
One-shot migration
# Migrate a VM from host-a to host-b
# 1. Shut down the VM
virsh shutdown webserver
# 2. Take a final snapshot
zfs snapshot rpool/vms/webserver@migrate
# 3. Send the full snapshot to the remote host
zfs send rpool/vms/webserver@migrate | \
ssh host-b zfs receive rpool/vms/webserver
# 4. Copy the libvirt XML
virsh dumpxml webserver | ssh host-b virsh define /dev/stdin
# 5. Start on the target
ssh host-b virsh start webserver
# 6. (Optional) Clean up the source
virsh undefine webserver
zfs destroy -r rpool/vms/webserver
Syncoid
(part of the Sanoid suite, included in kldload) automates incremental replication with
proper snapshot management, resume support, and compression in transit. Set it up as a
cron job and forget about it.
# Replicate a single VM zvol
syncoid rpool/vms/webserver dr-host:rpool/vms/webserver
# Replicate all VMs (recursive)
syncoid -r rpool/vms dr-host:rpool/vms
# Cron job: replicate every 15 minutes
*/15 * * * * /usr/sbin/syncoid -r --no-sync-snap rpool/vms dr-host:rpool/vms
This is Proxmox replication. Proxmox's replication feature is literally
zfs send -i wrapped in a cron job with a GUI. You just built the same thing
with one line in crontab. The VM on dr-host is ready to start at any time —
define it in libvirt and boot it. RPO (Recovery Point Objective) equals your cron interval.
I replicate every production VM to a DR host every 15 minutes
with syncoid. The total cost is: one cron line, one SSH key. No cluster software, no shared
storage, no quorum devices, no fencing agents. When a host dies, I define the VMs on the DR
host and start them. Total recovery time: under 5 minutes. Try getting that from VMware
without vSphere Replication licenses.
Performance tuning
KVM + ZFS can match or exceed bare-metal performance when tuned correctly. Most "ZFS is slow"
complaints come from wrong defaults, double caching, or I/O scheduler conflicts. Fix these
and performance is excellent.
Storage tuning
volblocksize=64K
Default for general VMs. Match guest DB page size for database VMs (8K for PostgreSQL, 16K for MySQL/InnoDB). Set at zvol creation — cannot be changed later.
compression=lz4
Always. LZ4 compresses faster than disk I/O. It reduces bytes written, which reduces I/O latency and extends SSD lifespan. There is no scenario where lz4 hurts performance on modern CPUs.
primarycache=all
Default. ARC caches both data and metadata. For very large VMs where guest data is already cached by the guest OS, set primarycache=metadata to save ARC space for other uses.
SLOG
Enterprise NVMe with power-loss protection, used as ZFS Intent Log. Dramatically reduces synchronous write latency (from ms to us). Required for database VMs and NFS serving. Not needed for async workloads.
I/O scheduler
Set none (noop) for NVMe and SSD devices backing ZFS pools. ZFS has its own I/O scheduler; the Linux elevator just adds latency. echo none > /sys/block/nvme0n1/queue/scheduler
QEMU/KVM tuning
cache=none
Mandatory for ZFS. Disables host page cache for the disk. ZFS ARC is the cache. Double caching (page cache + ARC) wastes RAM and causes cache thrashing. Only use cache=writeback if you have a battery-backed SLOG.
io=native
Use Linux native AIO instead of QEMU's userspace I/O threads. Lower latency, fewer context switches. Set in the libvirt disk XML: <driver io='native'/>
CPU pinning
Pin vCPUs to physical cores for latency-sensitive workloads. Prevents the scheduler from migrating vCPUs across cores (and across NUMA nodes). Use virsh vcpupin.
Hugepages
2MB hugepages reduce TLB misses for memory-intensive VMs. Allocate at boot: hugepages=1024 in kernel cmdline (2GB). Assign in libvirt XML: <memoryBacking><hugepages/></memoryBacking>
NUMA awareness
On multi-socket systems, pin VM memory and vCPUs to the same NUMA node. Cross-node memory access adds 50-100ns latency per operation. Use virsh numatune and virsh vcpupin.
virtio-blk
Use bus=virtio for disk. This is a paravirtualized block device driver with near-native performance. IDE emulation is 10-50x slower and should never be used.
CPU pinning example
# Pin 4 vCPUs to physical cores 4-7 (same NUMA node)
virsh vcpupin database 0 4
virsh vcpupin database 1 5
virsh vcpupin database 2 6
virsh vcpupin database 3 7
# Pin memory to NUMA node 0
virsh numatune database --nodeset 0 --mode strict
# Allocate hugepages and assign to VM
echo 4096 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
virsh edit database
# Add inside :
#
#
#
ARC memory management
ZFS ARC (Adaptive Replacement Cache) is an in-memory read cache. By default, it can consume
up to 50% of system RAM. On a VM host, this competes with guest memory. You must
limit ARC to leave enough RAM for VMs.
# Rule of thumb: ARC = total RAM - sum of all VM RAM - 4GB (for OS)
# Example: 128GB host, 96GB allocated to VMs, 4GB for OS = 28GB for ARC
# Set ARC max (persistent across reboots)
echo "options zfs zfs_arc_max=30064771072" > /etc/modprobe.d/zfs.conf
# 30064771072 = 28GB in bytes
# Set ARC max at runtime (takes effect immediately)
echo 30064771072 > /sys/module/zfs/parameters/zfs_arc_max
# Verify
arc_summary | grep "ARC size"
GPU passthrough with ZFS-backed VMs
VFIO-PCI passthrough gives a VM direct access to a physical GPU. The VM gets native GPU
performance — no emulation, no overhead. Combined with ZFS zvols for storage, you
get a VM that performs identically to bare metal for both compute and I/O.
Prerequisites
# 1. Enable IOMMU in kernel cmdline (GRUB or systemd-boot)
# Intel:
intel_iommu=on iommu=pt
# AMD:
amd_iommu=on iommu=pt
# 2. Verify IOMMU is active
dmesg | grep -i iommu
# 3. Identify the GPU's PCI address and IOMMU group
lspci -nn | grep -i nvidia
# 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation ... [10de:2684]
# 01:00.1 Audio device [0403]: NVIDIA Corporation ... [10de:22ba]
# 4. Check IOMMU group (all devices in the group must be passed through)
find /sys/kernel/iommu_groups/ -type l | sort -t/ -k5 -n | grep "01:00"
# 5. Bind GPU to vfio-pci driver
echo "options vfio-pci ids=10de:2684,10de:22ba" > /etc/modprobe.d/vfio.conf
echo "vfio-pci" > /etc/modules-load.d/vfio-pci.conf
# Rebuild initramfs (dracut on RHEL/Rocky/Fedora):
dracut -f
NVIDIA tip: Use --features kvm_hidden=on to hide the KVM
hypervisor signature from the NVIDIA driver. Older NVIDIA drivers refuse to run inside a
detected VM (Error 43). This flag tells the driver it is running on bare metal. Modern
drivers (535+) no longer need this, but it does not hurt.
GPU passthrough on KVM is better than VMware's DirectPath
I/O and miles ahead of Hyper-V's DDA. It works reliably with NVIDIA, AMD, and Intel GPUs.
I pass through NVIDIA RTX cards for AI/ML workloads running in VMs with ZFS zvol storage.
The VM gets native GPU speed and native storage speed. No compromises.
Networking for VMs
KVM VMs connect to the network through virtual interfaces. The most common and recommended
approach is a Linux bridge. VMs attach TAP interfaces to the bridge and
appear as first-class devices on your network — each with its own MAC and IP.
Linux bridge (recommended)
# Create a bridge using nmcli (NetworkManager)
nmcli con add type bridge con-name br0 ifname br0
nmcli con add type ethernet con-name br0-port1 ifname eno1 master br0
nmcli con modify br0 ipv4.method manual ipv4.addresses 10.0.0.1/24 ipv4.gateway 10.0.0.254
nmcli con up br0
# Or using systemd-networkd:
cat > /etc/systemd/network/br0.netdev << 'EOF'
[NetDev]
Name=br0
Kind=bridge
EOF
cat > /etc/systemd/network/br0.network << 'EOF'
[Match]
Name=br0
[Network]
Address=10.0.0.1/24
Gateway=10.0.0.254
DNS=10.0.0.254
EOF
cat > /etc/systemd/network/eno1.network << 'EOF'
[Match]
Name=eno1
[Network]
Bridge=br0
EOF
VLAN tagging on bridges
# Create VLAN interface, then bridge it
nmcli con add type vlan con-name vlan100 dev eno1 id 100
nmcli con add type bridge con-name br-vlan100 ifname br-vlan100
nmcli con add type ethernet con-name br-vlan100-port ifname eno1.100 master br-vlan100
# Attach VM to VLAN bridge
virt-install ... --network bridge=br-vlan100,model=virtio ...
Open vSwitch (advanced)
For software-defined networking, VXLAN overlays, and port mirroring, use Open vSwitch (OVS)
instead of a Linux bridge. OVS supports OpenFlow, LACP, port mirroring, and VXLAN/GRE tunnels.
kldload includes OVS in the server profile.
# Create OVS bridge
ovs-vsctl add-br ovsbr0
ovs-vsctl add-port ovsbr0 eno1
# Use in virt-install
virt-install ... --network network=ovs-net,model=virtio,virtualport_type=openvswitch ...
# VXLAN tunnel between hosts
ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 \
type=vxlan options:remote_ip=10.0.0.2 options:key=100
WireGuard integration
kldload installs WireGuard by default. You can route VM traffic through a WireGuard tunnel
for encrypted site-to-site connectivity. VMs on host A communicate with VMs on host B
over an encrypted WireGuard mesh — transparent to the guests.
# Bridge VM traffic over WireGuard
# 1. WireGuard tunnel between hosts (already configured by kldload)
# wg0: 10.200.0.1/24 (host-a) <-> 10.200.0.2/24 (host-b)
# 2. Route VM subnet through WireGuard
ip route add 10.100.0.0/24 via 10.200.0.2 dev wg0
# 3. VMs on host-b use 10.100.0.0/24 on their bridge
# Traffic flows: VM -> br0 -> host routing -> wg0 -> host-b -> br0 -> VM
Verdict: Use zvols for all VM storage on ZFS. Use qcow2 only for exporting
images to non-ZFS systems. Never put qcow2 files on ZFS — the double COW penalty is
severe (30-50% performance loss on random write workloads).
Storage operations speed comparison
Operation
ZFS zvol
qcow2 file
VMFS (VMware)
VHDX (Hyper-V)
Snapshot (100GB)
<0.01 sec
1-5 sec
1-10 sec
2-10 sec
Clone (100GB)
<1 sec
2-5 sec
30-120 sec
5-30 min (full copy)
Delete (100GB)
<1 sec
1-5 sec
5-30 sec
5-30 sec
Incremental replicate (1GB changed)
~10 sec
Full file (minutes)
vSR license required
~30 sec (Replica)
Rollback
<1 sec
1-5 sec
10-60 sec
10-60 sec
These numbers are not theoretical. I have measured them on
production hardware. ZFS snapshot and clone are metadata operations — they do not
touch data blocks. That is why they are constant-time regardless of dataset size. Every
other system copies blocks. If you have ever waited 30 minutes for a VMware clone, you
understand why this matters.
Golden image workflow
A golden image is a fully configured, patched, hardened base VM that serves as the template
for all production deployments. With ZFS clones, maintaining golden images is zero-cost.
You snapshot the image, clone it for every deployment, and the clones diverge only where
they need to (hostname, IP, application config). This is the kldload way.
Building the golden image
# 1. Create the template VM
kvm-create golden-rocky9 --ram 4096 --vcpus 2 --disk 50G \
--iso /root/vms-images/Rocky-9.iso
# 2. Install OS, then inside the VM:
dnf update -y
dnf install -y qemu-guest-agent cloud-init vim-enhanced tmux
systemctl enable qemu-guest-agent cloud-init
# Configure cloud-init for multi-datasource (NoCloud, OpenStack, EC2)
# Install your standard packages, harden SSH, configure firewall
# 3. Seal the image for cloning
# (clear machine-specific state)
truncate -s 0 /etc/machine-id
rm -f /etc/ssh/ssh_host_*
rm -f /var/lib/dbus/machine-id
cloud-init clean
dnf clean all
shutdown -h now
# 4. Snapshot the golden image
zfs snapshot rpool/vms/golden-rocky9@v1
# 5. Tag it
zfs set com.kldload:golden=true rpool/vms/golden-rocky9
zfs set com.kldload:version=1 rpool/vms/golden-rocky9@v1
Deploying from the golden image
# Clone for deployment (instant)
kvm-clone golden-rocky9 production-web-01
kvm-clone golden-rocky9 production-web-02
kvm-clone golden-rocky9 production-api-01
# Each clone boots with cloud-init, which:
# - Generates new machine-id
# - Regenerates SSH host keys
# - Sets hostname from NoCloud metadata
# - Configures networking from metadata
# - Runs any user-data scripts
# Provide cloud-init metadata via NoCloud ISO:
cloud-localds /var/lib/libvirt/images/web01-cidata.iso \
user-data.yaml meta-data.yaml
# Attach the cloud-init ISO to the clone
virsh attach-disk production-web-01 \
/var/lib/libvirt/images/web01-cidata.iso hdc \
--type cdrom --config
virsh start production-web-01
# VM boots, cloud-init runs, machine is unique and configured.
Exporting images
kldload's image export pipeline converts a zvol to portable formats for other hypervisors.
The VM is installed normally on ZFS, sealed for cloning, and exported via
qemu-img convert. Supported formats: qcow2, vmdk, vhd, ova, raw.
# Export a zvol as qcow2 (for non-ZFS KVM hosts)
qemu-img convert -f raw -O qcow2 -c \
/dev/zvol/rpool/vms/golden-rocky9 golden-rocky9.qcow2
# Export as vmdk (for VMware)
qemu-img convert -f raw -O vmdk -o subformat=streamOptimized \
/dev/zvol/rpool/vms/golden-rocky9 golden-rocky9.vmdk
# Export as vhd (for Hyper-V / Azure)
qemu-img convert -f raw -O vpc \
/dev/zvol/rpool/vms/golden-rocky9 golden-rocky9.vhd
# Or use kldload's kexport tool:
kexport golden-rocky9 --format qcow2 --output /tmp/
kexport golden-rocky9 --format vmdk --scp user@remote:/images/
Automation
KVM + ZFS is fully scriptable. Every operation is a CLI command. There is no GUI dependency,
no Java applet, no Flash plugin (yes, vSphere 6 still used Flash). This makes automation
straightforward with any tool: bash, Python, Ansible, Terraform.
virsh + ZFS scripting patterns
#!/bin/bash
# deploy-fleet.sh — deploy N VMs from a golden image
GOLDEN="golden-rocky9"
SNAP="v1"
COUNT=${1:-5}
BRIDGE="br0"
for i in $(seq 1 $COUNT); do
NAME="worker-$(printf '%03d' $i)"
echo "Deploying $NAME..."
# Clone zvol (instant)
zfs clone rpool/vms/${GOLDEN}@${SNAP} rpool/vms/${NAME}
# Generate cloud-init ISO
cat > /tmp/meta-data.yaml << EOF
instance-id: ${NAME}
local-hostname: ${NAME}
EOF
cloud-localds /tmp/${NAME}-cidata.iso /tmp/user-data.yaml /tmp/meta-data.yaml
# Define and start VM
virt-install \
--name ${NAME} \
--ram 2048 --vcpus 2 --cpu host \
--machine q35 --boot uefi \
--disk path=/dev/zvol/rpool/vms/${NAME},bus=virtio,cache=none \
--disk path=/tmp/${NAME}-cidata.iso,device=cdrom \
--network bridge=${BRIDGE},model=virtio \
--os-variant rocky9 \
--import --noautoconsole
done
echo "Deployed $COUNT VMs in $(($SECONDS)) seconds."
Terraform libvirt provider
# main.tf — Terraform with the dmacvicar/libvirt provider
terraform {
required_providers {
libvirt = {
source = "dmacvicar/libvirt"
}
}
}
provider "libvirt" {
uri = "qemu:///system"
}
# Reference an existing zvol (created outside Terraform)
resource "libvirt_domain" "web" {
count = 3
name = "web-${count.index + 1}"
memory = 4096
vcpu = 4
cpu {
mode = "host-passthrough"
}
disk {
# ZFS zvol created by a provisioner or pre-existing
block_device = "/dev/zvol/rpool/vms/web-${count.index + 1}"
}
network_interface {
bridge = "br0"
}
boot_device {
dev = ["hd"]
}
# Cloud-init
cloudinit = libvirt_cloudinit_disk.web[count.index].id
}
resource "libvirt_cloudinit_disk" "web" {
count = 3
name = "web-${count.index + 1}-cloudinit.iso"
user_data = templatefile("user-data.tpl", {
hostname = "web-${count.index + 1}"
})
}
libvirt Python bindings
#!/usr/bin/env python3
"""List all VMs with their ZFS zvol usage."""
import libvirt
import subprocess
import json
conn = libvirt.open("qemu:///system")
for dom in conn.listAllDomains():
name = dom.name()
state = "running" if dom.isActive() else "shut off"
# Get disk path from XML
import xml.etree.ElementTree as ET
tree = ET.fromstring(dom.XMLDesc())
disks = tree.findall(".//disk[@type='block']/source")
for disk in disks:
path = disk.get("dev", "")
if "/zvol/" in path:
# Extract ZFS dataset name from /dev/zvol/ path
dataset = path.replace("/dev/zvol/", "")
result = subprocess.run(
["zfs", "get", "-Hp", "-o", "value", "used,refer,compressratio", dataset],
capture_output=True, text=True
)
print(f"{name:20s} {state:10s} {dataset:40s} {result.stdout.strip()}")
conn.close()
If you can script it, you can automate it. If you can
automate it, you can scale it. Every VM I deploy is created by a script, not by clicking
through a web UI. The kvm-* tools are my scripts. Terraform is for when you need state
management. Ansible is for when you need configuration management. But the foundation is
always: zfs clone + virt-install + cloud-init.
Three commands, infinite scale.
Monitoring and maintenance
A KVM + ZFS hypervisor needs two monitoring planes: storage health (ZFS)
and VM metrics (libvirt). Both are exposed via standard commands and can
feed into Prometheus, Grafana, or any monitoring stack.
ZFS storage monitoring
# Pool health
zpool status rpool
# Look for: state (ONLINE/DEGRADED/FAULTED), errors, scrub status
# Real-time I/O statistics (per-second)
zpool iostat rpool 1
# capacity operations bandwidth
# pool alloc free read write read write
# rpool 45.2G 186G 142 287 8.91M 17.3M
# Per-zvol space accounting
zfs list -r -o name,used,refer,compressratio,volsize rpool/vms
# NAME USED REFER RATIO VOLSIZE
# rpool/vms 45.1G 24K - -
# rpool/vms/webserver 12.3G 12.3G 2.14x 100G
# rpool/vms/database 28.4G 28.4G 1.32x 200G
# rpool/vms/dev-clone 4.4G 12.3G 2.14x 100G
# Scrub scheduling (weekly is recommended)
# kldload sets this up automatically via systemd timer
systemctl status zfs-scrub-weekly@rpool.timer
# Manual scrub
zpool scrub rpool
zpool status rpool | grep scan
VM metrics via virsh
# VM state overview
virsh list --all
# Detailed stats for a VM
virsh domstats webserver
# Domain: 'webserver'
# state.state=1 (running)
# cpu.time=18293847000 (nanoseconds)
# balloon.current=4194304 (KB)
# block.0.rd.reqs=142857
# block.0.rd.bytes=8918573056
# block.0.wr.reqs=287142
# block.0.wr.bytes=17301438464
# net.0.rx.bytes=4918573056
# net.0.tx.bytes=2301438464
# CPU usage per VM
virsh cpu-stats webserver
# Memory usage
virsh dommemstat webserver
ARC monitoring
# ARC summary
arc_summary
# Key metrics to watch:
# - ARC hit ratio: should be >90% for cached workloads
# - ARC size: should stay below your configured max
# - Demand data hit ratio: directly impacts VM read performance
# Real-time ARC stats
arcstat 1
# time read miss miss% dmis dm% pmis pm% mmis mm% size c
# 14:30 1.2K 48 3.9% 35 3.2% 13 8.1% 0 0.0% 24G 28G
Common mistakes
Using qcow2 on ZFS
The most common and most damaging mistake. qcow2 has its own copy-on-write mechanism.
ZFS has copy-on-write. Stacking them creates double COW — every guest
write triggers two COW operations, doubling write amplification and halving throughput.
Use zvols instead. If someone tells you to use qcow2 on ZFS, they are wrong.
qcow2 on ZFS = two taxi meters running simultaneously. You pay double for the same ride.
volblocksize mismatch
Using the default 8K volblocksize for general VMs creates excessive metadata overhead and
poor sequential performance. Using 128K creates write amplification on random I/O. The right
answer is 64K for general VMs and matching the database page size for database VMs.
This is set at creation and cannot be changed.
volblocksize is permanent. Get it wrong and you rebuild the zvol from scratch.
ARC eating guest memory
ZFS ARC defaults to 50% of RAM. On a 128GB host with 96GB allocated to VMs, ARC will try to
use 64GB — leaving only 32GB for VMs that need 96GB. The host swaps, performance collapses.
Always set zfs_arc_max to leave enough RAM for all VM allocations plus OS overhead.
ARC + VM RAM + OS overhead must be < total RAM. Violate this and everything dies.
sync=disabled on database VMs
Disabling sync writes is safe for throwaway VMs but dangerous for databases.
Without sync, a power failure can lose committed transactions. Databases rely on
fsync() for durability guarantees. If the storage lies about flush completion,
the database's crash recovery cannot work correctly.
sync=disabled on a production database = eventual data loss. Not if, when.
No SLOG for sync-heavy workloads
If your VMs run databases, NFS, or any sync-write-heavy application, and you are using
spinning disks or consumer SSDs, synchronous write latency will be terrible (5-50ms per operation).
An enterprise NVMe SLOG drops this to 50-100us. The SLOG must have power-loss protection
(PLP) — a consumer NVMe without PLP as SLOG is worse than no SLOG.
SLOG with PLP = fast and safe. Consumer SSD as SLOG = fast until power failure, then corrupt.
Single-disk pool in production
A single disk with no redundancy means one drive failure destroys all VMs. ZFS checksumming
will detect the corruption but cannot correct it without a mirror or RAIDZ.
Production VM hosts need at least a mirror. The cost of a second disk is nothing compared to
the cost of rebuilding everything.
Single disk = you are one firmware bug away from losing everything. Mirror or RAIDZ, always.
Not using virtio
IDE emulation is the default for some virt-install invocations and graphical VM managers.
It is 10-50x slower than virtio. Always specify bus=virtio for
disk and model=virtio for network. The only exception is old guests that lack
virtio drivers (Windows XP, ancient Linux kernels). Modern Windows requires the
virtio-win drivers.
IDE in a KVM VM is like racing a Ferrari in first gear. Shift to virtio.
Ignoring NUMA
On multi-socket servers, VMs that span NUMA nodes pay a 50-100ns penalty on every cross-node
memory access. A latency-sensitive VM (database, real-time application) should be pinned to a
single NUMA node with virsh numatune and virsh vcpupin. Most people
never configure this and wonder why their 2-socket server is slower than expected.
NUMA is free performance. Pin your VMs and get 10-30% more throughput on multi-socket.
Using cache=writeback without SLOG
Setting cache=writeback in QEMU without a battery-backed SLOG means QEMU
acknowledges writes before they reach stable storage. A power failure loses those writes.
With ZFS-backed zvols, always use cache=none (let ZFS handle caching via ARC)
or cache=writeback only if you have an SLOG with PLP.
cache=none is always safe on ZFS. cache=writeback is a bet that the power stays on.
Not monitoring pool capacity
ZFS performance degrades severely above 80% capacity due to fragmentation and the COW write
pattern. With thin-provisioned zvols, a VM can write enough data to fill the pool without warning.
Monitor zpool list and alert at 75%. Keep 20% free at all times. Use
reservation on critical zvols to guarantee they always have space.
ZFS above 80% full = performance cliff. Monitor capacity or suffer.
I have made every one of these mistakes. The qcow2-on-ZFS
mistake cost me a week of debugging "slow storage" on a Proxmox cluster before I realized
the VMs were using qcow2 format on ZFS-backed storage. The ARC memory mistake caused an
out-of-memory killer event that took down 12 VMs simultaneously. Learn from my mistakes.
Read this list. Set up your system correctly the first time.
Migration from other hypervisors
Moving VMs to KVM + ZFS is straightforward because qemu-img can convert
between every major disk format, and ZFS zvols accept raw block data via dd.
The pattern is always the same: convert to raw, write to zvol, define in libvirt.
From Proxmox (qcow2 or raw on ZFS)
# If Proxmox already uses ZFS zvols (common):
# 1. Export the zvol via zfs send
ssh proxmox-host zfs send rpool/data/vm-100-disk-0@migrate | \
zfs receive rpool/vms/migrated-vm
# If Proxmox uses qcow2:
# 1. Copy the qcow2 file from Proxmox
scp proxmox-host:/var/lib/vz/images/100/vm-100-disk-0.qcow2 /tmp/
# 2. Get the virtual size
qemu-img info /tmp/vm-100-disk-0.qcow2
# virtual size: 100 GiB
# 3. Create a zvol of that size
zfs create -V 100G -b 64K rpool/vms/migrated-vm
# 4. Convert and write directly to the zvol
qemu-img convert -f qcow2 -O raw /tmp/vm-100-disk-0.qcow2 \
/dev/zvol/rpool/vms/migrated-vm
# 5. Define VM in libvirt (adjust the Proxmox XML or create new)
virt-install --import --name migrated-vm \
--ram 4096 --vcpus 4 --cpu host --machine q35 --boot uefi \
--disk path=/dev/zvol/rpool/vms/migrated-vm,bus=virtio,cache=none \
--network bridge=br0,model=virtio --noautoconsole
From VMware (vmdk)
# 1. Export the VM from VMware (OVA or VMDK)
# Use vSphere client: Export -> OVF template
# Or use ovftool: ovftool vi://vcenter/... /tmp/vm.ova
# 2. Extract VMDK from OVA (if needed)
tar xf /tmp/vm.ova
# Results in: vm.ovf, vm-disk1.vmdk
# 3. Check virtual size
qemu-img info /tmp/vm-disk1.vmdk
# 4. Create zvol
zfs create -V 100G -b 64K rpool/vms/from-vmware
# 5. Convert VMDK to raw directly onto zvol
qemu-img convert -f vmdk -O raw /tmp/vm-disk1.vmdk \
/dev/zvol/rpool/vms/from-vmware
# 6. IMPORTANT: Remove VMware Tools, install QEMU guest agent
# Boot the VM, then inside the guest:
# (RHEL/Rocky): dnf remove open-vm-tools && dnf install qemu-guest-agent
# (Debian/Ubuntu): apt remove open-vm-tools && apt install qemu-guest-agent
# (Windows): Uninstall VMware Tools, install virtio-win drivers + QEMU GA
# 7. Define in libvirt
virt-install --import --name from-vmware \
--ram 8192 --vcpus 4 --cpu host --machine q35 --boot uefi \
--disk path=/dev/zvol/rpool/vms/from-vmware,bus=virtio,cache=none \
--network bridge=br0,model=virtio --noautoconsole
From Hyper-V (vhd/vhdx)
# 1. Export the VM from Hyper-V Manager (Export Virtual Machine)
# Copy the .vhdx file to the KVM host
# 2. Convert VHDX to raw on zvol
qemu-img info /tmp/vm.vhdx
zfs create -V 100G -b 64K rpool/vms/from-hyperv
qemu-img convert -f vhdx -O raw /tmp/vm.vhdx \
/dev/zvol/rpool/vms/from-hyperv
# 3. Define and boot
virt-install --import --name from-hyperv \
--ram 4096 --vcpus 4 --cpu host --machine q35 --boot uefi \
--disk path=/dev/zvol/rpool/vms/from-hyperv,bus=virtio,cache=none \
--network bridge=br0,model=virtio --noautoconsole
# Note: Windows VMs from Hyper-V need virtio drivers installed
# before migration. Install them in Hyper-V first, or boot with
# IDE emulation, install virtio-win, then switch to virtio.
From raw qcow2 KVM (non-ZFS)
# The simplest migration — you are already on KVM.
# Just convert the qcow2 to a zvol.
# 1. Get virtual size
qemu-img info /var/lib/libvirt/images/oldvm.qcow2
# 2. Create zvol
zfs create -V 80G -b 64K rpool/vms/oldvm
# 3. Convert
qemu-img convert -f qcow2 -O raw \
/var/lib/libvirt/images/oldvm.qcow2 \
/dev/zvol/rpool/vms/oldvm
# 4. Update the existing libvirt XML
virsh edit oldvm
# Change:
# To:
# Change:
# To:
# 5. Start
virsh start oldvm
# You now have ZFS snapshots, clones, replication, checksumming.
# Welcome to the future.
I have migrated hundreds of VMs from VMware and Proxmox to
KVM + ZFS. The conversion step takes minutes (I/O bound by the disk copy). The hardest
part is Windows VMs — install the virtio-win drivers before migration or you will
be booting into a blue screen. Linux VMs migrate seamlessly because the kernel detects
the new virtio hardware automatically.
Quick reference
The 25 commands you need to run a KVM + ZFS hypervisor. Print this page.
zfs create -V 100G -b 64K pool/vms/name
Create a 100GB zvol with 64K block size
zfs set compression=lz4 pool/vms
Enable LZ4 compression on all VMs (inherited)
zfs set refreservation=none pool/vms/name
Thin-provision a zvol (no space reservation)
zfs snapshot pool/vms/name@tag
Instant snapshot of a VM disk
zfs rollback pool/vms/name@tag
Rollback to a snapshot (VM must be shut down)
zfs clone pool/vms/name@tag pool/vms/clone
Instant clone from a snapshot
zfs promote pool/vms/clone
Make a clone independent (reverse parent-child)
zfs destroy pool/vms/name
Delete a zvol (fails if snapshots/clones depend on it)
qemu-img convert -f raw -O qcow2 /dev/zvol/... out.qcow2
Export a zvol to qcow2 (for shipping to non-ZFS)
25 commands. That is the entire hypervisor. Not a 200-page
manual, not a certification course, not a week-long training. Twenty-five commands that
give you everything VMware and Proxmox offer. Learn them. You will never need anything else.
The bottom line: KVM is the kernel's hypervisor. ZFS is the best storage
layer. Together they give you instant snapshots, instant clones, checksummed storage,
transparent compression, incremental replication, and GPU passthrough — all for free,
all with standard Linux tools, all without a single proprietary component. kldload installs
this stack automatically with every desktop and server profile. Boot the ISO, install,
and you have a hypervisor that competes with anything on the market.
Stop paying rent on your hypervisor. The hypervisor you need is already in your kernel.