Documentation

KVM & Hypervisor Masterclass

This guide covers the full stack of Linux-native virtualization: KVM architecture, QEMU device emulation, libvirt management, ZFS zvol storage design, golden image workflows, CPU and memory tuning, virtio devices, networking topologies, live migration, multi-host orchestration, Proxmox integration, GPU passthrough, monitoring, and troubleshooting. By the end you will understand not just how to run VMs on Linux, but why KVM works the way it does — and how to build a production hypervisor that rivals any commercial offering.

The premise: Linux is not a host operating system that runs a hypervisor. Linux is the hypervisor. KVM turns the Linux kernel into a Type 1 hypervisor — every VM is a process, every vCPU is a thread, and the kernel's scheduler, memory manager, and I/O subsystem serve the VMs directly. This means every tool you already know — top, perf, cgroups, nftables, systemd — works on VMs too. No separate management OS. No proprietary vSphere. Just Linux.

What this page covers: KVM kernel module internals, QEMU emulation, libvirt domain management, ZFS zvol storage design for VM disks, golden image build-once-deploy-many workflows, virt-install and virsh lifecycle commands, CPU topology and pinning, hugepages and NUMA binding, virtio device tuning, bridge and SR-IOV networking, live migration strategies, multi-host fleet management with kvm-clone and kvm-deploy, Proxmox API integration, VFIO GPU passthrough, performance monitoring with eBPF and virt-top, and a comprehensive troubleshooting reference.

Prerequisites: a running kldload system with the server or desktop profile. The KVM tutorials assume Intel VT-x or AMD-V hardware. Nested virtualization works for learning but is not suitable for production workloads.

The hypervisor market is dominated by VMware vSphere and Microsoft Hyper-V — both proprietary, both expensive, both lock you into ecosystems that treat you as a revenue source. KVM is different. It shipped in the Linux kernel in 2007 and has been the hypervisor behind AWS (via a KVM fork called Nitro), Google Cloud (via KVM directly), and every OpenStack deployment since 2010. When you run KVM on a kldload node with ZFS, you are running the same hypervisor technology as the largest clouds on Earth — except you own it, you control it, and you can see every line of code. This masterclass teaches you to use it properly.

1. KVM Architecture

KVM (Kernel-based Virtual Machine) is a Linux kernel module that turns the kernel into a hypervisor. It was merged into mainline Linux in version 2.6.20 (February 2007) and has been the default virtualization technology for Linux ever since. Understanding its architecture is essential to understanding why Linux virtualization works so well.

The hypervisor classification problem

Traditional hypervisor taxonomy divides the world into Type 1 (bare metal) and Type 2 (hosted). Type 1 hypervisors — VMware ESXi, Microsoft Hyper-V, Xen — run directly on hardware with no host OS underneath. Type 2 hypervisors — VirtualBox, VMware Workstation — run as applications inside a host OS. KVM breaks this taxonomy because it is both: the Linux kernel runs on bare metal (Type 1), and KVM is a kernel module that turns that kernel into a hypervisor. The guest VMs run as regular Linux processes (which sounds Type 2), but they execute directly on the CPU via hardware virtualization extensions (which is Type 1 behavior). The correct answer is that KVM is a Type 1 hypervisor that happens to share its kernel with a general-purpose OS.

The /dev/kvm interface

KVM exposes itself as a character device at /dev/kvm. Userspace programs (QEMU) open this device and use ioctl() calls to create VMs, configure vCPUs, map memory regions, and enter guest mode. The kernel handles VM exits, instruction emulation, and interrupt injection.

// /dev/kvm is the API. QEMU is the client. The kernel is the server.

Hardware virtualization extensions

Intel VT-x (vmx) and AMD-V (svm) provide hardware support for guest execution. The CPU has a new privilege mode — VMX root (host) and VMX non-root (guest). Guests execute natively on the CPU. When they do something privileged (I/O, page table changes), the CPU traps into VMX root mode — a "VM exit" — and KVM handles it.

// The CPU itself has a "guest mode." No software emulation of privileged instructions.

VMs are processes, vCPUs are threads

Each VM is a QEMU process visible in ps. Each vCPU is a thread within that process. The Linux CFS scheduler schedules vCPU threads alongside all other processes. This means cgroups, nice, cpuset, and taskset all work on VMs directly.

// kill -9 a VM? Yes. It's just a process. renice a vCPU? Yes. It's just a thread.

Memory is mmap'd

Guest RAM is allocated as anonymous memory in the QEMU process address space, typically via mmap(). The kernel's memory manager handles paging, NUMA placement, and hugepage backing. KVM configures EPT (Intel) or NPT (AMD) — hardware-assisted nested page tables — so the guest's virtual-to-physical translations happen in hardware, not software.

// Guest RAM = host pages. EPT/NPT = hardware two-level page tables. No shadow page tables needed.

kvm_intel / kvm_amd modules

The KVM subsystem consists of a core module (kvm) and a CPU-specific module (kvm_intel or kvm_amd). The CPU module handles the hardware-specific VMX/SVM instructions. Load them with modprobe kvm_intel or modprobe kvm_amd. Verify with lsmod | grep kvm.

// Two modules: kvm (generic) + kvm_intel or kvm_amd (hardware-specific).

VM exit and re-entry

When a guest does something the CPU cannot handle in guest mode — I/O port access, MSR write, external interrupt — the CPU performs a VM exit back to KVM. KVM inspects the exit reason, handles it (or delegates to QEMU for device emulation), and re-enters guest mode with VMRESUME. Minimizing VM exits is the key to performance.

// VM exit = the guest asked for something the hardware cannot fake. KVM answers, then sends the guest back.

# Verify KVM is available
grep -cE '(vmx|svm)' /proc/cpuinfo
# > 0 means hardware virtualization is supported

# Load KVM modules
modprobe kvm
modprobe kvm_intel  # or kvm_amd

# Verify
lsmod | grep kvm
# kvm_intel  380928  0
# kvm        1130496 1 kvm_intel

# Check /dev/kvm exists
ls -la /dev/kvm
# crw-rw-rw- 1 root kvm 10, 232 ... /dev/kvm

The single most important thing to understand about KVM is that it does not reinvent operating system primitives. VMware ESXi has its own scheduler, its own memory manager, its own filesystem (VMFS), its own networking stack. KVM uses the Linux kernel's scheduler, Linux's memory manager, ZFS or ext4 or whatever you want, and Linux's networking stack. This means every improvement to the Linux kernel — every scheduler optimization, every memory management enhancement, every network driver update — automatically improves your hypervisor. KVM gets better every time you update your kernel. ESXi gets better when VMware ships a new release.

2. libvirt & QEMU

KVM provides the hardware virtualization interface. QEMU provides device emulation — the virtual disks, NICs, USB controllers, VGA adapters, and serial ports that the guest OS sees. libvirt sits above both and provides a stable management API: creating VMs, starting/stopping them, managing storage pools, snapshots, and networks. Together they form the standard Linux virtualization stack.

The QEMU layer

QEMU (Quick EMUlator) is a full system emulator that can emulate entire machines in software. When combined with KVM, it delegates CPU execution to the hardware and handles only device emulation. A QEMU process with KVM acceleration runs guest code at near-native speed — the CPU instructions execute directly on the hardware — while QEMU emulates the devices the guest interacts with (disk controllers, network cards, display adapters).

# Direct QEMU invocation (rarely done manually — libvirt generates this)
qemu-system-x86_64 \
  -enable-kvm \
  -cpu host \
  -smp 4,sockets=1,cores=4,threads=1 \
  -m 8G \
  -drive file=/dev/zvol/tank/vms/web01,format=raw,if=virtio,cache=none \
  -netdev bridge,id=net0,br=br0 \
  -device virtio-net-pci,netdev=net0,mac=52:54:00:aa:bb:01 \
  -vnc :1 \
  -daemonize

The libvirt layer

libvirt is a C library and daemon (libvirtd or virtqemud) that provides a uniform API for managing VMs across multiple hypervisors — KVM/QEMU, Xen, LXC, bhyve. It translates high-level operations ("create a VM with 4 CPUs, 8 GB RAM, a virtio disk, and a bridged NIC") into the correct QEMU command-line invocation. It also manages persistent XML domain definitions, storage pools, networks, and secrets.

# Install libvirt on kldload (CentOS/RHEL/Rocky)
dnf install -y libvirt libvirt-daemon-kvm qemu-kvm virt-install virt-top

# Enable and start
systemctl enable --now libvirtd

# Verify the hypervisor connection
virsh uri
# qemu:///system

# List all VMs
virsh list --all

# Dump a VM's XML definition (the source of truth)
virsh dumpxml web01

XML domain definitions

Every VM in libvirt is defined by an XML document that describes its hardware: CPU model, memory, disks, NICs, boot order, console, serial ports, TPM, and more. This XML is the declarative specification of the VM. virsh define registers it, virsh start instantiates it, and virsh dumpxml shows the current running configuration. Understanding the XML is essential — it is the equivalent of a Terraform resource definition for VMs.

<domain type='kvm'>
  <name>web01</name>
  <uuid>a1b2c3d4-e5f6-7890-abcd-ef1234567890</uuid>
  <memory unit='GiB'>8</memory>
  <vcpu placement='static'>4</vcpu>

  <os>
    <type arch='x86_64' machine='q35'>hvm</type>
    <boot dev='hd'/>
  </os>

  <cpu mode='host-passthrough' check='none' migratable='on'/>

  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
      <source dev='/dev/zvol/tank/vms/web01'/>
      <target dev='vda' bus='virtio'/>
    </disk>

    <interface type='bridge'>
      <source bridge='br0'/>
      <model type='virtio'/>
      <mac address='52:54:00:aa:bb:01'/>
    </interface>

    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>

    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
    </channel>

    <tpm model='tpm-crb'>
      <backend type='emulator' version='2.0'/>
    </tpm>
  </devices>
</domain>

Essential virsh commands

# Define a VM from XML
virsh define web01.xml

# Start / stop / reboot
virsh start web01
virsh shutdown web01       # graceful ACPI shutdown
virsh destroy web01        # force power off (like pulling the plug)
virsh reboot web01

# Autostart on host boot
virsh autostart web01
virsh autostart --disable web01

# Console access (serial)
virsh console web01
# Press Ctrl+] to detach

# Edit the XML live (opens $EDITOR)
virsh edit web01

# Snapshot (for qcow2 disks — zvol snapshots are done via ZFS)
virsh snapshot-create-as web01 --name "before-upgrade" --description "pre-patch"

# List networks and storage pools
virsh net-list --all
virsh pool-list --all

People ask "should I use Proxmox or bare KVM?" The answer depends on what you value. Proxmox gives you a web UI, clustering, HA fencing, backup integration, and a lower learning curve. Bare KVM with libvirt gives you complete control, no abstraction layer between you and the hypervisor, and the ability to automate everything with virsh and shell scripts. If you understand libvirt XML definitions and virsh, you can build anything Proxmox builds — and you understand exactly how it works. This masterclass teaches the bare-metal approach because understanding the primitives makes you effective with any tool built on top of them, including Proxmox.

3. ZFS zvol Storage for VMs

The most impactful decision in hypervisor design is the storage backend. The standard choice is qcow2 files on ext4 or XFS. The kldload choice is ZFS zvols — block devices backed by the ZFS storage pool. This section explains why zvols are superior and how to design your storage layout.

Why zvols over qcow2

Block device, not a file

A zvol is a block device (/dev/zvol/tank/vms/web01) that the guest accesses through virtio-blk or virtio-scsi. There is no filesystem layer between QEMU and the storage — no double caching, no filesystem metadata overhead, no fragmentation. QEMU opens the block device with cache=none,io=native and every I/O goes straight to ZFS.

// qcow2-on-ext4 = file inside a filesystem inside a volume manager. zvol = block device. One layer, not three.

Atomic snapshots in microseconds

ZFS snapshots are copy-on-write metadata operations — they take microseconds regardless of dataset size. Snapshotting a 500 GB zvol takes the same time as snapshotting a 5 GB zvol. No quiescing, no pausing writes, no snapshot file chain to manage.

// zfs snapshot tank/vms/web01@before-upgrade — done in <1ms, regardless of disk size.

Instant cloning

zfs clone creates a new zvol that shares all blocks with its parent snapshot. A clone of a 100 GB zvol takes zero additional space and completes instantly. This is the foundation of the golden image workflow — build one image, clone it 100 times, each clone only stores its deltas.

// zfs clone tank/vms/golden@sealed tank/vms/node04 — instant, zero-copy. The clone IS the deployment.

Compression and checksums

ZFS compresses data at the block level with lz4 (default) or zstd. A 100 GB Windows VM zvol might only consume 40 GB on disk. Every block is checksummed — silent data corruption is detected and corrected automatically if the pool has redundancy.

// Free compression + free integrity verification. qcow2 has neither unless you add layers.

Thin provisioning by default

A zvol with -s (sparse) flag allocates space only as the guest writes to it. A 500 GB zvol might consume only 2 GB until the guest installs an OS. No pre-allocation, no zeroing, no waiting.

// zfs create -s -V 500G tank/vms/web01 — "500 GB" disk that uses 0 bytes until written.

ZFS send/receive for migration

zfs send serializes a snapshot (or incremental delta between snapshots) into a byte stream that zfs receive can consume on any ZFS host. This is native, efficient, and handles live migration of VM storage without shared storage infrastructure.

// zfs send -i @snap1 @snap2 | ssh host2 zfs recv — incremental storage migration over SSH.

Storage layout design

# Create a dedicated VM dataset with tuned properties
zfs create -o compression=lz4 \
           -o primarycache=metadata \
           -o recordsize=64k \
           -o redundant_metadata=most \
           tank/vms

# Create a zvol for a VM (thin-provisioned)
zfs create -s -V 100G tank/vms/web01

# The zvol appears as a block device
ls -la /dev/zvol/tank/vms/web01
# lrwxrwxrwx 1 root root 11 ... /dev/zvol/tank/vms/web01 -> ../../zd0

# Set volblocksize for VM workloads (set at creation — cannot be changed)
zfs create -s -V 100G -o volblocksize=64k tank/vms/db01

# Snapshot before an upgrade
zfs snapshot tank/vms/web01@before-upgrade

# Rollback if the upgrade fails
zfs rollback tank/vms/web01@before-upgrade

# Clone a golden image
zfs snapshot tank/vms/golden@v1.0
zfs clone tank/vms/golden@v1.0 tank/vms/web02

volblocksize tuning

Workload	volblocksize	Rationale
General-purpose VM	64k	Matches most filesystem cluster sizes; good compression ratio
Database VM (PostgreSQL, MySQL)	16k	Matches database page size (8k/16k); avoids read-modify-write amplification
Windows VM	64k	NTFS default cluster size is 4k but NTFS allocation groups are 64k
Large sequential I/O (media)	128k	Maximizes throughput for sequential reads/writes
Random I/O heavy (OLTP)	8k	Minimizes write amplification for small random writes

The volblocksize decision is permanent — you cannot change it after zvol creation. This is the single most important tuning parameter for VM storage performance. The rule is simple: match the volblocksize to the guest's dominant I/O size. For a database VM running PostgreSQL (8k pages), use 8k or 16k. For a general-purpose Linux VM, 64k is the sweet spot — it gives good compression ratios and matches the default recordsize of datasets. If you get it wrong, you pay a write amplification penalty on every I/O for the life of the zvol. Measure twice, create once.

4. Golden Image Workflow

The golden image pattern is the foundation of scalable VM deployment: build one perfectly configured VM, seal it for cloning, snapshot it, and clone it to create new VMs. Each clone inherits the entire disk state of the golden image but stores only its own deltas. This is how cloud providers deploy VMs — and with ZFS, you can do it in seconds on a single host.

The build-seal-clone pipeline

Step 1 — Build the golden image

Install the OS into a zvol using kldload's installer, Packer, or manual virt-install. Configure everything that should be identical across all clones: base packages, security hardening, monitoring agents, WireGuard keys (template), SSH config, kernel parameters, ZFS inside the guest if needed.

// The golden image is your template. Get it right once. Every clone inherits it.

Step 2 — Seal for cloning

Remove machine-specific identity: clear /etc/machine-id, delete SSH host keys, remove persistent network rules, enable cloud-init or a first-boot script. The kldload function k_seal_image_for_clone() does this automatically. A sealed image boots with a new identity every time.

// Sealing = stripping unique identity. Like wiping the serial number so each clone gets its own.

Step 3 — Snapshot and clone

Snapshot the sealed zvol, then clone it for each new VM. The clone is instant and zero-copy. Boot the clone and cloud-init (or a first-boot script) assigns a new hostname, generates SSH host keys, configures networking, and sets the machine-id.

// zfs clone = instant VM deployment. No copying 100 GB. No waiting. Just metadata.

Sealing a golden image

# Inside the golden image VM, before shutdown:

# Clear machine-id (systemd regenerates on boot)
truncate -s 0 /etc/machine-id
rm -f /var/lib/dbus/machine-id

# Remove SSH host keys (sshd-keygen.service regenerates on boot)
rm -f /etc/ssh/ssh_host_*

# Remove persistent network rules
rm -f /etc/udev/rules.d/70-persistent-net.rules
rm -f /etc/NetworkManager/system-connections/*.nmconnection

# Clean dnf/apt cache
dnf clean all 2>/dev/null || apt clean 2>/dev/null

# Remove shell history
rm -f /root/.bash_history /home/*/.bash_history

# Enable cloud-init for identity assignment
systemctl enable cloud-init cloud-init-local cloud-config cloud-final

# Shutdown cleanly
poweroff

Clone deployment with kvm-clone

# kldload's kvm-clone automates the clone workflow
# It snapshots the source, clones the zvol, creates a new VM definition,
# and assigns unique identity

# Clone the golden image to create 4 web servers
for i in $(seq 1 4); do
  kvm-clone golden web0${i}
done

# Each clone gets:
# - A new zvol (tank/vms/web0N) cloned from tank/vms/golden@clone-web0N
# - A new libvirt domain with unique UUID and MAC address
# - A new UEFI NVRAM file (copied, not shared — avoids SELinux conflicts)
# - cloud-init generates unique SSH keys, hostname, machine-id on first boot

# Start all clones
for i in $(seq 1 4); do
  virsh start web0${i}
done

# Verify they're running
virsh list

Cloud-init configuration for clones

# /etc/cloud/cloud.cfg.d/99-kldload.cfg
# This configuration handles identity assignment on clone boot

datasource_list: [NoCloud, ConfigDrive, None]

# Generate new SSH host keys
ssh_deletekeys: true
ssh_genkeytypes: [ed25519, ecdsa, rsa]

# Set hostname from instance metadata or DHCP
preserve_hostname: false

# Disable phone-home to cloud providers
phone_home:
  url: ""

# Run custom first-boot scripts
runcmd:
  - systemd-machine-id-setup
  - hostnamectl set-hostname $(cat /etc/hostname)
  - systemctl restart systemd-journald

The golden image workflow is the most powerful pattern in virtualization — and ZFS makes it trivially cheap. Without ZFS, cloning a VM means copying the entire disk image: 100 GB copy, 5 minutes, 100 GB more storage. With ZFS, cloning means creating a metadata pointer to an existing snapshot: zero bytes copied, sub-second completion, zero additional storage until the clone diverges. This is not just faster — it changes what is economically feasible. You can spin up 50 test VMs from a golden image, run your test suite, and destroy them, all in under a minute. That kind of agility is usually reserved for containers. ZFS gives it to full VMs.

5. VM Creation & Lifecycle

This section covers the practical commands for creating, running, and managing VMs throughout their lifecycle — from initial creation through daily operations to eventual decommissioning.

Creating a VM with virt-install

# Create a zvol first
zfs create -s -V 100G -o volblocksize=64k tank/vms/web01

# Install from ISO using virt-install
virt-install \
  --name web01 \
  --ram 8192 \
  --vcpus 4 \
  --cpu host-passthrough \
  --machine q35 \
  --os-variant centos-stream9 \
  --disk /dev/zvol/tank/vms/web01,bus=virtio,cache=none,io=native,discard=unmap \
  --cdrom /root/kldload-1.0.3.iso \
  --network bridge=br0,model=virtio \
  --graphics vnc,listen=0.0.0.0,port=5901 \
  --serial pty \
  --console pty,target_type=serial \
  --tpm backend.type=emulator,backend.version=2.0,model=tpm-crb \
  --boot uefi \
  --autostart \
  --noautoconsole

# Install from kldload ISO with unattended answers file
virt-install \
  --name web01 \
  --ram 8192 \
  --vcpus 4 \
  --cpu host-passthrough \
  --machine q35 \
  --os-variant centos-stream9 \
  --disk /dev/zvol/tank/vms/web01,bus=virtio,cache=none,io=native \
  --cdrom /root/kldload-1.0.3.iso \
  --disk /root/answers.iso,device=cdrom \
  --network bridge=br0,model=virtio \
  --serial pty \
  --console pty,target_type=serial \
  --boot uefi \
  --noautoconsole

Lifecycle management commands

# --- Power management ---
virsh start web01                    # Power on
virsh shutdown web01                 # Graceful ACPI shutdown
virsh destroy web01                  # Force power off (immediate)
virsh reboot web01                   # Graceful reboot
virsh reset web01                    # Hard reset (like pressing reset button)

# --- State management ---
virsh suspend web01                  # Pause vCPUs (VM frozen in RAM)
virsh resume web01                   # Unpause
virsh save web01 /tmp/web01.state    # Save VM state to file (like hibernate)
virsh restore /tmp/web01.state       # Restore from saved state

# --- Configuration ---
virsh autostart web01                # Start on host boot
virsh autostart --disable web01      # Disable autostart
virsh setmem web01 16G --live        # Hot-add memory (if guest supports ballooning)
virsh setvcpus web01 8 --live        # Hot-add vCPUs (if guest kernel supports)

# --- Information ---
virsh dominfo web01                  # Basic info (state, memory, CPUs)
virsh domblklist web01               # List block devices
virsh domiflist web01                # List network interfaces
virsh domstats web01                 # Detailed statistics
virsh vcpuinfo web01                 # vCPU-to-pCPU mapping
virsh dumpxml web01                  # Full XML definition

# --- Console access ---
virsh console web01                  # Serial console (Ctrl+] to detach)
virt-viewer web01                    # Graphical console (SPICE/VNC)

# --- Deletion ---
virsh shutdown web01                 # Shutdown first
virsh undefine web01 --nvram         # Remove VM definition and NVRAM
zfs destroy tank/vms/web01           # Destroy the zvol

Attach and detach devices at runtime

# Hot-add a second disk
zfs create -s -V 50G tank/vms/web01-data
virsh attach-disk web01 /dev/zvol/tank/vms/web01-data vdb \
  --driver qemu --subdriver raw --cache none --persistent

# Hot-add a network interface
virsh attach-interface web01 bridge br1 \
  --model virtio --persistent

# Hot-remove a network interface
virsh detach-interface web01 bridge --mac 52:54:00:xx:xx:xx --persistent

# Attach a USB device from the host
virsh attach-device web01 <<'EOF'
<hostdev mode='subsystem' type='usb'>
  <source>
    <vendor id='0x1234'/>
    <product id='0x5678'/>
  </source>
</hostdev>
EOF

6. CPU Topology & Pinning

CPU configuration is one of the highest-impact tuning areas for KVM performance. The wrong CPU model or topology can halve your VM's performance. The right configuration — with proper pinning and NUMA awareness — can match bare-metal speeds.

CPU models

host-passthrough

Exposes the exact host CPU to the guest, including all features, model name, and microcode level. Gives the best performance because the guest can use every CPU instruction the hardware supports. The tradeoff: VMs cannot be live-migrated to hosts with different CPU models.

// "Give the guest the real CPU." Best performance. Pins you to identical hardware for migration.

host-model

libvirt reads the host CPU and selects the closest QEMU CPU model, adjusting feature flags to match. Slightly less performant than host-passthrough (some obscure instructions might be hidden) but allows migration between hosts with similar CPUs.

// "Give the guest a close approximation of the real CPU." Good performance. More migration flexibility.

Named models (Cascadelake-Server, EPYC-Rome)

A specific CPU model defined in QEMU's CPU database. The guest sees exactly those features, regardless of the host hardware. Useful for migration across heterogeneous hardware — as long as every host is at least as capable as the named model.

// "Give the guest a standardized CPU definition." Maximum migration compatibility. Some features hidden.

CPU pinning

By default, the Linux scheduler can run a VM's vCPU threads on any physical CPU. This means vCPUs can bounce between cores, losing L1/L2 cache state on every migration. CPU pinning binds each vCPU to a specific physical core, eliminating cache thrashing and providing deterministic performance.

# View host CPU topology
lscpu -e
# CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
# 0   0    0      0    0:0:0:0        yes
# 1   0    0      1    1:1:1:0        yes
# 2   0    0      2    2:2:2:0        yes
# ...

# Pin vCPUs in the VM's XML definition
virsh edit web01

<!-- CPU pinning configuration -->
<vcpu placement='static'>4</vcpu>
<cputune>
  <!-- Pin each vCPU to a specific physical core -->
  <vcpupin vcpu='0' cpuset='4'/>
  <vcpupin vcpu='1' cpuset='5'/>
  <vcpupin vcpu='2' cpuset='6'/>
  <vcpupin vcpu='3' cpuset='7'/>

  <!-- Pin QEMU emulator threads to a separate core -->
  <emulatorpin cpuset='0-1'/>

  <!-- Pin I/O threads -->
  <iothreadpin iothread='1' cpuset='2'/>
  <iothreadpin iothread='2' cpuset='3'/>
</cputune>

<!-- Expose correct topology to the guest -->
<cpu mode='host-passthrough' check='none' migratable='on'>
  <topology sockets='1' dies='1' cores='4' threads='1'/>
</cpu>

NUMA topology

On multi-socket systems, each CPU socket has its own memory controller and local memory. Accessing memory attached to the local socket (local access) is fast. Accessing memory on a remote socket (remote access) crosses the interconnect (QPI/UPI for Intel, Infinity Fabric for AMD) and is 1.5-2x slower. For VM performance, all of a VM's vCPUs and memory should live on the same NUMA node.

# View NUMA topology
numactl --hardware
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
# node 0 size: 65536 MB
# node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
# node 1 size: 65536 MB
# node distances:
# node   0   1
#   0:  10  21
#   1:  21  10

<!-- NUMA-aware VM configuration -->
<vcpu placement='static'>8</vcpu>
<cputune>
  <!-- Pin all vCPUs to NUMA node 0 -->
  <vcpupin vcpu='0' cpuset='0'/>
  <vcpupin vcpu='1' cpuset='1'/>
  <vcpupin vcpu='2' cpuset='2'/>
  <vcpupin vcpu='3' cpuset='3'/>
  <vcpupin vcpu='4' cpuset='4'/>
  <vcpupin vcpu='5' cpuset='5'/>
  <vcpupin vcpu='6' cpuset='6'/>
  <vcpupin vcpu='7' cpuset='7'/>
  <emulatorpin cpuset='16,17'/>
</cputune>

<!-- Bind memory to NUMA node 0 -->
<numatune>
  <memory mode='strict' nodeset='0'/>
</numatune>

CPU pinning is not optional for production workloads — it is mandatory. Without pinning, a latency-sensitive database VM might have its vCPU threads bouncing between cores on every scheduler tick, flushing L1 and L2 caches each time. With pinning, the vCPU stays on one core, its working set stays hot in cache, and latency drops by 30-50%. The key insight is that the Linux scheduler is designed for fairness across all processes — but a VM that owns a physical core does not need fairness. It needs isolation. Pinning gives it isolation. On a 32-core hypervisor running 4 VMs, dedicate cores: 0-7 for VM1, 8-15 for VM2, 16-23 for VM3, 24-31 for VM4. Reserve core 0 (or two cores) for the host and QEMU emulator threads.

7. Memory — Hugepages, Ballooning & KSM

Memory configuration has an outsized impact on VM performance because the translation lookaside buffer (TLB) — the CPU's cache for virtual-to-physical address translations — is small and expensive. With standard 4K pages, an 8 GB VM has 2 million pages. The TLB cannot cache all those translations, so TLB misses cause expensive page table walks. Hugepages solve this by using larger pages (2 MB or 1 GB), reducing the number of translations by 512x or 262,144x.

Hugepage configuration

# Check current hugepage status
cat /proc/meminfo | grep -i huge
# HugePages_Total:       0
# HugePages_Free:        0
# Hugepagesize:       2048 kB

# Allocate 2 MB hugepages at boot (add to kernel command line)
# For 64 GB of hugepages: 64 * 1024 / 2 = 32768 pages
hugepagesz=2M hugepages=32768

# Or allocate at runtime (may fail due to fragmentation)
echo 32768 > /proc/sys/vm/nr_hugepages

# For 1 GB hugepages (must be set at boot — cannot be allocated at runtime)
hugepagesz=1G hugepages=64
# Allocates 64 x 1 GB = 64 GB of hugepages

# Verify allocation
cat /proc/meminfo | grep HugePages_
# HugePages_Total:   32768
# HugePages_Free:    32768

# Make persistent via sysctl
echo "vm.nr_hugepages = 32768" >> /etc/sysctl.d/90-hugepages.conf
sysctl -p /etc/sysctl.d/90-hugepages.conf

VM configuration for hugepages

<!-- Enable hugepages for a VM -->
<memoryBacking>
  <hugepages>
    <page size='2048' unit='KiB'/>
  </hugepages>
  <locked/>       <!-- Prevent host from swapping VM memory -->
  <nosharepages/> <!-- Disable KSM for this VM (optional) -->
</memoryBacking>

<!-- For 1 GB hugepages -->
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
  <locked/>
</memoryBacking>

Memory ballooning

The virtio-balloon device lets the host reclaim memory from a guest dynamically. The balloon driver inside the guest "inflates" (allocates memory the guest cannot use) to return pages to the host, or "deflates" to give memory back to the guest. This allows overcommit — running VMs whose total configured memory exceeds physical RAM — as long as they don't all use their full allocation simultaneously.

# Check balloon status
virsh dommemstat web01
# actual 8388608        (8 GB allocated)
# rss 4194304           (4 GB actually used by host)
# usable 3145728        (3 GB free inside guest)

# Set balloon target (reduce available memory to 4 GB)
virsh setmem web01 4G --live

# Restore to full allocation
virsh setmem web01 8G --live

Kernel Same-page Merging (KSM)

KSM is a kernel feature that scans memory pages across processes and merges identical pages into a single copy-on-write page. For VMs running the same OS, KSM can save significant memory — the kernel code pages, library pages, and zero pages are identical across VMs and get merged. The tradeoff is CPU overhead for the scanning and a side-channel attack surface (KSM timing can leak information about guest memory contents).

# Enable KSM
echo 1 > /sys/kernel/mm/ksm/run

# Tune scanning aggressiveness
echo 200 > /sys/kernel/mm/ksm/sleep_millisecs  # Lower = more aggressive
echo 1000 > /sys/kernel/mm/ksm/pages_to_scan   # Pages per scan cycle

# Check KSM statistics
cat /sys/kernel/mm/ksm/pages_shared    # Unique pages being shared
cat /sys/kernel/mm/ksm/pages_sharing   # Total references to shared pages
# Memory saved = (pages_sharing - pages_shared) * 4096

# Disable KSM (recommended for security-sensitive environments)
echo 0 > /sys/kernel/mm/ksm/run

Feature	Best For	Tradeoff
2 MB hugepages	Most VM workloads	Cannot be swapped; must pre-allocate
1 GB hugepages	Large VMs (64+ GB), database workloads	Must allocate at boot; very coarse granularity
Ballooning	Overcommit; dynamic memory sharing	Incompatible with hugepages; adds latency on inflate
KSM	Many identical VMs (VDI, testing)	CPU overhead; side-channel risk; incompatible with hugepages
Memory locking	Latency-sensitive VMs	Prevents host from reclaiming; must have enough physical RAM

The choice between hugepages and ballooning is a choice between performance and flexibility. Hugepages give you a 10-30% performance improvement on memory-intensive workloads by eliminating TLB misses — but the memory is locked and cannot be shared or reclaimed. Ballooning lets you overcommit memory across VMs — but every balloon inflation/deflation adds latency and you cannot use hugepages. For production: use hugepages with locked memory. For dev/test labs where you want to pack 20 VMs on a host with 64 GB: use ballooning without hugepages. Never try to use both at the same time.

8. Virtio Devices

Virtio is a standardized interface for paravirtualized devices in KVM. Instead of emulating real hardware (which requires the guest to use a driver designed for physical hardware, and QEMU to translate those I/O operations), virtio defines a simple, efficient interface that both the host and guest understand natively. The result is dramatically better performance — virtio-blk is 2-5x faster than emulated IDE, and virtio-net is 3-10x faster than emulated e1000.

virtio-blk

A simple, fast block device. The guest sees /dev/vda. Single-queue by default but supports multiqueue. Lower overhead than virtio-scsi. Best for VMs that need 1-2 disks with maximum throughput. Does not support SCSI commands (no sg_* tools, no SCSI reservations).

// virtio-blk = the fast path for simple disk access. One disk, one queue, maximum speed.

virtio-scsi

A full SCSI controller emulation over virtio transport. Supports hundreds of disks per controller (vs. limited PCI slots for virtio-blk), SCSI commands, persistent reservations, UNMAP/TRIM, and device hotplug. Slightly more overhead than virtio-blk. Best for VMs with many disks or SCSI-dependent workloads.

// virtio-scsi = the Swiss Army knife. More features, more disks, slightly more overhead.

virtio-net

Paravirtualized network interface. Supports multiqueue (one queue per vCPU), checksum offload, TCP segmentation offload, and vhost-net kernel acceleration. With vhost-net, network packets bypass QEMU entirely — they go from the guest directly to the host kernel's network stack.

// virtio-net + vhost-net = near bare-metal networking. Packets skip QEMU completely.

virtio-fs (virtiofs)

Shared filesystem between host and guest using FUSE on the host and a virtio transport. Much faster than 9p or NFS for host-guest file sharing. Uses DAX (direct access) to map host page cache into the guest, avoiding data copies. Ideal for development workflows where the guest needs access to host files.

// virtio-fs = shared folders done right. Direct memory mapping, not network protocols.

vhost-user

Moves device emulation out of QEMU entirely, into a separate userspace process. Used primarily with DPDK for high-performance networking — the vhost-user backend process handles packets directly in userspace with kernel bypass. Achieves line-rate 10/25/100 GbE performance.

// vhost-user = DPDK's path into VMs. For when you need 10+ million packets per second.

Multiqueue

Both virtio-blk and virtio-net support multiqueue — multiple submission/completion queue pairs that can be processed in parallel by different vCPUs. Without multiqueue, all I/O funnels through a single queue, creating a bottleneck on multi-vCPU VMs. Enable it: queues=N where N equals the number of vCPUs.

// Single queue = one checkout lane. Multiqueue = one lane per vCPU. Eliminates the bottleneck.

Optimal disk configuration

<!-- virtio-blk with multiqueue and I/O threads -->
<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none' io='native'
         discard='unmap' iothread='1'/>
  <source dev='/dev/zvol/tank/vms/web01'/>
  <target dev='vda' bus='virtio'/>
</disk>
<iothreads>2</iothreads>

<!-- virtio-scsi with multiqueue for many-disk VMs -->
<controller type='scsi' model='virtio-scsi'>
  <driver queues='4' iothread='1'/>
</controller>
<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
  <source dev='/dev/zvol/tank/vms/db01'/>
  <target dev='sda' bus='scsi'/>
</disk>
<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
  <source dev='/dev/zvol/tank/vms/db01-log'/>
  <target dev='sdb' bus='scsi'/>
</disk>

Optimal network configuration

<!-- virtio-net with multiqueue and vhost -->
<interface type='bridge'>
  <source bridge='br0'/>
  <model type='virtio'/>
  <driver name='vhost' queues='4'/>
  <mac address='52:54:00:aa:bb:01'/>
</interface>

# Inside the guest: enable multiqueue on the NIC
ethtool -L eth0 combined 4

# Verify multiqueue is active
ethtool -l eth0
# Current hardware settings:
#   Combined: 4

Device	Use Case	cache	io	Notes
virtio-blk on zvol	General VM disk	none	native	Best performance; add `discard=unmap` for TRIM
virtio-scsi on zvol	Database VM, many disks	none	native	Use iothread for each controller
virtio-blk on qcow2	Snapshot-capable (non-ZFS)	writeback	threads	Only if you must use qcow2
virtio-net + vhost	All VMs	n/a	n/a	Always enable vhost; multiqueue = vCPU count

9. Networking for VMs

VM networking on Linux is built on the same primitives as container networking: bridges, veth pairs, macvtap, OVS, and nftables. The difference is that VMs present virtio-net (or e1000) NICs to the guest, and the backend connects to a host-side network device. This section covers the major networking topologies for KVM.

Bridge mode (standard)

A Linux bridge is a Layer 2 switch implemented in the kernel. VMs connect their virtio-net backends to the bridge. The bridge forwards frames between VMs and between VMs and the physical network. This is the default and most common networking mode.

# Create a bridge with NetworkManager
nmcli connection add type bridge ifname br0 con-name br0 \
  ipv4.method manual ipv4.addresses 10.0.0.1/24 ipv4.gateway 10.0.0.254 \
  ipv4.dns 10.0.0.1

# Enslave the physical NIC to the bridge
nmcli connection add type bridge-slave ifname eno1 master br0

# Bring up the bridge
nmcli connection up br0

# Verify
bridge link show
# 2: eno1: <BROADCAST,MULTICAST,UP> mtu 1500 master br0
# 5: vnet0: <BROADCAST,MULTICAST,UP> mtu 1500 master br0 (web01's NIC)

The libvirt default NAT network (virbr0)

When libvirt starts for the first time it creates a NAT-backed network named default: a bridge virbr0 on the 192.168.122.0/24 subnet with the host at 192.168.122.1 acting as gateway, DNS forwarder (dnsmasq), and DHCP server handing out .2–.254. This is the fallback attachment for any VM created without an explicit network — virt-install, virsh, and every tool in the kldload KVM lab all land on this bridge unless told otherwise.

# Config lives at /etc/libvirt/qemu/networks/default.xml
virsh net-dumpxml default
# <network>
#   <name>default</name>
#   <forward mode='nat'/>
#   <bridge name='virbr0' stp='on' delay='0'/>
#   <ip address='192.168.122.1' netmask='255.255.255.0'>
#     <dhcp><range start='192.168.122.2' end='192.168.122.254'/></dhcp>
#   </ip>
# </network>

# Change the subnet or gateway
virsh net-edit default       # opens the XML in $EDITOR
virsh net-destroy default
virsh net-start default      # re-read the updated config

kldload reserves 192.168.122.0/24 and the 192.168.122.1 gateway for the KVM lab. The kldload TLS CA includes 192.168.122.1 as a Subject Alternative Name on the webui, proxy, Grafana, and ttyd certificates so browsers hitting the lab over that bridge do not see certificate errors. If you change the default network to a different subnet or gateway you must re-issue the kldload service certs so the new address is in the SAN list:

sudo kldload-ca renew webui
sudo systemctl restart kldload-proxy kldload-webui grafana-server ttyd-k9s

Without the re-issue, connecting to https://<new-gateway>:8443/ from inside the lab yields SSLV3_ALERT_CERTIFICATE_UNKNOWN and the browser rejects the cert. The drift check in kldload-tls-cert catches new IPs that appear on the wire but can't read a subnet you've just changed manually — hence the explicit re-issue step.

macvtap mode

macvtap creates a virtual NIC directly attached to a physical NIC, bypassing the bridge. Each VM gets its own MAC address on the physical network. Simpler than a bridge and slightly lower latency, but VMs using macvtap on the same host cannot communicate with each other (the physical switch may drop packets between MACs on the same port) or with the host.

<!-- macvtap in bridge mode -->
<interface type='direct'>
  <source dev='eno1' mode='bridge'/>
  <model type='virtio'/>
</interface>

SR-IOV

Single Root I/O Virtualization (SR-IOV) creates hardware-level virtual NICs (VFs) from a single physical NIC (PF). Each VF is a real PCIe device that can be passed directly to a VM via VFIO. The VM communicates directly with the NIC hardware — zero host CPU overhead for packet processing. This is how cloud providers achieve line-rate networking for VMs.

# Enable SR-IOV VFs on an Intel NIC
echo 4 > /sys/class/net/eno1/device/sriov_numvfs

# List VFs
lspci | grep "Virtual Function"
# 03:10.0 Ethernet controller: Intel Corporation ... Virtual Function
# 03:10.2 Ethernet controller: Intel Corporation ... Virtual Function
# 03:10.4 Ethernet controller: Intel Corporation ... Virtual Function
# 03:10.6 Ethernet controller: Intel Corporation ... Virtual Function

# Pass a VF to a VM
virsh edit web01

<!-- SR-IOV VF passthrough -->
<interface type='hostdev' managed='yes'>
  <source>
    <address type='pci' domain='0x0000' bus='0x03' slot='0x10' function='0x0'/>
  </source>
  <mac address='52:54:00:aa:bb:01'/>
</interface>

WireGuard integration for VMs

# Option 1: WireGuard on the host — VMs route through the host's WireGuard tunnel
# The bridge is on the WireGuard subnet; VMs use the host as gateway
# Simplest approach — VMs don't need to know about WireGuard

# Option 2: WireGuard inside each VM — each VM has its own tunnel
# More secure (end-to-end encryption) but more management overhead

# Option 3: WireGuard on the host with per-VM routing
# Host runs WireGuard, nftables marks packets per VM, policy routing
# sends each VM's traffic through the appropriate tunnel

# Host-side nftables for VM traffic through WireGuard
nft add table ip nat
nft add chain ip nat postrouting '{ type nat hook postrouting priority 100 ; }'
nft add rule ip nat postrouting oif wg0 masquerade

nftables firewall for VMs

# Filter traffic between VMs on the bridge
nft add table bridge filter
nft add chain bridge filter forward '{ type filter hook forward priority 0 ; policy drop ; }'

# Allow established connections
nft add rule bridge filter forward ct state established,related accept

# Allow web01 to reach db01 on PostgreSQL port
nft add rule bridge filter forward \
  ether saddr 52:54:00:aa:bb:01 ether daddr 52:54:00:aa:bb:02 \
  ip daddr 10.0.0.5 tcp dport 5432 accept

# Allow all VMs to reach the gateway
nft add rule bridge filter forward ip daddr 10.0.0.254 accept

# Log and drop everything else
nft add rule bridge filter forward log prefix "VM-DROP: " drop

Most people use bridge mode and stop there. It works, and for a homelab or small deployment it is fine. But for production, you should understand the performance hierarchy: SR-IOV gives you line-rate hardware-accelerated networking with zero host CPU overhead. macvtap with vhost-net gives you near-line-rate with minimal overhead. Bridge mode with vhost-net gives you good performance with maximum flexibility. Plain bridge mode without vhost gives you the worst performance. The jump from bridge-without-vhost to bridge-with-vhost is often 3-5x in throughput — and it is a one-line XML change. Always enable vhost.

10. Live Migration

Live migration moves a running VM from one physical host to another without downtime. The guest continues to run during the migration — it does not notice anything except a brief pause (typically 10-100 ms) at the final switchover. This is the foundation of maintenance without downtime: drain a host of its VMs, patch and reboot the host, then migrate the VMs back.

Requirements for live migration

Shared or migrated storage

Both hosts must access the same storage (NFS, iSCSI, Ceph, GlusterFS) or you must migrate the storage alongside the VM. With ZFS, you can use zfs send/receive for storage migration — the VM's zvol is replicated to the target host before or during the migration.

// The disk must be on both hosts. Shared storage makes this automatic. ZFS send makes it possible without shared storage.

Compatible CPU models

The destination host must support all CPU features the guest is using. With host-passthrough, this means identical hardware. With a named model (Cascadelake-Server), any host with at least those features works. Plan your CPU model before you need to migrate.

// Migration fails if the destination CPU cannot do what the source CPU promised the guest.

Network connectivity

The guest's network must work on both hosts — same bridge name, same VLAN, same subnet. libvirt uses a separate TCP connection for the migration data stream (or TLS-encrypted tunnel). A dedicated migration network (10 GbE+) reduces migration time and avoids impacting production traffic.

// When the VM lands on the destination, its NIC must plug into the same network. Plan bridge names accordingly.

Pre-copy vs. post-copy migration

Pre-copy (default): iteratively copies guest memory to the destination while the VM runs. Each round copies pages that changed since the last round. When the remaining dirty pages are small enough, the VM is paused, the final pages are transferred, and the VM resumes on the destination. Works well when the guest's memory write rate is lower than the network transfer rate.

Post-copy: pauses the VM, transfers CPU state and a minimal memory set, resumes on the destination, then faults in remaining pages on demand. Total migration time is shorter but the VM runs slower until all pages are transferred (network page faults). If the network fails mid-migration, the VM is lost — it has pages on both hosts and neither has a complete copy. Use with caution.

Migration commands

# Basic live migration (shared storage assumed)
virsh migrate --live --persistent --undefinesource \
  web01 qemu+ssh://host2.internal/system

# Live migration with bandwidth limit (MiB/s)
virsh migrate --live --persistent --undefinesource \
  --bandwidth 500 \
  web01 qemu+ssh://host2.internal/system

# Tunneled migration (encrypted, does not require direct QEMU connection)
virsh migrate --live --persistent --undefinesource \
  --tunnelled \
  web01 qemu+ssh://host2.internal/system

# Post-copy migration
virsh migrate --live --persistent --undefinesource \
  --postcopy --postcopy-after-precopy \
  web01 qemu+ssh://host2.internal/system

# Monitor migration progress
virsh domjobinfo web01
# Job type:     Unbounded
# Time elapsed: 12453 ms
# Data processed: 4.2 GiB
# Data remaining: 1.8 GiB
# Memory processed: 4.2 GiB
# Memory remaining: 1.8 GiB
# Dirty rate: 45 MiB/s
# Iteration: 3

ZFS send for storage migration (no shared storage)

# When you don't have shared storage, migrate the zvol first, then live-migrate

# Step 1: Create an initial snapshot and send it to the destination
zfs snapshot tank/vms/web01@migrate-base
zfs send tank/vms/web01@migrate-base | ssh host2 zfs receive tank/vms/web01

# Step 2: While the VM is still running, create an incremental snapshot
zfs snapshot tank/vms/web01@migrate-incr1
zfs send -i @migrate-base @migrate-incr1 tank/vms/web01 | \
  ssh host2 zfs receive tank/vms/web01

# Step 3: Repeat incremental sends until the delta is small
zfs snapshot tank/vms/web01@migrate-incr2
zfs send -i @migrate-incr1 @migrate-incr2 tank/vms/web01 | \
  ssh host2 zfs receive tank/vms/web01

# Step 4: Pause the VM, send the final delta, live-migrate CPU/memory state
virsh suspend web01
zfs snapshot tank/vms/web01@migrate-final
zfs send -i @migrate-incr2 @migrate-final tank/vms/web01 | \
  ssh host2 zfs receive tank/vms/web01

# Step 5: Migrate CPU and memory state (storage is already on host2)
virsh migrate --offline --persistent web01 qemu+ssh://host2.internal/system
# Then start on destination:
ssh host2 virsh start web01

# Step 6: Clean up migration snapshots
zfs destroy tank/vms/web01@migrate-base
zfs destroy tank/vms/web01@migrate-incr1
zfs destroy tank/vms/web01@migrate-incr2
zfs destroy tank/vms/web01@migrate-final

Live migration without shared storage is the most underappreciated capability of ZFS + KVM. Traditional live migration requires NFS, iSCSI, or Ceph — shared storage infrastructure that adds complexity, cost, and failure domains. With ZFS send/receive, you can migrate a VM between any two hosts that have ZFS — no shared storage required. The initial send is a full copy, but subsequent incremental sends transfer only the blocks that changed. By the time you do the final pause-and-send, the delta is tiny (seconds of writes). The total downtime is the time to send that final delta plus the time to transfer CPU state — typically under 5 seconds. This makes every ZFS host a potential migration target without any shared storage infrastructure.

11. Multi-Host Orchestration

Managing VMs across multiple hosts requires consistent tooling, predictable naming, and automated workflows. The kldload toolkit provides kvm-deploy, kvm-clone, and kvm-delete for this purpose — thin shell wrappers around virsh and ZFS that enforce naming conventions, handle zvol creation, and manage the golden image clone lifecycle.

Fleet patterns

Golden image + clone fleet

Build one golden image per OS/role combination. Seal it. Snapshot it. Clone it to every host that needs VMs of that type. Use cloud-init to differentiate each clone (hostname, IP, role). This is the simplest and most efficient pattern for homogeneous fleets.

// One golden image, N clones. Cloud-init makes each one unique. ZFS makes each clone free.

Packer + deploy pipeline

Use Packer to build golden images from code (HCL template + shell provisioners). Packer boots a VM, runs the provisioners, seals the image, and exports it. Then deploy the image to hosts via zfs send or scp. This adds reproducibility — the golden image is defined in code and can be rebuilt from scratch.

// Packer = "build golden images from code." Version-controlled, reproducible, auditable.

kvm-deploy for ISO-based installs

For hosts where you want a fresh install from the kldload ISO (not a clone), kvm-deploy creates the zvol, generates the virt-install command, boots the ISO, and optionally attaches an answers file for unattended installation.

// kvm-deploy = "install a fresh VM from ISO." For when you need something the golden image does not have.

kvm-clone across hosts

# On the golden image host, export the sealed snapshot
zfs send tank/vms/golden@sealed | ssh host2 zfs receive tank/vms/golden
zfs send tank/vms/golden@sealed | ssh host3 zfs receive tank/vms/golden

# On each host, clone from the local golden image
ssh host2 "kvm-clone golden web01 && kvm-clone golden web02"
ssh host3 "kvm-clone golden web03 && kvm-clone golden web04"

# Or use a simple deployment loop
HOSTS="host2 host3 host4"
VM_PREFIX="web"
CLONES_PER_HOST=4

for host in $HOSTS; do
  for i in $(seq 1 $CLONES_PER_HOST); do
    idx=$(( (${host##host} - 2) * CLONES_PER_HOST + i ))
    vm="${VM_PREFIX}$(printf '%02d' $idx)"
    ssh "$host" "kvm-clone golden $vm && virsh start $vm"
  done
done

Consistent naming and inventory

# Generate an inventory of all VMs across all hosts
for host in host1 host2 host3; do
  echo "=== $host ==="
  ssh "$host" "virsh list --all --name" | while read vm; do
    [ -z "$vm" ] && continue
    state=$(ssh "$host" "virsh domstate '$vm'" 2>/dev/null)
    mem=$(ssh "$host" "virsh dominfo '$vm'" 2>/dev/null | awk '/Max memory/{print $3}')
    echo "  $vm  state=$state  memory=${mem}KiB"
  done
done

# Push a configuration change to all VMs on all hosts
for host in host1 host2 host3; do
  for vm in $(ssh "$host" "virsh list --name"); do
    ssh "$host" "virsh setmem '$vm' 4G --config"
  done
done

kvm-delete and orphan cleanup

# kvm-delete handles the full teardown:
# 1. Graceful shutdown (with timeout)
# 2. Force destroy if shutdown fails
# 3. Undefine the domain (including NVRAM)
# 4. Destroy the zvol
# 5. Check if this was the last clone of a snapshot — if so, destroy the orphaned snapshot

kvm-delete web01

# The orphan cleanup is important: when you clone golden@sealed to web01, web02, web03,
# and then delete all three, the snapshot golden@clone-web01 etc. become orphans.
# kvm-delete checks for this and cleans up automatically.

12. Proxmox Integration

Proxmox VE is a Debian-based open-source virtualization platform built on KVM and LXC. It adds a web UI, clustering, HA fencing, backup (Proxmox Backup Server), and an API. When you understand bare KVM (which this masterclass teaches), Proxmox becomes a productivity accelerator rather than a black box. This section covers when to use Proxmox vs. bare KVM and how to integrate kldload workflows with Proxmox's API.

When to use Proxmox vs. bare KVM

Criterion	Bare KVM + libvirt	Proxmox VE
Web UI	Cockpit (optional)	Built-in, production-ready
Clustering	Manual (corosync + scripts)	Built-in (corosync + pmxcfs)
HA fencing	Manual (fence agents)	Built-in (watchdog + STONITH)
Backup	ZFS snapshots + zfs send	Proxmox Backup Server (incremental, dedup)
API	libvirt API + virsh	REST API (pvesh, curl)
Storage	Any (ZFS, LVM, dir)	ZFS, LVM, Ceph, NFS, GlusterFS
LXC containers	Separate (lxc, systemd-nspawn)	Integrated (same UI, same API)
Learning curve	Higher (must understand all layers)	Lower (abstractions handle plumbing)
Control	Total (you own every config file)	Good (but Proxmox opinions sometimes conflict)
Best for	Single hosts, automation-heavy, learning	Multi-host clusters, teams, mixed VM+LXC

API-driven deployment on Proxmox

# Authenticate and get a ticket
DATA=$(curl -s -k -d "username=root@pam&password=yourpassword" \
  https://proxmox.internal:8006/api2/json/access/ticket)
TICKET=$(echo "$DATA" | jq -r '.data.ticket')
CSRF=$(echo "$DATA" | jq -r '.data.CSRFPreventionToken')

# Create a VM via the API
curl -s -k \
  -H "Cookie: PVEAuthCookie=$TICKET" \
  -H "CSRFPreventionToken: $CSRF" \
  -X POST "https://proxmox.internal:8006/api2/json/nodes/pve1/qemu" \
  -d "vmid=200" \
  -d "name=web01" \
  -d "memory=8192" \
  -d "cores=4" \
  -d "cpu=host" \
  -d "machine=q35" \
  -d "scsihw=virtio-scsi-single" \
  -d "scsi0=tank:vm-200-disk-0,size=100G,discard=on,iothread=1" \
  -d "net0=virtio,bridge=vmbr0" \
  -d "serial0=socket" \
  -d "ide2=local:iso/kldload-1.0.3.iso,media=cdrom" \
  -d "boot=order=ide2;scsi0" \
  -d "tpmstate0=tank:1,version=v2.0" \
  -d "bios=ovmf" \
  -d "efidisk0=tank:1"

# Start the VM
curl -s -k \
  -H "Cookie: PVEAuthCookie=$TICKET" \
  -H "CSRFPreventionToken: $CSRF" \
  -X POST "https://proxmox.internal:8006/api2/json/nodes/pve1/qemu/200/status/start"

# Or use pvesh from the Proxmox host directly
pvesh create /nodes/pve1/qemu/200/status/start

ZFS on Proxmox

# Proxmox natively supports ZFS storage pools
# Create a ZFS pool in the Proxmox UI or CLI
zpool create -f tank mirror /dev/sda /dev/sdb

# Add it as a Proxmox storage backend
pvesm add zfspool tank -pool tank -content images,rootdir -sparse 1

# Proxmox stores VM disks as zvols:
# tank/vm-200-disk-0  (the first disk of VMID 200)

# Snapshots via the Proxmox API create ZFS snapshots underneath
pvesh create /nodes/pve1/qemu/200/snapshot -snapname before-upgrade

# This creates: zfs snapshot tank/vm-200-disk-0@before-upgrade

The dirty secret of Proxmox is that it is just KVM + LXC + corosync + a web UI + a REST API, all running on Debian. The VMs are QEMU processes. The storage is ZFS zvols (or LVM or Ceph). The networking is Linux bridges. Everything this masterclass teaches about KVM, zvols, virtio tuning, CPU pinning, and hugepages applies directly to Proxmox — you just configure it through the web UI or API instead of editing XML files. Understanding bare KVM makes you a better Proxmox operator because when something breaks, you know which layer to look at. The Proxmox web UI is a convenience, not a mystery.

13. GPU Passthrough for VMs

GPU passthrough assigns a physical GPU directly to a VM using VFIO (Virtual Function I/O). The VM gets bare-metal GPU performance — it can run CUDA workloads, machine learning training, video transcoding, or a full desktop compositor at native speed. The host gives up access to the GPU entirely; the VM owns it exclusively.

IOMMU and VFIO setup

# Step 1: Enable IOMMU in the kernel command line
# Intel:
intel_iommu=on iommu=pt

# AMD:
amd_iommu=on iommu=pt

# 'iommu=pt' enables passthrough mode — devices not assigned to VMs
# use the native DMA path, avoiding IOMMU overhead for host devices.

# Step 2: Verify IOMMU is active
dmesg | grep -i iommu
# DMAR: IOMMU enabled
# DMAR-IR: IOMMU DMAR x enabled

# Step 3: Find the GPU's IOMMU group
for d in /sys/kernel/iommu_groups/*/devices/*; do
  n=$(basename "$d")
  echo "IOMMU Group $(basename $(dirname $(dirname "$d"))): $n $(lspci -nns "$n")"
done | grep -i nvidia
# IOMMU Group 14: 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation ... [10de:2684]
# IOMMU Group 14: 01:00.1 Audio device [0403]: NVIDIA Corporation ... [10de:22ba]

# Step 4: Bind the GPU to vfio-pci (all devices in the IOMMU group must be bound)
echo "10de:2684" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "10de:22ba" > /sys/bus/pci/drivers/vfio-pci/new_id

# Or permanently via modprobe.d:
echo "options vfio-pci ids=10de:2684,10de:22ba" > /etc/modprobe.d/vfio.conf
echo "vfio-pci" > /etc/modules-load.d/vfio-pci.conf

# Ensure vfio-pci loads before the nvidia driver
echo "softdep nvidia pre: vfio-pci" >> /etc/modprobe.d/vfio.conf
dracut --force

VM configuration for GPU passthrough

<!-- GPU passthrough in libvirt XML -->
<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
</hostdev>

<!-- Also pass the GPU's audio device -->
<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x1'/>
</hostdev>

<!-- Hide the hypervisor from the guest (some NVIDIA drivers check for this) -->
<features>
  <kvm>
    <hidden state='on'/>
  </kvm>
</features>

Mediated devices (vGPU)

Instead of passing the entire GPU to one VM, mediated devices (mdev) split a physical GPU into multiple virtual GPUs. Each VM gets a slice of the GPU's compute and VRAM. NVIDIA vGPU requires a commercial license. Intel GVT-g provides free mediated device support for integrated GPUs. AMD has experimental SR-IOV support on datacenter GPUs.

# List available mdev types (Intel GVT-g example)
ls /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/
# i915-GVTg_V5_4  — 1 vGPU, 256 MB VRAM
# i915-GVTg_V5_8  — 2 vGPUs, 128 MB each

# Create a mediated device
echo "$(uuidgen)" > /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/create

# Assign to a VM in libvirt XML
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'>
  <source>
    <address uuid='the-uuid-you-generated'/>
  </source>
</hostdev>

14. Monitoring & Performance

Monitoring KVM VMs uses the same tools as monitoring any Linux workload — plus some hypervisor-specific tools. Since VMs are processes and vCPUs are threads, standard Linux performance tools work directly. libvirt adds VM-aware statistics, and eBPF can trace hypervisor-specific events.

virt-top

# Real-time VM resource usage (like top for VMs)
virt-top
# ID  Name       State  CPU(s) CPU%  Mem  Mem%  Time   Net   Block
# 1   web01      R      4      12.3  8G   25.0  4:23   eth0  vda
# 2   db01       R      8      45.2  32G  50.0  12:01  eth0  sda,sdb
# 3   web02      R      4      8.1   8G   25.0  2:45   eth0  vda

libvirt statistics

# Per-VM statistics
virsh domstats web01
# Domain: 'web01'
#   state.state=1
#   state.reason=1
#   cpu.time=245678900000
#   cpu.user=123456700000
#   cpu.system=67890100000
#   balloon.current=8388608
#   balloon.maximum=8388608
#   block.count=1
#   block.0.name=vda
#   block.0.rd.reqs=456789
#   block.0.rd.bytes=12345678900
#   block.0.wr.reqs=234567
#   block.0.wr.bytes=6789012300
#   net.count=1
#   net.0.name=vnet0
#   net.0.rx.bytes=9876543210
#   net.0.rx.pkts=12345678
#   net.0.tx.bytes=8765432100
#   net.0.tx.pkts=11234567

# CPU usage percentage per VM
virsh cpu-stats web01
# CPU0:
#   cpu_time    45.678000000 seconds
#   user_time   23.456000000 seconds
#   system_time 12.345000000 seconds

eBPF for hypervisor tracing

# Trace KVM VM exits using bpftrace
bpftrace -e '
tracepoint:kvm:kvm_exit {
  @exits[args->exit_reason] = count();
}
interval:s:5 {
  print(@exits);
  clear(@exits);
}
'
# exit_reason 48 = EPT violation (memory-mapped I/O)
# exit_reason 30 = I/O instruction
# exit_reason 28 = CR access
# exit_reason 1  = external interrupt

# Trace VM exit latency (time spent handling each exit)
bpftrace -e '
tracepoint:kvm:kvm_exit { @start[tid] = nsecs; }
tracepoint:kvm:kvm_entry {
  if (@start[tid]) {
    @exit_latency_ns = hist(nsecs - @start[tid]);
    delete(@start[tid]);
  }
}'

# Count virtio interrupts per VM
bpftrace -e '
tracepoint:irq:irq_handler_entry /str(args->name) == "virtio"/ {
  @[comm] = count();
}'

Prometheus integration

# libvirt-exporter exposes VM metrics to Prometheus
# Install
curl -Lo /usr/local/bin/libvirt-exporter \
  https://github.com/prometheus-community/libvirt_exporter/releases/latest/download/libvirt_exporter-linux-amd64
chmod +x /usr/local/bin/libvirt-exporter

# Run as a systemd service
cat > /etc/systemd/system/libvirt-exporter.service <<'EOF'
[Unit]
Description=Prometheus libvirt exporter
After=libvirtd.service

[Service]
ExecStart=/usr/local/bin/libvirt-exporter --web.listen-address=:9177
Restart=always

[Install]
WantedBy=multi-user.target
EOF

systemctl enable --now libvirt-exporter

# Metrics available at http://localhost:9177/metrics
# libvirt_domain_info_cpu_time_seconds_total
# libvirt_domain_info_maximum_memory_bytes
# libvirt_domain_info_memory_usage_bytes
# libvirt_domain_info_virtual_cpus
# libvirt_domain_block_stats_read_bytes_total
# libvirt_domain_block_stats_write_bytes_total
# libvirt_domain_interface_stats_receive_bytes_total
# libvirt_domain_interface_stats_transmit_bytes_total

# Prometheus scrape config
# - job_name: 'libvirt'
#   static_configs:
#     - targets: ['hypervisor1:9177', 'hypervisor2:9177']

ZFS I/O monitoring for VM zvols

# Watch zvol I/O in real time
zpool iostat -v tank 2
#              capacity     operations     bandwidth
# pool        alloc  free   read  write   read  write
# ----------  -----  ----  -----  -----  -----  -----
# tank        120G   880G   1.2K  456    48.5M  18.3M
#   mirror    120G   880G   1.2K  456    48.5M  18.3M
#     sda         -      -  623    228   24.3M  9.15M
#     sdb         -      -  612    228   24.2M  9.15M

# Per-dataset I/O (requires zfs_latency kstat or eBPF)
# Use the ZFS ARC stats to monitor caching efficiency
cat /proc/spl/kstat/zfs/arcstats | grep -E '^(hits|misses|size)'
# hits  4  789012345
# misses 4  12345678
# size  4  34359738368

The monitoring story for KVM is better than any commercial hypervisor because everything is observable through standard Linux interfaces. VMware gives you vCenter metrics — a curated, opaque view of what VMware thinks you should see. KVM gives you raw access to everything: process-level CPU accounting, per-thread scheduling, block I/O statistics, network counters, and eBPF tracepoints inside the KVM module itself. You can trace individual VM exits, measure the latency of each exit handler, count EPT violations, and correlate them with guest workload patterns. No other hypervisor gives you this level of visibility without a commercial support contract.

15. Troubleshooting Reference

This section collects the most common KVM problems and their solutions. When something breaks, start here.

Problem	Symptom	Cause	Solution
VM won't start	`error: internal error: process exited while connecting to monitor`	QEMU crashed on startup — usually bad XML, missing image, or permission issue	Check `/var/log/libvirt/qemu/VMNAME.log` for the QEMU error. Fix the XML or permissions.
No KVM acceleration	`Could not access KVM kernel module: No such file or directory`	KVM modules not loaded or hardware virtualization disabled in BIOS	`modprobe kvm_intel` (or `kvm_amd`). Enable VT-x/AMD-V in BIOS. Check `grep -cE '(vmx\|svm)' /proc/cpuinfo`.
Poor disk performance	High latency, low throughput on virtio-blk	Using `cache=writethrough` or `io=threads` with a zvol	Set `cache=none,io=native` for zvol-backed disks. Add `iothread` for dedicated I/O processing.
VM network unreachable	Guest has no connectivity	Bridge not configured, vnet device not added to bridge, or nftables blocking	`bridge link show` to verify vnet is attached. Check `nft list ruleset` for FORWARD drops.
SELinux blocks zvol access	`Permission denied` opening zvol in QEMU log	SELinux svirt policy does not allow QEMU to open the zvol device	`setsebool -P virt_use_rawip 1` or set `security_driver = "none"` in `/etc/libvirt/qemu.conf` (less secure). Better: relabel the zvol.
Shared NVRAM conflict	Second clone fails to start with "NVRAM file already in use"	Multiple VMs sharing the same UEFI NVRAM file	Copy the NVRAM file per VM: `cp /usr/share/OVMF/OVMF_VARS.fd /var/lib/libvirt/qemu/nvram/VMNAME_VARS.fd`. kvm-clone does this automatically.
Migration fails	`Unable to resolve address` or `Timed out during operation`	Destination host unreachable, firewall blocking migration port, or DNS resolution failure	Verify SSH connectivity. Open TCP port 49152-49215 (QEMU migration range). Use `--tunnelled` to avoid direct QEMU connections.
CPU feature mismatch on migration	`unsupported configuration: guest CPU ... is not compatible`	Using `host-passthrough` with different CPU generations	Switch to a named CPU model or `host-model`. Or use `host-passthrough` with `migratable='on'` and matching hardware.
GPU passthrough fails	`VFIO error: device is not in a valid IOMMU group`	IOMMU not enabled, or GPU shares an IOMMU group with other devices	Enable `intel_iommu=on` or `amd_iommu=on`. Use ACS override patch if IOMMU groups are too large (security tradeoff).
NVIDIA driver detects hypervisor	`Error 43` in Windows guest with NVIDIA GPU	NVIDIA consumer drivers refuse to run in a VM	Add `<kvm><hidden state='on'/></kvm>` to the XML features section. Use a datacenter GPU (Tesla/A100) which does not have this restriction.
Hugepage allocation fails	`Cannot allocate memory` on VM start with hugepages	Not enough free hugepages — fragmentation prevents allocation at runtime	Allocate hugepages at boot via kernel command line. Check `cat /proc/meminfo \| grep HugePages`.
Slow clone boot	Clone takes 2+ minutes to boot	cloud-init waiting for network metadata that does not exist	Configure cloud-init with `datasource_list: [NoCloud, None]` and disable `cloud-init-network` if not using cloud metadata.

Log locations

# QEMU VM logs (most useful for startup failures)
/var/log/libvirt/qemu/VMNAME.log

# libvirt daemon log
journalctl -u libvirtd -f

# KVM kernel messages
dmesg | grep -i kvm

# IOMMU/VFIO messages
dmesg | grep -iE '(iommu|vfio|dmar)'

# ZFS zvol I/O errors
dmesg | grep -i zfs
zpool status -v tank

Performance diagnostics checklist

# 1. Verify KVM acceleration is in use (not TCG emulation)
virsh dumpxml web01 | grep -A2 '<domain'
# type='kvm' means hardware acceleration is active

# 2. Check CPU steal time inside the guest
top
# If %st (steal) is high, the host is overcommitting CPU

# 3. Verify virtio drivers are loaded in the guest
lsmod | grep virtio
# virtio_blk, virtio_net, virtio_balloon should be present

# 4. Check for excessive VM exits
perf kvm stat live
# Shows VM exit reasons and frequency

# 5. Verify disk I/O path
virsh dumpxml web01 | grep -A5 '




Related Pages


  KVM Virtual Machines Tutorial — getting started with KVM on kldload
  Kubernetes on KVM — running a Kubernetes cluster on KVM VMs
  Proxmox & ZFS Tutorial — ZFS integration with Proxmox VE
  Cloud & Packer — building golden images with Packer
  ZFS Masterclass — deep dive into ZFS storage architecture
  GPU & NVIDIA Masterclass — GPU passthrough, CUDA, driver management
  Backplane Networks Masterclass — high-speed interconnects for hypervisor clusters
  eBPF Masterclass — kernel tracing and observability including KVM tracepoints
  ZFS Wiki: KVM + ZFS Hypervisor — zvol tuning reference
  Packer & IaC Masterclass — infrastructure as code for VM images
  Storage & ZFS Platform — kldload storage architecture overview


      
        ← GPU & NVIDIA Masterclass
        Backplane Networks Masterclass →