KVM & Hypervisor Masterclass
This guide covers the full stack of Linux-native virtualization: KVM architecture, QEMU device emulation, libvirt management, ZFS zvol storage design, golden image workflows, CPU and memory tuning, virtio devices, networking topologies, live migration, multi-host orchestration, Proxmox integration, GPU passthrough, monitoring, and troubleshooting. By the end you will understand not just how to run VMs on Linux, but why KVM works the way it does — and how to build a production hypervisor that rivals any commercial offering.
The premise: Linux is not a host operating system that runs a hypervisor. Linux is the hypervisor. KVM turns the Linux kernel into a Type 1 hypervisor — every VM is a process, every vCPU is a thread, and the kernel's scheduler, memory manager, and I/O subsystem serve the VMs directly. This means every tool you already know — top, perf, cgroups, nftables, systemd — works on VMs too. No separate management OS. No proprietary vSphere. Just Linux.
What this page covers: KVM kernel module internals, QEMU emulation, libvirt domain management, ZFS zvol storage design for VM disks, golden image build-once-deploy-many workflows, virt-install and virsh lifecycle commands, CPU topology and pinning, hugepages and NUMA binding, virtio device tuning, bridge and SR-IOV networking, live migration strategies, multi-host fleet management with kvm-clone and kvm-deploy, Proxmox API integration, VFIO GPU passthrough, performance monitoring with eBPF and virt-top, and a comprehensive troubleshooting reference.
Prerequisites: a running kldload system with the server or desktop profile. The KVM tutorials assume Intel VT-x or AMD-V hardware. Nested virtualization works for learning but is not suitable for production workloads.
1. KVM Architecture
KVM (Kernel-based Virtual Machine) is a Linux kernel module that turns the kernel into a hypervisor. It was merged into mainline Linux in version 2.6.20 (February 2007) and has been the default virtualization technology for Linux ever since. Understanding its architecture is essential to understanding why Linux virtualization works so well.
The hypervisor classification problem
Traditional hypervisor taxonomy divides the world into Type 1 (bare metal) and Type 2 (hosted). Type 1 hypervisors — VMware ESXi, Microsoft Hyper-V, Xen — run directly on hardware with no host OS underneath. Type 2 hypervisors — VirtualBox, VMware Workstation — run as applications inside a host OS. KVM breaks this taxonomy because it is both: the Linux kernel runs on bare metal (Type 1), and KVM is a kernel module that turns that kernel into a hypervisor. The guest VMs run as regular Linux processes (which sounds Type 2), but they execute directly on the CPU via hardware virtualization extensions (which is Type 1 behavior). The correct answer is that KVM is a Type 1 hypervisor that happens to share its kernel with a general-purpose OS.
The /dev/kvm interface
KVM exposes itself as a character device at /dev/kvm. Userspace programs (QEMU) open this device and use ioctl() calls to create VMs, configure vCPUs, map memory regions, and enter guest mode. The kernel handles VM exits, instruction emulation, and interrupt injection.
Hardware virtualization extensions
Intel VT-x (vmx) and AMD-V (svm) provide hardware support for guest execution. The CPU has a new privilege mode — VMX root (host) and VMX non-root (guest). Guests execute natively on the CPU. When they do something privileged (I/O, page table changes), the CPU traps into VMX root mode — a "VM exit" — and KVM handles it.
VMs are processes, vCPUs are threads
Each VM is a QEMU process visible in ps. Each vCPU is a thread within that process. The Linux CFS scheduler schedules vCPU threads alongside all other processes. This means cgroups, nice, cpuset, and taskset all work on VMs directly.
Memory is mmap'd
Guest RAM is allocated as anonymous memory in the QEMU process address space, typically via mmap(). The kernel's memory manager handles paging, NUMA placement, and hugepage backing. KVM configures EPT (Intel) or NPT (AMD) — hardware-assisted nested page tables — so the guest's virtual-to-physical translations happen in hardware, not software.
kvm_intel / kvm_amd modules
The KVM subsystem consists of a core module (kvm) and a CPU-specific module (kvm_intel or kvm_amd). The CPU module handles the hardware-specific VMX/SVM instructions. Load them with modprobe kvm_intel or modprobe kvm_amd. Verify with lsmod | grep kvm.
VM exit and re-entry
When a guest does something the CPU cannot handle in guest mode — I/O port access, MSR write, external interrupt — the CPU performs a VM exit back to KVM. KVM inspects the exit reason, handles it (or delegates to QEMU for device emulation), and re-enters guest mode with VMRESUME. Minimizing VM exits is the key to performance.
# Verify KVM is available
grep -cE '(vmx|svm)' /proc/cpuinfo
# > 0 means hardware virtualization is supported
# Load KVM modules
modprobe kvm
modprobe kvm_intel # or kvm_amd
# Verify
lsmod | grep kvm
# kvm_intel 380928 0
# kvm 1130496 1 kvm_intel
# Check /dev/kvm exists
ls -la /dev/kvm
# crw-rw-rw- 1 root kvm 10, 232 ... /dev/kvm
2. libvirt & QEMU
KVM provides the hardware virtualization interface. QEMU provides device emulation — the virtual disks, NICs, USB controllers, VGA adapters, and serial ports that the guest OS sees. libvirt sits above both and provides a stable management API: creating VMs, starting/stopping them, managing storage pools, snapshots, and networks. Together they form the standard Linux virtualization stack.
The QEMU layer
QEMU (Quick EMUlator) is a full system emulator that can emulate entire machines in software. When combined with KVM, it delegates CPU execution to the hardware and handles only device emulation. A QEMU process with KVM acceleration runs guest code at near-native speed — the CPU instructions execute directly on the hardware — while QEMU emulates the devices the guest interacts with (disk controllers, network cards, display adapters).
# Direct QEMU invocation (rarely done manually — libvirt generates this)
qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-smp 4,sockets=1,cores=4,threads=1 \
-m 8G \
-drive file=/dev/zvol/tank/vms/web01,format=raw,if=virtio,cache=none \
-netdev bridge,id=net0,br=br0 \
-device virtio-net-pci,netdev=net0,mac=52:54:00:aa:bb:01 \
-vnc :1 \
-daemonize
The libvirt layer
libvirt is a C library and daemon (libvirtd or virtqemud) that provides a
uniform API for managing VMs across multiple hypervisors — KVM/QEMU, Xen, LXC,
bhyve. It translates high-level operations ("create a VM with 4 CPUs, 8 GB RAM, a
virtio disk, and a bridged NIC") into the correct QEMU command-line invocation. It also
manages persistent XML domain definitions, storage pools, networks, and secrets.
# Install libvirt on kldload (CentOS/RHEL/Rocky)
dnf install -y libvirt libvirt-daemon-kvm qemu-kvm virt-install virt-top
# Enable and start
systemctl enable --now libvirtd
# Verify the hypervisor connection
virsh uri
# qemu:///system
# List all VMs
virsh list --all
# Dump a VM's XML definition (the source of truth)
virsh dumpxml web01
XML domain definitions
Every VM in libvirt is defined by an XML document that describes its hardware: CPU model,
memory, disks, NICs, boot order, console, serial ports, TPM, and more. This XML is the
declarative specification of the VM. virsh define registers it, virsh start instantiates
it, and virsh dumpxml shows the current running configuration. Understanding the XML is
essential — it is the equivalent of a Terraform resource definition for VMs.
<domain type='kvm'>
<name>web01</name>
<uuid>a1b2c3d4-e5f6-7890-abcd-ef1234567890</uuid>
<memory unit='GiB'>8</memory>
<vcpu placement='static'>4</vcpu>
<os>
<type arch='x86_64' machine='q35'>hvm</type>
<boot dev='hd'/>
</os>
<cpu mode='host-passthrough' check='none' migratable='on'/>
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
<source dev='/dev/zvol/tank/vms/web01'/>
<target dev='vda' bus='virtio'/>
</disk>
<interface type='bridge'>
<source bridge='br0'/>
<model type='virtio'/>
<mac address='52:54:00:aa:bb:01'/>
</interface>
<serial type='pty'>
<target port='0'/>
</serial>
<console type='pty'>
<target type='serial' port='0'/>
</console>
<channel type='unix'>
<target type='virtio' name='org.qemu.guest_agent.0'/>
</channel>
<tpm model='tpm-crb'>
<backend type='emulator' version='2.0'/>
</tpm>
</devices>
</domain>
Essential virsh commands
# Define a VM from XML
virsh define web01.xml
# Start / stop / reboot
virsh start web01
virsh shutdown web01 # graceful ACPI shutdown
virsh destroy web01 # force power off (like pulling the plug)
virsh reboot web01
# Autostart on host boot
virsh autostart web01
virsh autostart --disable web01
# Console access (serial)
virsh console web01
# Press Ctrl+] to detach
# Edit the XML live (opens $EDITOR)
virsh edit web01
# Snapshot (for qcow2 disks — zvol snapshots are done via ZFS)
virsh snapshot-create-as web01 --name "before-upgrade" --description "pre-patch"
# List networks and storage pools
virsh net-list --all
virsh pool-list --all
3. ZFS zvol Storage for VMs
The most impactful decision in hypervisor design is the storage backend. The standard choice is qcow2 files on ext4 or XFS. The kldload choice is ZFS zvols — block devices backed by the ZFS storage pool. This section explains why zvols are superior and how to design your storage layout.
Why zvols over qcow2
Block device, not a file
A zvol is a block device (/dev/zvol/tank/vms/web01) that the guest accesses through virtio-blk or virtio-scsi. There is no filesystem layer between QEMU and the storage — no double caching, no filesystem metadata overhead, no fragmentation. QEMU opens the block device with cache=none,io=native and every I/O goes straight to ZFS.
Atomic snapshots in microseconds
ZFS snapshots are copy-on-write metadata operations — they take microseconds regardless of dataset size. Snapshotting a 500 GB zvol takes the same time as snapshotting a 5 GB zvol. No quiescing, no pausing writes, no snapshot file chain to manage.
Instant cloning
zfs clone creates a new zvol that shares all blocks with its parent snapshot. A clone of a 100 GB zvol takes zero additional space and completes instantly. This is the foundation of the golden image workflow — build one image, clone it 100 times, each clone only stores its deltas.
Compression and checksums
ZFS compresses data at the block level with lz4 (default) or zstd. A 100 GB Windows VM zvol might only consume 40 GB on disk. Every block is checksummed — silent data corruption is detected and corrected automatically if the pool has redundancy.
Thin provisioning by default
A zvol with -s (sparse) flag allocates space only as the guest writes to it. A 500 GB zvol might consume only 2 GB until the guest installs an OS. No pre-allocation, no zeroing, no waiting.
ZFS send/receive for migration
zfs send serializes a snapshot (or incremental delta between snapshots) into a byte stream that zfs receive can consume on any ZFS host. This is native, efficient, and handles live migration of VM storage without shared storage infrastructure.
Storage layout design
# Create a dedicated VM dataset with tuned properties
zfs create -o compression=lz4 \
-o primarycache=metadata \
-o recordsize=64k \
-o redundant_metadata=most \
tank/vms
# Create a zvol for a VM (thin-provisioned)
zfs create -s -V 100G tank/vms/web01
# The zvol appears as a block device
ls -la /dev/zvol/tank/vms/web01
# lrwxrwxrwx 1 root root 11 ... /dev/zvol/tank/vms/web01 -> ../../zd0
# Set volblocksize for VM workloads (set at creation — cannot be changed)
zfs create -s -V 100G -o volblocksize=64k tank/vms/db01
# Snapshot before an upgrade
zfs snapshot tank/vms/web01@before-upgrade
# Rollback if the upgrade fails
zfs rollback tank/vms/web01@before-upgrade
# Clone a golden image
zfs snapshot tank/vms/golden@v1.0
zfs clone tank/vms/golden@v1.0 tank/vms/web02
volblocksize tuning
| Workload | volblocksize | Rationale |
|---|---|---|
| General-purpose VM | 64k | Matches most filesystem cluster sizes; good compression ratio |
| Database VM (PostgreSQL, MySQL) | 16k | Matches database page size (8k/16k); avoids read-modify-write amplification |
| Windows VM | 64k | NTFS default cluster size is 4k but NTFS allocation groups are 64k |
| Large sequential I/O (media) | 128k | Maximizes throughput for sequential reads/writes |
| Random I/O heavy (OLTP) | 8k | Minimizes write amplification for small random writes |
4. Golden Image Workflow
The golden image pattern is the foundation of scalable VM deployment: build one perfectly configured VM, seal it for cloning, snapshot it, and clone it to create new VMs. Each clone inherits the entire disk state of the golden image but stores only its own deltas. This is how cloud providers deploy VMs — and with ZFS, you can do it in seconds on a single host.
The build-seal-clone pipeline
Step 1 — Build the golden image
Install the OS into a zvol using kldload's installer, Packer, or manual virt-install. Configure everything that should be identical across all clones: base packages, security hardening, monitoring agents, WireGuard keys (template), SSH config, kernel parameters, ZFS inside the guest if needed.
Step 2 — Seal for cloning
Remove machine-specific identity: clear /etc/machine-id, delete SSH host keys, remove persistent network rules, enable cloud-init or a first-boot script. The kldload function k_seal_image_for_clone() does this automatically. A sealed image boots with a new identity every time.
Step 3 — Snapshot and clone
Snapshot the sealed zvol, then clone it for each new VM. The clone is instant and zero-copy. Boot the clone and cloud-init (or a first-boot script) assigns a new hostname, generates SSH host keys, configures networking, and sets the machine-id.
Sealing a golden image
# Inside the golden image VM, before shutdown:
# Clear machine-id (systemd regenerates on boot)
truncate -s 0 /etc/machine-id
rm -f /var/lib/dbus/machine-id
# Remove SSH host keys (sshd-keygen.service regenerates on boot)
rm -f /etc/ssh/ssh_host_*
# Remove persistent network rules
rm -f /etc/udev/rules.d/70-persistent-net.rules
rm -f /etc/NetworkManager/system-connections/*.nmconnection
# Clean dnf/apt cache
dnf clean all 2>/dev/null || apt clean 2>/dev/null
# Remove shell history
rm -f /root/.bash_history /home/*/.bash_history
# Enable cloud-init for identity assignment
systemctl enable cloud-init cloud-init-local cloud-config cloud-final
# Shutdown cleanly
poweroff
Clone deployment with kvm-clone
# kldload's kvm-clone automates the clone workflow
# It snapshots the source, clones the zvol, creates a new VM definition,
# and assigns unique identity
# Clone the golden image to create 4 web servers
for i in $(seq 1 4); do
kvm-clone golden web0${i}
done
# Each clone gets:
# - A new zvol (tank/vms/web0N) cloned from tank/vms/golden@clone-web0N
# - A new libvirt domain with unique UUID and MAC address
# - A new UEFI NVRAM file (copied, not shared — avoids SELinux conflicts)
# - cloud-init generates unique SSH keys, hostname, machine-id on first boot
# Start all clones
for i in $(seq 1 4); do
virsh start web0${i}
done
# Verify they're running
virsh list
Cloud-init configuration for clones
# /etc/cloud/cloud.cfg.d/99-kldload.cfg
# This configuration handles identity assignment on clone boot
datasource_list: [NoCloud, ConfigDrive, None]
# Generate new SSH host keys
ssh_deletekeys: true
ssh_genkeytypes: [ed25519, ecdsa, rsa]
# Set hostname from instance metadata or DHCP
preserve_hostname: false
# Disable phone-home to cloud providers
phone_home:
url: ""
# Run custom first-boot scripts
runcmd:
- systemd-machine-id-setup
- hostnamectl set-hostname $(cat /etc/hostname)
- systemctl restart systemd-journald
5. VM Creation & Lifecycle
This section covers the practical commands for creating, running, and managing VMs throughout their lifecycle — from initial creation through daily operations to eventual decommissioning.
Creating a VM with virt-install
# Create a zvol first
zfs create -s -V 100G -o volblocksize=64k tank/vms/web01
# Install from ISO using virt-install
virt-install \
--name web01 \
--ram 8192 \
--vcpus 4 \
--cpu host-passthrough \
--machine q35 \
--os-variant centos-stream9 \
--disk /dev/zvol/tank/vms/web01,bus=virtio,cache=none,io=native,discard=unmap \
--cdrom /root/kldload-1.0.3.iso \
--network bridge=br0,model=virtio \
--graphics vnc,listen=0.0.0.0,port=5901 \
--serial pty \
--console pty,target_type=serial \
--tpm backend.type=emulator,backend.version=2.0,model=tpm-crb \
--boot uefi \
--autostart \
--noautoconsole
# Install from kldload ISO with unattended answers file
virt-install \
--name web01 \
--ram 8192 \
--vcpus 4 \
--cpu host-passthrough \
--machine q35 \
--os-variant centos-stream9 \
--disk /dev/zvol/tank/vms/web01,bus=virtio,cache=none,io=native \
--cdrom /root/kldload-1.0.3.iso \
--disk /root/answers.iso,device=cdrom \
--network bridge=br0,model=virtio \
--serial pty \
--console pty,target_type=serial \
--boot uefi \
--noautoconsole
Lifecycle management commands
# --- Power management ---
virsh start web01 # Power on
virsh shutdown web01 # Graceful ACPI shutdown
virsh destroy web01 # Force power off (immediate)
virsh reboot web01 # Graceful reboot
virsh reset web01 # Hard reset (like pressing reset button)
# --- State management ---
virsh suspend web01 # Pause vCPUs (VM frozen in RAM)
virsh resume web01 # Unpause
virsh save web01 /tmp/web01.state # Save VM state to file (like hibernate)
virsh restore /tmp/web01.state # Restore from saved state
# --- Configuration ---
virsh autostart web01 # Start on host boot
virsh autostart --disable web01 # Disable autostart
virsh setmem web01 16G --live # Hot-add memory (if guest supports ballooning)
virsh setvcpus web01 8 --live # Hot-add vCPUs (if guest kernel supports)
# --- Information ---
virsh dominfo web01 # Basic info (state, memory, CPUs)
virsh domblklist web01 # List block devices
virsh domiflist web01 # List network interfaces
virsh domstats web01 # Detailed statistics
virsh vcpuinfo web01 # vCPU-to-pCPU mapping
virsh dumpxml web01 # Full XML definition
# --- Console access ---
virsh console web01 # Serial console (Ctrl+] to detach)
virt-viewer web01 # Graphical console (SPICE/VNC)
# --- Deletion ---
virsh shutdown web01 # Shutdown first
virsh undefine web01 --nvram # Remove VM definition and NVRAM
zfs destroy tank/vms/web01 # Destroy the zvol
Attach and detach devices at runtime
# Hot-add a second disk
zfs create -s -V 50G tank/vms/web01-data
virsh attach-disk web01 /dev/zvol/tank/vms/web01-data vdb \
--driver qemu --subdriver raw --cache none --persistent
# Hot-add a network interface
virsh attach-interface web01 bridge br1 \
--model virtio --persistent
# Hot-remove a network interface
virsh detach-interface web01 bridge --mac 52:54:00:xx:xx:xx --persistent
# Attach a USB device from the host
virsh attach-device web01 <<'EOF'
<hostdev mode='subsystem' type='usb'>
<source>
<vendor id='0x1234'/>
<product id='0x5678'/>
</source>
</hostdev>
EOF
6. CPU Topology & Pinning
CPU configuration is one of the highest-impact tuning areas for KVM performance. The wrong CPU model or topology can halve your VM's performance. The right configuration — with proper pinning and NUMA awareness — can match bare-metal speeds.
CPU models
host-passthrough
Exposes the exact host CPU to the guest, including all features, model name, and microcode level. Gives the best performance because the guest can use every CPU instruction the hardware supports. The tradeoff: VMs cannot be live-migrated to hosts with different CPU models.
host-model
libvirt reads the host CPU and selects the closest QEMU CPU model, adjusting feature flags to match. Slightly less performant than host-passthrough (some obscure instructions might be hidden) but allows migration between hosts with similar CPUs.
Named models (Cascadelake-Server, EPYC-Rome)
A specific CPU model defined in QEMU's CPU database. The guest sees exactly those features, regardless of the host hardware. Useful for migration across heterogeneous hardware — as long as every host is at least as capable as the named model.
CPU pinning
By default, the Linux scheduler can run a VM's vCPU threads on any physical CPU. This means vCPUs can bounce between cores, losing L1/L2 cache state on every migration. CPU pinning binds each vCPU to a specific physical core, eliminating cache thrashing and providing deterministic performance.
# View host CPU topology
lscpu -e
# CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
# 0 0 0 0 0:0:0:0 yes
# 1 0 0 1 1:1:1:0 yes
# 2 0 0 2 2:2:2:0 yes
# ...
# Pin vCPUs in the VM's XML definition
virsh edit web01
<!-- CPU pinning configuration -->
<vcpu placement='static'>4</vcpu>
<cputune>
<!-- Pin each vCPU to a specific physical core -->
<vcpupin vcpu='0' cpuset='4'/>
<vcpupin vcpu='1' cpuset='5'/>
<vcpupin vcpu='2' cpuset='6'/>
<vcpupin vcpu='3' cpuset='7'/>
<!-- Pin QEMU emulator threads to a separate core -->
<emulatorpin cpuset='0-1'/>
<!-- Pin I/O threads -->
<iothreadpin iothread='1' cpuset='2'/>
<iothreadpin iothread='2' cpuset='3'/>
</cputune>
<!-- Expose correct topology to the guest -->
<cpu mode='host-passthrough' check='none' migratable='on'>
<topology sockets='1' dies='1' cores='4' threads='1'/>
</cpu>
NUMA topology
On multi-socket systems, each CPU socket has its own memory controller and local memory. Accessing memory attached to the local socket (local access) is fast. Accessing memory on a remote socket (remote access) crosses the interconnect (QPI/UPI for Intel, Infinity Fabric for AMD) and is 1.5-2x slower. For VM performance, all of a VM's vCPUs and memory should live on the same NUMA node.
# View NUMA topology
numactl --hardware
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
# node 0 size: 65536 MB
# node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
# node 1 size: 65536 MB
# node distances:
# node 0 1
# 0: 10 21
# 1: 21 10
<!-- NUMA-aware VM configuration -->
<vcpu placement='static'>8</vcpu>
<cputune>
<!-- Pin all vCPUs to NUMA node 0 -->
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='3'/>
<vcpupin vcpu='4' cpuset='4'/>
<vcpupin vcpu='5' cpuset='5'/>
<vcpupin vcpu='6' cpuset='6'/>
<vcpupin vcpu='7' cpuset='7'/>
<emulatorpin cpuset='16,17'/>
</cputune>
<!-- Bind memory to NUMA node 0 -->
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
7. Memory — Hugepages, Ballooning & KSM
Memory configuration has an outsized impact on VM performance because the translation lookaside buffer (TLB) — the CPU's cache for virtual-to-physical address translations — is small and expensive. With standard 4K pages, an 8 GB VM has 2 million pages. The TLB cannot cache all those translations, so TLB misses cause expensive page table walks. Hugepages solve this by using larger pages (2 MB or 1 GB), reducing the number of translations by 512x or 262,144x.
Hugepage configuration
# Check current hugepage status
cat /proc/meminfo | grep -i huge
# HugePages_Total: 0
# HugePages_Free: 0
# Hugepagesize: 2048 kB
# Allocate 2 MB hugepages at boot (add to kernel command line)
# For 64 GB of hugepages: 64 * 1024 / 2 = 32768 pages
hugepagesz=2M hugepages=32768
# Or allocate at runtime (may fail due to fragmentation)
echo 32768 > /proc/sys/vm/nr_hugepages
# For 1 GB hugepages (must be set at boot — cannot be allocated at runtime)
hugepagesz=1G hugepages=64
# Allocates 64 x 1 GB = 64 GB of hugepages
# Verify allocation
cat /proc/meminfo | grep HugePages_
# HugePages_Total: 32768
# HugePages_Free: 32768
# Make persistent via sysctl
echo "vm.nr_hugepages = 32768" >> /etc/sysctl.d/90-hugepages.conf
sysctl -p /etc/sysctl.d/90-hugepages.conf
VM configuration for hugepages
<!-- Enable hugepages for a VM -->
<memoryBacking>
<hugepages>
<page size='2048' unit='KiB'/>
</hugepages>
<locked/> <!-- Prevent host from swapping VM memory -->
<nosharepages/> <!-- Disable KSM for this VM (optional) -->
</memoryBacking>
<!-- For 1 GB hugepages -->
<memoryBacking>
<hugepages>
<page size='1048576' unit='KiB'/>
</hugepages>
<locked/>
</memoryBacking>
Memory ballooning
The virtio-balloon device lets the host reclaim memory from a guest dynamically. The balloon driver inside the guest "inflates" (allocates memory the guest cannot use) to return pages to the host, or "deflates" to give memory back to the guest. This allows overcommit — running VMs whose total configured memory exceeds physical RAM — as long as they don't all use their full allocation simultaneously.
# Check balloon status
virsh dommemstat web01
# actual 8388608 (8 GB allocated)
# rss 4194304 (4 GB actually used by host)
# usable 3145728 (3 GB free inside guest)
# Set balloon target (reduce available memory to 4 GB)
virsh setmem web01 4G --live
# Restore to full allocation
virsh setmem web01 8G --live
Kernel Same-page Merging (KSM)
KSM is a kernel feature that scans memory pages across processes and merges identical pages into a single copy-on-write page. For VMs running the same OS, KSM can save significant memory — the kernel code pages, library pages, and zero pages are identical across VMs and get merged. The tradeoff is CPU overhead for the scanning and a side-channel attack surface (KSM timing can leak information about guest memory contents).
# Enable KSM
echo 1 > /sys/kernel/mm/ksm/run
# Tune scanning aggressiveness
echo 200 > /sys/kernel/mm/ksm/sleep_millisecs # Lower = more aggressive
echo 1000 > /sys/kernel/mm/ksm/pages_to_scan # Pages per scan cycle
# Check KSM statistics
cat /sys/kernel/mm/ksm/pages_shared # Unique pages being shared
cat /sys/kernel/mm/ksm/pages_sharing # Total references to shared pages
# Memory saved = (pages_sharing - pages_shared) * 4096
# Disable KSM (recommended for security-sensitive environments)
echo 0 > /sys/kernel/mm/ksm/run
| Feature | Best For | Tradeoff |
|---|---|---|
| 2 MB hugepages | Most VM workloads | Cannot be swapped; must pre-allocate |
| 1 GB hugepages | Large VMs (64+ GB), database workloads | Must allocate at boot; very coarse granularity |
| Ballooning | Overcommit; dynamic memory sharing | Incompatible with hugepages; adds latency on inflate |
| KSM | Many identical VMs (VDI, testing) | CPU overhead; side-channel risk; incompatible with hugepages |
| Memory locking | Latency-sensitive VMs | Prevents host from reclaiming; must have enough physical RAM |
8. Virtio Devices
Virtio is a standardized interface for paravirtualized devices in KVM. Instead of emulating real hardware (which requires the guest to use a driver designed for physical hardware, and QEMU to translate those I/O operations), virtio defines a simple, efficient interface that both the host and guest understand natively. The result is dramatically better performance — virtio-blk is 2-5x faster than emulated IDE, and virtio-net is 3-10x faster than emulated e1000.
virtio-blk
A simple, fast block device. The guest sees /dev/vda. Single-queue by default but supports multiqueue. Lower overhead than virtio-scsi. Best for VMs that need 1-2 disks with maximum throughput. Does not support SCSI commands (no sg_* tools, no SCSI reservations).
virtio-scsi
A full SCSI controller emulation over virtio transport. Supports hundreds of disks per controller (vs. limited PCI slots for virtio-blk), SCSI commands, persistent reservations, UNMAP/TRIM, and device hotplug. Slightly more overhead than virtio-blk. Best for VMs with many disks or SCSI-dependent workloads.
virtio-net
Paravirtualized network interface. Supports multiqueue (one queue per vCPU), checksum offload, TCP segmentation offload, and vhost-net kernel acceleration. With vhost-net, network packets bypass QEMU entirely — they go from the guest directly to the host kernel's network stack.
virtio-fs (virtiofs)
Shared filesystem between host and guest using FUSE on the host and a virtio transport. Much faster than 9p or NFS for host-guest file sharing. Uses DAX (direct access) to map host page cache into the guest, avoiding data copies. Ideal for development workflows where the guest needs access to host files.
vhost-user
Moves device emulation out of QEMU entirely, into a separate userspace process. Used primarily with DPDK for high-performance networking — the vhost-user backend process handles packets directly in userspace with kernel bypass. Achieves line-rate 10/25/100 GbE performance.
Multiqueue
Both virtio-blk and virtio-net support multiqueue — multiple submission/completion queue pairs that can be processed in parallel by different vCPUs. Without multiqueue, all I/O funnels through a single queue, creating a bottleneck on multi-vCPU VMs. Enable it: queues=N where N equals the number of vCPUs.
Optimal disk configuration
<!-- virtio-blk with multiqueue and I/O threads -->
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'
discard='unmap' iothread='1'/>
<source dev='/dev/zvol/tank/vms/web01'/>
<target dev='vda' bus='virtio'/>
</disk>
<iothreads>2</iothreads>
<!-- virtio-scsi with multiqueue for many-disk VMs -->
<controller type='scsi' model='virtio-scsi'>
<driver queues='4' iothread='1'/>
</controller>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
<source dev='/dev/zvol/tank/vms/db01'/>
<target dev='sda' bus='scsi'/>
</disk>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
<source dev='/dev/zvol/tank/vms/db01-log'/>
<target dev='sdb' bus='scsi'/>
</disk>
Optimal network configuration
<!-- virtio-net with multiqueue and vhost -->
<interface type='bridge'>
<source bridge='br0'/>
<model type='virtio'/>
<driver name='vhost' queues='4'/>
<mac address='52:54:00:aa:bb:01'/>
</interface>
# Inside the guest: enable multiqueue on the NIC
ethtool -L eth0 combined 4
# Verify multiqueue is active
ethtool -l eth0
# Current hardware settings:
# Combined: 4
| Device | Use Case | cache | io | Notes |
|---|---|---|---|---|
| virtio-blk on zvol | General VM disk | none | native | Best performance; add discard=unmap for TRIM |
| virtio-scsi on zvol | Database VM, many disks | none | native | Use iothread for each controller |
| virtio-blk on qcow2 | Snapshot-capable (non-ZFS) | writeback | threads | Only if you must use qcow2 |
| virtio-net + vhost | All VMs | n/a | n/a | Always enable vhost; multiqueue = vCPU count |
9. Networking for VMs
VM networking on Linux is built on the same primitives as container networking: bridges, veth pairs, macvtap, OVS, and nftables. The difference is that VMs present virtio-net (or e1000) NICs to the guest, and the backend connects to a host-side network device. This section covers the major networking topologies for KVM.
Bridge mode (standard)
A Linux bridge is a Layer 2 switch implemented in the kernel. VMs connect their virtio-net backends to the bridge. The bridge forwards frames between VMs and between VMs and the physical network. This is the default and most common networking mode.
# Create a bridge with NetworkManager
nmcli connection add type bridge ifname br0 con-name br0 \
ipv4.method manual ipv4.addresses 10.0.0.1/24 ipv4.gateway 10.0.0.254 \
ipv4.dns 10.0.0.1
# Enslave the physical NIC to the bridge
nmcli connection add type bridge-slave ifname eno1 master br0
# Bring up the bridge
nmcli connection up br0
# Verify
bridge link show
# 2: eno1: <BROADCAST,MULTICAST,UP> mtu 1500 master br0
# 5: vnet0: <BROADCAST,MULTICAST,UP> mtu 1500 master br0 (web01's NIC)
The libvirt default NAT network (virbr0)
When libvirt starts for the first time it creates a NAT-backed network
named default: a bridge virbr0 on the 192.168.122.0/24
subnet with the host at 192.168.122.1 acting as gateway, DNS forwarder
(dnsmasq), and DHCP server handing out .2–.254. This is
the fallback attachment for any VM created without an explicit network —
virt-install, virsh, and every tool in the kldload KVM lab
all land on this bridge unless told otherwise.
# Config lives at /etc/libvirt/qemu/networks/default.xml
virsh net-dumpxml default
# <network>
# <name>default</name>
# <forward mode='nat'/>
# <bridge name='virbr0' stp='on' delay='0'/>
# <ip address='192.168.122.1' netmask='255.255.255.0'>
# <dhcp><range start='192.168.122.2' end='192.168.122.254'/></dhcp>
# </ip>
# </network>
# Change the subnet or gateway
virsh net-edit default # opens the XML in $EDITOR
virsh net-destroy default
virsh net-start default # re-read the updated config
kldload reserves 192.168.122.0/24 and the
192.168.122.1 gateway for the KVM lab. The kldload TLS CA
includes 192.168.122.1 as a Subject Alternative Name on the webui, proxy,
Grafana, and ttyd certificates so browsers hitting the lab over that bridge do not
see certificate errors. If you change the default network to a different subnet or
gateway you must re-issue the kldload service certs so the new address is in the SAN
list:
sudo kldload-ca renew webui
sudo systemctl restart kldload-proxy kldload-webui grafana-server ttyd-k9s
Without the re-issue, connecting to https://<new-gateway>:8443/
from inside the lab yields SSLV3_ALERT_CERTIFICATE_UNKNOWN and the browser
rejects the cert. The drift check in kldload-tls-cert catches new IPs
that appear on the wire but can't read a subnet you've just changed manually —
hence the explicit re-issue step.
macvtap mode
macvtap creates a virtual NIC directly attached to a physical NIC, bypassing the bridge. Each VM gets its own MAC address on the physical network. Simpler than a bridge and slightly lower latency, but VMs using macvtap on the same host cannot communicate with each other (the physical switch may drop packets between MACs on the same port) or with the host.
<!-- macvtap in bridge mode -->
<interface type='direct'>
<source dev='eno1' mode='bridge'/>
<model type='virtio'/>
</interface>
SR-IOV
Single Root I/O Virtualization (SR-IOV) creates hardware-level virtual NICs (VFs) from a single physical NIC (PF). Each VF is a real PCIe device that can be passed directly to a VM via VFIO. The VM communicates directly with the NIC hardware — zero host CPU overhead for packet processing. This is how cloud providers achieve line-rate networking for VMs.
# Enable SR-IOV VFs on an Intel NIC
echo 4 > /sys/class/net/eno1/device/sriov_numvfs
# List VFs
lspci | grep "Virtual Function"
# 03:10.0 Ethernet controller: Intel Corporation ... Virtual Function
# 03:10.2 Ethernet controller: Intel Corporation ... Virtual Function
# 03:10.4 Ethernet controller: Intel Corporation ... Virtual Function
# 03:10.6 Ethernet controller: Intel Corporation ... Virtual Function
# Pass a VF to a VM
virsh edit web01
<!-- SR-IOV VF passthrough -->
<interface type='hostdev' managed='yes'>
<source>
<address type='pci' domain='0x0000' bus='0x03' slot='0x10' function='0x0'/>
</source>
<mac address='52:54:00:aa:bb:01'/>
</interface>
WireGuard integration for VMs
# Option 1: WireGuard on the host — VMs route through the host's WireGuard tunnel
# The bridge is on the WireGuard subnet; VMs use the host as gateway
# Simplest approach — VMs don't need to know about WireGuard
# Option 2: WireGuard inside each VM — each VM has its own tunnel
# More secure (end-to-end encryption) but more management overhead
# Option 3: WireGuard on the host with per-VM routing
# Host runs WireGuard, nftables marks packets per VM, policy routing
# sends each VM's traffic through the appropriate tunnel
# Host-side nftables for VM traffic through WireGuard
nft add table ip nat
nft add chain ip nat postrouting '{ type nat hook postrouting priority 100 ; }'
nft add rule ip nat postrouting oif wg0 masquerade
nftables firewall for VMs
# Filter traffic between VMs on the bridge
nft add table bridge filter
nft add chain bridge filter forward '{ type filter hook forward priority 0 ; policy drop ; }'
# Allow established connections
nft add rule bridge filter forward ct state established,related accept
# Allow web01 to reach db01 on PostgreSQL port
nft add rule bridge filter forward \
ether saddr 52:54:00:aa:bb:01 ether daddr 52:54:00:aa:bb:02 \
ip daddr 10.0.0.5 tcp dport 5432 accept
# Allow all VMs to reach the gateway
nft add rule bridge filter forward ip daddr 10.0.0.254 accept
# Log and drop everything else
nft add rule bridge filter forward log prefix "VM-DROP: " drop
10. Live Migration
Live migration moves a running VM from one physical host to another without downtime. The guest continues to run during the migration — it does not notice anything except a brief pause (typically 10-100 ms) at the final switchover. This is the foundation of maintenance without downtime: drain a host of its VMs, patch and reboot the host, then migrate the VMs back.
Requirements for live migration
Shared or migrated storage
Both hosts must access the same storage (NFS, iSCSI, Ceph, GlusterFS) or you must migrate the storage alongside the VM. With ZFS, you can use zfs send/receive for storage migration — the VM's zvol is replicated to the target host before or during the migration.
Compatible CPU models
The destination host must support all CPU features the guest is using. With host-passthrough, this means identical hardware. With a named model (Cascadelake-Server), any host with at least those features works. Plan your CPU model before you need to migrate.
Network connectivity
The guest's network must work on both hosts — same bridge name, same VLAN, same subnet. libvirt uses a separate TCP connection for the migration data stream (or TLS-encrypted tunnel). A dedicated migration network (10 GbE+) reduces migration time and avoids impacting production traffic.
Pre-copy vs. post-copy migration
Pre-copy (default): iteratively copies guest memory to the destination while the VM runs. Each round copies pages that changed since the last round. When the remaining dirty pages are small enough, the VM is paused, the final pages are transferred, and the VM resumes on the destination. Works well when the guest's memory write rate is lower than the network transfer rate.
Post-copy: pauses the VM, transfers CPU state and a minimal memory set, resumes on the destination, then faults in remaining pages on demand. Total migration time is shorter but the VM runs slower until all pages are transferred (network page faults). If the network fails mid-migration, the VM is lost — it has pages on both hosts and neither has a complete copy. Use with caution.
Migration commands
# Basic live migration (shared storage assumed)
virsh migrate --live --persistent --undefinesource \
web01 qemu+ssh://host2.internal/system
# Live migration with bandwidth limit (MiB/s)
virsh migrate --live --persistent --undefinesource \
--bandwidth 500 \
web01 qemu+ssh://host2.internal/system
# Tunneled migration (encrypted, does not require direct QEMU connection)
virsh migrate --live --persistent --undefinesource \
--tunnelled \
web01 qemu+ssh://host2.internal/system
# Post-copy migration
virsh migrate --live --persistent --undefinesource \
--postcopy --postcopy-after-precopy \
web01 qemu+ssh://host2.internal/system
# Monitor migration progress
virsh domjobinfo web01
# Job type: Unbounded
# Time elapsed: 12453 ms
# Data processed: 4.2 GiB
# Data remaining: 1.8 GiB
# Memory processed: 4.2 GiB
# Memory remaining: 1.8 GiB
# Dirty rate: 45 MiB/s
# Iteration: 3
ZFS send for storage migration (no shared storage)
# When you don't have shared storage, migrate the zvol first, then live-migrate
# Step 1: Create an initial snapshot and send it to the destination
zfs snapshot tank/vms/web01@migrate-base
zfs send tank/vms/web01@migrate-base | ssh host2 zfs receive tank/vms/web01
# Step 2: While the VM is still running, create an incremental snapshot
zfs snapshot tank/vms/web01@migrate-incr1
zfs send -i @migrate-base @migrate-incr1 tank/vms/web01 | \
ssh host2 zfs receive tank/vms/web01
# Step 3: Repeat incremental sends until the delta is small
zfs snapshot tank/vms/web01@migrate-incr2
zfs send -i @migrate-incr1 @migrate-incr2 tank/vms/web01 | \
ssh host2 zfs receive tank/vms/web01
# Step 4: Pause the VM, send the final delta, live-migrate CPU/memory state
virsh suspend web01
zfs snapshot tank/vms/web01@migrate-final
zfs send -i @migrate-incr2 @migrate-final tank/vms/web01 | \
ssh host2 zfs receive tank/vms/web01
# Step 5: Migrate CPU and memory state (storage is already on host2)
virsh migrate --offline --persistent web01 qemu+ssh://host2.internal/system
# Then start on destination:
ssh host2 virsh start web01
# Step 6: Clean up migration snapshots
zfs destroy tank/vms/web01@migrate-base
zfs destroy tank/vms/web01@migrate-incr1
zfs destroy tank/vms/web01@migrate-incr2
zfs destroy tank/vms/web01@migrate-final
11. Multi-Host Orchestration
Managing VMs across multiple hosts requires consistent tooling, predictable naming,
and automated workflows. The kldload toolkit provides kvm-deploy, kvm-clone,
and kvm-delete for this purpose — thin shell wrappers around virsh and ZFS that
enforce naming conventions, handle zvol creation, and manage the golden image
clone lifecycle.
Fleet patterns
Golden image + clone fleet
Build one golden image per OS/role combination. Seal it. Snapshot it. Clone it to every host that needs VMs of that type. Use cloud-init to differentiate each clone (hostname, IP, role). This is the simplest and most efficient pattern for homogeneous fleets.
Packer + deploy pipeline
Use Packer to build golden images from code (HCL template + shell provisioners). Packer boots a VM, runs the provisioners, seals the image, and exports it. Then deploy the image to hosts via zfs send or scp. This adds reproducibility — the golden image is defined in code and can be rebuilt from scratch.
kvm-deploy for ISO-based installs
For hosts where you want a fresh install from the kldload ISO (not a clone), kvm-deploy creates the zvol, generates the virt-install command, boots the ISO, and optionally attaches an answers file for unattended installation.
kvm-clone across hosts
# On the golden image host, export the sealed snapshot
zfs send tank/vms/golden@sealed | ssh host2 zfs receive tank/vms/golden
zfs send tank/vms/golden@sealed | ssh host3 zfs receive tank/vms/golden
# On each host, clone from the local golden image
ssh host2 "kvm-clone golden web01 && kvm-clone golden web02"
ssh host3 "kvm-clone golden web03 && kvm-clone golden web04"
# Or use a simple deployment loop
HOSTS="host2 host3 host4"
VM_PREFIX="web"
CLONES_PER_HOST=4
for host in $HOSTS; do
for i in $(seq 1 $CLONES_PER_HOST); do
idx=$(( (${host##host} - 2) * CLONES_PER_HOST + i ))
vm="${VM_PREFIX}$(printf '%02d' $idx)"
ssh "$host" "kvm-clone golden $vm && virsh start $vm"
done
done
Consistent naming and inventory
# Generate an inventory of all VMs across all hosts
for host in host1 host2 host3; do
echo "=== $host ==="
ssh "$host" "virsh list --all --name" | while read vm; do
[ -z "$vm" ] && continue
state=$(ssh "$host" "virsh domstate '$vm'" 2>/dev/null)
mem=$(ssh "$host" "virsh dominfo '$vm'" 2>/dev/null | awk '/Max memory/{print $3}')
echo " $vm state=$state memory=${mem}KiB"
done
done
# Push a configuration change to all VMs on all hosts
for host in host1 host2 host3; do
for vm in $(ssh "$host" "virsh list --name"); do
ssh "$host" "virsh setmem '$vm' 4G --config"
done
done
kvm-delete and orphan cleanup
# kvm-delete handles the full teardown:
# 1. Graceful shutdown (with timeout)
# 2. Force destroy if shutdown fails
# 3. Undefine the domain (including NVRAM)
# 4. Destroy the zvol
# 5. Check if this was the last clone of a snapshot — if so, destroy the orphaned snapshot
kvm-delete web01
# The orphan cleanup is important: when you clone golden@sealed to web01, web02, web03,
# and then delete all three, the snapshot golden@clone-web01 etc. become orphans.
# kvm-delete checks for this and cleans up automatically.
12. Proxmox Integration
Proxmox VE is a Debian-based open-source virtualization platform built on KVM and LXC. It adds a web UI, clustering, HA fencing, backup (Proxmox Backup Server), and an API. When you understand bare KVM (which this masterclass teaches), Proxmox becomes a productivity accelerator rather than a black box. This section covers when to use Proxmox vs. bare KVM and how to integrate kldload workflows with Proxmox's API.
When to use Proxmox vs. bare KVM
| Criterion | Bare KVM + libvirt | Proxmox VE |
|---|---|---|
| Web UI | Cockpit (optional) | Built-in, production-ready |
| Clustering | Manual (corosync + scripts) | Built-in (corosync + pmxcfs) |
| HA fencing | Manual (fence agents) | Built-in (watchdog + STONITH) |
| Backup | ZFS snapshots + zfs send | Proxmox Backup Server (incremental, dedup) |
| API | libvirt API + virsh | REST API (pvesh, curl) |
| Storage | Any (ZFS, LVM, dir) | ZFS, LVM, Ceph, NFS, GlusterFS |
| LXC containers | Separate (lxc, systemd-nspawn) | Integrated (same UI, same API) |
| Learning curve | Higher (must understand all layers) | Lower (abstractions handle plumbing) |
| Control | Total (you own every config file) | Good (but Proxmox opinions sometimes conflict) |
| Best for | Single hosts, automation-heavy, learning | Multi-host clusters, teams, mixed VM+LXC |
API-driven deployment on Proxmox
# Authenticate and get a ticket
DATA=$(curl -s -k -d "username=root@pam&password=yourpassword" \
https://proxmox.internal:8006/api2/json/access/ticket)
TICKET=$(echo "$DATA" | jq -r '.data.ticket')
CSRF=$(echo "$DATA" | jq -r '.data.CSRFPreventionToken')
# Create a VM via the API
curl -s -k \
-H "Cookie: PVEAuthCookie=$TICKET" \
-H "CSRFPreventionToken: $CSRF" \
-X POST "https://proxmox.internal:8006/api2/json/nodes/pve1/qemu" \
-d "vmid=200" \
-d "name=web01" \
-d "memory=8192" \
-d "cores=4" \
-d "cpu=host" \
-d "machine=q35" \
-d "scsihw=virtio-scsi-single" \
-d "scsi0=tank:vm-200-disk-0,size=100G,discard=on,iothread=1" \
-d "net0=virtio,bridge=vmbr0" \
-d "serial0=socket" \
-d "ide2=local:iso/kldload-1.0.3.iso,media=cdrom" \
-d "boot=order=ide2;scsi0" \
-d "tpmstate0=tank:1,version=v2.0" \
-d "bios=ovmf" \
-d "efidisk0=tank:1"
# Start the VM
curl -s -k \
-H "Cookie: PVEAuthCookie=$TICKET" \
-H "CSRFPreventionToken: $CSRF" \
-X POST "https://proxmox.internal:8006/api2/json/nodes/pve1/qemu/200/status/start"
# Or use pvesh from the Proxmox host directly
pvesh create /nodes/pve1/qemu/200/status/start
ZFS on Proxmox
# Proxmox natively supports ZFS storage pools
# Create a ZFS pool in the Proxmox UI or CLI
zpool create -f tank mirror /dev/sda /dev/sdb
# Add it as a Proxmox storage backend
pvesm add zfspool tank -pool tank -content images,rootdir -sparse 1
# Proxmox stores VM disks as zvols:
# tank/vm-200-disk-0 (the first disk of VMID 200)
# Snapshots via the Proxmox API create ZFS snapshots underneath
pvesh create /nodes/pve1/qemu/200/snapshot -snapname before-upgrade
# This creates: zfs snapshot tank/vm-200-disk-0@before-upgrade
13. GPU Passthrough for VMs
GPU passthrough assigns a physical GPU directly to a VM using VFIO (Virtual Function I/O). The VM gets bare-metal GPU performance — it can run CUDA workloads, machine learning training, video transcoding, or a full desktop compositor at native speed. The host gives up access to the GPU entirely; the VM owns it exclusively.
IOMMU and VFIO setup
# Step 1: Enable IOMMU in the kernel command line
# Intel:
intel_iommu=on iommu=pt
# AMD:
amd_iommu=on iommu=pt
# 'iommu=pt' enables passthrough mode — devices not assigned to VMs
# use the native DMA path, avoiding IOMMU overhead for host devices.
# Step 2: Verify IOMMU is active
dmesg | grep -i iommu
# DMAR: IOMMU enabled
# DMAR-IR: IOMMU DMAR x enabled
# Step 3: Find the GPU's IOMMU group
for d in /sys/kernel/iommu_groups/*/devices/*; do
n=$(basename "$d")
echo "IOMMU Group $(basename $(dirname $(dirname "$d"))): $n $(lspci -nns "$n")"
done | grep -i nvidia
# IOMMU Group 14: 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation ... [10de:2684]
# IOMMU Group 14: 01:00.1 Audio device [0403]: NVIDIA Corporation ... [10de:22ba]
# Step 4: Bind the GPU to vfio-pci (all devices in the IOMMU group must be bound)
echo "10de:2684" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "10de:22ba" > /sys/bus/pci/drivers/vfio-pci/new_id
# Or permanently via modprobe.d:
echo "options vfio-pci ids=10de:2684,10de:22ba" > /etc/modprobe.d/vfio.conf
echo "vfio-pci" > /etc/modules-load.d/vfio-pci.conf
# Ensure vfio-pci loads before the nvidia driver
echo "softdep nvidia pre: vfio-pci" >> /etc/modprobe.d/vfio.conf
dracut --force
VM configuration for GPU passthrough
<!-- GPU passthrough in libvirt XML -->
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
</source>
<address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
</hostdev>
<!-- Also pass the GPU's audio device -->
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
</source>
<address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x1'/>
</hostdev>
<!-- Hide the hypervisor from the guest (some NVIDIA drivers check for this) -->
<features>
<kvm>
<hidden state='on'/>
</kvm>
</features>
Mediated devices (vGPU)
Instead of passing the entire GPU to one VM, mediated devices (mdev) split a physical GPU into multiple virtual GPUs. Each VM gets a slice of the GPU's compute and VRAM. NVIDIA vGPU requires a commercial license. Intel GVT-g provides free mediated device support for integrated GPUs. AMD has experimental SR-IOV support on datacenter GPUs.
# List available mdev types (Intel GVT-g example)
ls /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/
# i915-GVTg_V5_4 — 1 vGPU, 256 MB VRAM
# i915-GVTg_V5_8 — 2 vGPUs, 128 MB each
# Create a mediated device
echo "$(uuidgen)" > /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/create
# Assign to a VM in libvirt XML
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'>
<source>
<address uuid='the-uuid-you-generated'/>
</source>
</hostdev>
14. Monitoring & Performance
Monitoring KVM VMs uses the same tools as monitoring any Linux workload — plus some hypervisor-specific tools. Since VMs are processes and vCPUs are threads, standard Linux performance tools work directly. libvirt adds VM-aware statistics, and eBPF can trace hypervisor-specific events.
virt-top
# Real-time VM resource usage (like top for VMs)
virt-top
# ID Name State CPU(s) CPU% Mem Mem% Time Net Block
# 1 web01 R 4 12.3 8G 25.0 4:23 eth0 vda
# 2 db01 R 8 45.2 32G 50.0 12:01 eth0 sda,sdb
# 3 web02 R 4 8.1 8G 25.0 2:45 eth0 vda
libvirt statistics
# Per-VM statistics
virsh domstats web01
# Domain: 'web01'
# state.state=1
# state.reason=1
# cpu.time=245678900000
# cpu.user=123456700000
# cpu.system=67890100000
# balloon.current=8388608
# balloon.maximum=8388608
# block.count=1
# block.0.name=vda
# block.0.rd.reqs=456789
# block.0.rd.bytes=12345678900
# block.0.wr.reqs=234567
# block.0.wr.bytes=6789012300
# net.count=1
# net.0.name=vnet0
# net.0.rx.bytes=9876543210
# net.0.rx.pkts=12345678
# net.0.tx.bytes=8765432100
# net.0.tx.pkts=11234567
# CPU usage percentage per VM
virsh cpu-stats web01
# CPU0:
# cpu_time 45.678000000 seconds
# user_time 23.456000000 seconds
# system_time 12.345000000 seconds
eBPF for hypervisor tracing
# Trace KVM VM exits using bpftrace
bpftrace -e '
tracepoint:kvm:kvm_exit {
@exits[args->exit_reason] = count();
}
interval:s:5 {
print(@exits);
clear(@exits);
}
'
# exit_reason 48 = EPT violation (memory-mapped I/O)
# exit_reason 30 = I/O instruction
# exit_reason 28 = CR access
# exit_reason 1 = external interrupt
# Trace VM exit latency (time spent handling each exit)
bpftrace -e '
tracepoint:kvm:kvm_exit { @start[tid] = nsecs; }
tracepoint:kvm:kvm_entry {
if (@start[tid]) {
@exit_latency_ns = hist(nsecs - @start[tid]);
delete(@start[tid]);
}
}'
# Count virtio interrupts per VM
bpftrace -e '
tracepoint:irq:irq_handler_entry /str(args->name) == "virtio"/ {
@[comm] = count();
}'
Prometheus integration
# libvirt-exporter exposes VM metrics to Prometheus
# Install
curl -Lo /usr/local/bin/libvirt-exporter \
https://github.com/prometheus-community/libvirt_exporter/releases/latest/download/libvirt_exporter-linux-amd64
chmod +x /usr/local/bin/libvirt-exporter
# Run as a systemd service
cat > /etc/systemd/system/libvirt-exporter.service <<'EOF'
[Unit]
Description=Prometheus libvirt exporter
After=libvirtd.service
[Service]
ExecStart=/usr/local/bin/libvirt-exporter --web.listen-address=:9177
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl enable --now libvirt-exporter
# Metrics available at http://localhost:9177/metrics
# libvirt_domain_info_cpu_time_seconds_total
# libvirt_domain_info_maximum_memory_bytes
# libvirt_domain_info_memory_usage_bytes
# libvirt_domain_info_virtual_cpus
# libvirt_domain_block_stats_read_bytes_total
# libvirt_domain_block_stats_write_bytes_total
# libvirt_domain_interface_stats_receive_bytes_total
# libvirt_domain_interface_stats_transmit_bytes_total
# Prometheus scrape config
# - job_name: 'libvirt'
# static_configs:
# - targets: ['hypervisor1:9177', 'hypervisor2:9177']
ZFS I/O monitoring for VM zvols
# Watch zvol I/O in real time
zpool iostat -v tank 2
# capacity operations bandwidth
# pool alloc free read write read write
# ---------- ----- ---- ----- ----- ----- -----
# tank 120G 880G 1.2K 456 48.5M 18.3M
# mirror 120G 880G 1.2K 456 48.5M 18.3M
# sda - - 623 228 24.3M 9.15M
# sdb - - 612 228 24.2M 9.15M
# Per-dataset I/O (requires zfs_latency kstat or eBPF)
# Use the ZFS ARC stats to monitor caching efficiency
cat /proc/spl/kstat/zfs/arcstats | grep -E '^(hits|misses|size)'
# hits 4 789012345
# misses 4 12345678
# size 4 34359738368
15. Troubleshooting Reference
This section collects the most common KVM problems and their solutions. When something breaks, start here.
| Problem | Symptom | Cause | Solution |
|---|---|---|---|
| VM won't start | error: internal error: process exited while connecting to monitor |
QEMU crashed on startup — usually bad XML, missing image, or permission issue | Check /var/log/libvirt/qemu/VMNAME.log for the QEMU error. Fix the XML or permissions. |
| No KVM acceleration | Could not access KVM kernel module: No such file or directory |
KVM modules not loaded or hardware virtualization disabled in BIOS | modprobe kvm_intel (or kvm_amd). Enable VT-x/AMD-V in BIOS. Check grep -cE '(vmx|svm)' /proc/cpuinfo. |
| Poor disk performance | High latency, low throughput on virtio-blk | Using cache=writethrough or io=threads with a zvol |
Set cache=none,io=native for zvol-backed disks. Add iothread for dedicated I/O processing. |
| VM network unreachable | Guest has no connectivity | Bridge not configured, vnet device not added to bridge, or nftables blocking | bridge link show to verify vnet is attached. Check nft list ruleset for FORWARD drops. |
| SELinux blocks zvol access | Permission denied opening zvol in QEMU log |
SELinux svirt policy does not allow QEMU to open the zvol device | setsebool -P virt_use_rawip 1 or set security_driver = "none" in /etc/libvirt/qemu.conf (less secure). Better: relabel the zvol. |
| Shared NVRAM conflict | Second clone fails to start with "NVRAM file already in use" | Multiple VMs sharing the same UEFI NVRAM file | Copy the NVRAM file per VM: cp /usr/share/OVMF/OVMF_VARS.fd /var/lib/libvirt/qemu/nvram/VMNAME_VARS.fd. kvm-clone does this automatically. |
| Migration fails | Unable to resolve address or Timed out during operation |
Destination host unreachable, firewall blocking migration port, or DNS resolution failure | Verify SSH connectivity. Open TCP port 49152-49215 (QEMU migration range). Use --tunnelled to avoid direct QEMU connections. |
| CPU feature mismatch on migration | unsupported configuration: guest CPU ... is not compatible |
Using host-passthrough with different CPU generations |
Switch to a named CPU model or host-model. Or use host-passthrough with migratable='on' and matching hardware. |
| GPU passthrough fails | VFIO error: device is not in a valid IOMMU group |
IOMMU not enabled, or GPU shares an IOMMU group with other devices | Enable intel_iommu=on or amd_iommu=on. Use ACS override patch if IOMMU groups are too large (security tradeoff). |
| NVIDIA driver detects hypervisor | Error 43 in Windows guest with NVIDIA GPU |
NVIDIA consumer drivers refuse to run in a VM | Add <kvm><hidden state='on'/></kvm> to the XML features section. Use a datacenter GPU (Tesla/A100) which does not have this restriction. |
| Hugepage allocation fails | Cannot allocate memory on VM start with hugepages |
Not enough free hugepages — fragmentation prevents allocation at runtime | Allocate hugepages at boot via kernel command line. Check cat /proc/meminfo | grep HugePages. |
| Slow clone boot | Clone takes 2+ minutes to boot | cloud-init waiting for network metadata that does not exist | Configure cloud-init with datasource_list: [NoCloud, None] and disable cloud-init-network if not using cloud metadata. |
Log locations
# QEMU VM logs (most useful for startup failures)
/var/log/libvirt/qemu/VMNAME.log
# libvirt daemon log
journalctl -u libvirtd -f
# KVM kernel messages
dmesg | grep -i kvm
# IOMMU/VFIO messages
dmesg | grep -iE '(iommu|vfio|dmar)'
# ZFS zvol I/O errors
dmesg | grep -i zfs
zpool status -v tank
Performance diagnostics checklist
# 1. Verify KVM acceleration is in use (not TCG emulation)
virsh dumpxml web01 | grep -A2 '<domain'
# type='kvm' means hardware acceleration is active
# 2. Check CPU steal time inside the guest
top
# If %st (steal) is high, the host is overcommitting CPU
# 3. Verify virtio drivers are loaded in the guest
lsmod | grep virtio
# virtio_blk, virtio_net, virtio_balloon should be present
# 4. Check for excessive VM exits
perf kvm stat live
# Shows VM exit reasons and frequency
# 5. Verify disk I/O path
virsh dumpxml web01 | grep -A5 '
Related Pages
- KVM Virtual Machines Tutorial — getting started with KVM on kldload
- Kubernetes on KVM — running a Kubernetes cluster on KVM VMs
- Proxmox & ZFS Tutorial — ZFS integration with Proxmox VE
- Cloud & Packer — building golden images with Packer
- ZFS Masterclass — deep dive into ZFS storage architecture
- GPU & NVIDIA Masterclass — GPU passthrough, CUDA, driver management
- Backplane Networks Masterclass — high-speed interconnects for hypervisor clusters
- eBPF Masterclass — kernel tracing and observability including KVM tracepoints
- ZFS Wiki: KVM + ZFS Hypervisor — zvol tuning reference
- Packer & IaC Masterclass — infrastructure as code for VM images
- Storage & ZFS Platform — kldload storage architecture overview