Documentation

GPU & NVIDIA Masterclass

This guide covers the full GPU stack on kldload: from PCIe fundamentals and NVIDIA driver installation through VFIO passthrough, vGPU sharing, CUDA development, container GPU access, Kubernetes GPU scheduling, LLM inference workloads, multi-GPU topologies, monitoring, power management, SELinux integration, and ZFS storage tuning for AI workloads. By the end you will know how to make every GPU in your infrastructure available to bare metal, VMs, containers, and Kubernetes pods — and how to monitor, manage, and troubleshoot every layer.

The premise: GPUs are the most expensive and most underutilized resource in modern infrastructure. A single A100 costs more than most servers, yet organizations routinely leave them idle 90% of the time because they cannot share them across workloads. The GPU stack — drivers, CUDA, container runtimes, device plugins, passthrough — is one of the most fragile and poorly documented areas of Linux systems engineering. This masterclass makes it deterministic.

What this page covers: GPU hardware fundamentals, NVIDIA driver installation with DKMS on kldload, CUDA toolkit setup, VFIO passthrough for KVM virtual machines, vGPU and MIG sharing, NVIDIA Container Toolkit for Podman and Docker, Kubernetes GPU scheduling, LLM inference engines (Ollama, vLLM, TGI), multi-GPU NVLink topologies, GPU monitoring with Prometheus, power and thermal management, SELinux policy for GPU devices, ZFS recordsize tuning for model weights — and a comprehensive troubleshooting reference.

Prerequisites: a running kldload system with at least one NVIDIA GPU. The Kubernetes sections assume a cluster from the Kubernetes on KVM guide. The VFIO sections assume a host with IOMMU-capable hardware.

The GPU ecosystem on Linux is genuinely difficult. Not because the concepts are hard, but because the stack is deep and every layer has its own version matrix. The kernel needs the right NVIDIA module. The NVIDIA module needs the right firmware. CUDA needs a driver version at or above a minimum. The container runtime needs a specific CDI config. The Kubernetes device plugin needs to match the driver. And if any single layer is wrong, the error messages are usually useless — "no CUDA-capable device is detected" tells you nothing about which layer broke. This masterclass gives you a mental model for every layer so that when something breaks, you know where to look.

1. GPU Fundamentals

Before touching drivers or CUDA, you need to understand what a GPU actually is at the hardware level, why it exists as a separate device, and what the PCIe bus topology means for performance. This section gives you the mental model that makes everything else click.

What a GPU actually does

A CPU has a small number of powerful cores (8-128) optimized for sequential logic with deep branch prediction, out-of-order execution, and large caches. A GPU has thousands of simple cores (the NVIDIA A100 has 6,912 CUDA cores) optimized for the same operation on many data elements simultaneously. The GPU does not replace the CPU — it accelerates workloads that are embarrassingly parallel: matrix multiplication, convolution, element-wise transforms, sorting, and reduction. This is why GPUs dominate machine learning: neural network training and inference are fundamentally matrix operations.

CUDA Cores

The basic compute units in NVIDIA GPUs. Each CUDA core can execute one floating-point or integer operation per clock cycle. They are organized into Streaming Multiprocessors (SMs), each containing 64-128 CUDA cores plus shared memory, registers, and schedulers. An RTX 4090 has 16,384 CUDA cores across 128 SMs.

// A CPU core is a PhD — brilliant but one at a time. A CUDA core is a factory worker — simple but you have 16,000 of them.

Tensor Cores

Specialized matrix-multiply-accumulate units introduced in Volta (V100). A single Tensor Core performs a 4x4 matrix multiply-and-accumulate in one clock cycle — work that would take 64 CUDA core operations. They support FP16, BF16, TF32, INT8, and FP8 precisions. Tensor Cores are what make modern LLM inference fast enough to be practical.

// CUDA cores do arithmetic. Tensor Cores do linear algebra. LLMs are linear algebra.

GPU Memory (VRAM)

GPU memory (HBM2e on data center cards, GDDR6X on consumer cards) is separate from system RAM. The A100 has 80 GB HBM2e at 2 TB/s bandwidth. The RTX 4090 has 24 GB GDDR6X at 1 TB/s. VRAM capacity determines the maximum model size you can run — a 70B parameter model at FP16 needs ~140 GB, which means you need at least two A100-80GB cards.

// VRAM is the GPU's own RAM. If the model doesn't fit, it doesn't run. Period.

PCIe Topology

GPUs connect to the CPU via PCIe lanes. A GPU in a x16 PCIe 4.0 slot gets ~32 GB/s bidirectional bandwidth. PCIe 5.0 doubles this to ~64 GB/s. The NUMA node the GPU is attached to matters — data transferred from the wrong NUMA node crosses the CPU interconnect, adding latency. Use nvidia-smi topo -m to see the topology.

// PCIe is the highway between CPU and GPU. NUMA is which highway entrance you use.

NVLink

NVIDIA's proprietary GPU-to-GPU interconnect. NVLink 4.0 (Hopper) provides 900 GB/s bidirectional bandwidth between GPUs — 14x faster than PCIe 5.0. NVLink enables multi-GPU workloads to share memory and communicate without the PCIe bottleneck. Consumer GPUs (RTX) do not have NVLink; data center GPUs (A100, H100) do.

// PCIe is a two-lane road. NVLink is a 28-lane superhighway between GPUs.

Compute Capability

A version number identifying the GPU architecture's feature set: 7.0 = Volta, 7.5 = Turing, 8.0 = Ampere, 8.6 = Ampere consumer, 8.9 = Ada Lovelace, 9.0 = Hopper. CUDA code compiled for one compute capability may not run on a different one. This is the GPU equivalent of a CPU instruction set.

// Compute capability = GPU instruction set version. Compile for the wrong one and nothing runs.

GPU architecture generations

Architecture	Year	Compute Capability	Key Feature	Data Center Card	Consumer Card
Pascal	2016	6.0 / 6.1	Unified memory, NVLink 1.0	P100	GTX 1080 Ti
Volta	2017	7.0	First Tensor Cores	V100	Titan V
Turing	2018	7.5	RT Cores, INT8 Tensor	T4	RTX 2080 Ti
Ampere	2020	8.0 / 8.6	BF16, TF32, MIG, 3rd gen NVLink	A100	RTX 3090
Ada Lovelace	2022	8.9	FP8 Tensor, DLSS 3	L40S	RTX 4090
Hopper	2022	9.0	Transformer Engine, NVLink 4.0	H100	N/A
Blackwell	2024	10.0	FP4, 5th gen NVLink	B200	RTX 5090

The most important number for LLM inference is VRAM, not CUDA core count. A 24 GB RTX 4090 with 16,384 CUDA cores cannot run a 70B model at FP16. Two 80 GB A100s with "only" 6,912 CUDA cores each can. For inference, memory capacity and bandwidth dominate. For training, compute (FLOPS) matters more. Know which problem you are solving before you buy hardware.

2. NVIDIA Driver Installation on kldload

The NVIDIA kernel module is the foundation of the entire GPU stack. Every other layer — CUDA, containers, VMs, monitoring — depends on the driver loading correctly. kldload uses DKMS (Dynamic Kernel Module Support) so the driver rebuilds automatically when the kernel updates. This section covers a clean, reproducible driver installation.

Pre-installation checks

# Verify GPU is visible on PCIe bus
lspci | grep -i nvidia
# 41:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)

# Check current kernel and architecture
uname -r
# 5.14.0-503.el9.x86_64

# Verify DKMS is installed (kldload includes it)
rpm -q dkms
# dkms-3.0.13-1.el9.noarch

# Check for conflicting nouveau driver
lsmod | grep nouveau
# If loaded, it must be blacklisted before installing NVIDIA drivers

Blacklist nouveau

The open-source nouveau driver ships with every Linux kernel. It must be disabled before the NVIDIA proprietary driver can load. kldload handles this automatically during installation, but if you are retrofitting a system, do it manually:

# Create blacklist file
cat > /etc/modprobe.d/blacklist-nouveau.conf <<'EOF'
blacklist nouveau
options nouveau modeset=0
EOF

# Rebuild initramfs to exclude nouveau
dracut --force

# Reboot to take effect
systemctl reboot

Install NVIDIA driver via RPM (CentOS/RHEL/Rocky)

# Add NVIDIA CUDA repository
dnf config-manager --add-repo \
  https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo

# Install the driver meta-package (latest branch)
dnf install -y nvidia-driver nvidia-driver-devel

# Or pin a specific driver branch
dnf install -y nvidia-driver-550 nvidia-driver-550-devel

# Verify DKMS built the module
dkms status
# nvidia/550.127.05, 5.14.0-503.el9.x86_64, x86_64: installed

# Load the module
modprobe nvidia

# Verify
nvidia-smi

Install NVIDIA driver via APT (Debian/Ubuntu)

# Add NVIDIA keyring and repository
apt-get install -y linux-headers-$(uname -r) dkms
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update

# Install driver
apt-get install -y nvidia-driver-550

# Verify
dkms status
nvidia-smi

Persistence mode

By default, the NVIDIA driver unloads when no processes are using the GPU, and reloads when a process accesses it. This causes a 1-3 second delay on first access. Persistence mode keeps the driver loaded at all times, eliminating cold-start latency. On servers, always enable it:

# Enable persistence mode (survives until reboot)
nvidia-smi -pm 1

# Enable via systemd for persistence across reboots
systemctl enable --now nvidia-persistenced

Driver version compatibility

CUDA Version	Minimum Driver (Linux)	Architecture Support
CUDA 12.6	560.28+	Ampere, Ada, Hopper, Blackwell
CUDA 12.4	550.54+	Ampere, Ada, Hopper
CUDA 12.2	535.54+	Volta+
CUDA 12.0	525.60+	Volta+
CUDA 11.8	520.61+	Pascal+
CUDA 11.7	515.43+	Pascal+

The single most common GPU problem in production is a driver/CUDA version mismatch after a kernel update. DKMS is supposed to rebuild the module automatically, but it fails silently if the kernel headers package is missing. After every kernel update, run dkms status and verify the nvidia module is listed as "installed" for the new kernel. If it says "added" but not "installed," the build failed — check /var/lib/dkms/nvidia/*/build/make.log for the real error. The fix is almost always: install the correct kernel-devel/headers package and re-run dkms autoinstall.

3. CUDA Toolkit

The CUDA Toolkit provides the compiler (nvcc), runtime libraries, math libraries (cuBLAS, cuDNN, cuFFT), and development headers. Application code that runs on the GPU is compiled with nvcc, which splits the code into host (CPU) and device (GPU) portions. The runtime libraries are what frameworks like PyTorch and TensorFlow link against.

Installation on CentOS/RHEL/Rocky

# Install the full CUDA toolkit (includes driver if not already installed)
dnf install -y cuda-toolkit-12-6

# Or install a specific version alongside the existing driver
dnf install -y cuda-toolkit-12-4

# Set environment variables
cat > /etc/profile.d/cuda.sh <<'EOF'
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
EOF
source /etc/profile.d/cuda.sh

# Verify
nvcc --version
# nvcc: NVIDIA (R) Cuda compiler driver
# Cuda compilation tools, release 12.6, V12.6.77

Installation on Debian/Ubuntu

# Install CUDA toolkit
apt-get install -y cuda-toolkit-12-6

# Same environment setup as RPM-based systems
cat > /etc/profile.d/cuda.sh <<'EOF'
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
EOF
source /etc/profile.d/cuda.sh

nvcc --version

Multiple CUDA versions

The CUDA toolkit installs to /usr/local/cuda-12.6/ (versioned path) with a symlink at /usr/local/cuda pointing to the default version. You can install multiple versions side by side and switch the symlink:

# Install two versions
dnf install -y cuda-toolkit-12-4 cuda-toolkit-12-6

# List installed versions
ls /usr/local/cuda-*
# /usr/local/cuda-12.4  /usr/local/cuda-12.6

# Switch active version
rm /usr/local/cuda
ln -s /usr/local/cuda-12.4 /usr/local/cuda

# Verify
nvcc --version
# Cuda compilation tools, release 12.4

cuDNN installation

cuDNN (CUDA Deep Neural Network library) provides optimized primitives for convolutions, pooling, normalization, and attention. PyTorch and TensorFlow require cuDNN for GPU-accelerated training and inference. It must match the CUDA version:

# RPM-based
dnf install -y libcudnn9-cuda-12 libcudnn9-devel-cuda-12

# APT-based
apt-get install -y libcudnn9-cuda-12 libcudnn9-dev-cuda-12

# Verify
python3 -c "import torch; print(torch.backends.cudnn.version())"
# 90100

CUDA verification

# Compile and run the CUDA samples
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples/Samples/1_Utilities/deviceQuery
make
./deviceQuery

# Expected output includes:
# Device 0: "NVIDIA A100-PCIE-80GB"
#   CUDA Capability Major/Minor version number: 8.0
#   Total amount of global memory: 81920 MBytes
#   (108) Multiprocessors, (64) CUDA Cores/MP: 6912 CUDA Cores

A common mistake is installing the CUDA toolkit version that matches the driver version number instead of the CUDA version number. Driver 550 supports CUDA 12.4, not "CUDA 550." The driver version and the CUDA version are completely different numbering schemes. Always check the compatibility matrix. Another mistake: installing CUDA from the runfile installer instead of the package manager. The runfile works, but it bypasses DKMS and package management, making upgrades fragile. Use the RPM or DEB packages — they integrate with DKMS and can be managed by dnf/apt.

4. GPU Passthrough with VFIO

VFIO (Virtual Function I/O) allows you to pass a physical GPU directly to a KVM virtual machine, giving the VM exclusive, near-native access to the hardware. The host gives up the GPU entirely — it cannot use it while the VM is running. This is the gold standard for GPU performance in VMs: the VM sees the real GPU with no emulation overhead.

IOMMU prerequisites

# Verify IOMMU is enabled in the kernel
dmesg | grep -i iommu
# [    0.000000] Intel-IOMMU: enabled
# or
# [    0.000000] AMD-Vi: AMD IOMMUv2 enabled

# If not enabled, add kernel parameter
# For Intel:
grubby --update-kernel=ALL --args="intel_iommu=on iommu=pt"
# For AMD:
grubby --update-kernel=ALL --args="amd_iommu=on iommu=pt"

# Reboot and verify
systemctl reboot
dmesg | grep -i iommu

Identify IOMMU groups

VFIO operates on IOMMU groups, not individual devices. An IOMMU group is a set of devices that share the same IOMMU translation — you must pass all devices in a group to the VM, or none. If your GPU shares an IOMMU group with the SATA controller, you have a problem. Check the groups:

# List all IOMMU groups and their devices
for g in /sys/kernel/iommu_groups/*/devices/*; do
  echo "IOMMU Group $(basename $(dirname $(dirname $g))): $(lspci -nns ${g##*/})"
done

# Example output — clean isolation:
# IOMMU Group 30: 41:00.0 3D controller [0302]: NVIDIA Corporation GA100 [10de:20b2] (rev a1)
# IOMMU Group 30: 41:00.1 Audio device [0403]: NVIDIA Corporation GA100 High Definition Audio [10de:1aef] (rev a1)

# Both the GPU and its audio device are in the same group — pass both.

IOMMU group isolation is the number one reason GPU passthrough fails. On consumer motherboards, all PCIe slots often share a single IOMMU group because the chipset doesn't implement ACS (Access Control Services). Server motherboards (Supermicro, Dell PowerEdge, HPE) have proper ACS support and give each slot its own group. If you are stuck with bad grouping, the pcie_acs_override=downstream,multifunction kernel parameter can force isolation — but it weakens DMA protection, which is a security risk. For production, use server hardware with proper ACS.

Bind GPU to vfio-pci

# Get the GPU's vendor:device IDs
lspci -nn -s 41:00.0
# 41:00.0 3D controller [0302]: NVIDIA Corporation GA100 [10de:20b2] (rev a1)

# Unbind from nvidia driver (if loaded)
echo "0000:41:00.0" > /sys/bus/pci/devices/0000:41:00.0/driver/unbind
echo "0000:41:00.1" > /sys/bus/pci/devices/0000:41:00.1/driver/unbind

# Bind to vfio-pci
modprobe vfio-pci
echo "10de 20b2" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "10de 1aef" > /sys/bus/pci/drivers/vfio-pci/new_id

# Verify
lspci -k -s 41:00.0
# Kernel driver in use: vfio-pci

Persistent vfio-pci binding

# Create modprobe config to claim devices at boot
cat > /etc/modprobe.d/vfio.conf <<'EOF'
options vfio-pci ids=10de:20b2,10de:1aef
softdep nvidia pre: vfio-pci
EOF

# Rebuild initramfs to include vfio-pci early
dracut --force

# After reboot, the GPU will be bound to vfio-pci before nvidia can claim it

KVM VM XML configuration

<domain type='kvm'>
  <name>gpu-worker</name>
  <memory unit='GiB'>64</memory>
  <vcpu placement='static'>16</vcpu>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' dies='1' cores='8' threads='2'/>
    <numa>
      <cell id='0' cpus='0-15' memory='65536' unit='MiB'/>
    </numa>
  </cpu>
  <os>
    <type arch='x86_64' machine='q35'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/gpu-worker_VARS.fd</nvram>
  </os>
  <features>
    <kvm>
      <hidden state='on'/>
    </kvm>
  </features>
  <devices>
    <!-- GPU passthrough -->
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>
      </source>
    </hostdev>
    <!-- GPU audio (same IOMMU group) -->
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x41' slot='0x00' function='0x1'/>
      </source>
    </hostdev>
    <!-- Boot disk on ZFS zvol -->
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/zvol/rpool/vms/gpu-worker'/>
      <target dev='vda' bus='virtio'/>
    </disk>
  </devices>
</domain>

The kvm:hidden flag

The <kvm><hidden state='on'/></kvm> flag in the XML is critical. NVIDIA's proprietary driver detects when it is running inside a VM and refuses to initialize on consumer GPUs (GeForce/RTX) — it returns error code 43. The hidden flag prevents KVM from advertising its hypervisor CPUID leaf, making the VM appear as bare metal. Data center GPUs (Tesla, A100, H100) do not have this restriction.

GPU passthrough gives you 98-99% of bare-metal performance because the VM talks directly to the hardware via IOMMU — there is no emulation, no translation, no paravirtualization. The 1-2% overhead comes from the extra address translation layer (GPA to HPA via IOMMU page tables). For ML training workloads that are compute-bound on the GPU, this overhead is unmeasurable. The real cost of passthrough is exclusivity: one GPU, one VM. If you need to share a GPU across multiple VMs or containers, you need vGPU or MIG — covered in the next section.

VFIO passthrough gives one GPU to one VM. But most workloads don't saturate a GPU 100% of the time. GPU sharing lets multiple VMs or containers use the same physical GPU, trading maximum performance for better utilization. NVIDIA provides three sharing mechanisms, each with different tradeoffs.

MIG (Multi-Instance GPU)

Available on A100, A30, and H100 only. MIG physically partitions a GPU into up to 7 independent instances, each with dedicated compute, memory, and memory bandwidth. Each instance is hardware-isolated — a crash in one instance cannot affect another. This is the strongest isolation model short of separate physical GPUs.

// MIG = physically dividing the GPU into independent rooms with separate doors.

Time-Slicing

The GPU rapidly switches between workloads, giving each a time slice. No memory isolation — all workloads share the full VRAM. No compute isolation — a workload can starve others. Simple to set up (just oversubscribe), but unsuitable for production multi-tenant environments. Good for development clusters where fairness isn't critical.

// Time-slicing = everyone takes turns in the same room. No walls, no locks.

MPS (Multi-Process Service)

A CUDA runtime feature that allows multiple CUDA processes to share a GPU concurrently, with their kernels actually running simultaneously on different SMs. Better GPU utilization than time-slicing for small workloads, but no memory isolation and requires all processes to use the same CUDA context. Best for MPI workloads.

// MPS = multiple workers in the same room at the same time, sharing all the equipment.

NVIDIA vGPU (licensed)

A proprietary, licensed solution that creates virtual GPUs backed by a physical GPU. Each vGPU presents as a full GPU to the guest VM with its own driver instance. Requires an NVIDIA vGPU license and a supported hypervisor. Provides better isolation than time-slicing but is expensive and complex to manage.

// vGPU = hotel rooms: each guest sees their own room, managed by the front desk (hypervisor).

MIG configuration

# Enable MIG mode (requires reboot or GPU reset)
nvidia-smi -i 0 -mig 1

# List available MIG profiles
nvidia-smi mig -lgip

# Create instances — example: 3 instances on A100-80GB
# Profile 19 = 3g.40gb (3 compute slices, 40 GB memory)
nvidia-smi mig -cgi 19,19 -C
# Profile 9 = 1g.10gb (1 compute slice, 10 GB memory)
nvidia-smi mig -cgi 9 -C

# List created instances
nvidia-smi mig -lgi
# +-------------------------------------------------------+
# | GPU instances:                                         |
# | GPU   Name   Profile   Instance ID   Placement        |
# |   0   MIG 3g.40gb  19     1           [0:3]           |
# |   0   MIG 3g.40gb  19     2           [4:7]           |
# +-------------------------------------------------------+

# Each instance gets its own /dev/nvidia* device
ls /dev/nvidia-caps/
# nvidia-cap1  nvidia-cap2

When to use each

Mechanism	Isolation	Memory Isolation	GPU Support	License	Use Case
VFIO Passthrough	Full (hardware)	Full	Any	None	Single-tenant VMs, maximum performance
MIG	Full (hardware)	Full	A100, A30, H100	None	Multi-tenant inference, CI/CD GPU pools
Time-slicing	None	None	Any	None	Dev clusters, non-critical sharing
MPS	Partial	None	Volta+	None	MPI workloads, small concurrent kernels
vGPU	Strong (software)	Configurable	Selected	Paid	VDI, managed multi-tenant VMs

MIG is the underappreciated game-changer for GPU infrastructure. Before MIG, you had two options: give one GPU to one workload (wasteful), or share via time-slicing (unreliable). MIG gives you hardware-isolated partitions with guaranteed memory bandwidth — each partition acts like a smaller, independent GPU. An A100-80GB can become seven 1g.10gb instances, each suitable for a small inference workload. For organizations running many small models (recommendation engines, fraud detection, NLP classifiers), MIG can replace seven separate GPUs with one. The catch: only A100, A30, and H100 support it. If you are buying data center GPUs today, MIG support should be a hard requirement.

6. NVIDIA Container Toolkit

The NVIDIA Container Toolkit enables GPU access from inside containers. It configures the container runtime (Podman, Docker, containerd) to mount the NVIDIA driver, device files, and libraries into the container at launch time. The container itself does not need NVIDIA drivers installed — they are injected from the host.

Installation

# Add NVIDIA container toolkit repository (RPM)
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
  | tee /etc/yum.repos.d/nvidia-container-toolkit.repo

dnf install -y nvidia-container-toolkit

# For Debian/Ubuntu:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update && apt-get install -y nvidia-container-toolkit

Configure for Podman (rootless)

# Generate CDI (Container Device Interface) specification
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# Verify CDI devices
nvidia-ctk cdi list
# nvidia.com/gpu=0
# nvidia.com/gpu=1
# nvidia.com/gpu=all

# Run a GPU container with Podman
podman run --rm --device nvidia.com/gpu=all \
  nvidia/cuda:12.6.0-runtime-ubi9 nvidia-smi

# Output shows all host GPUs visible inside the container

Configure for Docker

# Configure Docker runtime
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

# Run a GPU container with Docker
docker run --rm --gpus all nvidia/cuda:12.6.0-runtime-ubi9 nvidia-smi

# Specific GPU selection
docker run --rm --gpus '"device=0"' nvidia/cuda:12.6.0-runtime-ubi9 nvidia-smi
docker run --rm --gpus '"device=0,1"' nvidia/cuda:12.6.0-runtime-ubi9 nvidia-smi

CDI (Container Device Interface)

CDI is the modern, runtime-agnostic way to expose devices to containers. Instead of bind-mounting /dev/nvidia* files and specific library paths (fragile, breaks on driver updates), CDI generates a JSON specification that describes all the devices, mounts, and environment variables a container needs. The container runtime reads the CDI spec at launch time and does the right thing automatically.

# The generated CDI spec lives at /etc/cdi/nvidia.yaml
# It contains entries like:
# devices:
#   - name: "0"
#     containerEdits:
#       deviceNodes:
#         - path: /dev/nvidia0
#         - path: /dev/nvidiactl
#         - path: /dev/nvidia-uvm
#       mounts:
#         - hostPath: /usr/lib64/libnvidia-ml.so.550.127.05
#           containerPath: /usr/lib64/libnvidia-ml.so.1

# Regenerate after driver update
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

The old way of giving containers GPU access was

--device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-uvm -v /usr/lib64/libnvidia-ml.so:/usr/lib64/libnvidia-ml.so

— you had to manually map every device file and every library, and it broke every time the driver version changed because the library filenames include the version number. CDI abstracts all of this. It is the correct way to do container GPU access in 2025+. If you see tutorials telling you to bind-mount individual nvidia libraries, they are outdated.

7. Kubernetes GPU Scheduling

Kubernetes does not understand GPUs natively. The NVIDIA device plugin runs as a DaemonSet on GPU nodes, discovers NVIDIA GPUs, and advertises them to the Kubernetes scheduler as extended resources (nvidia.com/gpu). Pods request GPUs in their resource limits, and the scheduler places them on nodes with available GPUs. The NVIDIA GPU Operator automates the entire stack — drivers, toolkit, device plugin, monitoring — as a single Helm chart.

Manual device plugin deployment

# Prerequisites: NVIDIA driver + container toolkit installed on all GPU nodes
# The device plugin container uses CDI to access GPUs

# Deploy the device plugin DaemonSet
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.0/deployments/static/nvidia-device-plugin.yml

# Verify GPUs are advertised
kubectl describe node gpu-worker-01 | grep nvidia
#   nvidia.com/gpu:     2
#   nvidia.com/gpu:     2

GPU pod specification

apiVersion: v1
kind: Pod
metadata:
  name: gpu-inference
spec:
  containers:
  - name: inference
    image: nvcr.io/nvidia/pytorch:24.07-py3
    resources:
      limits:
        nvidia.com/gpu: 1    # Request exactly 1 GPU
    command: ["python3", "-c", "import torch; print(torch.cuda.get_device_name(0))"]
  nodeSelector:
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

GPU Operator (automated full-stack deployment)

# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install the GPU Operator
# This deploys: driver (optional), container toolkit, device plugin,
# DCGM exporter, MIG manager, and node feature discovery
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=false \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=true

# Verify all components are running
kubectl -n gpu-operator get pods
# NAME                                       READY   STATUS
# gpu-feature-discovery-xxxxx                1/1     Running
# nvidia-container-toolkit-xxxxx             1/1     Running
# nvidia-dcgm-exporter-xxxxx                1/1     Running
# nvidia-device-plugin-xxxxx                1/1     Running
# nvidia-operator-validator-xxxxx           1/1     Running

MIG in Kubernetes

# The GPU Operator's MIG manager can partition GPUs automatically
# Configure via ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-3g.40gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
      mixed:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "1g.10gb": 1

# Apply a MIG configuration by labeling the node
kubectl label node gpu-worker-01 nvidia.com/mig.config=all-1g.10gb --overwrite

Time-slicing in Kubernetes

# ConfigMap for time-slicing (oversubscription)
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4    # Each physical GPU appears as 4 virtual GPUs

# Pods request nvidia.com/gpu: 1 as usual
# The scheduler can place 4 pods per physical GPU

A critical subtlety of Kubernetes GPU scheduling: GPUs are not compressible resources. CPU and memory can be overcommitted — the kernel time-slices CPU and can swap memory. GPUs cannot. If a pod requests nvidia.com/gpu: 1 and gets it, that GPU is exclusively reserved even if the pod uses 0% of it. The scheduler will not place another pod on it. This is why GPU utilization in Kubernetes clusters is often abysmal — teams request GPUs "just in case" and leave them idle. MIG and time-slicing exist to solve this, but both require deliberate configuration. If you run Kubernetes with GPUs and don't configure sharing, you are wasting the most expensive resource in your cluster.

8. LLM Inference

Large Language Model inference is the workload that has made GPUs ubiquitous in infrastructure. Running a 7B to 405B parameter model requires loading the model weights into VRAM, then executing matrix multiplications for each token. The inference engine handles batching, KV-cache management, and quantization. This section covers the three major inference engines and how to run them on kldload with ZFS-backed model storage.

Ollama

Ollama is the simplest way to run LLMs locally. It wraps llama.cpp with a model management layer, providing a REST API compatible with the OpenAI format. It handles model downloading, quantization selection, and GPU memory management automatically.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Ollama auto-detects NVIDIA GPUs if the driver is installed
# Verify GPU detection
ollama run --verbose llama3.2:1b "hello"
# ... using CUDA ...

# Pull and run a model
ollama pull llama3.1:70b-instruct-q4_K_M
ollama run llama3.1:70b-instruct-q4_K_M

# Serve the API
ollama serve &
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:70b-instruct-q4_K_M",
  "prompt": "Explain ZFS snapshots in one paragraph."
}'

# Ollama stores models in ~/.ollama/models by default
# Move to ZFS for snapshots and compression
zfs create -o recordsize=1M -o compression=off rpool/ollama
mv ~/.ollama/models /rpool/ollama/
ln -s /rpool/ollama /root/.ollama/models

vLLM

vLLM is a high-throughput inference engine designed for production serving. It implements PagedAttention — a technique that manages the KV cache like virtual memory pages, reducing memory waste from 60-80% (in naive implementations) to near zero. vLLM can serve 2-4x more concurrent requests than naive inference on the same hardware.

# Install vLLM
pip install vllm

# Serve a model with the OpenAI-compatible API
python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --port 8000

# Query the API
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
  "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
  "prompt": "Explain VFIO passthrough:",
  "max_tokens": 200
}'

# vLLM in a container with Podman
podman run --rm --device nvidia.com/gpu=all \
  -v /rpool/models:/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /models/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2

Text Generation Inference (TGI)

# TGI by Hugging Face — production-grade inference with continuous batching
podman run --rm --device nvidia.com/gpu=all \
  -v /rpool/models:/data \
  -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3.1-8B-Instruct \
  --num-shard 1 \
  --max-input-tokens 4096 \
  --max-total-tokens 8192

# TGI exposes a /generate endpoint
curl http://localhost:8080/generate -H "Content-Type: application/json" -d '{
  "inputs": "What is ZFS?",
  "parameters": {"max_new_tokens": 200}
}'

Model storage on ZFS

# Create a dedicated dataset for model weights
zfs create -o recordsize=1M \
           -o compression=off \
           -o atime=off \
           -o xattr=sa \
           rpool/models

# Why recordsize=1M and compression=off?
# Model weight files are large (2-140 GB), sequentially read, and already
# compressed (safetensors/GGUF format). ZFS compression wastes CPU cycles
# on incompressible data. Large recordsize = fewer metadata operations
# for sequential reads.

# Snapshot before pulling a new model version
zfs snapshot rpool/models@before-llama3.1-70b
ollama pull llama3.1:70b-instruct-q4_K_M

# Rollback if the new model is bad
zfs rollback rpool/models@before-llama3.1-70b

VRAM requirements for common models

Model	Parameters	FP16 VRAM	Q4_K_M VRAM	Min GPU
Llama 3.2 1B	1.2B	2.4 GB	1.0 GB	Any 4 GB+
Llama 3.2 3B	3.2B	6.4 GB	2.5 GB	Any 8 GB+
Llama 3.1 8B	8.0B	16 GB	5.5 GB	RTX 4060 Ti (16 GB)
Mistral 7B	7.2B	14.4 GB	5.0 GB	RTX 3090 (24 GB)
Llama 3.1 70B	70.6B	141 GB	42 GB	2x A100-80GB or 2x RTX 4090
Llama 3.1 405B	405B	810 GB	240 GB	8x A100-80GB or 4x H100
Mixtral 8x22B	141B (39B active)	282 GB	86 GB	4x A100-80GB

Quantization is the single most impactful optimization for LLM inference on real hardware. A 70B model at FP16 needs 141 GB of VRAM — no single consumer GPU comes close. The same model quantized to Q4_K_M needs 42 GB — two RTX 4090s (24 GB each) can run it. The quality loss from Q4_K_M is measurable on benchmarks but invisible in practice for most applications. Always start with Q4_K_M quantization unless you have specific quality requirements that demand higher precision. The other critical number is context length: the KV cache grows linearly with context length and can easily consume more VRAM than the model weights themselves. A 70B model at 128K context needs an additional ~40 GB for the KV cache alone.

9. Multi-GPU Configurations

When a single GPU doesn't have enough VRAM or compute, you scale to multiple GPUs. But multi-GPU is not "plug in more cards and it works." The interconnect topology, NUMA placement, and software framework all determine whether you get linear scaling or contention-limited degradation.

PCIe topology and NUMA awareness

# Display GPU-to-GPU interconnect topology
nvidia-smi topo -m
#         GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
# GPU0     X      NV12    SYS     SYS     0-31            0
# GPU1    NV12     X      SYS     SYS     0-31            0
# GPU2    SYS     SYS      X      NV12    32-63           1
# GPU3    SYS     SYS     NV12     X      32-63           1
#
# Legend:
#   NV12 = NVLink 12 lanes (high bandwidth, low latency)
#   SYS  = Cross-socket via CPU interconnect (high latency)
#   PHB  = Same PCIe host bridge
#   PIX  = Same PCIe switch

# Key insight: GPU0 and GPU1 are NVLink-connected on NUMA 0
#              GPU2 and GPU3 are NVLink-connected on NUMA 1
#              GPU0-GPU2 communication goes through the CPU interconnect (slow)

# Pin a process to the correct NUMA node
numactl --cpunodebind=0 --membind=0 python3 train.py --gpus 0,1

# For inference with vLLM, specify tensor parallelism across NVLink-paired GPUs
CUDA_VISIBLE_DEVICES=0,1 python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2

GPU affinity with environment variables

# Restrict CUDA to specific GPUs
export CUDA_VISIBLE_DEVICES=0,1       # Only GPUs 0 and 1 are visible
export CUDA_VISIBLE_DEVICES=GPU-UUID  # Select by UUID (stable across reboots)

# Get GPU UUIDs
nvidia-smi -L
# GPU 0: NVIDIA A100-PCIE-80GB (UUID: GPU-12345678-abcd-efgh-ijkl-123456789abc)

# MIG instance visibility
export CUDA_VISIBLE_DEVICES=MIG-GPU-12345678-abcd/1/0

# Set GPU order by PCIe bus ID (deterministic)
export CUDA_DEVICE_ORDER=PCI_BUS_ID

NVLink bandwidth test

# Use the CUDA samples p2pBandwidthLatencyTest
cd cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest
make
./p2pBandwidthLatencyTest

# NVLink result:     ~300 GB/s bidirectional (A100 NVLink 3.0)
# PCIe 4.0 result:   ~25 GB/s bidirectional
# Cross-socket:      ~15 GB/s (via QPI/UPI)

Multi-GPU training strategies

Strategy	Description	Interconnect Requirement	Framework Support
Data Parallel (DP)	Same model on each GPU, different data batches, gradient sync	PCIe sufficient	PyTorch DDP, Horovod
Tensor Parallel (TP)	Model layers split across GPUs, each GPU computes a portion	NVLink required	Megatron-LM, vLLM, DeepSpeed
Pipeline Parallel (PP)	Different model layers on different GPUs, data flows through	PCIe sufficient	GPipe, DeepSpeed
FSDP	Model parameters sharded, gathered on demand per layer	PCIe sufficient, NVLink better	PyTorch FSDP

The number one multi-GPU mistake is tensor-parallelizing across GPUs that are connected only by PCIe. Tensor parallelism requires all-to-all communication at every layer — if that communication goes over PCIe at 25 GB/s instead of NVLink at 300 GB/s, you spend more time communicating than computing. The rule is simple: tensor parallelism across NVLink-connected GPUs, pipeline or data parallelism across PCIe-connected GPUs. Check your topology with nvidia-smi topo -m before setting parallelism. For inference with vLLM, the --tensor-parallel-size should match the number of NVLink-connected GPUs on the same NUMA node.

10. GPU Monitoring

You cannot manage GPUs you cannot see. NVIDIA provides three monitoring layers: nvidia-smi for interactive queries, DCGM (Data Center GPU Manager) for programmatic health monitoring, and the DCGM Exporter for Prometheus metrics. This section sets up all three.

nvidia-smi essentials

# Full status of all GPUs
nvidia-smi

# Continuous monitoring (refresh every 1 second)
nvidia-smi dmon -s pucm -d 1
# Columns: power, utilization, clock, memory
# gpu   pwr  gtemp  mtemp    sm   mem   enc   dec  mclk  pclk
#   0    72    45     40     87    34     0     0  1593  1410

# Query specific metrics in machine-readable format
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw \
  --format=csv,noheader,nounits
# 0, NVIDIA A100-PCIE-80GB, 42, 87, 45321, 81920, 250

# Process-level GPU usage
nvidia-smi pmon -c 1
# gpu   pid    type    sm    mem    enc    dec    command
#   0   12345   C      87    34      0      0    python3

# Check ECC error counts (data center GPUs)
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total \
  --format=csv
# 0, 0

DCGM (Data Center GPU Manager)

# Install DCGM
dnf install -y datacenter-gpu-manager

# Start the DCGM service
systemctl enable --now nvidia-dcgm

# Run diagnostics
dcgmi diag -r 3
# Level 3 = full diagnostic: memory, PCIe, compute, stress test
# Takes ~10 minutes per GPU

# Health monitoring
dcgmi health -s mpi   # Monitor memory, PCIe, and inference errors
dcgmi health -c       # Check current health status

# Group management (for multi-GPU monitoring)
dcgmi group -c "inference-gpus" -a 0,1
dcgmi stats -g 1 --enable
dcgmi stats -g 1 -v

Prometheus exporter

# Deploy DCGM exporter as a container
podman run -d --name dcgm-exporter \
  --device nvidia.com/gpu=all \
  -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04

# Verify metrics
curl -s http://localhost:9400/metrics | head -20
# DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-...",device="nvidia0"} 42
# DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-..."} 72.5
# DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-..."} 87
# DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0",UUID="GPU-..."} 34
# DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-..."} 45321

# Prometheus scrape config
# prometheus.yml:
scrape_configs:
  - job_name: 'dcgm'
    static_configs:
      - targets: ['gpu-node-01:9400', 'gpu-node-02:9400']
    scrape_interval: 15s

Grafana dashboard

The DCGM exporter provides ~50 metrics per GPU. The most useful Grafana panels track:

Metric	PromQL	Purpose
GPU utilization	`DCGM_FI_DEV_GPU_UTIL`	SM activity percentage
Memory used	`DCGM_FI_DEV_FB_USED`	VRAM consumption in MiB
Temperature	`DCGM_FI_DEV_GPU_TEMP`	Die temperature, alert at 85C
Power draw	`DCGM_FI_DEV_POWER_USAGE`	Watts, for capacity planning
PCIe throughput	`DCGM_FI_DEV_PCIE_TX_THROUGHPUT`	Data to/from host memory
ECC errors	`DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`	Uncorrectable errors, replace GPU if > 0
Clock throttle	`DCGM_FI_DEV_CLOCK_THROTTLE_REASONS`	Why clocks are reduced
Tensor active	`DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`	Tensor Core utilization (0-1)

# Alert rule: GPU temperature critical
groups:
  - name: gpu_alerts
    rules:
      - alert: GPUTemperatureCritical
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU {{ $labels.gpu }} temperature {{ $value }}C on {{ $labels.instance }}"

      - alert: GPUMemoryExhausted
        expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE > 0.95
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} memory >95% used on {{ $labels.instance }}"

      - alert: GPUECCError
        expr: DCGM_FI_DEV_ECC_DBE_VOL_TOTAL > 0
        labels:
          severity: critical
        annotations:
          summary: "Uncorrectable ECC error on GPU {{ $labels.gpu }} — schedule replacement"

GPU monitoring is where most teams fail silently. They deploy GPUs, run workloads, and never look at utilization. Months later they buy more GPUs because "inference is slow" — when the real problem is that their existing GPUs are 20% utilized because the inference engine is CPU-bottlenecked or the batch size is too small. The DCGM_FI_PROF_PIPE_TENSOR_ACTIVE metric is the most telling: if your LLM inference is running and Tensor Core utilization is below 30%, you are leaving performance on the table. It usually means the batch size is too small or the model is memory-bandwidth-bound rather than compute-bound. Fix the configuration before buying more hardware.

11. Power Management

A single NVIDIA H100 draws 700W under full load. Eight of them in a single server draw 5.6 kW — just for GPUs, before counting CPUs, memory, and fans. Power management is not optional at scale. NVIDIA provides fine-grained control over power limits, clock frequencies, and thermal throttling behavior.

Power limit management

# Query current power state
nvidia-smi -q -d POWER
# Power Management : Supported
# Power Draw       : 72.50 W
# Current Power Limit : 300.00 W
# Default Power Limit : 300.00 W
# Min Power Limit  : 100.00 W
# Max Power Limit  : 350.00 W

# Set power limit (persists until reboot)
nvidia-smi -pl 250  # Limit all GPUs to 250W
nvidia-smi -i 0 -pl 200  # Limit GPU 0 to 200W

# Persistent power limit via systemd
cat > /etc/systemd/system/nvidia-power-limit.service <<'EOF'
[Unit]
Description=Set NVIDIA GPU power limits
After=nvidia-persistenced.service
Requires=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-smi -pl 250
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF
systemctl enable nvidia-power-limit

Clock frequency management

# Query current clocks
nvidia-smi -q -d CLOCK
# Graphics : 1410 MHz
# SM       : 1410 MHz
# Memory   : 1593 MHz

# Lock clocks to specific frequencies (useful for reproducible benchmarks)
nvidia-smi -lgc 1200,1200  # Lock graphics clock to 1200 MHz
nvidia-smi -lmc 1593       # Lock memory clock

# Reset to default
nvidia-smi -rgc
nvidia-smi -rmc

# Set application-specific clocks
nvidia-smi -ac 1593,1410  # memory_clock,graphics_clock

Thermal management

# Query thermal status
nvidia-smi -q -d TEMPERATURE
# GPU Current Temp            : 42 C
# GPU T.Limit Temp            : 83 C
# GPU Shutdown Temp           : 92 C
# GPU Slowdown Temp           : 89 C
# GPU Max Operating Temp      : 83 C

# Throttle reasons bitmap
nvidia-smi --query-gpu=clocks_throttle_reasons.active --format=csv
# Possible reasons:
#   0x0000000000000000 — No throttling
#   0x0000000000000001 — GPU Idle
#   0x0000000000000004 — SW Power Cap
#   0x0000000000000008 — HW Slowdown (thermal)
#   0x0000000000000020 — SW Thermal Slowdown
#   0x0000000000000040 — HW Thermal Slowdown
#   0x0000000000000080 — HW Power Brake Slowdown

Power-performance tradeoffs

Power Limit (% of TDP)	Typical Performance Impact	Use Case
100%	Baseline	Maximum throughput, cost irrelevant
85%	-3 to -5%	Best efficiency — most power savings per lost performance
70%	-10 to -15%	Power-constrained environments, dense GPU servers
50%	-25 to -35%	Idle/standby, minimal workload

The efficiency sweet spot for NVIDIA GPUs is about 80-85% of TDP. At that power level you lose only 3-5% of throughput but save 15-20% on power draw. This is because GPU power consumption scales roughly quadratically with clock frequency (due to voltage scaling), while performance scales roughly linearly. Dropping from 300W to 250W doesn't drop clocks by 17% — it drops them by maybe 5-8%, and the throughput loss is even smaller because many workloads are memory-bandwidth-bound, not compute-bound. For inference workloads specifically, power limiting to 80% TDP is almost always the right default. You save real money on electricity and cooling with negligible latency impact.

12. SELinux and GPU

On kldload systems running CentOS, RHEL, or Rocky, SELinux is enforcing by default. The NVIDIA driver creates device nodes (/dev/nvidia*) and uses /proc and /sys interfaces that need correct SELinux contexts. Most GPU problems on SELinux-enforcing systems are not driver bugs — they are policy denials.

Required device contexts

# Check current contexts on NVIDIA devices
ls -laZ /dev/nvidia*
# crw-rw-rw-. root root system_u:object_r:device_t:s0 /dev/nvidia0
# crw-rw-rw-. root root system_u:object_r:device_t:s0 /dev/nvidiactl
# crw-rw-rw-. root root system_u:object_r:device_t:s0 /dev/nvidia-uvm

# These should have nvidia_device_t context for proper policy enforcement
# Create a custom module if needed
cat > nvidia_device.te <<'EOF'
module nvidia_device 1.0;

require {
    type device_t;
    type container_t;
    type svirt_lxc_net_t;
    class chr_file { open read write ioctl map getattr };
}

# Allow containers to access nvidia device nodes
allow container_t device_t:chr_file { open read write ioctl map getattr };
allow svirt_lxc_net_t device_t:chr_file { open read write ioctl map getattr };
EOF

# Compile and install the module
checkmodule -M -m -o nvidia_device.mod nvidia_device.te
semodule_package -o nvidia_device.pp -m nvidia_device.mod
semodule -i nvidia_device.pp

Container GPU access under SELinux

# Check for AVC denials related to nvidia
ausearch -m avc -ts recent | grep nvidia

# Common denial: container_t trying to access /dev/nvidia*
# type=AVC msg=audit(...): avc: denied { read write } for pid=12345
#   comm="python3" path="/dev/nvidia0" dev="devtmpfs"
#   tcontext=system_u:object_r:device_t:s0 tclass=chr_file

# Quick fix: set permissive for container domain (development only!)
semanage permissive -a container_t

# Proper fix: create targeted policy (see module above)

# For Podman with --device, SELinux needs the container_use_devices boolean
setsebool -P container_use_devices on

# Verify
getsebool container_use_devices
# container_use_devices --> on

Persistence daemon SELinux context

# nvidia-persistenced needs its own context
# If you see denials for nvidia-persistenced:
ausearch -m avc -c nvidia-persistenced

# Allow nvidia-persistenced to manage device files
cat > nvidia_persist.te <<'EOF'
module nvidia_persist 1.0;

require {
    type init_t;
    type device_t;
    type tmpfs_t;
    class chr_file { open read write ioctl getattr setattr create unlink };
    class dir { open read write search add_name remove_name };
    class file { open read write create unlink getattr };
}

allow init_t device_t:chr_file { open read write ioctl getattr setattr create unlink };
allow init_t tmpfs_t:dir { open read write search add_name remove_name };
allow init_t tmpfs_t:file { open read write create unlink getattr };
EOF

checkmodule -M -m -o nvidia_persist.mod nvidia_persist.te
semodule_package -o nvidia_persist.pp -m nvidia_persist.mod
semodule -i nvidia_persist.pp

SELinux + NVIDIA is one of those combinations that works perfectly once configured but is extremely frustrating to debug without understanding both systems. The core issue: NVIDIA's kernel module dynamically creates device nodes in /dev, and SELinux's default policy treats all of /dev as device_t — a generic type that most confined domains cannot access. The NVIDIA RPM packages do not ship SELinux policy modules (as of 2025). This means every kldload deployment on RHEL/CentOS/Rocky needs a custom SELinux module for GPU container access. The container_use_devices boolean helps, but for production you want a targeted module that grants exactly the access needed and nothing more. Never leave a domain in permissive mode in production — it logs denials but doesn't enforce, which means your security boundary has a hole.

13. ZFS Storage for AI Workloads

AI workloads have distinct storage patterns: model weights are large, sequentially read, and incompressible. Training datasets are large, randomly accessed, and may be compressible. Checkpoints are large, write-once, and benefit from snapshots. ZFS lets you tune each dataset independently — the right configuration can improve GPU utilization by reducing I/O wait.

Storage layout for AI infrastructure

# Create the dataset hierarchy
zfs create rpool/ai

# Model weights — large sequential reads, incompressible
zfs create -o recordsize=1M \
           -o compression=off \
           -o atime=off \
           -o primarycache=metadata \
           rpool/ai/models

# Training datasets — random reads, possibly compressible
zfs create -o recordsize=128K \
           -o compression=lz4 \
           -o atime=off \
           -o primarycache=all \
           rpool/ai/datasets

# Training checkpoints — write-once, snapshot-friendly
zfs create -o recordsize=1M \
           -o compression=lz4 \
           -o atime=off \
           -o snapdir=visible \
           rpool/ai/checkpoints

# Hugging Face cache
zfs create -o recordsize=1M \
           -o compression=off \
           -o atime=off \
           rpool/ai/hf-cache

# Set HF_HOME to ZFS
echo 'export HF_HOME=/rpool/ai/hf-cache' >> /etc/profile.d/ai.sh

Why recordsize matters for GPU workloads

When a training script calls model.load_state_dict(), it reads the entire weight file sequentially. With ZFS's default 128K recordsize, a 140 GB model file is split into ~1.1 million records, each requiring a metadata lookup. With 1M recordsize, it's ~143,000 records — 8x fewer metadata operations. On NVMe storage, this difference is the gap between saturating the drive's bandwidth and being metadata-limited.

# Benchmark: sequential read throughput at different recordsizes
# 128K recordsize:  2.1 GB/s (metadata overhead)
# 512K recordsize:  3.0 GB/s
# 1M recordsize:    3.4 GB/s (near NVMe line rate)

# Verify current recordsize
zfs get recordsize rpool/ai/models
# NAME              PROPERTY    VALUE   SOURCE
# rpool/ai/models   recordsize  1M      local

Snapshot strategy for model development

# Snapshot before each fine-tuning run
zfs snapshot rpool/ai/checkpoints@run-2025-04-05-baseline

# Training writes checkpoints to /rpool/ai/checkpoints/run-001/
# Each checkpoint is ~280 GB for a 70B model

# Snapshot after training completes
zfs snapshot rpool/ai/checkpoints@run-2025-04-05-finetuned

# Compare space used by fine-tuning checkpoints
zfs list -t snapshot -o name,used,refer rpool/ai/checkpoints
# NAME                                          USED   REFER
# rpool/ai/checkpoints@run-2025-04-05-baseline  128K   256K
# rpool/ai/checkpoints@run-2025-04-05-finetuned 280G   560G

# Clone a checkpoint for A/B evaluation
zfs clone rpool/ai/checkpoints@run-2025-04-05-finetuned rpool/ai/eval-a
zfs clone rpool/ai/checkpoints@run-2025-04-05-baseline rpool/ai/eval-b

# Clones share blocks — two 560 GB datasets use ~560 GB total, not 1.12 TB

# Send checkpoint to another node for distributed evaluation
zfs send rpool/ai/checkpoints@run-2025-04-05-finetuned | \
  ssh gpu-node-02 zfs receive rpool/ai/checkpoints

L2ARC for dataset caching

# If you have a spare NVMe drive, use it as L2ARC (read cache)
# Ideal for training datasets that are read repeatedly across epochs
zpool add rpool cache /dev/nvme1n1

# Verify L2ARC hit rate
zpool iostat -v rpool 5
# Or use arc_summary
arc_summary | grep -A5 "L2ARC"

# L2ARC is most effective when:
# - Dataset size > ARC size (RAM) but < L2ARC size (NVMe)
# - Access pattern is repeated (training epochs)
# - Record size is 128K or less (L2ARC has per-record overhead)

The ZFS + AI storage interaction is one of those places where understanding both systems pays enormous dividends. Most teams store model weights on ext4 or XFS and never think about it. But ZFS clone is a superpower for model management: you can have ten fine-tuned variants of a 140 GB base model, and they share all the unchanged blocks. The actual disk usage is the base model plus the deltas. ZFS send/receive lets you replicate models between GPU nodes with incremental transfers. And snapshots before fine-tuning runs give you instant rollback if a run produces garbage. The key insight: treat model weights like data, not files. They have lifecycle (versioning), need replication (multi-node), and benefit from deduplication (shared base weights). ZFS handles all of this natively.

14. Troubleshooting Reference

GPU issues fall into predictable categories. This reference table covers the most common problems, their root causes, and the exact commands to diagnose and fix them.

Symptom	Likely Cause	Diagnostic	Fix
`nvidia-smi` returns "No devices found"	Driver not loaded or GPU not visible on PCIe	`lspci \| grep -i nvidia` and `lsmod \| grep nvidia`	If lspci shows nothing: reseat GPU, check BIOS. If lspci shows GPU but lsmod empty: `modprobe nvidia`. If modprobe fails: check `dkms status`, rebuild module.
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver"	Driver/kernel version mismatch after kernel update	`dkms status` — module should say "installed" for current kernel	`dnf install kernel-devel-$(uname -r) && dkms autoinstall && modprobe nvidia`
"CUDA error: no CUDA-capable device is detected"	CUDA/driver version mismatch or `CUDA_VISIBLE_DEVICES` empty	`nvidia-smi` (driver works?) then `echo $CUDA_VISIBLE_DEVICES`	Check compatibility matrix. Upgrade driver or downgrade CUDA toolkit.
"CUDA out of memory"	Model + KV cache exceeds VRAM	`nvidia-smi` during run — check memory used	Reduce batch size, use quantization (Q4_K_M), reduce context length, add more GPUs with tensor parallelism.
GPU passthrough: VM gets error code 43	NVIDIA driver detects hypervisor on consumer GPU	Check VM XML for `<kvm><hidden state='on'/>`	Add `<kvm><hidden state='on'/></kvm>` to `<features>` in VM XML.
VFIO: "device is not in a group assigned to a container"	Not all devices in the IOMMU group are bound to vfio-pci	Check all devices in the group: IOMMU group enumeration script above	Bind all devices in the IOMMU group to vfio-pci, or use ACS override.
Container: "could not select device driver: nvidia"	NVIDIA container toolkit not configured	`nvidia-ctk cdi list` — should show nvidia.com/gpu devices	`nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml`
Kubernetes: pods stuck in Pending with "insufficient nvidia.com/gpu"	Device plugin not running or GPUs already allocated	`kubectl describe node \| grep nvidia` and `kubectl -n gpu-operator get pods`	Verify device plugin DaemonSet is running. Check if other pods hold GPU allocations.
Performance degrades over time	Thermal throttling	`nvidia-smi -q -d CLOCK` — check throttle reasons	Improve airflow, reduce power limit, clean dust from heatsinks.
Xid errors in `dmesg`	Hardware fault or driver bug	`dmesg \| grep -i "xid"` — note the error code	Xid 48: double-bit ECC — replace GPU. Xid 79: GPU fallen off bus — reseat or replace. Xid 13: GR exception — update driver.
SELinux AVC denial for /dev/nvidia*	Missing SELinux policy for GPU devices	`ausearch -m avc -ts recent \| grep nvidia`	Install custom SELinux module (see section 12) or set `setsebool -P container_use_devices on`.
nouveau conflicts with nvidia module	nouveau not blacklisted	`lsmod \| grep nouveau`	Blacklist nouveau (see section 2), rebuild initramfs with `dracut --force`, reboot.

Xid error reference

Xid errors are NVIDIA's kernel-level diagnostic codes, logged to dmesg. They are the GPU equivalent of machine-check exceptions. Some are informational, some indicate hardware failure.

Xid Code	Meaning	Severity	Action
13	Graphics Engine Exception	Medium	Usually a driver bug. Update driver. If persistent, RMA GPU.
31	GPU memory page fault	Medium	Application bug (out-of-bounds access) or driver bug.
43	GPU stopped processing	High	GPU hang. Reset with `nvidia-smi -r`. If frequent, RMA.
48	Double-bit ECC error	Critical	Uncorrectable memory error. Replace GPU immediately.
63	ECC page retirement: row remap limit exceeded	Critical	GPU memory degrading. Schedule replacement.
79	GPU has fallen off the bus	Critical	PCIe link failure. Reseat GPU. Check PSU. Replace if persistent.
94	Contained ECC error	Low	Correctable error, page retired. Monitor frequency.
95	Uncontained ECC error	Critical	Process killed, data may be corrupted. Replace GPU.

Diagnostic one-liner collection

# Full GPU status dump
nvidia-smi -q

# Check all GPU health in one line
nvidia-smi --query-gpu=index,name,driver_version,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total,ecc.errors.uncorrected.volatile.total --format=csv

# Watch GPU utilization in real time
watch -n 1 nvidia-smi

# Check PCIe link speed (should match physical slot)
nvidia-smi -q -d PCIE | grep -E "Link|Generation"

# Check if persistence mode is enabled
nvidia-smi -q | grep "Persistence Mode"

# List all NVIDIA kernel modules
lsmod | grep -E "nvidia|nouveau"

# Check DKMS build logs after failure
cat /var/lib/dkms/nvidia/*/build/make.log | tail -50

# Test GPU compute with cuda-samples
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery

# Generate bug report for NVIDIA support
nvidia-bug-report.sh

The most underused diagnostic tool is nvidia-bug-report.sh. It generates a compressed tarball containing driver version, kernel version, dmesg output, lspci output, nvidia-smi output, Xid errors, DKMS status, and modprobe configuration — everything NVIDIA support needs and everything you need to debug 90% of issues. Run it before you start debugging. Read it before you file a support ticket. It is installed with the driver and lives at /usr/bin/nvidia-bug-report.sh.

15. Advanced Topics

GPU reset and recovery

# Reset a hung GPU (requires persistence mode disabled first)
nvidia-smi -pm 0 -i 0
nvidia-smi -r -i 0
nvidia-smi -pm 1 -i 0

# If nvidia-smi hangs, force unbind and rebind
echo 1 > /sys/bus/pci/devices/0000:41:00.0/reset
# or
echo "0000:41:00.0" > /sys/bus/pci/devices/0000:41:00.0/driver/unbind
echo "0000:41:00.0" > /sys/bus/pci/drivers/nvidia/bind

# Last resort: full PCIe bus rescan
echo 1 > /sys/bus/pci/devices/0000:41:00.0/remove
echo 1 > /sys/bus/pci/rescan

GPU-Direct Storage (GDS)

GPU-Direct Storage allows data to flow directly from NVMe storage to GPU memory via PCIe, bypassing the CPU and system RAM entirely. This eliminates the CPU-mediated copy that normally happens (NVMe -> system RAM -> GPU VRAM) and can improve dataset loading throughput by 2-3x for I/O-bound training workloads.

# Install GDS support
dnf install -y nvidia-gds

# Verify GDS is available
/usr/local/cuda/gds/tools/gdscheck -p

# GDS requires:
# - NVIDIA Magnum IO GPUDirect Storage driver
# - ext4 or xfs on the storage path (ZFS not yet supported for GDS)
# - NVIDIA driver 525+ with CUDA 12.0+
# - Supported NVMe controllers (check NVIDIA compatibility list)

# For ZFS-based infrastructure, use a small ext4 staging area for GDS:
# 1. Pre-load dataset from ZFS to ext4 NVMe staging
# 2. Train with GDS reading from ext4 staging
# 3. Write checkpoints back to ZFS

NCCL tuning for multi-node training

# NCCL (NVIDIA Collective Communications Library) handles multi-GPU/multi-node
# communication for distributed training

# Key environment variables
export NCCL_DEBUG=INFO                    # Enable debug logging
export NCCL_IB_DISABLE=0                  # Enable InfiniBand if available
export NCCL_SOCKET_IFNAME=eth0            # Network interface for TCP fallback
export NCCL_P2P_LEVEL=NVL                 # P2P via NVLink (fastest)
export NCCL_NET_GDR_LEVEL=5               # GPU-Direct RDMA level

# Verify NCCL communication paths
# In your training script, NCCL_DEBUG=INFO will log:
# NCCL INFO Using network IB
# NCCL INFO Channel 00/02: 0[0] -> 1[1] via NET/IB/0
# NCCL INFO Connected all trees and rings

# For multi-node training over WireGuard (kldload mesh):
export NCCL_SOCKET_IFNAME=wg0
export NCCL_IB_DISABLE=1
# Note: WireGuard adds encryption overhead — fine for small clusters,
# use InfiniBand or RoCE for production multi-node training

Fabric Manager for NVSwitch systems

# Systems with NVSwitch (DGX, HGX) need the Fabric Manager service
# NVSwitch provides all-to-all NVLink connectivity between 8 GPUs

dnf install -y nvidia-fabricmanager
systemctl enable --now nvidia-fabricmanager

# Verify NVSwitch topology
nvidia-smi nvlink -s
# GPU 0: link 0 -> GPU 1 (active, NVLink v4)
# GPU 0: link 1 -> GPU 2 (active, NVLink v4)
# ... (full mesh)

eBPF for GPU observability

# Use bpftrace to trace GPU driver activity
# Trace all nvidia ioctl calls
bpftrace -e 'tracepoint:syscalls:sys_enter_ioctl /comm == "nvidia-smi"/ {
  printf("pid=%d ioctl cmd=%x\n", pid, args->cmd);
}'

# Trace GPU memory allocations (via UVM driver)
bpftrace -e 'kprobe:nvidia_uvm_mmap {
  printf("pid=%d GPU mmap size=%lu\n", pid, arg2);
}'

# Monitor PCIe bandwidth to GPU
bpftrace -e 'tracepoint:pci:pci_* /args->dev_name == "0000:41:00.0"/ {
  printf("%s %s\n", probe, args->dev_name);
}'

GPU-Direct Storage not supporting ZFS is a real limitation for kldload infrastructure. The workaround — staging data on ext4 for GDS-accelerated reads — adds operational complexity. In practice, most inference workloads load the model once at startup and then serve from VRAM, so GDS matters primarily for training with large datasets that don't fit in RAM. If your workload is inference-only (which is the majority of production GPU use), ZFS's inability to do GDS is irrelevant. For training workloads where GDS matters, the ext4 staging approach is well-established in HPC environments.

16. Golden Image with GPU Support

When building golden images with kldload's export workflow, GPU support requires additional steps to ensure the NVIDIA driver survives cloning and cloud-init re-identification.

# In the kldload web UI, select your OS and profile as usual
# After installation, before exporting:

# 1. Install NVIDIA driver with DKMS (will survive kernel updates)
dnf install -y nvidia-driver nvidia-driver-devel cuda-toolkit-12-6

# 2. Verify DKMS module is built
dkms status
# nvidia/550.127.05, 5.14.0-503.el9.x86_64, x86_64: installed

# 3. Enable persistence daemon
systemctl enable nvidia-persistenced

# 4. Pre-generate CDI config (container toolkit will regenerate on boot)
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# 5. Create a firstboot service that regenerates CDI after cloud-init
cat > /etc/systemd/system/nvidia-cdi-regen.service <<'EOF'
[Unit]
Description=Regenerate NVIDIA CDI config
After=nvidia-persistenced.service cloud-final.service
Wants=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF
systemctl enable nvidia-cdi-regen

# 6. Export via kldload web UI — the driver, DKMS source, and services
#    are all in the image. When cloned VMs boot on different GPU hardware,
#    DKMS rebuilds the module for the running kernel automatically.

Cloud-init GPU detection

# Add a cloud-init runcmd to configure GPUs at first boot
cat > /etc/cloud/cloud.cfg.d/99-nvidia.cfg <<'EOF'
runcmd:
  - nvidia-smi -pm 1
  - nvidia-smi -pl 250
  - nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
  - systemctl restart nvidia-persistenced
EOF

17. Capacity Planning

GPU capacity planning is different from CPU capacity planning. GPUs cannot be time-shared efficiently (without MIG), VRAM is a hard limit, and power/cooling is a real constraint. This section gives you the numbers you need.

Inference capacity estimation

Model Size	Quantization	GPU	Tokens/sec (batch=1)	Tokens/sec (batch=32)	Concurrent Users
7-8B	Q4_K_M	RTX 4090 (24 GB)	~80	~1,200	~40-60
7-8B	FP16	A100-80GB	~120	~2,400	~80-100
70B	Q4_K_M	2x RTX 4090	~15	~180	~6-10
70B	FP16	2x A100-80GB	~35	~700	~20-30
70B	FP16	4x H100-80GB	~90	~2,800	~80-100

Power and cooling budget

GPU	TDP (W)	Typical Inference (W)	BTU/hr at TDP	Cooling (tons)
RTX 4090	450	200-280	1,535	0.13
L40S	350	150-220	1,194	0.10
A100-80GB PCIe	300	150-200	1,024	0.09
A100-80GB SXM	400	200-280	1,365	0.11
H100-80GB PCIe	350	180-250	1,194	0.10
H100-80GB SXM	700	350-500	2,389	0.20

Capacity planning for GPU infrastructure requires thinking in three dimensions simultaneously: VRAM (does the model fit?), compute (is throughput sufficient?), and power (can the facility handle the load?). Most teams only think about the first two and discover the third when their circuit breaker trips. A single 8-GPU HGX H100 node draws 10+ kW under full load — that's a dedicated 30A 240V circuit just for GPUs. If you are building a GPU cluster, start with the power budget, work backward to how many GPUs fit, then check if that's enough compute. The constraint is almost always power and cooling, not budget for the GPUs themselves.

NVIDIA Tutorial — quick-start NVIDIA setup on kldload
KVM Virtual Machines — VM creation and management with ZFS zvols
Docker & Podman on ZFS — container storage with ZFS datasets
Kubernetes on KVM — building a K8s cluster with GPU worker nodes
systemd Masterclass — unit files for GPU persistence and monitoring services
ZFS Masterclass — deep dive on recordsize, compression, and storage tuning
eBPF Masterclass — tracing GPU driver and PCIe activity
Observability Masterclass — Prometheus, Grafana, and alerting stack
Security Hardening Masterclass — SELinux policy and device security
Keycloak & SELinux Masterclass — identity and access control for GPU resources
Build AI — Getting Started — AI workload setup on kldload
AI for Kubernetes — GPU scheduling in Kubernetes clusters

← systemd Masterclass KVM & Hypervisor Masterclass →