The Full Stack
What does a kldload platform look like when you go all-out? Every layer filled in, every connection encrypted, every packet observed, every secret managed, every disk checksummed, every service authenticated, every deployment reversible. This is that document — the reference architecture for a fully deployed kldload stack, from bare metal to production workloads.
This is not a tutorial. This page is a map. It shows every technology in the stack, why it is there, what it connects to, and what it replaces. Each component links to its dedicated masterclass for the deep dive. Read this page first to understand the full picture, then drill into the individual masterclasses for implementation.
The premise: Most platforms are assembled from disconnected decisions — a firewall here, a container runtime there, certificates from somewhere, monitoring bolted on later. A kldload full stack is different. Every layer is chosen to reinforce every other layer. ZFS protects storage. WireGuard protects the network. eBPF observes the kernel. Cilium enforces policy. Keycloak authenticates users. Vault manages secrets. Sanoid snapshots everything. There are no gaps, no duct tape, and no vendor lock-in. You own every layer.
Prerequisites: none. This page is the starting point. Follow the links to go deeper.
1. The Map — Every Layer at a Glance
A fully deployed kldload platform has seven layers. Each layer depends on the ones below it and enables the ones above it. From bottom to top:
┌─────────────────────────────────────────────────────────────────┐
│ WORKLOADS │
│ Kubernetes pods, databases, application containers, AI/LLM │
├─────────────────────────────────────────────────────────────────┤
│ ORCHESTRATION │
│ Kubernetes (Cilium CNI, CoreDNS), blue/green deploys, Packer │
├─────────────────────────────────────────────────────────────────┤
│ OBSERVABILITY │
│ eBPF tracing, Prometheus, Grafana, Loki, Alertmanager │
├─────────────────────────────────────────────────────────────────┤
│ SECURITY & IDENTITY │
│ Keycloak SSO, Vault secrets, step-ca PKI, SELinux, nftables │
├─────────────────────────────────────────────────────────────────┤
│ NETWORKING │
│ WireGuard backplane, BIRD BGP, VXLAN/EVPN, DNS, IPsec, HAProxy │
├─────────────────────────────────────────────────────────────────┤
│ COMPUTE │
│ KVM hypervisor, libvirt, QEMU, NVIDIA GPU passthrough │
├─────────────────────────────────────────────────────────────────┤
│ STORAGE & OS │
│ ZFS on root, sanoid snapshots, zfs-send replication, systemd │
└─────────────────────────────────────────────────────────────────┘
Bare metal — kldload ISO — one install — all of this
2. Layer 1 — Storage & Operating System
Everything starts with the disk. Every byte on this platform lives on ZFS — the operating system, the virtual machines, the databases, the container images, the logs, the backups. ZFS is the foundation because it is the only filesystem that provides atomic snapshots, built-in replication, transparent compression, per-block checksumming, and native encryption in a single coherent package.
STORAGE ZFS on Root
Every kldload node boots from a ZFS pool. The root filesystem, /home, /var, swap — all ZFS datasets with independent snapshot, compression, and quota policies. Boot environments let you snapshot before upgrades and roll back in seconds if anything breaks. There is no ext4, no XFS, no LVM. One filesystem, one tool, one set of commands.
STORAGE Pool Design
Boot pool: mirror vdev across two NVMe drives. Data pool: RAIDZ2 or dRAID across the remaining drives. SLOG: Optane or high-endurance NVMe for synchronous write acceleration (databases, NFS). L2ARC: fast SSD as a read cache for datasets that exceed ARC. Every pool has ashift=12, compression=zstd, atime=off, xattr=sa.
STORAGE Sanoid — Automated Snapshots
Sanoid runs on every node. It takes hourly, daily, and monthly snapshots of every dataset according to retention policy. Syncoid replicates snapshots to a remote node for disaster recovery. zfs send is incremental — only changed blocks cross the wire. A full DR replica of a 10 TB pool adds minutes of transfer per day, not hours.
STORAGE ZFS Encryption
Datasets containing secrets, user data, or compliance-scoped information use native ZFS encryption (encryption=aes-256-gcm, keyformat=passphrase or keyformat=raw with a key in Vault). Encryption happens below the snapshot layer — snapshots and replication work identically whether the dataset is encrypted or not. Keys are loaded at boot from Vault or a local keyfile.
STORAGE systemd
Every service, timer, and mount on the platform is managed by systemd. No cron jobs, no init scripts, no screen sessions. Service dependencies are explicit. Restart policies are defined. Resource limits (cgroups) are set per unit. Journal logs are structured and queryable. This is not optional — systemd is the control plane for the operating system, and treating it as such is what makes the platform manageable.
Masterclass deep dives: ZFS · systemd
3. Layer 2 — Compute
Bare metal runs the hypervisor. KVM is built into the Linux kernel — every kldload node is a hypervisor by default. Virtual machines run on ZFS zvols. GPU passthrough is native. No VMware, no Proxmox required (though Proxmox is supported). The hypervisor is the OS.
COMPUTE KVM & libvirt
KVM provides hardware-accelerated virtualisation. libvirt manages VM lifecycle. QEMU handles device emulation. VMs are defined as XML, stored on ZFS, snapshotted atomically, and cloned instantly with zfs clone. A golden image workflow — install once, seal for cloning, deploy hundreds — reduces provisioning to seconds per VM.
COMPUTE Golden Images & Packer
Packer builds machine images from code. A single Packer template produces identical images for KVM (qcow2), Proxmox (template), AWS (AMI), and bare metal. cloud-init handles first-boot personalisation: hostname, SSH keys, network config, ZFS pool import. Every node in the fleet boots from the same image. Drift is impossible because there is nothing to drift from — the image is the source of truth.
COMPUTE GPU Passthrough
NVIDIA GPUs are passed through to VMs or containers via VFIO (for full passthrough) or vGPU (for sharing). AI inference workloads — Ollama, vLLM, text-generation-inference — run in containers with --gpus all on ZFS-backed storage. The NVIDIA driver, CUDA toolkit, and container toolkit are pre-configured in the desktop profile. DKMS rebuilds the driver on kernel updates, just like ZFS.
COMPUTE Containers — Podman & Firecracker
Podman is the container runtime. It is daemonless, rootless-capable, and uses the ZFS storage driver for copy-on-write image layers. For microVM isolation, Firecracker provides hardware-level isolation with VM-speed startup. Containers run directly on the host or inside KVM VMs — the architecture supports both. SELinux MCS labels isolate containers at the kernel level.
zfs snapshot, you can firewall its traffic with nftables, you can trace its I/O with eBPF. The hypervisor and the platform are the same thing. This is the fundamental architectural advantage over a traditional split between hypervisor and management plane.Masterclass deep dives: Packer & IaC · Containers · CI/CD & GitOps · KVM Tutorial · NVIDIA Tutorial
4. Layer 3 — Networking
Networking on a full kldload stack is multi-plane. Different types of traffic travel on different encrypted planes, each with its own keys, its own routing, and its own policies. Nothing crosses planes without an explicit decision. This is the backplane architecture.
NETWORK WireGuard Backplane — Multi-Plane Mesh
Every node runs multiple WireGuard interfaces, each serving a different plane: management (SSH, Ansible, monitoring), storage (ZFS replication, NFS, iSCSI), workload (pod-to-pod, service mesh), and external (public-facing traffic, IPsec to partners). Each plane is a separate WireGuard interface with separate keys. Compromise of one plane does not expose the others.
NETWORK BIRD BGP — Dynamic Routing
BIRD runs on every node and exchanges routes via BGP. No static routes, no hardcoded IPs in config files. When a new node joins, BIRD announces its networks and every other node learns the routes automatically. When a node goes down, BGP withdraws its routes and traffic reroutes. ECMP (Equal-Cost Multi-Path) distributes traffic across multiple paths. This is how hyperscalers route — BGP on every host.
NETWORK VXLAN & EVPN — Overlay Networks
For workloads that need Layer 2 adjacency across Layer 3 boundaries — VM migration, multi-site clusters, legacy applications that assume broadcast — VXLAN encapsulates Ethernet frames in UDP. EVPN (via BIRD or FRRouting) provides control-plane MAC/IP learning so VXLAN does not flood. The overlay runs on top of the WireGuard backplane, so it is encrypted end-to-end without VXLAN knowing.
NETWORK DNS — CoreDNS + Unbound
CoreDNS serves internal zone records (forward and reverse) for every host, VM, and service on the platform, backed by a simple zone file or etcd. Unbound provides recursive resolution with DNSSEC validation for external queries. Every node's /etc/resolv.conf points to the local Unbound instance, which forwards internal queries to CoreDNS. DNS is not optional infrastructure — it is how services find each other.
NETWORK IPsec — External Connections
When you need to connect to a cloud VPN gateway (AWS, Azure, GCP), a partner's network, or a government system that requires FIPS-validated encryption, IPsec provides the tunnel. strongSwan handles IKEv2 negotiation. XFRM interfaces make IPsec tunnels route-based so they integrate with BIRD BGP. The WireGuard backplane handles internal traffic; IPsec handles the outside world.
NETWORK HAProxy & keepalived — Load Balancing
HAProxy distributes traffic across backends with health checking, TLS termination, and connection draining. keepalived provides a virtual IP (VIP) that floats between HAProxy instances — if the primary dies, the VIP moves to the standby in under a second. For Kubernetes, MetalLB or Cilium's LB IPAM announces service IPs via BGP directly.
Masterclass deep dives: WireGuard · BIRD & BGP · VXLAN & EVPN · DNS · IPsec Tunnels · Backplane Networks · Load Balancing & HA
5. Layer 4 — Security & Identity
Security is not a layer you bolt on. It is woven into every other layer. But the identity layer — who you are, what you are allowed to do, what certificates you hold, what secrets you can access — is concentrated here. Every authentication decision and every secret on the platform flows through these components.
SECURITY Keycloak — Identity & SSO
Keycloak is the single source of identity. Every user, every service account, every role assignment lives in Keycloak's realm. Grafana, Vault, Kubernetes, the kldload web UI — all authenticate via OIDC tokens issued by Keycloak. One login, one set of credentials, one MFA prompt. No per-application passwords. No shared accounts. Active Directory and LDAP federate through Keycloak, so existing corporate identity works without migration.
SECURITY step-ca — Internal PKI
step-ca is the internal Certificate Authority. It issues short-lived X.509 certificates to every service via ACME (the same protocol Let's Encrypt uses). Every internal connection — database, API, metrics scrape, gRPC — is mTLS. Certificates rotate automatically every 24 hours. The CA root key lives on ZFS-encrypted storage, backed by Vault. No self-signed certificates, no curl -k, no "we'll add TLS later."
SECURITY Let's Encrypt — Public TLS
Public-facing services (the web UI, the API gateway, any external endpoint) get certificates from Let's Encrypt via certbot with DNS-01 challenges. Wildcard certificates cover subdomains. Renewal is automatic via systemd timer. Internal services use step-ca. Public services use Let's Encrypt. There is no overlap and no gap.
SECURITY HashiCorp Vault — Secrets Management
Vault stores every secret on the platform: database credentials, API keys, Keycloak client secrets, ZFS encryption keys, TLS private keys, WireGuard private keys. Applications access secrets via Vault's API, authenticated by their Keycloak OIDC token or Kubernetes service account. Secrets are never in config files, environment variables, or git. Vault's storage backend is a ZFS-backed Raft cluster — snapshottable, replicable, encrypted.
SECURITY SELinux — Mandatory Access Control
Every node runs SELinux in enforcing mode. Every confined service (httpd, sshd, named, Java, containers) operates within its labelled domain and cannot escape it, even as root. Custom policy modules cover kldload-specific services. MCS categories isolate containers at the kernel level. semanage export captures all customisations for reproducible deployment.
SECURITY nftables — Firewall
nftables provides stateful packet filtering on every node. The base policy is default-deny: only explicitly allowed traffic passes. Each WireGuard interface has its own nftables chain scoped to the plane's allowed services. Management plane allows SSH and Prometheus. Workload plane allows pod traffic. Storage plane allows ZFS send and NFS. No cross-plane traffic unless a rule says so.
SECURITY FIPS 140-3 Compliance
For regulated environments: RHEL's FIPS mode is enabled at the kernel level. OpenSSL and GnuTLS use only FIPS-validated algorithms. Libreswan provides FIPS-validated IPsec. Vault's seal mechanism uses FIPS-approved KMS. The entire cryptographic stack — from disk encryption to TLS to tunnel encryption — uses only algorithms with NIST validation certificates. This matters for government, finance, and healthcare.
Masterclass deep dives: Keycloak & SELinux · TLS & PKI · Vault & Secrets · Security Hardening · nftables · FIPS 140-3 Compliance
6. Layer 5 — Observability
You cannot operate what you cannot see. Observability on a full kldload stack covers three pillars — metrics, logs, and traces — plus a fourth that most platforms lack: kernel-level instrumentation via eBPF.
OBSERVE eBPF — Kernel Instrumentation
eBPF programs attach to kernel tracepoints, kprobes, and network hooks to observe syscalls, network flows, disk I/O, and scheduler events — without modifying the kernel or restarting anything. On a full stack deployment, eBPF provides: Cilium's network policy enforcement, Hubble's network flow visibility, custom latency histograms via bpftrace, and Falco's runtime security detection. eBPF is the nervous system of the platform.
OBSERVE Prometheus & Alertmanager — Metrics
Prometheus scrapes metrics from every component: node_exporter (hardware/OS), ZFS exporter (pool health, I/O), kube-state-metrics (Kubernetes objects), Keycloak metrics endpoint, Vault telemetry, WireGuard exporter, HAProxy stats, and application-level metrics. Alertmanager routes alerts to Slack, PagerDuty, or email. Alert rules cover: ZFS scrub errors, pool capacity >80%, certificate expiry <7 days, node down >5 minutes, SELinux AVC denials, WireGuard handshake failures.
OBSERVE Grafana — Dashboards
Grafana provides the visual layer. Pre-built dashboards cover: ZFS pool status, WireGuard tunnel health, Kubernetes cluster overview, node resource utilisation, database query latency, Keycloak login metrics, certificate expiry timeline, and eBPF network flow maps. Grafana authenticates via Keycloak OIDC — role-based access controls who sees what. Data sources: Prometheus for metrics, Loki for logs, Tempo for traces.
OBSERVE Loki — Log Aggregation
Loki collects logs from every node and container. Promtail (or the Grafana Agent) ships systemd journal entries, container stdout, and application log files to Loki. Logs are indexed by labels (node, service, namespace) not by full text — storage-efficient and query-fast. In Grafana, you can click from a metric spike directly to the logs for that service at that time. Storage backend: ZFS dataset with compression.
OBSERVE Hubble — Network Flow Visibility
Hubble is Cilium's observability layer. It captures every network flow in the Kubernetes cluster — source pod, destination pod, protocol, port, verdict (allowed/denied), latency. Hubble UI shows a real-time service map. Hubble metrics feed into Prometheus. When a network policy blocks something it should not (or allows something it should not), Hubble shows you exactly what happened, which policy matched, and why.
Masterclass deep dives: eBPF · Cilium · Observability
7. Layer 6 — Orchestration
Orchestration is how workloads get deployed, scaled, updated, and rolled back. On a full kldload stack, this means Kubernetes for containers and blue/green deployments for everything else.
WORKLOAD Kubernetes on KVM
The Kubernetes cluster runs on KVM VMs, not on bare metal. Control plane nodes are three VMs on separate physical hosts for HA. Worker nodes are VMs cloned from a golden image. ZFS zvols back the VM disks — snapshotting an entire worker node before a Kubernetes upgrade is one command. If the upgrade fails, roll back the zvol. The cluster is disposable. The data is not.
WORKLOAD Cilium CNI — eBPF Networking
Cilium replaces kube-proxy and provides the Container Network Interface (CNI). Pod-to-pod networking uses eBPF programs attached directly to the kernel's network stack — no iptables chains, no netfilter overhead. Network policies are enforced at the eBPF level. Cilium provides: pod networking, service load balancing, network policy, transparent encryption (WireGuard or IPsec between nodes), bandwidth management, and Hubble observability.
WORKLOAD Blue/Green Deployments
Stateless services use Kubernetes rolling deployments. Stateful infrastructure (databases, message queues, the Kubernetes cluster itself) uses blue/green: deploy the new version alongside the old, verify it works, switch traffic, keep the old version for instant rollback. ZFS snapshots make blue/green trivial for VMs — the "green" environment is a clone of the "blue" snapshot. If green fails, destroy it and blue is untouched.
WORKLOAD GitOps & Packer Pipeline
Infrastructure changes flow through git. Packer builds golden images from committed code. Kubernetes manifests are applied via Flux or ArgoCD watching a git repo. Terraform or OpenTofu manages VM provisioning. Nothing is configured by hand. If a node drifts, destroy it and redeploy from the golden image. The git repo is the source of truth. The running infrastructure is a projection of it.
Masterclass deep dives: Kubernetes · Cilium · Blue/Green & SRE · Packer & IaC · Containers · CI/CD & GitOps · Construction Kit
8. Layer 7 — Workloads
This is what everything exists to serve. The workloads are the applications, databases, APIs, and services that deliver value. Everything below this point is infrastructure. The infrastructure's job is to make workloads reliable, secure, observable, and deployable.
WORKLOAD Databases on ZFS
PostgreSQL, MySQL, Redis, and etcd run on dedicated ZFS datasets with tuned recordsize (8K for PostgreSQL, 16K for MySQL InnoDB), synchronous writes to SLOG, and hourly snapshots via Sanoid. Point-in-time recovery is instant: zfs rollback to any snapshot. Replication uses zfs send to a standby node. Client connections use mTLS certificates from step-ca. Credentials live in Vault.
WORKLOAD AI & LLM Inference
Ollama or vLLM serves language models from NVIDIA GPUs passed through to KVM VMs or exposed to containers via the NVIDIA container toolkit. Model weights are stored on ZFS datasets with recordsize=1M and compression=off (already compressed). Inference APIs authenticate via Keycloak OIDC tokens. GPU utilisation metrics flow to Prometheus. This is the same infrastructure as everything else — no special snowflake.
WORKLOAD Application Containers
Stateless microservices run in Kubernetes pods. They pull configuration from Vault, authenticate users via Keycloak, serve traffic behind HAProxy or Cilium's load balancer, emit metrics to Prometheus, send logs to Loki, and store persistent data on ZFS-backed PersistentVolumes. Network policies (Cilium) restrict which pods can talk to which. Every container runs under SELinux with a unique MCS label.
WORKLOAD NFS & iSCSI Shared Storage
For workloads that need shared filesystem access (legacy applications, some AI training frameworks), NFS is served from a ZFS dataset with NFS kernel server. iSCSI provides block-level access for VMs that need raw devices. Both run over the storage plane WireGuard interface, encrypted in transit. ZFS quotas, reservations, and snapshots apply.
Masterclass deep dives: Databases on ZFS · Load Balancing & HA · Operations Guide Upgrades & Boot Environments · Labeling & Assets
9. How a Request Flows Through the Stack
To make the architecture concrete, here is what happens when an external user hits an API endpoint on a fully deployed kldload platform. Every layer participates.
User's browser
│
│ HTTPS (Let's Encrypt certificate)
▼
HAProxy (TLS termination, keepalived VIP)
│
│ nftables: allow inbound 443, rate limit
│ eBPF: Cilium captures flow metadata for Hubble
▼
Kubernetes Ingress (Cilium)
│
│ Cilium network policy: only allow traffic to this namespace
│ mTLS: Cilium encrypts pod-to-pod with WireGuard
▼
Application Pod
│
│ Keycloak OIDC: validates JWT access token (local signature check)
│ Token contains: user identity, roles, client scope
│ SELinux: pod runs as container_t:s0:c123,c456
▼
Application queries database
│
│ Vault: application fetched DB credentials at startup (dynamic secret, 1hr TTL)
│ step-ca: mTLS client certificate authenticates to PostgreSQL
│ WireGuard: DB connection travels on storage plane, not workload plane
▼
PostgreSQL on ZFS
│
│ ZFS: recordsize=8K, SLOG for synchronous writes
│ Sanoid: hourly snapshots, 30-day retention
│ SELinux: postgresql_t domain, cannot access /home, /tmp, or other services
▼
Response travels back up the same path
│
│ eBPF: latency histogram recorded by bpftrace
│ Prometheus: request count and duration metric incremented
│ Loki: structured log entry with request ID, user, latency
│ Hubble: full network flow recorded (src pod → dst pod, port, verdict)
▼
User sees the response
10. Disaster Recovery & Backup
On a full kldload stack, disaster recovery is not a separate system. It is a property of the architecture.
ZFS Snapshots — Point-in-Time Recovery
Sanoid takes hourly snapshots of every dataset. Accidentally delete a file? zfs rollback. Corrupt a database? Roll back the dataset to before the corruption. Need the state from three weeks ago? The snapshot is there. Cost: near-zero (copy-on-write, only changed blocks consume space).
Syncoid Replication — Off-Site Backup
Syncoid sends incremental snapshots to a remote node via zfs send | zfs receive over the storage WireGuard plane. The remote node has a complete, consistent, up-to-date copy of every dataset. If the primary site burns down, the remote site has everything up to the last sync (typically 15–60 minutes).
Boot Environments — Safe Upgrades
Before any OS upgrade, a boot environment snapshot is created. If the upgrade breaks boot, select the previous boot environment from the bootloader and the system comes up exactly as it was. ZFS makes this atomic — the rollback is a metadata operation, not a file copy.
Golden Image Rebuilds — Immutable Infrastructure
If a node is compromised or corrupt beyond repair, do not fix it. Destroy it and redeploy from the golden image. Packer builds the image, cloud-init personalises it, ZFS receive restores the data from the replica. A complete node rebuild takes minutes, not hours. The infrastructure is cattle, not pets.
RPO and RTO
| Scenario | RPO (data loss) | RTO (downtime) | Mechanism |
|---|---|---|---|
| Accidental file deletion | <1 hour | Seconds | zfs rollback or browse .zfs/snapshot/ |
| Database corruption | <1 hour | Minutes | zfs rollback dataset to pre-corruption snapshot |
| Bad OS upgrade | Zero | 1 reboot | Boot environment rollback |
| Node hardware failure | <1 hour | Minutes | Redeploy golden image + zfs receive from replica |
| Full site loss | <1 hour | Hours | DR site has full ZFS replica. Promote and repoint DNS. |
| Ransomware / compromise | <1 hour | Hours | Destroy all nodes. Redeploy from golden images. Restore from ZFS replicas (read-only, attacker cannot encrypt them). |
11. The Numbers — What This Looks Like in Practice
Reference deployment: 6-node cluster
| Node | Role | Hardware | Runs |
|---|---|---|---|
infra-1 |
Infrastructure | 64 GB RAM, 2x NVMe (mirror), 4x SSD (RAIDZ2) | Keycloak, Vault, step-ca, CoreDNS, HAProxy (primary) |
infra-2 |
Infrastructure (HA) | 64 GB RAM, 2x NVMe (mirror), 4x SSD (RAIDZ2) | Keycloak replica, Vault (standby), CoreDNS, HAProxy (standby) |
compute-1 |
Hypervisor | 256 GB RAM, 2x NVMe (mirror), 8x SSD (dRAID), NVIDIA A4000 | KVM VMs: K8s control plane, workers, GPU workloads |
compute-2 |
Hypervisor | 256 GB RAM, 2x NVMe (mirror), 8x SSD (dRAID), NVIDIA A4000 | KVM VMs: K8s workers, databases, application VMs |
observe-1 |
Observability | 128 GB RAM, 2x NVMe (mirror), 6x HDD (RAIDZ2) | Prometheus, Grafana, Loki, Alertmanager |
dr-1 |
Disaster recovery | 64 GB RAM, 2x NVMe (mirror), 8x HDD (RAIDZ2) | ZFS replicas (syncoid target), cold standby for all services |
Network plane layout
┌─────────────────────────────────────────────────────────────────┐
│ MANAGEMENT PLANE (wg-mgmt) 10.250.0.0/24 │
│ SSH, Ansible, Prometheus scrapes, Grafana │
│ Keys: per-node Curve25519 keypair │
├─────────────────────────────────────────────────────────────────┤
│ STORAGE PLANE (wg-storage) 10.251.0.0/24 │
│ ZFS send/receive, NFS, iSCSI, database replication │
│ Keys: per-node Curve25519 keypair (different from mgmt) │
├─────────────────────────────────────────────────────────────────┤
│ WORKLOAD PLANE (wg-workload) 10.252.0.0/24 │
│ Pod-to-pod (Cilium), service traffic, API calls │
│ Keys: per-node Curve25519 keypair (different from both above) │
├─────────────────────────────────────────────────────────────────┤
│ EXTERNAL PLANE (eth0 / ipsec0) │
│ Public-facing services, IPsec tunnels to partners/cloud │
│ nftables: strict ingress filtering, DDoS mitigation │
└─────────────────────────────────────────────────────────────────┘
BIRD BGP runs on ALL planes, exchanging routes per-plane.
Each plane is a full mesh of WireGuard tunnels.
No traffic crosses planes without an explicit nftables FORWARD rule.
Certificate hierarchy
Let's Encrypt (public)
├── *.example.com (wildcard, 90-day, auto-renewed)
│ ├── api.example.com (HAProxy TLS termination)
│ ├── grafana.example.com (Grafana)
│ └── auth.example.com (Keycloak)
step-ca (internal)
├── Root CA (10-year, offline, ZFS-encrypted dataset in Vault)
│ └── Intermediate CA (3-year, online, step-ca server on infra-1)
│ ├── *.internal (24-hour leaf certs, ACME auto-renewed)
│ │ ├── postgres.internal (database mTLS)
│ │ ├── vault.internal (Vault API mTLS)
│ │ ├── k8s-api.internal (Kubernetes API server)
│ │ └── prometheus.internal (metrics scrape mTLS)
│ └── client certificates (service-to-service mTLS)
strongSwan IPsec CA
├── IPsec Root CA (step-ca issued intermediate)
│ ├── gateway-a.example.com (site-to-site tunnel certs)
│ └── gateway-b.example.com
Secrets management map
Vault (infra-1, HA with infra-2)
├── secret/keycloak/ — admin password, DB credentials, client secrets
├── secret/postgres/ — superuser password, replication credentials
├── secret/grafana/ — Keycloak OIDC client secret
├── secret/wireguard/ — private keys for all nodes and planes
├── secret/step-ca/ — intermediate CA private key
├── secret/zfs/ — encryption passphrases per dataset
├── secret/ipsec/ — PSKs or certificate private keys
├── pki/internal/ — Vault PKI engine (alternative to step-ca)
├── database/postgres/ — dynamic credentials (Vault generates per-app creds)
└── transit/ — encryption-as-a-service (envelope encryption for apps)
12. Why Each Technology — The Decision Table
| Component | What it Does | Why This One | What it Replaces |
|---|---|---|---|
| ZFS | Filesystem + volume manager | Checksums, snapshots, replication, encryption, compression — one tool | ext4 + LVM + mdadm + rsync + LUKS (5 tools for one job) |
| WireGuard | Encrypted tunnels | 4K lines of code, in-kernel, Curve25519, no configuration complexity | OpenVPN, IPsec for internal traffic |
| BIRD | BGP routing daemon | Lightweight, config-file driven, BGP on every host like hyperscalers | Static routes, OSPF, proprietary routing |
| Cilium | Kubernetes CNI + network policy + LB | eBPF-native, replaces kube-proxy + iptables, Hubble observability | Calico, Flannel, kube-proxy |
| eBPF | Kernel instrumentation | Programmable kernel observation without patching or modules | strace, dtrace, kernel modules, tcpdump |
| Keycloak | Identity & SSO | Open source, OIDC + SAML, federation, MFA, full-featured | Okta, Auth0, Dex, per-app auth |
| Vault | Secrets management | Dynamic secrets, PKI engine, transit encryption, audit log | Ansible Vault, .env files, hardcoded credentials |
| step-ca | Internal Certificate Authority | ACME protocol, short-lived certs, auto-renewal, lightweight | Self-signed certs, manual OpenSSL, CFSSL |
| SELinux | Mandatory access control | Kernel-enforced, survives root compromise, MCS for containers | AppArmor, nothing (most skip MAC entirely) |
| nftables | Firewall | Successor to iptables, atomic ruleset loads, sets/maps, faster | iptables, firewalld |
| strongSwan | IPsec VPN | IKEv2, certificate auth, XFRM interfaces, interoperable | OpenVPN for external connections |
| Prometheus | Metrics collection | Pull-based, PromQL, de facto standard, massive exporter ecosystem | Nagios, Zabbix, Datadog |
| Grafana | Dashboards & visualisation | Multi-datasource, Keycloak SSO, alerting, open source | Kibana, proprietary dashboards |
| Sanoid | Snapshot policy + replication | Purpose-built for ZFS, policy-driven, syncoid for send/receive | cron + zfs snapshot scripts, Bacula, Borg |
| Packer | Machine image builds | Multi-platform (KVM, cloud, bare metal), code-defined images | Manual installs, custom scripts, Kickstart alone |
13. How to Get There — Build Order
You do not deploy all of this at once. The stack builds bottom-up, layer by layer. Each step is usable on its own. You can stop at any layer and have a functional platform.
Phase 1 — Foundation (Day 1)
├── Install kldload (ZFS on root, WireGuard, eBPF, nftables, SELinux)
├── Configure ZFS pool layout and Sanoid snapshot policies
├── Set up WireGuard management plane between nodes
├── Enable SELinux enforcing, configure booleans
└── Result: encrypted, snapshotted, firewalled bare metal
Phase 2 — Networking (Day 2-3)
├── Deploy BIRD BGP on all nodes
├── Add storage and workload WireGuard planes
├── Configure CoreDNS for internal name resolution
├── Set up nftables per-plane policies
└── Result: multi-plane routed network, no static routes
Phase 3 — Security (Day 4-5)
├── Deploy step-ca (internal PKI)
├── Deploy Keycloak (SSO)
├── Deploy Vault (secrets management)
├── Configure mTLS between all services
├── Move all secrets to Vault
└── Result: every connection authenticated and encrypted
Phase 4 — Compute (Day 6-7)
├── Build golden images with Packer
├── Deploy KVM VMs for Kubernetes control plane and workers
├── Install Kubernetes with Cilium CNI
├── Configure Kubernetes OIDC with Keycloak
└── Result: container orchestration on encrypted, snapshotted VMs
Phase 5 — Observability (Day 8-9)
├── Deploy Prometheus + Alertmanager
├── Deploy Grafana (Keycloak SSO)
├── Deploy Loki for log aggregation
├── Configure eBPF tracing and Hubble
├── Build dashboards, set up alert rules
└── Result: full visibility into every layer
Phase 6 — Workloads (Day 10+)
├── Deploy databases on tuned ZFS datasets
├── Deploy application containers
├── Configure load balancing (HAProxy or Cilium LB)
├── Set up blue/green deployment pipeline
├── Configure syncoid replication to DR node
└── Result: production workloads with full DR
14. The Point
A fully deployed kldload platform is not a product you buy. It is a platform you build. Every component is open source. Every configuration is a text file in a git repo. Every node boots from the same ISO. You understand every layer because you built every layer.
The result is infrastructure that is encrypted at every layer (ZFS, WireGuard, mTLS, IPsec), authenticated everywhere (Keycloak SSO, Vault dynamic secrets, certificate-based mTLS), observable to the kernel (eBPF, Prometheus, Grafana, Hubble), recoverable to any point in time (ZFS snapshots, syncoid replication, boot environments), auditable (SELinux, Vault audit log, Keycloak login events, structured logging), and entirely yours.
No vendor lock-in. No licence fees. No phone-home telemetry. No cloud dependency. It runs in your rack, on your hardware, under your control. That is the full stack.
Related pages
- First-Class Infrastructure — the philosophy behind the stack
- ZFS Masterclass — Layer 1 deep dive
- WireGuard Masterclass — backplane encryption
- BIRD & BGP Masterclass — dynamic routing
- Keycloak & SELinux Masterclass — identity and access control
- TLS & PKI Masterclass — certificate infrastructure
- Vault & Secrets Masterclass — secrets management
- eBPF Masterclass — kernel observability
- Cilium Masterclass — Kubernetes networking
- Kubernetes Masterclass — container orchestration
- IPsec Tunnels Masterclass — external connectivity
- Operations Guide Upgrades & Boot Environments — day-2 operations
- Build Your Own — getting started with custom deployments