Documentation

Production Cloud — Build Your Own AWS

AWS has 200+ services. You need about 12 of them. The rest exist to lock you in, bill you for breathing, and make your infrastructure so entangled with proprietary APIs that leaving feels like open-heart surgery. Every single critical service AWS provides has an open-source equivalent that runs on bare metal. VPC? That's VXLAN + Open vSwitch. Route 53? PowerDNS. ELB? HAProxy. EC2? KVM. S3? MinIO. The difference is: when you build it yourself, there's no egress fee, no surprise bill, and no vendor who can raise prices 30% because they feel like it.

This recipe is the advanced tier. Build Your Own Cloud got your services running. Multi-Site Cloud replicated them across regions. This page turns that into a production-grade cloud platform with enterprise networking, dynamic routing, overlay networks, multi-tenant isolation, load balancing, and an API-driven control plane. The full stack. No training wheels.

This is the capstone recipe. It ties together every masterclass on the site: ZFS for storage, WireGuard for encrypted transport, BGP for dynamic routing, VXLAN/EVPN for overlay networking, Cilium for K8s networking, eBPF for observability, nftables for firewalling, DNS for service discovery, systemd for service management, Packer for image automation, and backplane networking for the invisible infrastructure underneath. If you've read the masterclasses, this page shows you what it looks like when they're all running together as one platform.

Prerequisites: You should have completed the Multi-Site Cloud recipe first. This builds on that foundation — WireGuard mesh, ZFS replication, and multi-node infrastructure are assumed to be in place.

What you're replacing

AWS Service	Open-Source Replacement	What it actually does
VPC / Subnets	VXLAN + Open vSwitch	Virtual network overlays with tenant isolation
Route Tables / Transit Gateway	FRRouting (BGP + OSPF)	Dynamic routing between sites and networks
ELB / ALB / NLB	HAProxy + keepalived	Layer 4/7 load balancing with health checks
Route 53	PowerDNS + CoreDNS	Authoritative + internal DNS with API
EC2	KVM + libvirt	Virtual machines on bare metal
ECS / Fargate	Nomad or Kubernetes	Container orchestration
S3	MinIO	S3-compatible object storage on ZFS
CloudWatch	Prometheus + Grafana + Loki	Metrics, dashboards, log aggregation
IAM	Keycloak	Identity, SSO, RBAC, OIDC
ACM (certificates)	step-ca + ACME	Internal PKI and automatic cert issuance
CloudFormation	Terraform + Ansible	Infrastructure as code
API Gateway	Kong or Traefik	API routing, rate limiting, auth

The equivalent cloud bill for this stack across 3 regions is significant. On bare metal, the same capabilities cost a fraction. And you own it.

Architecture

This is no longer "three servers with WireGuard." This is a proper cloud fabric — overlay networks carrying tenant traffic, underlay networks carrying control plane traffic, dynamic routing protocols making forwarding decisions, and a load balancer tier accepting traffic from the internet. Every component you'd find in an AWS region, except it's open source and you can actually read the config files.

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                        kldload Production Cloud                                   │
│                                                                                     │
│  ┌── Internet ──────────────────────────────────────────────────────────────────┐    │
│  │  Floating IP (anycast or DNS failover)                                      │    │
│  │  HAProxy (L4/L7 load balancer) + keepalived (VRRP failover)                 │    │
│  └──────────┬──────────────────────────────────────────────────────────────────┘    │
│             │                                                                       │
│  ┌── Control Plane ─────────────────────────────────────────────────────────────┐    │
│  │  FRRouting        — BGP/OSPF dynamic routing between all nodes              │    │
│  │  PowerDNS         — authoritative DNS (public zones)                         │    │
│  │  CoreDNS          — internal service discovery (*.cloud.internal)            │    │
│  │  Keycloak         — identity / SSO / RBAC                                    │    │
│  │  step-ca          — internal PKI / automatic TLS certs                       │    │
│  │  Consul           — service mesh / health checking / KV store                │    │
│  └──────────────────────────────────────────────────────────────────────────────┘    │
│                                                                                     │
│  ┌── Data Plane (VXLAN overlay) ────────────────────────────────────────────────┐    │
│  │                                                                              │    │
│  │  Open vSwitch bridges — per-tenant VXLAN segments (VNIs)                     │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │    │
│  │  │ VNI 100     │  │ VNI 200     │  │ VNI 300     │  │ VNI 400     │        │    │
│  │  │ tenant-a    │  │ tenant-b    │  │ staging     │  │ management  │        │    │
│  │  │ 10.100.0/24 │  │ 10.200.0/24 │  │ 10.30.0/24  │  │ 10.40.0/24  │        │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘        │    │
│  │                                                                              │    │
│  └──────────────────────────────────────────────────────────────────────────────┘    │
│                                                                                     │
│  ┌── Compute ───────────────────────────────────────────────────────────────────┐    │
│  │  KVM/libvirt VMs   │  Nomad/K8s containers   │  Firecracker microVMs        │    │
│  └──────────────────────────────────────────────────────────────────────────────┘    │
│                                                                                     │
│  ┌── Storage ───────────────────────────────────────────────────────────────────┐    │
│  │  ZFS pools (block/file)  │  MinIO (S3 object)  │  NFS/iSCSI (shared)        │    │
│  │  Sanoid snapshots → Syncoid replication across all sites                     │    │
│  └──────────────────────────────────────────────────────────────────────────────┘    │
│                                                                                     │
│  ┌── Underlay (physical / WireGuard) ───────────────────────────────────────────┐    │
│  │  Site A (Montreal) ◄──WireGuard──► Site B (Frankfurt) ◄──WG──► Site C (Home) │    │
│  │  BGP AS 65001          BGP AS 65002           BGP AS 65003                   │    │
│  └──────────────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────────────┘

Underlay vs. overlay — the two networks

The underlay is the physical network (or WireGuard tunnels between sites). It carries control plane traffic: BGP route advertisements, ZFS replication, SSH management. The overlay is VXLAN — virtual Layer 2 networks that ride on top of the underlay. Tenant VMs and containers live on the overlay. They think they're on their own private LAN, but they're actually encapsulated in UDP packets flying between sites.

The underlay is the highway system. The overlay is the postal system — letters (packets) ride inside trucks (VXLAN tunnels) on the highway, but the sender and receiver only see addresses.

Why not just use more WireGuard tunnels?

WireGuard is perfect for site-to-site and remote access. But it's point-to-point — you'd need N² tunnels for N networks, and it doesn't do multi-tenancy, broadcast domains, or dynamic membership. VXLAN + OVS gives you thousands of isolated virtual networks, dynamic VTEP discovery via BGP EVPN, and the ability to live-migrate VMs between hosts without reconfiguring anything. It's what the cloud providers actually use under the hood.

Step 1: FRRouting — dynamic routing with BGP and OSPF

Static routes are fine for three servers. They're a nightmare for thirty. FRRouting is the open-source routing suite that runs on every major ISP and cloud provider's edge network. It speaks BGP, OSPF, IS-IS, BFD, EVPN — the same protocols that route the actual internet. We're going to use it for two things: OSPF for fast internal convergence within a site, and BGP for policy-based routing between sites.

OSPF vs. BGP — when to use which

OSPF (Open Shortest Path First) is a link-state protocol. Every router knows the complete topology and calculates shortest paths itself. It converges fast (sub-second with BFD) and is perfect for internal routing within a site or campus. BGP (Border Gateway Protocol) is a path-vector protocol. It's what connects autonomous systems on the internet — and it's what connects your sites. BGP gives you policy control: prefer one path over another, prepend AS paths to influence traffic, and gracefully drain a site before maintenance.

OSPF is GPS navigation within a city — it knows every street and picks the fastest route. BGP is the highway system between cities — it knows which highways exist and lets you choose based on policy (toll roads, speed, congestion).

# Install FRRouting on all nodes (CentOS/RHEL/Rocky)
dnf install -y frr frr-pythontools

# Enable the daemons we need
sed -i 's/bgpd=no/bgpd=yes/' /etc/frr/daemons
sed -i 's/ospfd=no/ospfd=yes/' /etc/frr/daemons
sed -i 's/bfdd=no/bfdd=yes/' /etc/frr/daemons
sed -i 's/zebra=no/zebra=yes/' /etc/frr/daemons

systemctl enable --now frr

OSPF — internal routing within each site

OSPF discovers neighbors automatically and builds a complete map of the internal network. When a link goes down, every router knows within milliseconds (with BFD) and recalculates paths. No manual route updates. No "oh, someone forgot to add a static route" at 3am.

# Site A — /etc/frr/frr.conf (OSPF section)
cat >> /etc/frr/frr.conf << 'FRR'
!
! ─── OSPF: internal routing ───────────────────────────
router ospf
 ospf router-id 10.10.0.1
 ! Advertise all internal networks
 network 10.10.0.0/24 area 0.0.0.0
 network 10.100.0.0/16 area 0.0.0.0
 ! Fast convergence with BFD
 passive-interface default
 no passive-interface wg0
 no passive-interface br-mgmt
!
! BFD — sub-second failure detection
bfd
 peer 10.10.0.2
  no shutdown
 !
 peer 10.10.0.3
  no shutdown
 !
!
FRR

BGP — inter-site routing with policy

Each site gets its own private ASN (65001, 65002, 65003). BGP peers over the WireGuard mesh. This gives you fine-grained control over which site handles which traffic, the ability to drain a site for maintenance, and automatic failover when a site goes down.

# Site A (AS 65001) — /etc/frr/frr.conf (BGP section)
cat >> /etc/frr/frr.conf << 'FRR'
!
! ─── BGP: inter-site routing ──────────────────────────
router bgp 65001
 bgp router-id 10.10.0.1
 bgp log-neighbor-changes
 bgp bestpath as-path multipath-relax
 !
 ! Neighbors — peer over WireGuard
 neighbor 10.10.0.2 remote-as 65002
 neighbor 10.10.0.2 description Site-B-Frankfurt
 neighbor 10.10.0.2 bfd
 neighbor 10.10.0.2 timers 10 30
 !
 neighbor 10.10.0.3 remote-as 65003
 neighbor 10.10.0.3 description Site-C-HomeLab
 neighbor 10.10.0.3 bfd
 neighbor 10.10.0.3 timers 10 30
 !
 ! Address family — advertise service networks
 address-family ipv4 unicast
  network 10.100.0.0/16
  network 172.20.0.0/14
  ! Prefer local exit (lower MED = preferred)
  neighbor 10.10.0.2 route-map SITE-B-OUT out
  neighbor 10.10.0.3 route-map SITE-C-OUT out
  ! Accept all from peers
  neighbor 10.10.0.2 route-map ACCEPT-ALL in
  neighbor 10.10.0.3 route-map ACCEPT-ALL in
 exit-address-family
 !
 ! EVPN address family — for VXLAN overlay routing
 address-family l2vpn evpn
  neighbor 10.10.0.2 activate
  neighbor 10.10.0.3 activate
  advertise-all-vni
 exit-address-family
!
! ─── Route maps ───────────────────────────────────────
route-map ACCEPT-ALL permit 10
!
route-map SITE-B-OUT permit 10
 set metric 100
!
route-map SITE-C-OUT permit 10
 set metric 200
!
! ─── Prefix lists (safety) ────────────────────────────
ip prefix-list INTERNAL seq 10 permit 10.0.0.0/8 le 24
ip prefix-list INTERNAL seq 20 permit 172.16.0.0/12 le 24
ip prefix-list INTERNAL seq 100 deny any
!
FRR

# Reload FRR without restarting
systemctl reload frr

# Site B (AS 65002) — same structure, swap ASNs and IPs
cat >> /etc/frr/frr.conf << 'FRR'
router bgp 65002
 bgp router-id 10.10.0.2
 bgp log-neighbor-changes
 neighbor 10.10.0.1 remote-as 65001
 neighbor 10.10.0.1 description Site-A-Montreal
 neighbor 10.10.0.1 bfd
 neighbor 10.10.0.3 remote-as 65003
 neighbor 10.10.0.3 description Site-C-HomeLab
 neighbor 10.10.0.3 bfd
 address-family ipv4 unicast
  network 10.200.0.0/16
  network 172.24.0.0/14
 exit-address-family
 address-family l2vpn evpn
  neighbor 10.10.0.1 activate
  neighbor 10.10.0.3 activate
  advertise-all-vni
 exit-address-family
!
FRR

# Verify BGP peering
vtysh -c "show bgp summary"
vtysh -c "show bgp ipv4 unicast"
vtysh -c "show ip ospf neighbor"
vtysh -c "show bfd peers"

Maintenance mode — drain a site with BGP

Need to reboot Site A for maintenance? Don't just pull the plug. Use BGP to gracefully drain traffic first. Prepend the AS path to make Site A's routes less preferred — traffic shifts to Site B in seconds. Do your work. Remove the prepend. Traffic flows back. Zero downtime. This is how every ISP and cloud provider does it.

It's like putting up a "lane closed ahead" sign. Traffic merges to the other lanes before you start construction, not after.

# Drain Site A before maintenance
vtysh << 'DRAIN'
configure terminal
route-map DRAIN-OUT permit 10
 set as-path prepend 65001 65001 65001
!
router bgp 65001
 address-family ipv4 unicast
  neighbor 10.10.0.2 route-map DRAIN-OUT out
  neighbor 10.10.0.3 route-map DRAIN-OUT out
 exit-address-family
!
end
clear bgp * soft out
DRAIN

echo "Site A drained — traffic is now flowing through Site B"
echo "Wait 30 seconds for convergence, then do your maintenance"

# After maintenance — restore normal routing
vtysh << 'RESTORE'
configure terminal
no route-map DRAIN-OUT
router bgp 65001
 address-family ipv4 unicast
  neighbor 10.10.0.2 route-map SITE-B-OUT out
  neighbor 10.10.0.3 route-map SITE-C-OUT out
 exit-address-family
!
end
clear bgp * soft out
RESTORE

Step 2: VXLAN + Open vSwitch — the network fabric

This is the heart of the cloud. VXLAN (Virtual Extensible LAN) creates isolated Layer 2 overlay networks on top of your physical/WireGuard underlay. Each tenant, environment, or workload gets its own VXLAN segment identified by a VNI (VXLAN Network Identifier). There are 16 million possible VNIs. AWS calls these "VPCs." We call them "a few OVS commands."

What VXLAN actually does

Take an Ethernet frame from a VM. Wrap it in a UDP packet. Send it across the underlay to another host. Unwrap it. Deliver it to the destination VM. The VMs think they're on the same Layer 2 switch, even if they're on different continents. The encapsulation uses UDP port 4789, and each VXLAN segment is identified by a 24-bit VNI in the header — giving you 16,777,216 isolated networks. AWS charges extra for each VPC. You get 16 million of them for free.

VXLAN is a tunnel that makes two switches on different continents look like they're the same switch. The VNI is the VLAN tag, but with 16 million possible values instead of 4,096.

# Install Open vSwitch on all nodes
dnf install -y openvswitch libibverbs
systemctl enable --now openvswitch

# Verify
ovs-vsctl show

Create the OVS bridge and VXLAN tunnels

# On Site A (10.10.0.1) — create the main OVS bridge
ovs-vsctl add-br br-overlay

# Add VXLAN tunnel ports to other sites
# key=flow means VNI is determined per-flow, not per-tunnel
ovs-vsctl add-port br-overlay vxlan-site-b -- \
    set interface vxlan-site-b type=vxlan \
    options:remote_ip=10.10.0.2 \
    options:key=flow \
    options:dst_port=4789

ovs-vsctl add-port br-overlay vxlan-site-c -- \
    set interface vxlan-site-c type=vxlan \
    options:remote_ip=10.10.0.3 \
    options:key=flow \
    options:dst_port=4789

# Verify tunnel setup
ovs-vsctl show

Create tenant networks (VNIs)

Each tenant or environment gets its own VNI. OpenFlow rules on the OVS bridge enforce isolation — traffic from VNI 100 can never reach VNI 200 unless you explicitly route between them.

# /usr/local/bin/cloud-network
cat > /usr/local/bin/cloud-network << 'SCRIPT'
#!/bin/bash
set -euo pipefail

ACTION="${1:-help}"
VNI="${2:-}"
NAME="${3:-}"
SUBNET="${4:-}"

case "$ACTION" in
    create)
        [ -z "$VNI" ] || [ -z "$NAME" ] || [ -z "$SUBNET" ] && {
            echo "Usage: $0 create   "
            echo "  e.g. $0 create 100 tenant-a 10.100.0.0/24"
            exit 1
        }

        echo "Creating network: $NAME (VNI $VNI, subnet $SUBNET)"

        # Create internal OVS port for this VNI
        ovs-vsctl add-port br-overlay "vni-$VNI" \
            tag="$VNI" -- \
            set interface "vni-$VNI" type=internal

        # Assign the gateway IP (first usable address)
        GW_IP=$(echo "$SUBNET" | sed 's|0/|1/|')
        ip addr add "$GW_IP" dev "vni-$VNI"
        ip link set "vni-$VNI" up

        # Add OpenFlow rules for VXLAN encap/decap
        # Incoming: match VNI, deliver to local port
        ovs-ofctl add-flow br-overlay \
            "table=0,priority=100,tun_id=$VNI,actions=output:vni-$VNI"

        # Outgoing: tag with VNI, send to all VXLAN tunnels
        PORT_NUM=$(ovs-ofctl show br-overlay | grep "vni-$VNI" | awk -F'(' '{print $1}' | tr -d ' ')
        ovs-ofctl add-flow br-overlay \
            "table=0,priority=100,in_port=$PORT_NUM,actions=set_field:$VNI->tun_id,output:vxlan-site-b,output:vxlan-site-c"

        # Enable DHCP for this network via dnsmasq
        cat > "/etc/dnsmasq.d/vni-$VNI.conf" << EOF
interface=vni-$VNI
dhcp-range=${SUBNET%.*}.10,${SUBNET%.*}.250,255.255.255.0,12h
dhcp-option=option:router,$GW_IP
dhcp-option=option:dns-server,${SUBNET%.*}.1
EOF
        systemctl restart dnsmasq

        echo "Network $NAME (VNI $VNI) created"
        echo "  Gateway: $GW_IP"
        echo "  DHCP: ${SUBNET%.*}.10 - ${SUBNET%.*}.250"
        ;;

    list)
        echo "=== Active overlay networks ==="
        ovs-vsctl list-ports br-overlay | grep "^vni-" | while read port; do
            VNI_NUM="${port#vni-}"
            IP=$(ip -4 addr show "$port" 2>/dev/null | grep inet | awk '{print $2}')
            echo "  VNI $VNI_NUM: $IP ($port)"
        done
        ;;

    delete)
        [ -z "$VNI" ] && { echo "Usage: $0 delete "; exit 1; }
        echo "Deleting network VNI $VNI"
        ovs-ofctl del-flows br-overlay "tun_id=$VNI"
        ovs-vsctl del-port br-overlay "vni-$VNI" 2>/dev/null || true
        rm -f "/etc/dnsmasq.d/vni-$VNI.conf"
        systemctl restart dnsmasq
        echo "Network VNI $VNI deleted"
        ;;

    *)
        echo "Usage: $0 {create|list|delete} [VNI] [name] [subnet]"
        echo ""
        echo "Examples:"
        echo "  $0 create 100 production 10.100.0.0/24"
        echo "  $0 create 200 staging    10.200.0.0/24"
        echo "  $0 create 300 dev-team   10.30.0.0/24"
        echo "  $0 list"
        echo "  $0 delete 300"
        ;;
esac
SCRIPT
chmod +x /usr/local/bin/cloud-network

# Create the network fabric
cloud-network create 100 production  10.100.0.0/24
cloud-network create 200 staging     10.200.0.0/24
cloud-network create 300 development 10.30.0.0/24
cloud-network create 900 management  10.90.0.0/24

# Verify
cloud-network list
ovs-ofctl dump-flows br-overlay

Attach VMs to overlay networks

# When creating a KVM VM, attach it to a VXLAN network:
# 1. Create an OVS port for the VM
ovs-vsctl add-port br-overlay "vm-web-01" tag=100 -- \
    set interface "vm-web-01" type=internal

# 2. Use the port as the VM's network interface in libvirt XML:
#    
#      
#      
#        
#      
#      
#    

# The VM lands on VNI 100 (production network), gets DHCP,
# and can talk to other VNI 100 VMs across all sites.

Step 3: BGP EVPN — distributed VXLAN control plane

So far, VXLAN tunnels are statically configured between sites. That works for 3 nodes. For 30, you need a control plane that automatically discovers which VMs are on which hosts and populates MAC/IP tables accordingly. That's BGP EVPN (Ethernet VPN) — the same protocol that data centers use to scale VXLAN to thousands of hosts.

What EVPN actually solves

Without EVPN, every VXLAN host floods broadcast traffic to every other host to learn MAC addresses — just like a physical switch, but across your WAN links. That's expensive. EVPN uses BGP to advertise MAC/IP bindings: "VM with MAC aa:bb:cc:dd:ee:ff and IP 10.100.0.5 is reachable via VTEP 10.10.0.1, VNI 100." Every other host installs that entry in its forwarding table. No flooding. No wasted bandwidth. Just targeted, unicast delivery.

Without EVPN, finding a VM is like shouting in a crowded room. With EVPN, it's like checking a phone book.

# FRRouting EVPN config — add to /etc/frr/frr.conf on all nodes
cat >> /etc/frr/frr.conf << 'FRR'
!
! ─── EVPN: VXLAN control plane ────────────────────────
router bgp 65001
 address-family l2vpn evpn
  neighbor 10.10.0.2 activate
  neighbor 10.10.0.3 activate
  advertise-all-vni
  ! Advertise IP prefixes for inter-VNI routing
  advertise ipv4 unicast
 exit-address-family
!
! Per-VNI configuration
vni 100
 rd 10.10.0.1:100
 route-target import 65000:100
 route-target export 65000:100
!
vni 200
 rd 10.10.0.1:200
 route-target import 65000:200
 route-target export 65000:200
!
FRR

systemctl reload frr

# Verify EVPN routes
vtysh -c "show bgp l2vpn evpn summary"
vtysh -c "show bgp l2vpn evpn route"
vtysh -c "show evpn vni"
vtysh -c "show evpn mac vni all"

Step 4: HAProxy + keepalived — production load balancing

Traffic from the internet needs to reach your services. HAProxy is the load balancer that every high-traffic site secretly runs behind the scenes. It handles Layer 4 (TCP) and Layer 7 (HTTP) load balancing, TLS termination, health checks, rate limiting, and connection draining. Keepalived provides VRRP failover — if the primary HAProxy node dies, the floating IP moves to the standby in under a second.

Why HAProxy instead of nginx/Caddy?

Nginx and Caddy are great reverse proxies. HAProxy is a great load balancer. The difference matters at scale: HAProxy has connection-aware health checks (not just HTTP pings), graceful connection draining (finish in-flight requests before removing a backend), sticky sessions, circuit breakers, and a runtime API that lets you add/remove backends without reloading config. It also handles 1M+ concurrent connections on a single core. There's a reason it's been the industry standard for 20 years.

# Install HAProxy and keepalived
dnf install -y haproxy keepalived

cat > /etc/haproxy/haproxy.cfg << 'HAPROXY'
# ═══════════════════════════════════════════════════════
# kldload Production Cloud — HAProxy configuration
# ═══════════════════════════════════════════════════════

global
    log         /dev/log local0
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     50000
    user        haproxy
    group       haproxy
    daemon
    # Modern TLS only
    ssl-default-bind-ciphersuites TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256
    ssl-default-bind-options ssl-min-ver TLSv1.2 no-tls-tickets
    # Runtime API — add/remove backends without reload
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s

defaults
    mode        http
    log         global
    option      httplog
    option      dontlognull
    option      http-server-close
    option      forwardfor except 127.0.0.0/8
    retries     3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         5s
    timeout client          30s
    timeout server          30s
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn     10000
    # Health check defaults
    default-server inter 3s fall 3 rise 2

# ─── Stats dashboard ──────────────────────────────────
listen stats
    bind *:8404
    mode http
    stats enable
    stats uri /
    stats refresh 10s
    stats admin if TRUE
    stats show-legends

# ─── HTTPS frontend ───────────────────────────────────
frontend https-in
    bind *:443 ssl crt /etc/haproxy/certs/ alpn h2,http/1.1
    bind *:80
    # Redirect HTTP to HTTPS
    http-request redirect scheme https unless { ssl_fc }

    # Route by hostname
    acl host_app     hdr(host) -i app.example.com
    acl host_api     hdr(host) -i api.example.com
    acl host_grafana hdr(host) -i grafana.example.com
    acl host_minio   hdr(host) -i s3.example.com

    use_backend app-servers     if host_app
    use_backend api-servers     if host_api
    use_backend grafana         if host_grafana
    use_backend minio           if host_minio
    default_backend app-servers

    # Rate limiting — 100 requests/10s per IP
    stick-table type ip size 100k expire 30s store http_req_rate(10s)
    http-request track-sc0 src
    http-request deny deny_status 429 if { sc_http_req_rate(0) gt 100 }

# ─── Backend: application servers ─────────────────────
backend app-servers
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200
    # Sticky sessions via cookie
    cookie SERVERID insert indirect nocache
    # Graceful drain — finish in-flight requests
    default-server inter 5s fall 3 rise 2 slowstart 60s
    server app-a-1 10.100.0.10:8080 check cookie a1
    server app-a-2 10.100.0.11:8080 check cookie a2
    server app-b-1 10.200.0.10:8080 check cookie b1 backup

# ─── Backend: API servers ────────────────────────────
backend api-servers
    balance leastconn
    option httpchk GET /api/health
    http-check expect status 200
    server api-a-1 10.100.0.20:3000 check
    server api-a-2 10.100.0.21:3000 check
    server api-b-1 10.200.0.20:3000 check backup

# ─── Backend: Grafana ────────────────────────────────
backend grafana
    balance roundrobin
    option httpchk GET /api/health
    server grafana-1 10.90.0.10:3000 check

# ─── Backend: MinIO ──────────────────────────────────
backend minio
    balance leastconn
    option httpchk GET /minio/health/live
    http-check expect status 200
    server minio-a-1 10.100.0.30:9000 check
    server minio-a-2 10.100.0.31:9000 check

# ─── TCP frontend: PostgreSQL ────────────────────────
frontend postgres-in
    mode tcp
    bind *:5432
    default_backend postgres-servers

backend postgres-servers
    mode tcp
    option pgsql-check user haproxy
    server pg-primary 10.100.0.40:5432 check
    server pg-standby 10.200.0.40:5432 check backup
HAPROXY

systemctl enable --now haproxy

# Runtime API examples — no config reload needed
# Add a new backend server
echo "add server app-servers/app-a-3 10.100.0.12:8080 check" | \
    socat stdio /run/haproxy/admin.sock

# Drain a server before maintenance (finish in-flight, reject new)
echo "set server app-servers/app-a-1 state drain" | \
    socat stdio /run/haproxy/admin.sock

# Check backend health
echo "show servers state" | socat stdio /run/haproxy/admin.sock

Step 5: PowerDNS + CoreDNS — the naming layer

AWS charges you per million DNS queries. Per million. For looking up names. PowerDNS handles authoritative DNS (your public zones) with a PostgreSQL backend and an HTTP API for dynamic updates. CoreDNS handles internal service discovery — every VM and container gets a DNS name automatically.

# Install PowerDNS (authoritative) and CoreDNS (internal)
dnf install -y pdns pdns-backend-pgsql

# PowerDNS config — PostgreSQL backend
cat > /etc/pdns/pdns.conf << 'PDNS'
launch=gpgsql
gpgsql-host=127.0.0.1
gpgsql-dbname=pdns
gpgsql-user=pdns
gpgsql-password=changeme-pdns-password

# API for dynamic updates
api=yes
api-key=changeme-api-key
webserver=yes
webserver-address=0.0.0.0
webserver-port=8081
webserver-allow-from=10.0.0.0/8,172.16.0.0/12

# DNSSEC
default-soa-content=ns1.example.com hostmaster.example.com 0 10800 3600 604800 3600

# Logging
loglevel=4
log-dns-queries=no
PDNS

systemctl enable --now pdns

# CoreDNS — internal service discovery
# Resolves *.cloud.internal to overlay network IPs

curl -fsSL https://github.com/coredns/coredns/releases/latest/download/coredns_linux_amd64.tgz | \
    tar -xz -C /usr/local/bin/

cat > /etc/coredns/Corefile << 'COREFILE'
# Internal zone — auto-populated from consul/etcd
cloud.internal {
    forward . 127.0.0.1:8600
    # Consul DNS interface on port 8600
    log
    errors
    cache 30
}

# Reverse DNS for overlay networks
10.in-addr.arpa {
    forward . 127.0.0.1:8600
    log
    cache 60
}

# Everything else — forward to public DNS
. {
    forward . 1.1.1.1 8.8.8.8 {
        tls_servername cloudflare-dns.com
    }
    cache 300
    log
}
COREFILE

# Systemd unit for CoreDNS
cat > /etc/systemd/system/coredns.service << 'UNIT'
[Unit]
Description=CoreDNS DNS server
After=network.target

[Service]
ExecStart=/usr/local/bin/coredns -conf /etc/coredns/Corefile
Restart=always
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
UNIT

systemctl daemon-reload
systemctl enable --now coredns

# Dynamic DNS updates via PowerDNS API
# Create a zone
curl -s -X POST http://localhost:8081/api/v1/servers/localhost/zones \
    -H "X-API-Key: changeme-api-key" \
    -H "Content-Type: application/json" \
    -d '{
        "name": "example.com.",
        "kind": "Master",
        "nameservers": ["ns1.example.com.", "ns2.example.com."],
        "rrsets": [
            {
                "name": "app.example.com.",
                "type": "A",
                "ttl": 60,
                "records": [{"content": "203.0.113.50", "disabled": false}]
            }
        ]
    }'

# Update a record (automated failover can call this)
curl -s -X PATCH http://localhost:8081/api/v1/servers/localhost/zones/example.com. \
    -H "X-API-Key: changeme-api-key" \
    -H "Content-Type: application/json" \
    -d '{
        "rrsets": [{
            "name": "app.example.com.",
            "type": "A",
            "ttl": 60,
            "changetype": "REPLACE",
            "records": [{"content": "203.0.113.51", "disabled": false}]
        }]
    }'

Step 6: step-ca — internal PKI (your own certificate authority)

Every service in your cloud needs TLS. You're not going to get Let's Encrypt certs for postgres-primary.cloud.internal. You need your own CA. step-ca from Smallstep is an ACME-compatible certificate authority that runs on your infrastructure. Services request certs automatically via the ACME protocol — the same protocol Let's Encrypt uses — but against your internal CA.

# Install step CLI and step-ca
curl -fsSL https://dl.smallstep.com/gh-release/cli/docs-cli-install/v0.25.0/step-cli_amd64.rpm -o /tmp/step-cli.rpm
curl -fsSL https://dl.smallstep.com/gh-release/certificates/docs-ca-install/v0.25.0/step-ca_amd64.rpm -o /tmp/step-ca.rpm
dnf install -y /tmp/step-cli.rpm /tmp/step-ca.rpm

# Initialize the CA
step ca init \
    --name "kldload Production CA" \
    --provisioner admin \
    --dns "ca.cloud.internal" \
    --dns "10.90.0.1" \
    --address ":8443" \
    --deployment-type standalone

# Enable ACME provisioner (Let's Encrypt-compatible)
step ca provisioner add acme --type ACME

# Start the CA
systemctl enable --now step-ca

# Any service can now request a certificate automatically:
step ca certificate "postgres.cloud.internal" \
    server.crt server.key \
    --ca-url https://ca.cloud.internal:8443 \
    --root /root/.step/certs/root_ca.crt \
    --not-after 720h

# Or use ACME (works with Caddy, HAProxy, Traefik, etc.):
# In Caddy:
#   tls {
#       ca https://ca.cloud.internal:8443/acme/acme/directory
#   }

# Auto-renew with a cron job
cat > /etc/cron.d/cert-renew << 'EOF'
0 */12 * * * root step ca renew /etc/ssl/server.crt /etc/ssl/server.key --force 2>&1 | logger -t cert-renew
EOF

Step 7: Keycloak — identity and access management

AWS IAM is the thing that makes grown engineers cry. Keycloak does the same job — SSO, RBAC, OIDC, SAML — but you can actually read the documentation without needing a decoder ring. One login for Grafana, MinIO, Gitea, your apps, and the cloud management API.

# Deploy Keycloak on the management network
docker run -d \
    --name keycloak \
    --network host \
    -e KEYCLOAK_ADMIN=admin \
    -e KEYCLOAK_ADMIN_PASSWORD="$(openssl rand -base64 32)" \
    -e KC_DB=postgres \
    -e KC_DB_URL=jdbc:postgresql://10.90.0.40:5432/keycloak \
    -e KC_DB_USERNAME=keycloak \
    -e KC_DB_PASSWORD=changeme-keycloak-db \
    -e KC_HOSTNAME=auth.example.com \
    -e KC_PROXY=edge \
    -v /srv/keycloak:/opt/keycloak/data \
    quay.io/keycloak/keycloak:latest \
    start

echo "Keycloak admin console: https://auth.example.com"
echo "Create a realm, add OIDC clients for each service"

# Configure Grafana to use Keycloak SSO
cat >> /etc/grafana/grafana.ini << 'INI'
[auth.generic_oauth]
enabled = true
name = kldload SSO
client_id = grafana
client_secret = your-client-secret
scopes = openid profile email
auth_url = https://auth.example.com/realms/cloud/protocol/openid-connect/auth
token_url = https://auth.example.com/realms/cloud/protocol/openid-connect/token
api_url = https://auth.example.com/realms/cloud/protocol/openid-connect/userinfo
role_attribute_path = contains(realm_access.roles[*], 'admin') && 'Admin' || 'Viewer'
INI

# Configure MinIO to use Keycloak
mc admin config set homelab identity_openid \
    config_url="https://auth.example.com/realms/cloud/.well-known/openid-configuration" \
    client_id="minio" \
    claim_name="policy" \
    scopes="openid"

Step 8: Consul — service mesh and discovery

You have services running across three sites on overlay networks. How does HAProxy know which backends are healthy? How does CoreDNS know which IP belongs to postgres.cloud.internal? Consul. It's the glue — service registration, health checking, KV store, and service mesh in one binary.

# Install Consul on all nodes
dnf install -y consul

# Server config (run on 3 or 5 nodes for quorum)
cat > /etc/consul.d/consul.hcl << 'HCL'
datacenter = "site-a"
data_dir = "/opt/consul"
server = true
bootstrap_expect = 3
bind_addr = "10.10.0.1"
client_addr = "0.0.0.0"
ui_config {
  enabled = true
}
# WAN federation between sites
retry_join_wan = ["10.10.0.2", "10.10.0.3"]
# DNS interface for CoreDNS integration
ports {
  dns = 8600
}
# Enable service mesh (Connect)
connect {
  enabled = true
}
# TLS via step-ca
tls {
  defaults {
    ca_file = "/etc/consul.d/certs/ca.pem"
    cert_file = "/etc/consul.d/certs/server.pem"
    key_file = "/etc/consul.d/certs/server-key.pem"
    verify_incoming = true
    verify_outgoing = true
  }
}
HCL

systemctl enable --now consul

# Register a service with health check
cat > /etc/consul.d/services/postgres.hcl << 'HCL'
service {
  name = "postgres"
  port = 5432
  tags = ["primary", "production"]

  check {
    id       = "postgres-tcp"
    name     = "PostgreSQL TCP"
    tcp      = "localhost:5432"
    interval = "5s"
    timeout  = "2s"
  }

  check {
    id       = "postgres-query"
    name     = "PostgreSQL Query"
    args     = ["/usr/local/bin/pg-health-check"]
    interval = "10s"
    timeout  = "5s"
  }
}
HCL

consul reload

# Query services
consul catalog services
consul catalog nodes -service=postgres
dig @127.0.0.1 -p 8600 postgres.service.consul SRV

Step 9: Production observability stack

You can't run a cloud you can't see. The production stack is three pillars: metrics (Prometheus), logs (Loki), and traces (Tempo). All feeding into Grafana. All scraped automatically via Consul service discovery. No more manually adding targets to prometheus.yml.

# Prometheus config — auto-discover services via Consul
cat > /etc/prometheus/prometheus.yml << 'PROM'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

# ─── Auto-discovery from Consul ───────────────────────
scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'localhost:8500'
        services: []
    relabel_configs:
      # Use Consul service name as job label
      - source_labels: [__meta_consul_service]
        target_label: job
      # Use Consul node name as instance label
      - source_labels: [__meta_consul_node]
        target_label: instance
      # Add site label from Consul datacenter
      - source_labels: [__meta_consul_dc]
        target_label: site
      # Only scrape services tagged 'metrics'
      - source_labels: [__meta_consul_tags]
        regex: .*,metrics,.*
        action: keep

  - job_name: 'node-exporter'
    consul_sd_configs:
      - server: 'localhost:8500'
        services: ['node-exporter']
    relabel_configs:
      - source_labels: [__meta_consul_node]
        target_label: instance

  # ─── ZFS-specific metrics ──────────────────────────
  - job_name: 'zfs-exporter'
    static_configs:
      - targets: ['10.10.0.1:9134', '10.10.0.2:9134', '10.10.0.3:9134']
        labels:
          tier: 'storage'

  # ─── HAProxy stats ────────────────────────────────
  - job_name: 'haproxy'
    static_configs:
      - targets: ['localhost:8404']
PROM

systemctl restart prometheus

# Loki — log aggregation (the Grafana alternative to Elasticsearch)
docker run -d \
    --name loki \
    --network host \
    -v /srv/loki:/loki \
    grafana/loki:latest \
    -config.file=/etc/loki/local-config.yaml

# Promtail on every node — ships logs to Loki
cat > /etc/promtail/config.yml << 'PROMTAIL'
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://10.90.0.10:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets: ['localhost']
        labels:
          job: syslog
          __path__: /var/log/*.log

  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'

  - job_name: frr
    static_configs:
      - targets: ['localhost']
        labels:
          job: frr
          __path__: /var/log/frr/*.log

  - job_name: haproxy
    static_configs:
      - targets: ['localhost']
        labels:
          job: haproxy
          __path__: /var/log/haproxy.log
PROMTAIL

Step 10: The control plane API

The difference between "a bunch of servers" and "a cloud" is an API. You need a way to say "create a VM on VNI 100 with 4 CPUs and 8GB RAM" and have it happen. Here's a minimal control plane that ties together everything we've built.

# /usr/local/bin/cloud-ctl — the cloud management CLI
cat > /usr/local/bin/cloud-ctl << 'SCRIPT'
#!/bin/bash
set -euo pipefail

CMD="${1:-help}"
shift || true

case "$CMD" in
    # ─── Network operations ──────────────────────────
    network)
        cloud-network "$@"
        ;;

    # ─── VM operations ───────────────────────────────
    vm-create)
        NAME="${1:?Usage: cloud-ctl vm-create     }"
        VNI="${2:?}"
        CPUS="${3:-2}"
        RAM="${4:-4}"
        DISK="${5:-50}"

        echo "=== Creating VM: $NAME ==="
        echo "  Network: VNI $VNI"
        echo "  Resources: ${CPUS} vCPUs, ${RAM}GB RAM, ${DISK}GB disk"

        # Create ZFS dataset for VM disk
        zfs create -o volsize="${DISK}G" "rpool/vms/$NAME"

        # Create OVS port
        ovs-vsctl add-port br-overlay "tap-$NAME" tag="$VNI" -- \
            set interface "tap-$NAME" type=internal

        # Generate libvirt XML
        cat > "/tmp/$NAME.xml" << VMXML

  $NAME
  $RAM
  $CPUS
  hvm
  
  
  
    
      
      
      
    
    
      
      
      
      
    
    
    
  

VMXML

        virsh define "/tmp/$NAME.xml"
        virsh start "$NAME"

        # Register with Consul
        consul services register -name="vm-$NAME" \
            -tag="vni-$VNI" -tag="compute" \
            -meta="cpus=$CPUS" -meta="ram=${RAM}G"

        echo "=== VM $NAME is running ==="
        virsh dominfo "$NAME"
        ;;

    vm-list)
        echo "=== Virtual Machines ==="
        virsh list --all
        echo ""
        echo "=== ZFS VM Volumes ==="
        zfs list -r rpool/vms -o name,volsize,used 2>/dev/null || echo "No VM volumes"
        ;;

    vm-destroy)
        NAME="${1:?Usage: cloud-ctl vm-destroy }"
        echo "Destroying VM: $NAME"
        virsh destroy "$NAME" 2>/dev/null || true
        virsh undefine "$NAME" 2>/dev/null || true
        zfs destroy "rpool/vms/$NAME" 2>/dev/null || true
        ovs-vsctl del-port br-overlay "tap-$NAME" 2>/dev/null || true
        consul services deregister -id="vm-$NAME" 2>/dev/null || true
        echo "VM $NAME destroyed"
        ;;

    vm-snapshot)
        NAME="${1:?Usage: cloud-ctl vm-snapshot  [label]}"
        LABEL="${2:-manual-$(date +%s)}"
        echo "Snapshotting VM $NAME as $LABEL"
        # ZFS snapshot is instant — the VM doesn't even notice
        zfs snapshot "rpool/vms/$NAME@$LABEL"
        echo "Snapshot created: rpool/vms/$NAME@$LABEL"
        ;;

    vm-clone)
        SRC="${1:?Usage: cloud-ctl vm-clone   }"
        DST="${2:?}"
        VNI="${3:?}"
        echo "Cloning $SRC to $DST"
        # Snapshot source, then clone — instant, copy-on-write
        zfs snapshot "rpool/vms/$SRC@clone-$DST"
        zfs clone "rpool/vms/$SRC@clone-$DST" "rpool/vms/$DST"
        echo "Clone ready: $DST (near-zero space until data diverges)"
        ;;

    # ─── Status ──────────────────────────────────────
    status)
        echo "=============================="
        echo "  Production Cloud Status"
        echo "  $(date '+%Y-%m-%d %H:%M:%S')"
        echo "=============================="
        echo ""
        echo "--- Routing ---"
        vtysh -c "show bgp summary" 2>/dev/null || echo "FRR not running"
        echo ""
        echo "--- Overlay Networks ---"
        cloud-network list
        echo ""
        echo "--- VMs ---"
        virsh list --all 2>/dev/null || echo "libvirt not running"
        echo ""
        echo "--- Services (Consul) ---"
        consul catalog services 2>/dev/null || echo "Consul not running"
        echo ""
        echo "--- HAProxy ---"
        echo "show stat" | socat stdio /run/haproxy/admin.sock 2>/dev/null | \
            awk -F, '{printf "  %-20s %-12s %s\n", $1, $2, $18}' | head -20 || \
            echo "HAProxy not running"
        echo ""
        echo "--- ZFS ---"
        zpool status -x
        echo ""
        echo "--- Storage ---"
        zfs list -o name,used,avail,compressratio -r rpool | head -20
        ;;

    *)
        echo "cloud-ctl — kldload Production Cloud Management"
        echo ""
        echo "Usage: cloud-ctl  [args]"
        echo ""
        echo "Network:"
        echo "  network create      Create overlay network"
        echo "  network list                            List overlay networks"
        echo "  network delete                     Delete overlay network"
        echo ""
        echo "Compute:"
        echo "  vm-create   [cpus] [ram] [disk]   Create a VM"
        echo "  vm-list                                       List all VMs"
        echo "  vm-destroy                              Destroy a VM"
        echo "  vm-snapshot  [label]                    Snapshot a VM"
        echo "  vm-clone               Clone a VM (instant)"
        echo ""
        echo "Status:"
        echo "  status                                  Full cloud status"
        ;;
esac
SCRIPT
chmod +x /usr/local/bin/cloud-ctl

# Example workflow — deploy a web application
cloud-ctl network create 100 production 10.100.0.0/24
cloud-ctl vm-create web-01 100 4 8 50
cloud-ctl vm-create web-02 100 4 8 50
cloud-ctl vm-create db-01  100 8 32 200

# Clone a production VM for staging in under a second
cloud-ctl vm-clone web-01 staging-web-01 200
cloud-ctl vm-clone db-01  staging-db-01  200

# Snapshot before deploying
cloud-ctl vm-snapshot web-01 pre-deploy-v2.1
cloud-ctl vm-snapshot db-01  pre-deploy-v2.1

# Something broke? Rollback is instant.
# zfs rollback rpool/vms/web-01@pre-deploy-v2.1

The complete stack — AWS to open source translation

Every layer of this stack is open source, battle-tested in production at companies far larger than yours, and runs on commodity hardware. There is no proprietary component. No license key. No "contact sales for pricing." No vendor who can hold your infrastructure hostage.

The total cost: $135–400/month in bare metal rentals + your home lab. The AWS equivalent: $3,000–8,000/month, plus the invisible cost of being locked into a platform that gets more expensive every year and harder to leave every quarter. You're not saving money. You're buying freedom.

Layer	Tool	AWS Equivalent	Status
Network fabric	VXLAN + Open vSwitch	VPC	Step 2
Dynamic routing	FRRouting (BGP + OSPF)	Route Tables / TGW	Step 1
VXLAN control plane	BGP EVPN	VPC Peering	Step 3
Load balancing	HAProxy + keepalived	ELB / ALB / NLB	Step 4
DNS	PowerDNS + CoreDNS	Route 53	Step 5
PKI / Certificates	step-ca (ACME)	ACM	Step 6
Identity / SSO	Keycloak	IAM / Cognito	Step 7
Service mesh	Consul	App Mesh / Cloud Map	Step 8
Observability	Prometheus + Loki + Grafana	CloudWatch	Step 9
Control plane	cloud-ctl (custom)	AWS Console / CLI	Step 10
Compute	KVM + libvirt	EC2	Multi-Site recipe
Object storage	MinIO on ZFS	S3	Homelab recipe
Block storage	ZFS zvols	EBS	Built in
Snapshots	ZFS snapshots	EBS Snapshots	Built in
Replication	Syncoid over WireGuard	Cross-Region Replication	Multi-Site recipe
Encryption	ZFS native + WireGuard	KMS + VPN	Built in

Is this actually production-ready?

Every component in this stack runs in production at scale. FRRouting powers ISP edge networks. HAProxy handles billions of requests per day at companies like GitHub, Stack Overflow, and Airbnb. Open vSwitch runs in every major cloud provider's data center. Consul runs at HashiCorp's own customers at massive scale. The question isn't whether these tools are production-ready — they've been production-ready for a decade. The question is whether you're ready to stop paying someone else to run them for you.

The ingredients are the same ones the restaurants use. You're just cooking at home.

Where to go from here

Container orchestration — Add Nomad or Kubernetes for container workloads alongside KVM VMs. Nomad is simpler; Kubernetes has the ecosystem. Both integrate with Consul.
Firecracker microVMs — For serverless workloads, Firecracker boots a VM in 125ms. See the Serverless / Firecracker guide.
Ceph for distributed storage — When ZFS replication isn't enough and you need active-active storage across sites, Ceph provides distributed block/object/file storage. It's complex but proven.
Terraform provider — Wrap cloud-ctl in a Terraform provider for declarative infrastructure management. Libvirt already has one.
Multi-tenant billing — If you're selling this as a service, add usage metering with Prometheus and export to a billing system.
GPU passthrough — For ML workloads, pass NVIDIA GPUs through to KVM VMs. See the NVIDIA guide.

You just built an open-source AWS. Not a toy version. Not a demo. A production cloud platform with overlay networking, dynamic routing, load balancing, service discovery, internal PKI, identity management, and observability. On hardware you own. With data sovereignty you control. For a fraction of the cost.

The cloud isn't a place. It's a set of patterns. VPCs are just VXLAN. Route 53 is just DNS. ELB is just HAProxy. IAM is just Keycloak. EC2 is just KVM. S3 is just MinIO. The cloud providers packaged these patterns, put a web console on top, and charge you $8,000/month for the privilege. Now you know how the trick works. And you can do it yourself.

← Multi-Site Cloud Game Servers →