Documentation

Multi-Site Cloud — Bare Metal as a Service

AWS charges you to move data out of their cloud. They literally penalize you for leaving. They call it "egress fees." We call it a protection racket. What if your infrastructure was just... computers? Computers in different buildings, connected by encrypted tunnels, replicating data between them automatically. No egress fees. No API charges. No 47-page pricing calculator. Just ZFS, WireGuard, and bare metal.

Three or more bare-metal nodes across multiple regions, connected by WireGuard, replicated with ZFS. Ephemeral environments that spin up at 8am and tear down at 4pm. A different workload every night. This isn't a toy — this is the architecture that billion-dollar companies pay millions to run on AWS, except you're doing it on rented bare metal plus your home lab.

This recipe is the homelab cloud recipe scaled to multiple sites. It uses everything from the masterclass collection at once: WireGuard for the encrypted mesh between sites, ZFS send/receive for block-level replication, backplane networking for invisible infrastructure, BGP for dynamic route exchange between sites, and nftables for per-site firewalling. The BMaaS concept is the unique part: treat bare metal servers like cloud instances. Build your environment from a ZFS snapshot at 8am. Work all day. Replicate the delta home at 4pm. Destroy the environment. Tomorrow, build a different one. The hardware is interchangeable. The data lives in ZFS.

What this enables:

True multi-region failover — data replicated across sites, any site can become primary. Not "we have backups somewhere" — "the other site is already running"
Bare Metal as a Service (BMaaS) — build at 8am, tear down at 4pm, rent your idle capacity overnight or shut it down entirely. Reduce datacenter emissions by turning hardware off when you're done with it. A novel concept.
Hourly cloud rental — use OVH/Hetzner bare metal, replicate your data home every night, and return the server when you don't need it. Try doing that on AWS without selling a kidney to cover the egress bill.
Ransomware immunity — ZFS snapshots are read-only. Attacker encrypts your files? Roll back to the snapshot from an hour ago. In seconds. On all three sites.

Architecture

Three sites. Three ZFS pools. One WireGuard mesh connecting them all. Site A is production. Site B is a hot standby with 15-minute replication lag. Site C is your home lab — the cold replica that survives even if both rented servers get hit by a meteor. Or more realistically, if OVH has another fire.

┌─────────────────────┐     WireGuard      ┌─────────────────────┐
│  SITE A - Primary   │<==================>│  SITE B - Secondary │
│  OVH Montreal       │     encrypted       │  Hetzner Frankfurt  │
│  ┌───────────────┐  │     mesh            │  ┌───────────────┐  │
│  │ rpool/services│  │                     │  │ rpool/services│  │
│  │ rpool/vms     │  │                     │  │ rpool/vms     │  │
│  │ rpool/data    │  │                     │  │ rpool/data    │  │
│  └───────────────┘  │                     │  └───────────────┘  │
└─────────┬───────────┘                     └─────────┬───────────┘
          │              WireGuard                     │
          └──────────────────┬─────────────────────────┘
                             │
                    ┌────────┴────────────┐
                    │  SITE C - Home Lab  │
                    │  Your hardware      │
                    │  ┌───────────────┐  │
                    │  │ Cold replica  │  │
                    │  │ rpool/backup  │  │
                    │  └───────────────┘  │
                    └─────────────────────┘

Why three sites?

Two sites protect you from hardware failure. Three sites protect you from provider failure. If OVH Montreal burns down (it happened), your Hetzner node takes over in seconds. If both rented servers vanish simultaneously, your home lab has a nightly copy of everything. Three sites means no single point of failure — not even your wallet.

Two copies is a backup. Three copies across three providers is a survival strategy.

The WireGuard mesh is the foundation. Every other layer — ZFS replication, failover, BMaaS — runs on top of it. Three sites, full mesh, four WireGuard planes (management, control, monitoring, data). The data plane carries ZFS replication traffic. The control plane carries SSH and orchestration. They never compete because they're separate encrypted tunnels. If you've read the Backplane Masterclass, this is that architecture deployed across three physical locations.

Step 1: WireGuard mesh between sites

WireGuard is the backbone. Every site gets a tunnel to every other site, forming a fully meshed overlay network on 10.10.0.0/24. All inter-site traffic — replication, monitoring, SSH, failover heartbeats — flows through these encrypted tunnels. The public internet only ever sees WireGuard handshakes on UDP 51820. Everything else is invisible.

# On each node — generate WireGuard keys
wg genkey | tee /etc/wireguard/private.key | wg pubkey > /etc/wireguard/public.key
chmod 600 /etc/wireguard/private.key

Site A — OVH Montreal (10.10.0.1)

cat > /etc/wireguard/wg0.conf << 'EOF'
[Interface]
Address = 10.10.0.1/24
ListenPort = 51820
PrivateKey = 
# Enable forwarding for inter-site routing
PostUp = sysctl -w net.ipv4.ip_forward=1

# Site B — Hetzner Frankfurt
[Peer]
PublicKey = 
AllowedIPs = 10.10.0.2/32
Endpoint = site-b.example.com:51820
PersistentKeepalive = 25

# Site C — Home Lab
[Peer]
PublicKey = 
AllowedIPs = 10.10.0.3/32
Endpoint = site-c.example.com:51820
PersistentKeepalive = 25
EOF

systemctl enable --now wg-quick@wg0

Site B — Hetzner Frankfurt (10.10.0.2)

cat > /etc/wireguard/wg0.conf << 'EOF'
[Interface]
Address = 10.10.0.2/24
ListenPort = 51820
PrivateKey = 
PostUp = sysctl -w net.ipv4.ip_forward=1

# Site A — OVH Montreal
[Peer]
PublicKey = 
AllowedIPs = 10.10.0.1/32
Endpoint = site-a.example.com:51820
PersistentKeepalive = 25

# Site C — Home Lab
[Peer]
PublicKey = 
AllowedIPs = 10.10.0.3/32
Endpoint = site-c.example.com:51820
PersistentKeepalive = 25
EOF

systemctl enable --now wg-quick@wg0

Site C — Home Lab (10.10.0.3)

cat > /etc/wireguard/wg0.conf << 'EOF'
[Interface]
Address = 10.10.0.3/24
ListenPort = 51820
PrivateKey = 
PostUp = sysctl -w net.ipv4.ip_forward=1

# Site A — OVH Montreal
[Peer]
PublicKey = 
AllowedIPs = 10.10.0.1/32
Endpoint = site-a.example.com:51820
PersistentKeepalive = 25

# Site B — Hetzner Frankfurt
[Peer]
PublicKey = 
AllowedIPs = 10.10.0.2/32
Endpoint = site-b.example.com:51820
PersistentKeepalive = 25
EOF

systemctl enable --now wg-quick@wg0

Verify the mesh:

# From any node — all peers should show a recent handshake
wg show wg0
# Test connectivity
ping -c 3 10.10.0.1  # Site A
ping -c 3 10.10.0.2  # Site B
ping -c 3 10.10.0.3  # Site C

Step 2: ZFS replication topology

This is where ZFS earns its keep. syncoid sends only the changed blocks since the last snapshot — not the whole dataset, not the whole file, just the deltas. A 2TB dataset with 50MB of changes sends 50MB. Over an encrypted WireGuard tunnel. On a schedule. Automatically.

The replication topology

Site A pushes to Site B every 15 minutes (hot standby). Both sites push to Site C nightly (cold archive). Any site can become primary by promoting its latest snapshot. The math: worst case, you lose 15 minutes of data if Site A dies. If both A and B die simultaneously, you lose at most 24 hours. If all three die, you have bigger problems than data recovery.

Think of it as a relay race with the baton being copied, not passed. Every runner has their own baton.

Site A (primary)
  │
  ├── every 15 min ──→ Site B (hot standby)
  ├── nightly ────────→ Site C (cold archive)
  │
Site B (secondary)
  │
  └── nightly ────────→ Site C (cold archive)

ZFS dataset layout (all sites)

# Create the dataset hierarchy on each site
zfs create -o mountpoint=none rpool/services
zfs create -o mountpoint=/srv/services/web rpool/services/web
zfs create -o mountpoint=/srv/services/db rpool/services/db
zfs create -o mountpoint=/srv/services/app rpool/services/app

zfs create -o mountpoint=none rpool/vms
zfs create -o mountpoint=none rpool/data
zfs create -o mountpoint=/srv/data/shared rpool/data/shared

# Enable encryption at rest
zfs create -o encryption=aes-256-gcm -o keyformat=passphrase \
    -o mountpoint=/srv/secrets rpool/secrets

Real-time replication: Site A to Site B (every 15 min)

# Install sanoid/syncoid (included in kldload)
# On Site A — /etc/sanoid/sanoid.conf
cat > /etc/sanoid/sanoid.conf << 'EOF'
[rpool/services]
  use_template = production
  recursive = yes

[rpool/data]
  use_template = production
  recursive = yes

[rpool/vms]
  use_template = production
  recursive = yes

[template_production]
  frequently = 4
  hourly = 48
  daily = 30
  monthly = 6
  yearly = 0
  autosnap = yes
  autoprune = yes
EOF

systemctl enable --now sanoid.timer

# Cron: replicate to Site B every 15 minutes over WireGuard
cat > /etc/cron.d/replicate-site-b << 'CRON'
# Site A → Site B: hot replication every 15 minutes
*/15 * * * * root /usr/local/bin/replicate-to-site-b 2>&1 | logger -t zfs-replicate
CRON

# /usr/local/bin/replicate-to-site-b
cat > /usr/local/bin/replicate-to-site-b << 'SCRIPT'
#!/bin/bash
set -euo pipefail

SITE_B="10.10.0.2"
LOG_TAG="replicate-site-b"

log() { logger -t "$LOG_TAG" "$*"; }

# Check WireGuard tunnel is up
if ! ping -c 1 -W 3 "$SITE_B" > /dev/null 2>&1; then
    log "ERROR: Site B ($SITE_B) unreachable — skipping replication"
    exit 1
fi

log "Starting replication to Site B"

# Replicate each top-level dataset recursively
for ds in rpool/services rpool/data rpool/vms; do
    log "Replicating $ds"
    syncoid --recursive --no-sync-snap \
        --sendoptions="w" \
        "$ds" "$SITE_B:$ds" 2>&1 | while read -r line; do
            log "$ds: $line"
        done
done

log "Replication to Site B complete"
SCRIPT
chmod +x /usr/local/bin/replicate-to-site-b

Nightly backup: both sites to Site C

# On Site A — nightly push to Site C (home lab)
cat > /etc/cron.d/replicate-site-c << 'CRON'
# Site A → Site C: nightly cold backup at 02:00
0 2 * * * root syncoid --recursive --no-sync-snap rpool/services 10.10.0.3:rpool/backup/site-a/services 2>&1 | logger -t replicate-site-c
15 2 * * * root syncoid --recursive --no-sync-snap rpool/data 10.10.0.3:rpool/backup/site-a/data 2>&1 | logger -t replicate-site-c
30 2 * * * root syncoid --recursive --no-sync-snap rpool/vms 10.10.0.3:rpool/backup/site-a/vms 2>&1 | logger -t replicate-site-c
CRON

# On Site B — nightly push to Site C at 03:00 (staggered)
cat > /etc/cron.d/replicate-site-c << 'CRON'
# Site B → Site C: nightly cold backup at 03:00
0 3 * * * root syncoid --recursive --no-sync-snap rpool/services 10.10.0.3:rpool/backup/site-b/services 2>&1 | logger -t replicate-site-c
15 3 * * * root syncoid --recursive --no-sync-snap rpool/data 10.10.0.3:rpool/backup/site-b/data 2>&1 | logger -t replicate-site-c
30 3 * * * root syncoid --recursive --no-sync-snap rpool/vms 10.10.0.3:rpool/backup/site-b/vms 2>&1 | logger -t replicate-site-c
CRON

Site C (home lab) dataset layout after replication:

rpool/backup/
├── site-a/
│   ├── services/
│   ├── data/
│   └── vms/
└── site-b/
    ├── services/
    ├── data/
    └── vms/

ZFS replication is the core of multi-site reliability. The 15-minute replication cadence means your RPO (Recovery Point Objective) is 15 minutes — in the worst case, you lose 15 minutes of data. The nightly backup to Site C means even if both production sites burn down, you have last night's data at home. The key insight: ZFS replication is incremental at the block level. A 2TB dataset that changed 500MB in the last 15 minutes sends 500MB, not 2TB. Over a WireGuard tunnel, that's seconds on a decent connection.

Step 3: Failover strategy

The whole point of multi-site is that when one site dies, the other takes over. Two modes: automated (keepalived detects failure and reassigns the floating IP in seconds — your users might not even notice) or manual (SSH in, run a script, update DNS, done in under 5 minutes). Pick your adventure based on how much downtime your boss/spouse/conscience will tolerate.

Automated: keepalived + floating IP

OVH and Hetzner both support floating IPs via API. keepalived monitors the primary and reassigns the floating IP when it goes down.

# Install on both Site A and Site B
dnf install -y keepalived  # CentOS/RHEL
# apt install -y keepalived  # Debian

# Site A — /etc/keepalived/keepalived.conf (MASTER)
cat > /etc/keepalived/keepalived.conf << 'EOF'
global_defs {
    router_id SITE_A
    script_user root
    enable_script_security
}

vrrp_script check_services {
    script "/usr/local/bin/check-site-health"
    interval 5
    weight -20
    fall 3
    rise 2
}

vrrp_instance MULTISITE {
    state MASTER
    interface wg0
    virtual_router_id 51
    priority 100
    advert_int 1
    unicast_src_ip 10.10.0.1
    unicast_peer {
        10.10.0.2
    }
    authentication {
        auth_type PASS
        auth_pass changeme_secret
    }
    track_script {
        check_services
    }
    notify_master "/usr/local/bin/failover-become-master"
    notify_backup "/usr/local/bin/failover-become-backup"
}
EOF

# Site B — /etc/keepalived/keepalived.conf (BACKUP)
cat > /etc/keepalived/keepalived.conf << 'EOF'
global_defs {
    router_id SITE_B
    script_user root
    enable_script_security
}

vrrp_script check_services {
    script "/usr/local/bin/check-site-health"
    interval 5
    weight -20
    fall 3
    rise 2
}

vrrp_instance MULTISITE {
    state BACKUP
    interface wg0
    virtual_router_id 51
    priority 90
    advert_int 1
    unicast_src_ip 10.10.0.2
    unicast_peer {
        10.10.0.1
    }
    authentication {
        auth_type PASS
        auth_pass changeme_secret
    }
    track_script {
        check_services
    }
    notify_master "/usr/local/bin/failover-become-master"
    notify_backup "/usr/local/bin/failover-become-backup"
}
EOF

systemctl enable --now keepalived

# /usr/local/bin/check-site-health
cat > /usr/local/bin/check-site-health << 'SCRIPT'
#!/bin/bash
# Return 0 if healthy, 1 if not

# Check ZFS pool is healthy
zpool status rpool | grep -q "state: ONLINE" || exit 1

# Check critical services are running
systemctl is-active --quiet nginx || exit 1
systemctl is-active --quiet postgresql || exit 1

# Check disk space (fail if < 10% free)
AVAIL=$(zfs get -Hp -o value available rpool)
USED=$(zfs get -Hp -o value used rpool)
TOTAL=$((AVAIL + USED))
PCT=$((AVAIL * 100 / TOTAL))
[ "$PCT" -lt 10 ] && exit 1

exit 0
SCRIPT
chmod +x /usr/local/bin/check-site-health

# /usr/local/bin/failover-become-master
cat > /usr/local/bin/failover-become-master << 'SCRIPT'
#!/bin/bash
set -euo pipefail

LOG_TAG="failover"
log() { logger -t "$LOG_TAG" "$*"; echo "$*"; }

log "=== BECOMING MASTER ==="

# Step 1: Reassign floating IP via provider API
# OVH example:
log "Reassigning floating IP to this node..."
curl -s -X POST "https://api.ovh.com/1.0/ip/203.0.113.50/move" \
    -H "X-Ovh-Application: $OVH_APP_KEY" \
    -H "X-Ovh-Consumer: $OVH_CONSUMER_KEY" \
    -H "X-Ovh-Timestamp: $(date +%s)" \
    -d '{"to": "ns12345.ip-XX-XX-XX.eu"}' || true

# Hetzner example (uncomment if using Hetzner):
# curl -s -X POST "https://api.hetzner.cloud/v1/floating_ips/12345/actions/assign" \
#     -H "Authorization: Bearer $HETZNER_API_TOKEN" \
#     -d '{"server": 67890}' || true

# Step 2: Update DNS (Cloudflare example)
log "Updating DNS to point to this node..."
curl -s -X PUT "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records/$CF_RECORD_ID" \
    -H "Authorization: Bearer $CF_API_TOKEN" \
    -H "Content-Type: application/json" \
    -d "{\"type\":\"A\",\"name\":\"app.example.com\",\"content\":\"$(curl -s ifconfig.me)\",\"ttl\":60}" || true

# Step 3: Start services
log "Starting services..."
systemctl start nginx postgresql app-server

log "=== MASTER TRANSITION COMPLETE ==="
SCRIPT
chmod +x /usr/local/bin/failover-become-master

Manual failover: 5 minutes, tops

# On Site B — manual failover script
cat > /usr/local/bin/manual-failover << 'SCRIPT'
#!/bin/bash
set -euo pipefail

echo "=== Manual Failover to Site B ==="
echo ""

# Step 1: Check replication status
echo "Last snapshot received:"
zfs list -t snapshot -r rpool/services -o name,creation -s creation | tail -5
echo ""

# Step 2: Promote this site to primary
echo "Starting services on Site B..."
systemctl start nginx postgresql app-server

# Step 3: Update DNS
echo "Update DNS to point to Site B's IP:"
echo "  app.example.com → $(curl -s ifconfig.me)"
echo ""
echo "Or use the Cloudflare API:"
echo "  curl -X PUT 'https://api.cloudflare.com/client/v4/zones/ZONE/dns_records/RECORD' \\"
echo "    -H 'Authorization: Bearer TOKEN' \\"
echo "    -d '{\"type\":\"A\",\"name\":\"app.example.com\",\"content\":\"$(curl -s ifconfig.me)\",\"ttl\":60}'"
echo ""
echo "=== Site B is now primary ==="
SCRIPT
chmod +x /usr/local/bin/manual-failover

BMaaS is the concept that makes multi-site kldload unique. Traditional cloud: you rent a VM 24/7 and pay whether you use it or not. BMaaS: you rent bare metal by the hour (OVH, Hetzner), zfs send your environment to it in the morning, work all day, zfs send the delta home at night, return the server. Tomorrow, send a different environment to a different server. The hardware is interchangeable — your data lives in ZFS snapshots, not on any particular machine. This inverts the cloud model: instead of your data living in someone else's data center permanently, your data lives at home and you temporarily deploy it to rented compute.

Step 4: BMaaS — Bare Metal as a Service

This is the part that makes cloud architects do a double-take. Use ZFS snapshots to run completely different environments on the same hardware at different times of day. Production during business hours. ML training overnight. Clean slate every morning. One server, two completely different workloads, zero waste. The cloud providers charge you for 24 hours even when you only use 8. We just... turn it off.

The daily cycle

# Create base snapshots (one-time setup)
# Production base: OS + services configured, no customer data
zfs snapshot -r rpool@clean-base

# ML training base: CUDA drivers, training frameworks, no data
zfs snapshot -r rpool@ml-training-base

# /usr/local/bin/bmaas-scheduler
cat > /usr/local/bin/bmaas-scheduler << 'SCRIPT'
#!/bin/bash
set -euo pipefail

LOG_TAG="bmaas"
log() { logger -t "$LOG_TAG" "$*"; echo "[$(date '+%H:%M:%S')] $*"; }

ACTION="${1:-status}"

case "$ACTION" in
    deploy-production)
        log "=== Deploying production environment ==="

        # Rollback to clean base
        log "Rolling back to clean-base snapshot..."
        zfs rollback -r rpool@clean-base

        # Deploy current production config
        log "Running Ansible deployment..."
        ansible-playbook -i /srv/ansible/inventory \
            /srv/ansible/deploy-production.yml

        # Start services
        log "Starting production services..."
        systemctl start nginx postgresql redis app-server

        # Update DNS / floating IP
        /usr/local/bin/failover-become-master

        log "=== Production environment ready ==="
        ;;

    teardown-production)
        log "=== Tearing down production ==="

        # Stop services gracefully
        log "Stopping services..."
        systemctl stop app-server redis postgresql nginx

        # Replicate today's data to home lab before teardown
        log "Replicating data to home lab..."
        syncoid --recursive --no-sync-snap \
            rpool/data "10.10.0.3:rpool/archive/$(date +%Y%m%d)/data"
        syncoid --recursive --no-sync-snap \
            rpool/services/db "10.10.0.3:rpool/archive/$(date +%Y%m%d)/db"

        # Rollback to clean base
        log "Rolling back to clean-base..."
        zfs rollback -r rpool@clean-base

        log "=== Teardown complete — hardware is clean ==="
        ;;

    deploy-ml)
        log "=== Deploying ML training environment ==="

        zfs rollback -r rpool@ml-training-base

        # Pull latest training data from home lab
        log "Syncing training data..."
        syncoid --recursive --no-sync-snap \
            "10.10.0.3:rpool/ml/datasets" rpool/ml/datasets

        # Start training
        log "Launching training jobs..."
        /srv/ml/start-training.sh

        log "=== ML environment running ==="
        ;;

    teardown-ml)
        log "=== Tearing down ML environment ==="

        # Export results to home lab
        log "Exporting ML results..."
        syncoid --recursive --no-sync-snap \
            rpool/ml/results "10.10.0.3:rpool/ml/results/$(date +%Y%m%d)"

        zfs rollback -r rpool@clean-base

        log "=== ML teardown complete ==="
        ;;

    status)
        echo "=== BMaaS Status ==="
        echo "Current snapshots:"
        zfs list -t snapshot -o name,creation -s creation | grep "rpool@"
        echo ""
        echo "Dataset usage:"
        zfs list -r -o name,used,avail rpool
        ;;

    *)
        echo "Usage: $0 {deploy-production|teardown-production|deploy-ml|teardown-ml|status}"
        exit 1
        ;;
esac
SCRIPT
chmod +x /usr/local/bin/bmaas-scheduler

The automated schedule

# /etc/cron.d/bmaas-schedule
cat > /etc/cron.d/bmaas-schedule << 'CRON'
# === Weekday BMaaS Schedule ===

# 08:00 — Deploy production
0 8 * * 1-5 root /usr/local/bin/bmaas-scheduler deploy-production 2>&1 | logger -t bmaas

# 17:00 — Tear down production, replicate data home
0 17 * * 1-5 root /usr/local/bin/bmaas-scheduler teardown-production 2>&1 | logger -t bmaas

# 18:00 — Deploy ML training (uses cheap overnight hours)
0 18 * * 1-5 root /usr/local/bin/bmaas-scheduler deploy-ml 2>&1 | logger -t bmaas

# 06:00 — Tear down ML, export results
0 6 * * 2-6 root /usr/local/bin/bmaas-scheduler teardown-ml 2>&1 | logger -t bmaas

# === Weekend: ML training runs 24h ===
0 8 * * 6 root /usr/local/bin/bmaas-scheduler deploy-ml 2>&1 | logger -t bmaas
0 6 * * 1 root /usr/local/bin/bmaas-scheduler teardown-ml 2>&1 | logger -t bmaas
CRON

How is this even possible?

ZFS rollback is a metadata operation — it doesn't copy data, it just repoints the dataset to a previous snapshot. Rolling back a 2TB production environment to a clean base takes seconds, not hours. Then you deploy a completely different stack on top. When you're done, rollback again. The hardware doesn't care what it's running. It's just blocks.

It's like having a whiteboard that can save and restore its entire contents instantly. Erase, restore yesterday's drawing, erase, restore last week's. The whiteboard doesn't get tired.

Step 5: Ephemeral environments

Need a staging environment with a full copy of production data? On AWS, that's a multi-hour snapshot restore and a significant chunk of your monthly bill. With ZFS clones, it's one command and it's ready in under a second. Clones are copy-on-write forks of a snapshot — they cost nearly zero space until data diverges. Spin up ten of them. Nobody cares. They're free.

# Create a production snapshot to clone from
zfs snapshot -r rpool/services@prod-latest

# Developer needs a staging environment — ready in under a second
zfs clone rpool/services/web@prod-latest rpool/staging/web-$(date +%s)
zfs clone rpool/services/db@prod-latest rpool/staging/db-$(date +%s)
zfs clone rpool/services/app@prod-latest rpool/staging/app-$(date +%s)

# Full copy of production data, writable, instant, near-zero space

# /usr/local/bin/ephemeral-env
cat > /usr/local/bin/ephemeral-env << 'SCRIPT'
#!/bin/bash
set -euo pipefail

ACTION="${1:-help}"
ENV_NAME="${2:-}"

case "$ACTION" in
    create)
        [ -z "$ENV_NAME" ] && { echo "Usage: $0 create "; exit 1; }
        STAMP=$(date +%s)
        echo "Creating ephemeral environment: $ENV_NAME"

        # Snapshot current production state
        zfs snapshot -r "rpool/services@ephemeral-$ENV_NAME-$STAMP"

        # Clone all service datasets
        for ds in web db app; do
            zfs clone "rpool/services/$ds@ephemeral-$ENV_NAME-$STAMP" \
                "rpool/ephemeral/$ENV_NAME/$ds"
        done

        echo "Environment ready:"
        echo "  Web: /srv/ephemeral/$ENV_NAME/web"
        echo "  DB:  /srv/ephemeral/$ENV_NAME/db"
        echo "  App: /srv/ephemeral/$ENV_NAME/app"
        echo ""
        echo "Space used: $(zfs list -H -o used rpool/ephemeral/$ENV_NAME)"
        ;;

    destroy)
        [ -z "$ENV_NAME" ] && { echo "Usage: $0 destroy "; exit 1; }
        echo "Destroying ephemeral environment: $ENV_NAME"

        # Destroy clones
        zfs destroy -r "rpool/ephemeral/$ENV_NAME"

        echo "Environment destroyed. Space reclaimed."
        ;;

    list)
        echo "=== Active ephemeral environments ==="
        zfs list -r -o name,used,creation rpool/ephemeral 2>/dev/null || \
            echo "No ephemeral environments"
        ;;

    *)
        echo "Usage: $0 {create|destroy|list} [name]"
        echo ""
        echo "Examples:"
        echo "  $0 create staging-v2    # Clone production for testing"
        echo "  $0 create demo-client   # Clone for client demo"
        echo "  $0 destroy staging-v2   # Clean up when done"
        echo "  $0 list                 # Show all environments"
        ;;
esac
SCRIPT
chmod +x /usr/local/bin/ephemeral-env

A developer breaks the staging environment? zfs destroy and zfs clone again. Fresh copy of production, ready in under a second. No VMs to rebuild, no containers to repull, no databases to restore from a 4-hour-old dump. Just clone. Every developer gets their own full copy of prod. Every QA run starts from a known-good state. This is the workflow that makes people stop mid-sentence when you explain it.

Step 6: Why this works

No egress fees. No API charges. No surprise bills. No vendor lock-in. No "we're raising prices 20% because we can." No "this service is being deprecated, migrate by March." No 200-page compliance questionnaire about where your data lives. Your data lives on your hardware.

ZFS replication is just zfs send piped over SSH through WireGuard — a kernel primitive, not a billable service. The only dependency is bandwidth between your sites. And bandwidth is cheap. Two bare metal servers with unmetered gigabit from OVH or Hetzner cost a fraction of equivalent cloud infrastructure.

The real math

OVH Advance-1: ~$90/month. Hetzner AX41: ~$45/month. Home lab: electricity only. Total: $135/month for a multi-region, auto-failover, ZFS-replicated infrastructure with bare metal performance. The equivalent on AWS — three regions, dedicated instances, cross-region replication, data transfer — would cost $2,000-5,000/month. And you'd still be renting.

AWS is a luxury hotel. This is buying three houses for the price of one hotel room.

Step 7: OVH/Hetzner setup

Practical steps to go from zero to a running multi-site cluster.

Order the hardware

OVH Advance-1 (Montreal):
  - Intel Xeon E-2386G (6c/12t)
  - 32 GB ECC DDR4
  - 2x 512GB NVMe (ZFS mirror)
  - Unmetered 1 Gbps
  - ~$90/month

Hetzner AX41-NVMe (Frankfurt):
  - AMD Ryzen 5 3600 (6c/12t)
  - 64 GB ECC DDR4
  - 2x 512GB NVMe (ZFS mirror)
  - 20 TB traffic included
  - ~$45/month

Home Lab (your hardware):
  - Any x86_64 with 16GB+ RAM
  - 2+ disks for ZFS mirror
  - $0/month (electricity only)

Install kldload on each node

# Boot from kldload ISO via IPMI/iLO/iDRAC (OVH/Hetzner provide KVM-over-IP)
# Or mount ISO via their rescue console

# Unattended install — same on all nodes, just change hostname
cat > /tmp/answers.env << 'EOF'
KLDLOAD_DISTRO=centos
KLDLOAD_DISK=/dev/nvme0n1
KLDLOAD_DISK2=/dev/nvme1n1
KLDLOAD_POOL_TYPE=mirror
KLDLOAD_HOSTNAME=site-a       # Change per node: site-a, site-b, site-c
KLDLOAD_USERNAME=admin
KLDLOAD_PASSWORD=changeme
KLDLOAD_PROFILE=server
KLDLOAD_NET_METHOD=dhcp
EOF
kldload-install-target --config /tmp/answers.env

# After reboot — install WireGuard (included in kldload darksite)
dnf install -y wireguard-tools

# Configure WireGuard mesh (see Step 1 above)
# Configure ZFS replication (see Step 2 above)
# Deploy services via Ansible or manually

Step 8: Monitoring across sites

# /usr/local/bin/multisite-status
cat > /usr/local/bin/multisite-status << 'SCRIPT'
#!/bin/bash
# Run from any site to check the health of all three

SITES=("10.10.0.1:site-a" "10.10.0.2:site-b" "10.10.0.3:site-c")

echo "=============================="
echo "  Multi-Site Cloud Status"
echo "  $(date '+%Y-%m-%d %H:%M:%S')"
echo "=============================="
echo ""

for entry in "${SITES[@]}"; do
    IP="${entry%%:*}"
    NAME="${entry##*:}"

    echo "--- $NAME ($IP) ---"

    # WireGuard tunnel
    if ping -c 1 -W 2 "$IP" > /dev/null 2>&1; then
        RTT=$(ping -c 3 -W 2 "$IP" 2>/dev/null | tail -1 | awk -F'/' '{print $5}')
        echo "  WireGuard: UP (${RTT}ms avg)"
    else
        echo "  WireGuard: DOWN"
        echo ""
        continue
    fi

    # ZFS pool health (via SSH over WireGuard)
    POOL_STATE=$(ssh -o ConnectTimeout=5 "$IP" "zpool status -x" 2>/dev/null)
    echo "  ZFS Pool: ${POOL_STATE:-unreachable}"

    # Last snapshot
    LAST_SNAP=$(ssh -o ConnectTimeout=5 "$IP" \
        "zfs list -t snapshot -H -o name,creation -s creation | tail -1" 2>/dev/null)
    echo "  Last Snapshot: ${LAST_SNAP:-unknown}"

    # Replication lag
    if [ "$IP" != "$(hostname -I | awk '{print $1}')" ]; then
        LOCAL_SNAP=$(zfs list -t snapshot -H -o name -s creation -r rpool/services | tail -1)
        REMOTE_SNAP=$(ssh -o ConnectTimeout=5 "$IP" \
            "zfs list -t snapshot -H -o name -s creation -r rpool/services 2>/dev/null | tail -1")
        if [ "$LOCAL_SNAP" = "$REMOTE_SNAP" ]; then
            echo "  Replication: IN SYNC"
        else
            echo "  Replication: LAGGING"
            echo "    Local:  $LOCAL_SNAP"
            echo "    Remote: $REMOTE_SNAP"
        fi
    fi

    # Disk usage
    DISK=$(ssh -o ConnectTimeout=5 "$IP" \
        "zfs list -H -o used,avail rpool" 2>/dev/null)
    echo "  Disk: ${DISK:-unknown}"

    echo ""
done

echo "--- Failover Readiness ---"
echo "  keepalived: $(systemctl is-active keepalived 2>/dev/null || echo 'not installed')"
echo "  WireGuard peers: $(wg show wg0 2>/dev/null | grep -c 'peer:') connected"
echo "  Sanoid timer: $(systemctl is-active sanoid.timer 2>/dev/null || echo 'not running')"
SCRIPT
chmod +x /usr/local/bin/multisite-status

Grafana dashboard

# Install node_exporter on all sites (included in kldload)
systemctl enable --now node_exporter

# Install Prometheus + Grafana on Site A (or your monitoring site)
dnf install -y grafana prometheus

# Prometheus config — scrape all three sites over WireGuard
cat > /etc/prometheus/prometheus.yml << 'EOF'
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'site-a'
    static_configs:
      - targets: ['10.10.0.1:9100']
        labels:
          site: 'montreal'
          role: 'primary'

  - job_name: 'site-b'
    static_configs:
      - targets: ['10.10.0.2:9100']
        labels:
          site: 'frankfurt'
          role: 'secondary'

  - job_name: 'site-c'
    static_configs:
      - targets: ['10.10.0.3:9100']
        labels:
          site: 'homelab'
          role: 'archive'
EOF

systemctl enable --now prometheus grafana-server

Key metrics to watch across sites:

ZFS replication lag — time since last successful syncoid to each target
WireGuard tunnel status — handshake age, bytes transferred, packet loss
Pool health — zpool status on every site, alert on DEGRADED or FAULTED
Failover readiness — keepalived state, service health checks passing
Inter-site latency — eBPF-based RTT monitoring between WireGuard peers

eBPF latency monitoring

# Use bpftrace to measure WireGuard tunnel latency in real time
# This traces ICMP round-trip time through the wg0 interface
cat > /usr/local/bin/wg-latency-monitor << 'SCRIPT'
#!/bin/bash
# Quick latency check across all sites — runs every minute via cron
for site in 10.10.0.1 10.10.0.2 10.10.0.3; do
    RTT=$(ping -c 5 -W 2 -I wg0 "$site" 2>/dev/null | tail -1 | awk -F'/' '{print $5}')
    if [ -n "$RTT" ]; then
        echo "wg_rtt_ms{target=\"$site\"} $RTT" >> /var/lib/node_exporter/textfile/wg_latency.prom
    fi
done
SCRIPT
chmod +x /usr/local/bin/wg-latency-monitor

# Expose to Prometheus via node_exporter textfile collector
mkdir -p /var/lib/node_exporter/textfile
echo "*/1 * * * * root /usr/local/bin/wg-latency-monitor" > /etc/cron.d/wg-latency

Step 9: Security

Every inter-site byte travels through WireGuard. No exceptions. The only port open to the public internet on any node is UDP 51820 — the WireGuard handshake. Everything else — SSH, monitoring, replication, management — runs over the encrypted overlay. An attacker scanning your IP sees one open UDP port and nothing else. Good luck with that.

# nftables firewall — lock down each site
cat > /etc/nftables.conf << 'NFTEOF'
#!/usr/sbin/nft -f

flush ruleset

table inet multisite {
    chain input {
        type filter hook input priority 0; policy drop;

        # Loopback
        iif lo accept

        # Established connections
        ct state established,related accept

        # WireGuard handshake (only port open to the internet)
        udp dport 51820 accept

        # Everything over WireGuard is trusted
        iifname "wg0" accept

        # ICMP for diagnostics
        ip protocol icmp accept
        ip6 nexthdr icmpv6 accept

        # Log and drop everything else
        log prefix "nft-drop: " limit rate 5/minute
        drop
    }

    chain forward {
        type filter hook forward priority 0; policy drop;

        # Only forward between WireGuard peers
        iifname "wg0" oifname "wg0" accept
        ct state established,related accept
        drop
    }

    chain output {
        type filter hook output priority 0; policy accept;

        # Allow all outbound (we control the server)
    }
}
NFTEOF

nft -f /etc/nftables.conf
systemctl enable nftables

Security posture:

Only UDP port 51820 is exposed to the internet (WireGuard)
All management (SSH, monitoring, replication) runs over the WireGuard overlay
ZFS encryption at rest on all sites — data is encrypted even if disks are stolen
Each site has independent nftables rules — compromise of one doesn't open the others
SSH keys only, no password authentication, and only reachable over WireGuard
Sanoid snapshots provide ransomware rollback — even if an attacker gets root, read-only snapshots survive

# SSH hardening — only listen on WireGuard interface
cat >> /etc/ssh/sshd_config << 'EOF'

# Only accept SSH over WireGuard
ListenAddress 10.10.0.1   # Change per site: .1, .2, .3
PasswordAuthentication no
PermitRootLogin prohibit-password
MaxAuthTries 3
EOF

systemctl restart sshd

Step 10: Disaster recovery runbook

Four scenarios, four procedures. Print this out and tape it to the rack. Seriously — when Site A is down and your phone is blowing up, you don't want to be scrolling through a wiki. You want a laminated card that says "do this, then this, then this."

Scenario 1: Site A fails — promote Site B (5 minutes)

# Site A is down. Site B has data from ≤15 minutes ago.

# 1. Verify Site A is actually down (not just a WireGuard blip)
ping -c 5 10.10.0.1     # No response
ssh 10.10.0.1 hostname   # Connection refused/timeout

# 2. On Site B — start services
systemctl start nginx postgresql app-server

# 3. Reassign floating IP (or update DNS)
/usr/local/bin/failover-become-master

# 4. Verify
curl -s https://app.example.com/health | jq .

# Total time: ~5 minutes
# Data loss: ≤15 minutes (last replication cycle)

Scenario 2: Site A + B both fail — promote Site C (15 minutes)

# Both rented servers are down. Home lab has nightly backup.

# 1. On Site C — check latest backup
zfs list -t snapshot -r rpool/backup -o name,creation -s creation | tail -10

# 2. Clone backup datasets to primary mountpoints
zfs clone rpool/backup/site-a/services@autosnap_latest rpool/services
zfs clone rpool/backup/site-a/data@autosnap_latest rpool/data

# 3. Install and start services (Site C may not have them running)
dnf install -y nginx postgresql
systemctl start nginx postgresql app-server

# 4. Update DNS to point to home lab's public IP
#    (or set up a new WireGuard tunnel to expose services)

# Total time: ~15 minutes
# Data loss: ≤24 hours (last nightly backup)

Scenario 3: Full rebuild from scratch (30 minutes)

# Everything is gone. You have the kldload ISO and a ZFS backup.

# 1. Boot kldload ISO on new hardware
# 2. Install kldload (server profile)
kldload-install-target --config /tmp/answers.env

# 3. After reboot — import backup pool (if you have the physical disks)
zpool import backup-pool
# Or receive from remote backup:
ssh 10.10.0.3 "zfs send -R rpool/backup/site-a@latest" | zfs receive -F rpool

# 4. Configure WireGuard, start services
# 5. Update DNS

# Total time: ~30 minutes (mostly waiting for ZFS receive)
# Data loss: depends on backup age

Scenario 4: Ransomware — rollback to clean snapshot (seconds)

# Attacker encrypted your files. ZFS snapshots are read-only.

# 1. Identify the last clean snapshot
zfs list -t snapshot -r rpool/services -o name,creation -s creation

# 2. Rollback (destroys everything after the snapshot — that's what we want)
zfs rollback -r rpool/services@autosnap_2026-03-27_00:00:00_daily
zfs rollback -r rpool/data@autosnap_2026-03-27_00:00:00_daily

# 3. Restart services
systemctl restart nginx postgresql app-server

# 4. Investigate how they got in, patch the hole

# Total time: seconds (rollback is instant)
# Data loss: only changes since the snapshot

Why ZFS snapshots beat ransomware

ZFS snapshots are immutable. An attacker with root can zfs destroy them, but they can't modify them in place — there's no way to encrypt a snapshot without destroying it first. And your off-site replicas (Site B, Site C) have their own independent snapshot chains on completely separate hardware. Even if an attacker compromises Site A and destroys all local snapshots, Site B and Site C still have clean copies. The attacker would need to simultaneously compromise all three sites to actually destroy your data. At that point, they're not a script kiddy — they're a nation-state, and you have bigger problems.

Traditional backups are like keeping a spare key under the doormat. ZFS replication is like having three houses in different countries, each with a copy of everything you own.

← Build Your Own Cloud Production Cloud →