Multi-Site Cloud — Bare Metal as a Service
AWS charges you to move data out of their cloud. They literally penalize you for leaving. They call it "egress fees." We call it a protection racket. What if your infrastructure was just... computers? Computers in different buildings, connected by encrypted tunnels, replicating data between them automatically. No egress fees. No API charges. No 47-page pricing calculator. Just ZFS, WireGuard, and bare metal.
Three or more bare-metal nodes across multiple regions, connected by WireGuard, replicated with ZFS. Ephemeral environments that spin up at 8am and tear down at 4pm. A different workload every night. This isn't a toy — this is the architecture that billion-dollar companies pay millions to run on AWS, except you're doing it on rented bare metal plus your home lab.
What this enables:
- True multi-region failover — data replicated across sites, any site can become primary. Not "we have backups somewhere" — "the other site is already running"
- Bare Metal as a Service (BMaaS) — build at 8am, tear down at 4pm, rent your idle capacity overnight or shut it down entirely. Reduce datacenter emissions by turning hardware off when you're done with it. A novel concept.
- Hourly cloud rental — use OVH/Hetzner bare metal, replicate your data home every night, and return the server when you don't need it. Try doing that on AWS without selling a kidney to cover the egress bill.
- Ransomware immunity — ZFS snapshots are read-only. Attacker encrypts your files? Roll back to the snapshot from an hour ago. In seconds. On all three sites.
Architecture
Three sites. Three ZFS pools. One WireGuard mesh connecting them all. Site A is production. Site B is a hot standby with 15-minute replication lag. Site C is your home lab — the cold replica that survives even if both rented servers get hit by a meteor. Or more realistically, if OVH has another fire.
┌─────────────────────┐ WireGuard ┌─────────────────────┐
│ SITE A - Primary │<==================>│ SITE B - Secondary │
│ OVH Montreal │ encrypted │ Hetzner Frankfurt │
│ ┌───────────────┐ │ mesh │ ┌───────────────┐ │
│ │ rpool/services│ │ │ │ rpool/services│ │
│ │ rpool/vms │ │ │ │ rpool/vms │ │
│ │ rpool/data │ │ │ │ rpool/data │ │
│ └───────────────┘ │ │ └───────────────┘ │
└─────────┬───────────┘ └─────────┬───────────┘
│ WireGuard │
└──────────────────┬─────────────────────────┘
│
┌────────┴────────────┐
│ SITE C - Home Lab │
│ Your hardware │
│ ┌───────────────┐ │
│ │ Cold replica │ │
│ │ rpool/backup │ │
│ └───────────────┘ │
└─────────────────────┘
Why three sites?
Two sites protect you from hardware failure. Three sites protect you from provider failure. If OVH Montreal burns down (it happened), your Hetzner node takes over in seconds. If both rented servers vanish simultaneously, your home lab has a nightly copy of everything. Three sites means no single point of failure — not even your wallet.
Step 1: WireGuard mesh between sites
WireGuard is the backbone. Every site gets a tunnel to every other site, forming
a fully meshed overlay network on 10.10.0.0/24. All inter-site traffic —
replication, monitoring, SSH, failover heartbeats — flows through these encrypted
tunnels. The public internet only ever sees WireGuard handshakes on UDP 51820.
Everything else is invisible.
# On each node — generate WireGuard keys
wg genkey | tee /etc/wireguard/private.key | wg pubkey > /etc/wireguard/public.key
chmod 600 /etc/wireguard/private.key
Site A — OVH Montreal (10.10.0.1)
cat > /etc/wireguard/wg0.conf << 'EOF'
[Interface]
Address = 10.10.0.1/24
ListenPort = 51820
PrivateKey =
# Enable forwarding for inter-site routing
PostUp = sysctl -w net.ipv4.ip_forward=1
# Site B — Hetzner Frankfurt
[Peer]
PublicKey =
AllowedIPs = 10.10.0.2/32
Endpoint = site-b.example.com:51820
PersistentKeepalive = 25
# Site C — Home Lab
[Peer]
PublicKey =
AllowedIPs = 10.10.0.3/32
Endpoint = site-c.example.com:51820
PersistentKeepalive = 25
EOF
systemctl enable --now wg-quick@wg0
Site B — Hetzner Frankfurt (10.10.0.2)
cat > /etc/wireguard/wg0.conf << 'EOF'
[Interface]
Address = 10.10.0.2/24
ListenPort = 51820
PrivateKey =
PostUp = sysctl -w net.ipv4.ip_forward=1
# Site A — OVH Montreal
[Peer]
PublicKey =
AllowedIPs = 10.10.0.1/32
Endpoint = site-a.example.com:51820
PersistentKeepalive = 25
# Site C — Home Lab
[Peer]
PublicKey =
AllowedIPs = 10.10.0.3/32
Endpoint = site-c.example.com:51820
PersistentKeepalive = 25
EOF
systemctl enable --now wg-quick@wg0
Site C — Home Lab (10.10.0.3)
cat > /etc/wireguard/wg0.conf << 'EOF'
[Interface]
Address = 10.10.0.3/24
ListenPort = 51820
PrivateKey =
PostUp = sysctl -w net.ipv4.ip_forward=1
# Site A — OVH Montreal
[Peer]
PublicKey =
AllowedIPs = 10.10.0.1/32
Endpoint = site-a.example.com:51820
PersistentKeepalive = 25
# Site B — Hetzner Frankfurt
[Peer]
PublicKey =
AllowedIPs = 10.10.0.2/32
Endpoint = site-b.example.com:51820
PersistentKeepalive = 25
EOF
systemctl enable --now wg-quick@wg0
Verify the mesh:
# From any node — all peers should show a recent handshake
wg show wg0
# Test connectivity
ping -c 3 10.10.0.1 # Site A
ping -c 3 10.10.0.2 # Site B
ping -c 3 10.10.0.3 # Site C
Step 2: ZFS replication topology
This is where ZFS earns its keep. syncoid sends only the changed blocks
since the last snapshot — not the whole dataset, not the whole file, just the
deltas. A 2TB dataset with 50MB of changes sends 50MB. Over an encrypted WireGuard
tunnel. On a schedule. Automatically.
The replication topology
Site A pushes to Site B every 15 minutes (hot standby). Both sites push to Site C nightly (cold archive). Any site can become primary by promoting its latest snapshot. The math: worst case, you lose 15 minutes of data if Site A dies. If both A and B die simultaneously, you lose at most 24 hours. If all three die, you have bigger problems than data recovery.
Site A (primary)
│
├── every 15 min ──→ Site B (hot standby)
├── nightly ────────→ Site C (cold archive)
│
Site B (secondary)
│
└── nightly ────────→ Site C (cold archive)
ZFS dataset layout (all sites)
# Create the dataset hierarchy on each site
zfs create -o mountpoint=none rpool/services
zfs create -o mountpoint=/srv/services/web rpool/services/web
zfs create -o mountpoint=/srv/services/db rpool/services/db
zfs create -o mountpoint=/srv/services/app rpool/services/app
zfs create -o mountpoint=none rpool/vms
zfs create -o mountpoint=none rpool/data
zfs create -o mountpoint=/srv/data/shared rpool/data/shared
# Enable encryption at rest
zfs create -o encryption=aes-256-gcm -o keyformat=passphrase \
-o mountpoint=/srv/secrets rpool/secrets
Real-time replication: Site A to Site B (every 15 min)
# Install sanoid/syncoid (included in kldload)
# On Site A — /etc/sanoid/sanoid.conf
cat > /etc/sanoid/sanoid.conf << 'EOF'
[rpool/services]
use_template = production
recursive = yes
[rpool/data]
use_template = production
recursive = yes
[rpool/vms]
use_template = production
recursive = yes
[template_production]
frequently = 4
hourly = 48
daily = 30
monthly = 6
yearly = 0
autosnap = yes
autoprune = yes
EOF
systemctl enable --now sanoid.timer
# Cron: replicate to Site B every 15 minutes over WireGuard
cat > /etc/cron.d/replicate-site-b << 'CRON'
# Site A → Site B: hot replication every 15 minutes
*/15 * * * * root /usr/local/bin/replicate-to-site-b 2>&1 | logger -t zfs-replicate
CRON
# /usr/local/bin/replicate-to-site-b
cat > /usr/local/bin/replicate-to-site-b << 'SCRIPT'
#!/bin/bash
set -euo pipefail
SITE_B="10.10.0.2"
LOG_TAG="replicate-site-b"
log() { logger -t "$LOG_TAG" "$*"; }
# Check WireGuard tunnel is up
if ! ping -c 1 -W 3 "$SITE_B" > /dev/null 2>&1; then
log "ERROR: Site B ($SITE_B) unreachable — skipping replication"
exit 1
fi
log "Starting replication to Site B"
# Replicate each top-level dataset recursively
for ds in rpool/services rpool/data rpool/vms; do
log "Replicating $ds"
syncoid --recursive --no-sync-snap \
--sendoptions="w" \
"$ds" "$SITE_B:$ds" 2>&1 | while read -r line; do
log "$ds: $line"
done
done
log "Replication to Site B complete"
SCRIPT
chmod +x /usr/local/bin/replicate-to-site-b
Nightly backup: both sites to Site C
# On Site A — nightly push to Site C (home lab)
cat > /etc/cron.d/replicate-site-c << 'CRON'
# Site A → Site C: nightly cold backup at 02:00
0 2 * * * root syncoid --recursive --no-sync-snap rpool/services 10.10.0.3:rpool/backup/site-a/services 2>&1 | logger -t replicate-site-c
15 2 * * * root syncoid --recursive --no-sync-snap rpool/data 10.10.0.3:rpool/backup/site-a/data 2>&1 | logger -t replicate-site-c
30 2 * * * root syncoid --recursive --no-sync-snap rpool/vms 10.10.0.3:rpool/backup/site-a/vms 2>&1 | logger -t replicate-site-c
CRON
# On Site B — nightly push to Site C at 03:00 (staggered)
cat > /etc/cron.d/replicate-site-c << 'CRON'
# Site B → Site C: nightly cold backup at 03:00
0 3 * * * root syncoid --recursive --no-sync-snap rpool/services 10.10.0.3:rpool/backup/site-b/services 2>&1 | logger -t replicate-site-c
15 3 * * * root syncoid --recursive --no-sync-snap rpool/data 10.10.0.3:rpool/backup/site-b/data 2>&1 | logger -t replicate-site-c
30 3 * * * root syncoid --recursive --no-sync-snap rpool/vms 10.10.0.3:rpool/backup/site-b/vms 2>&1 | logger -t replicate-site-c
CRON
Site C (home lab) dataset layout after replication:
rpool/backup/
├── site-a/
│ ├── services/
│ ├── data/
│ └── vms/
└── site-b/
├── services/
├── data/
└── vms/
Step 3: Failover strategy
The whole point of multi-site is that when one site dies, the other takes over. Two modes: automated (keepalived detects failure and reassigns the floating IP in seconds — your users might not even notice) or manual (SSH in, run a script, update DNS, done in under 5 minutes). Pick your adventure based on how much downtime your boss/spouse/conscience will tolerate.
Automated: keepalived + floating IP
OVH and Hetzner both support floating IPs via API. keepalived monitors the
primary and reassigns the floating IP when it goes down.
# Install on both Site A and Site B
dnf install -y keepalived # CentOS/RHEL
# apt install -y keepalived # Debian
# Site A — /etc/keepalived/keepalived.conf (MASTER)
cat > /etc/keepalived/keepalived.conf << 'EOF'
global_defs {
router_id SITE_A
script_user root
enable_script_security
}
vrrp_script check_services {
script "/usr/local/bin/check-site-health"
interval 5
weight -20
fall 3
rise 2
}
vrrp_instance MULTISITE {
state MASTER
interface wg0
virtual_router_id 51
priority 100
advert_int 1
unicast_src_ip 10.10.0.1
unicast_peer {
10.10.0.2
}
authentication {
auth_type PASS
auth_pass changeme_secret
}
track_script {
check_services
}
notify_master "/usr/local/bin/failover-become-master"
notify_backup "/usr/local/bin/failover-become-backup"
}
EOF
# Site B — /etc/keepalived/keepalived.conf (BACKUP)
cat > /etc/keepalived/keepalived.conf << 'EOF'
global_defs {
router_id SITE_B
script_user root
enable_script_security
}
vrrp_script check_services {
script "/usr/local/bin/check-site-health"
interval 5
weight -20
fall 3
rise 2
}
vrrp_instance MULTISITE {
state BACKUP
interface wg0
virtual_router_id 51
priority 90
advert_int 1
unicast_src_ip 10.10.0.2
unicast_peer {
10.10.0.1
}
authentication {
auth_type PASS
auth_pass changeme_secret
}
track_script {
check_services
}
notify_master "/usr/local/bin/failover-become-master"
notify_backup "/usr/local/bin/failover-become-backup"
}
EOF
systemctl enable --now keepalived
# /usr/local/bin/check-site-health
cat > /usr/local/bin/check-site-health << 'SCRIPT'
#!/bin/bash
# Return 0 if healthy, 1 if not
# Check ZFS pool is healthy
zpool status rpool | grep -q "state: ONLINE" || exit 1
# Check critical services are running
systemctl is-active --quiet nginx || exit 1
systemctl is-active --quiet postgresql || exit 1
# Check disk space (fail if < 10% free)
AVAIL=$(zfs get -Hp -o value available rpool)
USED=$(zfs get -Hp -o value used rpool)
TOTAL=$((AVAIL + USED))
PCT=$((AVAIL * 100 / TOTAL))
[ "$PCT" -lt 10 ] && exit 1
exit 0
SCRIPT
chmod +x /usr/local/bin/check-site-health
# /usr/local/bin/failover-become-master
cat > /usr/local/bin/failover-become-master << 'SCRIPT'
#!/bin/bash
set -euo pipefail
LOG_TAG="failover"
log() { logger -t "$LOG_TAG" "$*"; echo "$*"; }
log "=== BECOMING MASTER ==="
# Step 1: Reassign floating IP via provider API
# OVH example:
log "Reassigning floating IP to this node..."
curl -s -X POST "https://api.ovh.com/1.0/ip/203.0.113.50/move" \
-H "X-Ovh-Application: $OVH_APP_KEY" \
-H "X-Ovh-Consumer: $OVH_CONSUMER_KEY" \
-H "X-Ovh-Timestamp: $(date +%s)" \
-d '{"to": "ns12345.ip-XX-XX-XX.eu"}' || true
# Hetzner example (uncomment if using Hetzner):
# curl -s -X POST "https://api.hetzner.cloud/v1/floating_ips/12345/actions/assign" \
# -H "Authorization: Bearer $HETZNER_API_TOKEN" \
# -d '{"server": 67890}' || true
# Step 2: Update DNS (Cloudflare example)
log "Updating DNS to point to this node..."
curl -s -X PUT "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records/$CF_RECORD_ID" \
-H "Authorization: Bearer $CF_API_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"type\":\"A\",\"name\":\"app.example.com\",\"content\":\"$(curl -s ifconfig.me)\",\"ttl\":60}" || true
# Step 3: Start services
log "Starting services..."
systemctl start nginx postgresql app-server
log "=== MASTER TRANSITION COMPLETE ==="
SCRIPT
chmod +x /usr/local/bin/failover-become-master
Manual failover: 5 minutes, tops
# On Site B — manual failover script
cat > /usr/local/bin/manual-failover << 'SCRIPT'
#!/bin/bash
set -euo pipefail
echo "=== Manual Failover to Site B ==="
echo ""
# Step 1: Check replication status
echo "Last snapshot received:"
zfs list -t snapshot -r rpool/services -o name,creation -s creation | tail -5
echo ""
# Step 2: Promote this site to primary
echo "Starting services on Site B..."
systemctl start nginx postgresql app-server
# Step 3: Update DNS
echo "Update DNS to point to Site B's IP:"
echo " app.example.com → $(curl -s ifconfig.me)"
echo ""
echo "Or use the Cloudflare API:"
echo " curl -X PUT 'https://api.cloudflare.com/client/v4/zones/ZONE/dns_records/RECORD' \\"
echo " -H 'Authorization: Bearer TOKEN' \\"
echo " -d '{\"type\":\"A\",\"name\":\"app.example.com\",\"content\":\"$(curl -s ifconfig.me)\",\"ttl\":60}'"
echo ""
echo "=== Site B is now primary ==="
SCRIPT
chmod +x /usr/local/bin/manual-failover
zfs send your environment to it in the morning, work all day, zfs send the delta home at night, return the server. Tomorrow, send a different environment to a different server. The hardware is interchangeable — your data lives in ZFS snapshots, not on any particular machine. This inverts the cloud model: instead of your data living in someone else's data center permanently, your data lives at home and you temporarily deploy it to rented compute.Step 4: BMaaS — Bare Metal as a Service
This is the part that makes cloud architects do a double-take. Use ZFS snapshots to run completely different environments on the same hardware at different times of day. Production during business hours. ML training overnight. Clean slate every morning. One server, two completely different workloads, zero waste. The cloud providers charge you for 24 hours even when you only use 8. We just... turn it off.
The daily cycle
# Create base snapshots (one-time setup)
# Production base: OS + services configured, no customer data
zfs snapshot -r rpool@clean-base
# ML training base: CUDA drivers, training frameworks, no data
zfs snapshot -r rpool@ml-training-base
# /usr/local/bin/bmaas-scheduler
cat > /usr/local/bin/bmaas-scheduler << 'SCRIPT'
#!/bin/bash
set -euo pipefail
LOG_TAG="bmaas"
log() { logger -t "$LOG_TAG" "$*"; echo "[$(date '+%H:%M:%S')] $*"; }
ACTION="${1:-status}"
case "$ACTION" in
deploy-production)
log "=== Deploying production environment ==="
# Rollback to clean base
log "Rolling back to clean-base snapshot..."
zfs rollback -r rpool@clean-base
# Deploy current production config
log "Running Ansible deployment..."
ansible-playbook -i /srv/ansible/inventory \
/srv/ansible/deploy-production.yml
# Start services
log "Starting production services..."
systemctl start nginx postgresql redis app-server
# Update DNS / floating IP
/usr/local/bin/failover-become-master
log "=== Production environment ready ==="
;;
teardown-production)
log "=== Tearing down production ==="
# Stop services gracefully
log "Stopping services..."
systemctl stop app-server redis postgresql nginx
# Replicate today's data to home lab before teardown
log "Replicating data to home lab..."
syncoid --recursive --no-sync-snap \
rpool/data "10.10.0.3:rpool/archive/$(date +%Y%m%d)/data"
syncoid --recursive --no-sync-snap \
rpool/services/db "10.10.0.3:rpool/archive/$(date +%Y%m%d)/db"
# Rollback to clean base
log "Rolling back to clean-base..."
zfs rollback -r rpool@clean-base
log "=== Teardown complete — hardware is clean ==="
;;
deploy-ml)
log "=== Deploying ML training environment ==="
zfs rollback -r rpool@ml-training-base
# Pull latest training data from home lab
log "Syncing training data..."
syncoid --recursive --no-sync-snap \
"10.10.0.3:rpool/ml/datasets" rpool/ml/datasets
# Start training
log "Launching training jobs..."
/srv/ml/start-training.sh
log "=== ML environment running ==="
;;
teardown-ml)
log "=== Tearing down ML environment ==="
# Export results to home lab
log "Exporting ML results..."
syncoid --recursive --no-sync-snap \
rpool/ml/results "10.10.0.3:rpool/ml/results/$(date +%Y%m%d)"
zfs rollback -r rpool@clean-base
log "=== ML teardown complete ==="
;;
status)
echo "=== BMaaS Status ==="
echo "Current snapshots:"
zfs list -t snapshot -o name,creation -s creation | grep "rpool@"
echo ""
echo "Dataset usage:"
zfs list -r -o name,used,avail rpool
;;
*)
echo "Usage: $0 {deploy-production|teardown-production|deploy-ml|teardown-ml|status}"
exit 1
;;
esac
SCRIPT
chmod +x /usr/local/bin/bmaas-scheduler
The automated schedule
# /etc/cron.d/bmaas-schedule
cat > /etc/cron.d/bmaas-schedule << 'CRON'
# === Weekday BMaaS Schedule ===
# 08:00 — Deploy production
0 8 * * 1-5 root /usr/local/bin/bmaas-scheduler deploy-production 2>&1 | logger -t bmaas
# 17:00 — Tear down production, replicate data home
0 17 * * 1-5 root /usr/local/bin/bmaas-scheduler teardown-production 2>&1 | logger -t bmaas
# 18:00 — Deploy ML training (uses cheap overnight hours)
0 18 * * 1-5 root /usr/local/bin/bmaas-scheduler deploy-ml 2>&1 | logger -t bmaas
# 06:00 — Tear down ML, export results
0 6 * * 2-6 root /usr/local/bin/bmaas-scheduler teardown-ml 2>&1 | logger -t bmaas
# === Weekend: ML training runs 24h ===
0 8 * * 6 root /usr/local/bin/bmaas-scheduler deploy-ml 2>&1 | logger -t bmaas
0 6 * * 1 root /usr/local/bin/bmaas-scheduler teardown-ml 2>&1 | logger -t bmaas
CRON
How is this even possible?
ZFS rollback is a metadata operation — it doesn't copy data, it just repoints the dataset to a previous snapshot. Rolling back a 2TB production environment to a clean base takes seconds, not hours. Then you deploy a completely different stack on top. When you're done, rollback again. The hardware doesn't care what it's running. It's just blocks.
Step 5: Ephemeral environments
Need a staging environment with a full copy of production data? On AWS, that's a multi-hour snapshot restore and a significant chunk of your monthly bill. With ZFS clones, it's one command and it's ready in under a second. Clones are copy-on-write forks of a snapshot — they cost nearly zero space until data diverges. Spin up ten of them. Nobody cares. They're free.
# Create a production snapshot to clone from
zfs snapshot -r rpool/services@prod-latest
# Developer needs a staging environment — ready in under a second
zfs clone rpool/services/web@prod-latest rpool/staging/web-$(date +%s)
zfs clone rpool/services/db@prod-latest rpool/staging/db-$(date +%s)
zfs clone rpool/services/app@prod-latest rpool/staging/app-$(date +%s)
# Full copy of production data, writable, instant, near-zero space
# /usr/local/bin/ephemeral-env
cat > /usr/local/bin/ephemeral-env << 'SCRIPT'
#!/bin/bash
set -euo pipefail
ACTION="${1:-help}"
ENV_NAME="${2:-}"
case "$ACTION" in
create)
[ -z "$ENV_NAME" ] && { echo "Usage: $0 create "; exit 1; }
STAMP=$(date +%s)
echo "Creating ephemeral environment: $ENV_NAME"
# Snapshot current production state
zfs snapshot -r "rpool/services@ephemeral-$ENV_NAME-$STAMP"
# Clone all service datasets
for ds in web db app; do
zfs clone "rpool/services/$ds@ephemeral-$ENV_NAME-$STAMP" \
"rpool/ephemeral/$ENV_NAME/$ds"
done
echo "Environment ready:"
echo " Web: /srv/ephemeral/$ENV_NAME/web"
echo " DB: /srv/ephemeral/$ENV_NAME/db"
echo " App: /srv/ephemeral/$ENV_NAME/app"
echo ""
echo "Space used: $(zfs list -H -o used rpool/ephemeral/$ENV_NAME)"
;;
destroy)
[ -z "$ENV_NAME" ] && { echo "Usage: $0 destroy "; exit 1; }
echo "Destroying ephemeral environment: $ENV_NAME"
# Destroy clones
zfs destroy -r "rpool/ephemeral/$ENV_NAME"
echo "Environment destroyed. Space reclaimed."
;;
list)
echo "=== Active ephemeral environments ==="
zfs list -r -o name,used,creation rpool/ephemeral 2>/dev/null || \
echo "No ephemeral environments"
;;
*)
echo "Usage: $0 {create|destroy|list} [name]"
echo ""
echo "Examples:"
echo " $0 create staging-v2 # Clone production for testing"
echo " $0 create demo-client # Clone for client demo"
echo " $0 destroy staging-v2 # Clean up when done"
echo " $0 list # Show all environments"
;;
esac
SCRIPT
chmod +x /usr/local/bin/ephemeral-env
A developer breaks the staging environment? zfs destroy and zfs clone again.
Fresh copy of production, ready in under a second. No VMs to rebuild, no containers to
repull, no databases to restore from a 4-hour-old dump. Just clone. Every developer gets
their own full copy of prod. Every QA run starts from a known-good state. This is the
workflow that makes people stop mid-sentence when you explain it.
Step 6: Why this works
No egress fees. No API charges. No surprise bills. No vendor lock-in. No "we're raising prices 20% because we can." No "this service is being deprecated, migrate by March." No 200-page compliance questionnaire about where your data lives. Your data lives on your hardware.
ZFS replication is just zfs send piped over SSH through WireGuard —
a kernel primitive, not a billable service. The only dependency is bandwidth
between your sites. And bandwidth is cheap. Two bare metal servers with unmetered
gigabit from OVH or Hetzner cost a fraction of equivalent cloud infrastructure.
The real math
OVH Advance-1: ~$90/month. Hetzner AX41: ~$45/month. Home lab: electricity only. Total: $135/month for a multi-region, auto-failover, ZFS-replicated infrastructure with bare metal performance. The equivalent on AWS — three regions, dedicated instances, cross-region replication, data transfer — would cost $2,000-5,000/month. And you'd still be renting.
Step 7: OVH/Hetzner setup
Practical steps to go from zero to a running multi-site cluster.
Order the hardware
OVH Advance-1 (Montreal):
- Intel Xeon E-2386G (6c/12t)
- 32 GB ECC DDR4
- 2x 512GB NVMe (ZFS mirror)
- Unmetered 1 Gbps
- ~$90/month
Hetzner AX41-NVMe (Frankfurt):
- AMD Ryzen 5 3600 (6c/12t)
- 64 GB ECC DDR4
- 2x 512GB NVMe (ZFS mirror)
- 20 TB traffic included
- ~$45/month
Home Lab (your hardware):
- Any x86_64 with 16GB+ RAM
- 2+ disks for ZFS mirror
- $0/month (electricity only)
Install kldload on each node
# Boot from kldload ISO via IPMI/iLO/iDRAC (OVH/Hetzner provide KVM-over-IP)
# Or mount ISO via their rescue console
# Unattended install — same on all nodes, just change hostname
cat > /tmp/answers.env << 'EOF'
KLDLOAD_DISTRO=centos
KLDLOAD_DISK=/dev/nvme0n1
KLDLOAD_DISK2=/dev/nvme1n1
KLDLOAD_POOL_TYPE=mirror
KLDLOAD_HOSTNAME=site-a # Change per node: site-a, site-b, site-c
KLDLOAD_USERNAME=admin
KLDLOAD_PASSWORD=changeme
KLDLOAD_PROFILE=server
KLDLOAD_NET_METHOD=dhcp
EOF
kldload-install-target --config /tmp/answers.env
# After reboot — install WireGuard (included in kldload darksite)
dnf install -y wireguard-tools
# Configure WireGuard mesh (see Step 1 above)
# Configure ZFS replication (see Step 2 above)
# Deploy services via Ansible or manually
Step 8: Monitoring across sites
# /usr/local/bin/multisite-status
cat > /usr/local/bin/multisite-status << 'SCRIPT'
#!/bin/bash
# Run from any site to check the health of all three
SITES=("10.10.0.1:site-a" "10.10.0.2:site-b" "10.10.0.3:site-c")
echo "=============================="
echo " Multi-Site Cloud Status"
echo " $(date '+%Y-%m-%d %H:%M:%S')"
echo "=============================="
echo ""
for entry in "${SITES[@]}"; do
IP="${entry%%:*}"
NAME="${entry##*:}"
echo "--- $NAME ($IP) ---"
# WireGuard tunnel
if ping -c 1 -W 2 "$IP" > /dev/null 2>&1; then
RTT=$(ping -c 3 -W 2 "$IP" 2>/dev/null | tail -1 | awk -F'/' '{print $5}')
echo " WireGuard: UP (${RTT}ms avg)"
else
echo " WireGuard: DOWN"
echo ""
continue
fi
# ZFS pool health (via SSH over WireGuard)
POOL_STATE=$(ssh -o ConnectTimeout=5 "$IP" "zpool status -x" 2>/dev/null)
echo " ZFS Pool: ${POOL_STATE:-unreachable}"
# Last snapshot
LAST_SNAP=$(ssh -o ConnectTimeout=5 "$IP" \
"zfs list -t snapshot -H -o name,creation -s creation | tail -1" 2>/dev/null)
echo " Last Snapshot: ${LAST_SNAP:-unknown}"
# Replication lag
if [ "$IP" != "$(hostname -I | awk '{print $1}')" ]; then
LOCAL_SNAP=$(zfs list -t snapshot -H -o name -s creation -r rpool/services | tail -1)
REMOTE_SNAP=$(ssh -o ConnectTimeout=5 "$IP" \
"zfs list -t snapshot -H -o name -s creation -r rpool/services 2>/dev/null | tail -1")
if [ "$LOCAL_SNAP" = "$REMOTE_SNAP" ]; then
echo " Replication: IN SYNC"
else
echo " Replication: LAGGING"
echo " Local: $LOCAL_SNAP"
echo " Remote: $REMOTE_SNAP"
fi
fi
# Disk usage
DISK=$(ssh -o ConnectTimeout=5 "$IP" \
"zfs list -H -o used,avail rpool" 2>/dev/null)
echo " Disk: ${DISK:-unknown}"
echo ""
done
echo "--- Failover Readiness ---"
echo " keepalived: $(systemctl is-active keepalived 2>/dev/null || echo 'not installed')"
echo " WireGuard peers: $(wg show wg0 2>/dev/null | grep -c 'peer:') connected"
echo " Sanoid timer: $(systemctl is-active sanoid.timer 2>/dev/null || echo 'not running')"
SCRIPT
chmod +x /usr/local/bin/multisite-status
Grafana dashboard
# Install node_exporter on all sites (included in kldload)
systemctl enable --now node_exporter
# Install Prometheus + Grafana on Site A (or your monitoring site)
dnf install -y grafana prometheus
# Prometheus config — scrape all three sites over WireGuard
cat > /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'site-a'
static_configs:
- targets: ['10.10.0.1:9100']
labels:
site: 'montreal'
role: 'primary'
- job_name: 'site-b'
static_configs:
- targets: ['10.10.0.2:9100']
labels:
site: 'frankfurt'
role: 'secondary'
- job_name: 'site-c'
static_configs:
- targets: ['10.10.0.3:9100']
labels:
site: 'homelab'
role: 'archive'
EOF
systemctl enable --now prometheus grafana-server
Key metrics to watch across sites:
- ZFS replication lag — time since last successful syncoid to each target
- WireGuard tunnel status — handshake age, bytes transferred, packet loss
- Pool health —
zpool statuson every site, alert on DEGRADED or FAULTED - Failover readiness — keepalived state, service health checks passing
- Inter-site latency — eBPF-based RTT monitoring between WireGuard peers
eBPF latency monitoring
# Use bpftrace to measure WireGuard tunnel latency in real time
# This traces ICMP round-trip time through the wg0 interface
cat > /usr/local/bin/wg-latency-monitor << 'SCRIPT'
#!/bin/bash
# Quick latency check across all sites — runs every minute via cron
for site in 10.10.0.1 10.10.0.2 10.10.0.3; do
RTT=$(ping -c 5 -W 2 -I wg0 "$site" 2>/dev/null | tail -1 | awk -F'/' '{print $5}')
if [ -n "$RTT" ]; then
echo "wg_rtt_ms{target=\"$site\"} $RTT" >> /var/lib/node_exporter/textfile/wg_latency.prom
fi
done
SCRIPT
chmod +x /usr/local/bin/wg-latency-monitor
# Expose to Prometheus via node_exporter textfile collector
mkdir -p /var/lib/node_exporter/textfile
echo "*/1 * * * * root /usr/local/bin/wg-latency-monitor" > /etc/cron.d/wg-latency
Step 9: Security
Every inter-site byte travels through WireGuard. No exceptions. The only port open to the public internet on any node is UDP 51820 — the WireGuard handshake. Everything else — SSH, monitoring, replication, management — runs over the encrypted overlay. An attacker scanning your IP sees one open UDP port and nothing else. Good luck with that.
# nftables firewall — lock down each site
cat > /etc/nftables.conf << 'NFTEOF'
#!/usr/sbin/nft -f
flush ruleset
table inet multisite {
chain input {
type filter hook input priority 0; policy drop;
# Loopback
iif lo accept
# Established connections
ct state established,related accept
# WireGuard handshake (only port open to the internet)
udp dport 51820 accept
# Everything over WireGuard is trusted
iifname "wg0" accept
# ICMP for diagnostics
ip protocol icmp accept
ip6 nexthdr icmpv6 accept
# Log and drop everything else
log prefix "nft-drop: " limit rate 5/minute
drop
}
chain forward {
type filter hook forward priority 0; policy drop;
# Only forward between WireGuard peers
iifname "wg0" oifname "wg0" accept
ct state established,related accept
drop
}
chain output {
type filter hook output priority 0; policy accept;
# Allow all outbound (we control the server)
}
}
NFTEOF
nft -f /etc/nftables.conf
systemctl enable nftables
Security posture:
- Only UDP port 51820 is exposed to the internet (WireGuard)
- All management (SSH, monitoring, replication) runs over the WireGuard overlay
- ZFS encryption at rest on all sites — data is encrypted even if disks are stolen
- Each site has independent nftables rules — compromise of one doesn't open the others
- SSH keys only, no password authentication, and only reachable over WireGuard
- Sanoid snapshots provide ransomware rollback — even if an attacker gets root, read-only snapshots survive
# SSH hardening — only listen on WireGuard interface
cat >> /etc/ssh/sshd_config << 'EOF'
# Only accept SSH over WireGuard
ListenAddress 10.10.0.1 # Change per site: .1, .2, .3
PasswordAuthentication no
PermitRootLogin prohibit-password
MaxAuthTries 3
EOF
systemctl restart sshd
Step 10: Disaster recovery runbook
Four scenarios, four procedures. Print this out and tape it to the rack. Seriously — when Site A is down and your phone is blowing up, you don't want to be scrolling through a wiki. You want a laminated card that says "do this, then this, then this."
Scenario 1: Site A fails — promote Site B (5 minutes)
# Site A is down. Site B has data from ≤15 minutes ago.
# 1. Verify Site A is actually down (not just a WireGuard blip)
ping -c 5 10.10.0.1 # No response
ssh 10.10.0.1 hostname # Connection refused/timeout
# 2. On Site B — start services
systemctl start nginx postgresql app-server
# 3. Reassign floating IP (or update DNS)
/usr/local/bin/failover-become-master
# 4. Verify
curl -s https://app.example.com/health | jq .
# Total time: ~5 minutes
# Data loss: ≤15 minutes (last replication cycle)
Scenario 2: Site A + B both fail — promote Site C (15 minutes)
# Both rented servers are down. Home lab has nightly backup.
# 1. On Site C — check latest backup
zfs list -t snapshot -r rpool/backup -o name,creation -s creation | tail -10
# 2. Clone backup datasets to primary mountpoints
zfs clone rpool/backup/site-a/services@autosnap_latest rpool/services
zfs clone rpool/backup/site-a/data@autosnap_latest rpool/data
# 3. Install and start services (Site C may not have them running)
dnf install -y nginx postgresql
systemctl start nginx postgresql app-server
# 4. Update DNS to point to home lab's public IP
# (or set up a new WireGuard tunnel to expose services)
# Total time: ~15 minutes
# Data loss: ≤24 hours (last nightly backup)
Scenario 3: Full rebuild from scratch (30 minutes)
# Everything is gone. You have the kldload ISO and a ZFS backup.
# 1. Boot kldload ISO on new hardware
# 2. Install kldload (server profile)
kldload-install-target --config /tmp/answers.env
# 3. After reboot — import backup pool (if you have the physical disks)
zpool import backup-pool
# Or receive from remote backup:
ssh 10.10.0.3 "zfs send -R rpool/backup/site-a@latest" | zfs receive -F rpool
# 4. Configure WireGuard, start services
# 5. Update DNS
# Total time: ~30 minutes (mostly waiting for ZFS receive)
# Data loss: depends on backup age
Scenario 4: Ransomware — rollback to clean snapshot (seconds)
# Attacker encrypted your files. ZFS snapshots are read-only.
# 1. Identify the last clean snapshot
zfs list -t snapshot -r rpool/services -o name,creation -s creation
# 2. Rollback (destroys everything after the snapshot — that's what we want)
zfs rollback -r rpool/services@autosnap_2026-03-27_00:00:00_daily
zfs rollback -r rpool/data@autosnap_2026-03-27_00:00:00_daily
# 3. Restart services
systemctl restart nginx postgresql app-server
# 4. Investigate how they got in, patch the hole
# Total time: seconds (rollback is instant)
# Data loss: only changes since the snapshot
Why ZFS snapshots beat ransomware
ZFS snapshots are immutable. An attacker with root can zfs destroy them,
but they can't modify them in place — there's no way to encrypt a snapshot without
destroying it first. And your off-site replicas (Site B, Site C) have their own
independent snapshot chains on completely separate hardware. Even if an attacker
compromises Site A and destroys all local snapshots, Site B and Site C still have
clean copies. The attacker would need to simultaneously compromise all three sites
to actually destroy your data. At that point, they're not a script kiddy — they're
a nation-state, and you have bigger problems.