| your Linux re-packer
kldload — your platform, your way, anywhere, free
Source

VXLAN & EVPN Masterclass

The Networking tutorial showed you how to build a VXLAN tunnel by hand — create the interface, set a remote peer, done. The BIRD/BGP Masterclass showed you how BGP distributes routes across sites. This page combines both: EVPN is what happens when you give BGP control of your VXLAN fabric. Manual tunnels disappear. Peer discovery becomes automatic. MAC addresses propagate via routing protocol. This is how every serious data center and every cloud provider builds overlay networking.

What this page covers: VXLAN encapsulation recap, why static VXLAN fails at scale, BGP EVPN route types, FRRouting EVPN configuration on kldload, BIRD 2.x EVPN, multi-site EVPN over WireGuard, Cilium VXLAN integration, GENEVE, performance tuning, VNI design, and a complete troubleshooting reference.

Prerequisites: the Networking tutorial (manual VXLAN) and the BIRD/BGP Masterclass (BGP fundamentals). You should be comfortable with ip link, bridge fdb, and basic BGP peering before continuing.


1. From Manual VXLAN to an EVPN Fabric

The networking tutorial showed manual VXLAN: create a tunnel, set a remote IP, done. That works for two or three nodes. At scale, you need a control plane — something that automatically discovers peers, learns MAC addresses, and distributes reachability information. That is EVPN: Ethernet VPN. It uses BGP to turn VXLAN from a point-to-point hack into a real network fabric.

This is what every data center, every cloud provider, and every serious overlay network runs. AWS VPC. Azure VNet. Google Cloud VPC. Every "virtual network" in every public cloud is an EVPN segment underneath. When you understand EVPN, you understand how clouds actually work below the API surface.

EVPN is not a new encapsulation format. The data plane is still VXLAN — same UDP port 4789, same VNI field, same outer/inner Ethernet frames. What EVPN adds is a control plane: a BGP address family (L2VPN EVPN, AFI 25 / SAFI 70) that carries MAC and IP reachability information between VTEPs. Instead of you manually maintaining peer lists and flood tables, BGP does it automatically.

EVPN is the technology behind AWS VPC, Azure VNet, and every modern data center fabric. When you create a "virtual network" in any cloud, you are getting an EVPN segment — a Layer 2 domain whose MAC table is maintained by BGP, not by flooding. The VXLAN VNI maps to your subnet ID. The BGP EVPN route reflectors are the cloud control plane. Understanding EVPN means understanding how clouds actually work underneath their APIs. When an AWS engineer talks about "the VPC fabric", they are describing EVPN at a scale of hundreds of thousands of VTEPs. The same protocol runs on your three-node kldload cluster.

2. VXLAN Recap — The Data Plane

Before adding EVPN, be clear on what VXLAN does and does not do by itself.

Encapsulation

VXLAN wraps an entire Ethernet frame inside a UDP packet. The outer packet routes across the underlay network (your physical LAN or WireGuard mesh). The inner packet is the original Layer 2 frame, including the MAC header. The receiving VTEP strips the outer headers and delivers the inner frame to a local bridge interface as if it arrived on a physical cable.

[ Physical NIC ]
  Outer Ethernet header  (VTEP-A MAC → VTEP-B MAC)
  Outer IP header        (VTEP-A IP  → VTEP-B IP)
  Outer UDP header       (src port ephemeral → dst port 4789)
  VXLAN header           (8 bytes: flags + 24-bit VNI)
    Inner Ethernet header  (VM-A MAC → VM-B MAC)
    Inner IP header        (10.0.100.2 → 10.0.100.5)
    Inner payload          (TCP, UDP, ICMP ...)

The 24-bit VNI (Virtual Network Identifier) is the VXLAN equivalent of a VLAN ID. It identifies which virtual network this frame belongs to. With 24 bits you get 16,777,216 possible VNIs — compared to 4,094 for 802.1Q VLANs.

Manual VXLAN (from the networking tutorial)

# Static VXLAN — the way you learned it in the Networking tutorial
# Node A: 192.168.1.10
ip link add vxlan100 type vxlan id 100 dstport 4789 local 192.168.1.10
ip link set vxlan100 up
ip addr add 10.0.100.1/24 dev vxlan100

# Manually add the remote peer (Node B at 192.168.1.20)
bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 192.168.1.20

# Node B: 192.168.1.20
ip link add vxlan100 type vxlan id 100 dstport 4789 local 192.168.1.20
ip link set vxlan100 up
ip addr add 10.0.100.2/24 dev vxlan100
bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 192.168.1.10

That 00:00:00:00:00:00 entry is the "flood entry" — any frame whose destination MAC is not yet known gets sent to that remote IP. It works for two nodes. The problem emerges the moment you add a third.

The scaling problem

With static VXLAN, every node must know about every other node. In a 50-node cluster, each node has 49 flood entries. When you add node 51, you log in to all 50 existing nodes and add a new entry. When a node changes IP, you update 49 entries. When a node fails, the flood entries remain until you manually remove them — so traffic keeps going to a dead VTEP, causing timeouts.

This is N×N configuration management. It is fragile, error-prone, and has no failure detection. It is the VXLAN equivalent of static routes in a dynamic network — fine for a lab, unusable in production.

Manual VXLAN with static flood-and-learn works for a homelab with three nodes. In a 50-node data center, maintaining N×N peer lists is operationally impossible. You need a control plane. Multicast-based flood-and-learn (the original VXLAN design from RFC 7348) is one option — all VTEPs join a multicast group and flood BUM (Broadcast, Unknown-unicast, Multicast) traffic to the group. But multicast requires multicast-capable underlay switches, and most modern networks (including anything over WireGuard) do not support multicast. EVPN solves this cleanly: no multicast, no flood groups, no per-node config updates. BGP distributes everything.

3. What EVPN Adds to VXLAN

EVPN (RFC 7432, extended for VXLAN in RFC 8365) defines a BGP address family that carries Layer 2 reachability information. The key insight: instead of learning MAC addresses by flooding frames and watching who responds, VTEPs advertise their MAC and IP bindings into BGP. Every VTEP in the fabric learns the entire MAC table via BGP route updates.

MAC learning via BGP

When a VM comes online on VTEP-A, VTEP-A advertises a BGP EVPN Type 2 route: "MAC aa:bb:cc:dd:ee:ff, IP 10.0.100.5 is reachable via VTEP 192.168.1.10 in VNI 100." Every other VTEP receives this advertisement and programs it into their local VXLAN forwarding database. No flooding required, ever.

// Traditional: send frame, wait for response, learn MAC // EVPN: receive BGP update, program FDB, traffic goes direct

ARP suppression

Without EVPN, when VM-A sends ARP "who has 10.0.100.5?", the request floods to every VTEP. With EVPN, the local VTEP already knows (from a BGP Type 2 route) that 10.0.100.5 is at MAC aa:bb:cc:dd:ee:ff on VTEP-B. The local VTEP answers the ARP directly — the request never leaves the originating node.

// ARP flood suppression: your local VTEP is the ARP proxy // Zero broadcast traffic crossing VXLAN tunnels

Type 2 route — MAC/IP Advertisement

The core EVPN route. Carries: RD (Route Distinguisher), MAC address, optional IP address, VTEP tunnel endpoint, VNI. This is how VTEPs learn where each MAC lives. The IP field enables host-route distribution — L3 reachability derived directly from the MAC/IP binding.

// BGP EVPN Type 2: "I have MAC X, IP Y, reach me at VTEP Z, VNI W" // One advertisement covers both L2 (MAC) and L3 (IP) reachability

Type 3 route — Inclusive Multicast Ethernet Tag

Announces a VTEP's participation in a VNI and how it handles BUM (Broadcast, Unknown-unicast, Multicast) traffic. Instead of a multicast group, VTEPs exchange Type 3 routes to build an ingress replication list — the list of remote VTEPs that BUM traffic should be unicasted to.

// Type 3: "I participate in VNI 100, send me BUM traffic here" // Replaces multicast groups with a BGP-maintained VTEP list

Type 5 route — IP Prefix

Carries IP prefixes (like normal BGP routes) but tagged with EVPN context. Used for L3 routing between EVPN segments — where a gateway VTEP advertises an entire subnet rather than individual host routes. Essential for inter-VNI routing and for connecting EVPN to external BGP speakers.

// Type 5: "Prefix 10.0.200.0/24 reachable via VTEP 192.168.1.30" // This is how EVPN connects to the outside world

Route Distinguisher & Route Target

RD (Route Distinguisher) makes EVPN routes globally unique in the BGP table — prepended to the route key. RT (Route Target) controls import/export policy: which VRFs/VNIs import which routes. A VTEP exports routes with RT 65000:100 and other VTEPs import routes matching that RT into VNI 100.

// RD: makes the route unique (namespace) // RT: controls who imports it (policy)
In traditional VXLAN, when a VM sends an ARP "who has 10.0.100.5?", the request floods to every VTEP in the VNI. Each VTEP delivers the broadcast to every local VM on that VNI. At 100 VMs across 20 VTEPs, every ARP request becomes 19 VXLAN unicast packets plus 100 VM-delivered broadcasts. At 10,000 VMs, this is catastrophic. With EVPN ARP suppression, the local VTEP already knows (via BGP Type 2 routes) that 10.0.100.5 is at MAC aa:bb:cc:dd:ee:ff on VTEP 192.168.1.20. It generates an ARP reply locally. The ARP request never enters the VXLAN fabric at all. Zero flooding. This is how you scale to thousands of nodes without broadcast storms — the same technique used by every hyperscale cloud provider.

4. EVPN with FRRouting on kldload

FRRouting is the natural choice for EVPN on kldload. It originated from Cumulus Linux — the company that pioneered BGP EVPN on commodity Linux hardware and contributed the EVPN kernel integration to mainline Linux. FRR handles both the BGP control plane (advertising and receiving EVPN routes) and the kernel programming side (populating the VXLAN FDB and ARP tables via netlink). You configure BGP + EVPN in vtysh; FRR and the kernel do the rest.

Install FRRouting

# CentOS Stream 9 / Rocky / RHEL
dnf install frr frr-pythontools

# Debian 13 / Ubuntu 24.04
apt install frr

# Enable the BGP and Zebra daemons
sed -i 's/bgpd=no/bgpd=yes/' /etc/frr/daemons
sed -i 's/zebra=no/zebra=yes/' /etc/frr/daemons

systemctl enable --now frr

Example topology

Three kldload nodes, each a VTEP. They peer in a full-mesh iBGP (or use a route reflector for larger deployments). All nodes share VNI 100 (the 10.0.100.0/24 overlay network).

Node 1: underlay 192.168.1.10, loopback 10.255.0.1  (VTEP IP)
Node 2: underlay 192.168.1.20, loopback 10.255.0.2
Node 3: underlay 192.168.1.30, loopback 10.255.0.3
BGP AS: 65000 (iBGP full mesh)
VNI:    100   (overlay 10.0.100.0/24)

Kernel setup — identical on all three nodes (adjust IPs)

# Node 1 example
# Loopback address used as the VTEP source IP
ip addr add 10.255.0.1/32 dev lo

# VXLAN interface — no remote peer, no flood entry
# FRR will populate the FDB via netlink
ip link add vxlan100 type vxlan id 100 dstport 4789 \
  local 10.255.0.1 nolearning
ip link set vxlan100 up

# Bridge the VXLAN interface
ip link add br100 type bridge
ip link set vxlan100 master br100
ip link set br100 up
ip addr add 10.0.100.1/24 dev br100

The nolearning flag on the VXLAN interface is critical. It tells the kernel to not learn MACs by snooping traffic — that job belongs to FRR/EVPN now. Without nolearning, you get a race between kernel flood-and-learn and EVPN BGP learning, which causes forwarding inconsistencies.

FRR configuration — Node 1

! /etc/frr/frr.conf — Node 1 (10.255.0.1)

frr version 9.1
frr defaults traditional
hostname node1
log syslog informational
no ipv6 forwarding

interface lo
 ip address 10.255.0.1/32
!
interface eth0
 ip address 192.168.1.10/24
!

router bgp 65000
 bgp router-id 10.255.0.1
 no bgp default ipv4-unicast
 neighbor 10.255.0.2 remote-as 65000
 neighbor 10.255.0.2 update-source lo
 neighbor 10.255.0.3 remote-as 65000
 neighbor 10.255.0.3 update-source lo

 address-family l2vpn evpn
  neighbor 10.255.0.2 activate
  neighbor 10.255.0.3 activate
  advertise-all-vni
 exit-address-family
!
line vty
!

The key directive is advertise-all-vni. This tells FRR to scan all VXLAN interfaces on the system, discover their VNIs, and automatically generate EVPN advertisements (Type 2 for local MACs/IPs, Type 3 for BUM handling). You do not need to enumerate each VNI manually — FRR discovers them from the kernel.

FRR configuration — Node 2 (10.255.0.2) and Node 3 (10.255.0.3)

! Node 2 — change router-id and neighbor IPs accordingly
router bgp 65000
 bgp router-id 10.255.0.2
 no bgp default ipv4-unicast
 neighbor 10.255.0.1 remote-as 65000
 neighbor 10.255.0.1 update-source lo
 neighbor 10.255.0.3 remote-as 65000
 neighbor 10.255.0.3 update-source lo

 address-family l2vpn evpn
  neighbor 10.255.0.1 activate
  neighbor 10.255.0.3 activate
  advertise-all-vni
 exit-address-family

Verify the EVPN fabric is working

# Check BGP peers are up
vtysh -c "show bgp summary"

# Show all EVPN routes received
vtysh -c "show bgp l2vpn evpn"

# Show Type 2 (MAC/IP) routes for VNI 100
vtysh -c "show bgp l2vpn evpn route type macip"

# Show Type 3 (multicast/BUM) routes
vtysh -c "show bgp l2vpn evpn route type multicast"

# Show the EVPN MAC table as seen by FRR
vtysh -c "show evpn mac vni 100"

# Show the kernel VXLAN FDB (should be populated by FRR)
bridge fdb show dev vxlan100

# Ping between overlay IPs to verify end-to-end
# (run a VM or namespace on the bridge on each node)
ping 10.0.100.2 -c 3

When it is working, bridge fdb show dev vxlan100 will show entries like:

aa:bb:cc:dd:ee:ff dst 10.255.0.2 self permanent
00:00:00:00:00:00 dst 10.255.0.2 self permanent
00:00:00:00:00:00 dst 10.255.0.3 self permanent

The specific MAC entry for aa:bb:cc:dd:ee:ff was populated by FRR from the BGP Type 2 route received from Node 2. The 00:00:00:00:00:00 entries (one per remote VTEP) are the ingress replication list built from Type 3 routes — BUM traffic is unicasted to each remote VTEP individually, no multicast needed.

FRRouting is the easiest path to EVPN on Linux. It handles both the BGP control plane (EVPN route exchange) and programs the kernel's VXLAN FDB (forwarding database) automatically via netlink. You configure BGP + EVPN in vtysh, FRR learns the remote VTEPs and MACs, and the kernel VXLAN interface just works. The advertise-all-vni directive is the magic: FRR scans all VXLAN interfaces, discovers VNI 100, and starts advertising Type 2 routes for every MAC it learns locally and Type 3 routes for BUM handling. Adding a new VNI is as simple as creating a new VXLAN interface — FRR picks it up automatically. This is the same behavior as Cumulus Linux, Arista EOS, and Cisco NX-OS with EVPN enabled. The protocol is identical; only the CLI differs.

5. EVPN with BIRD

BIRD 2.x added L2VPN EVPN support starting in 2.0.8. The configuration is more explicit than FRRouting's — BIRD does not auto-discover VNIs, so you define each one. BIRD also does not program the kernel FDB directly; you need a helper daemon (typically evpn-helper or a custom script using birdc and bridge fdb) to translate BIRD's EVPN table into kernel FDB entries. For most deployments, FRRouting is simpler. BIRD's strength is its flexible policy language — if you need complex route filtering and manipulation in addition to EVPN, BIRD can handle both in one config file.

BIRD 2.x EVPN configuration

# /etc/bird/bird.conf — BIRD 2.x EVPN example
# Node 1: router-id 10.255.0.1

router id 10.255.0.1;
log syslog all;

protocol device { scan time 10; }
protocol direct { ipv4; }

# iBGP with EVPN address family
protocol bgp node2 {
  local 10.255.0.1 as 65000;
  neighbor 10.255.0.2 as 65000;
  hold time 90;

  l2vpn evpn {
    import all;
    export all;
  };
}

protocol bgp node3 {
  local 10.255.0.1 as 65000;
  neighbor 10.255.0.3 as 65000;
  hold time 90;

  l2vpn evpn {
    import all;
    export all;
  };
}

# EVPN instance for VNI 100
protocol evpn vni100 {
  vni 100;
  rd 10.255.0.1:100;
  rt both 65000:100;

  interface "vxlan100";

  l2vpn evpn {
    import all;
    export all;
  };
}

When to use BIRD vs FRRouting for EVPN

Criterion FRRouting BIRD 2.x
EVPN maturity Production-grade, originated from Cumulus Linux Newer, functional, less battle-tested
Kernel FDB programming Automatic via netlink (built-in) Requires external helper
VNI discovery Automatic (advertise-all-vni) Manual per-VNI config
Route policy flexibility Route maps, prefix lists BIRD filter language (more expressive)
Best for Pure EVPN fabric, fast setup Complex mixed routing + EVPN policies
FRRouting is more mature for EVPN — it came from Cumulus Linux, which pioneered EVPN on Linux and contributed the kernel-side VXLAN FDB programming to the Linux kernel itself. BIRD's EVPN support is newer. For production EVPN, FRRouting is the safer choice today. Use BIRD for EVPN only if you are already running BIRD for other routing protocols and need to keep everything in one daemon. If you are starting fresh, use FRR for EVPN.

6. Multi-Site EVPN over WireGuard

This is the full overlay stack: WireGuard encrypts the transport between sites, VXLAN provides the virtual Layer 2, and BGP EVPN automates peer discovery and MAC learning. VMs on different sites share the same Layer 2 segment without a single static peer entry.

Topology

Site A                              Site B
192.168.1.0/24                      192.168.2.0/24
Node A1: 192.168.1.10               Node B1: 192.168.2.10
VTEP loopback: 10.255.0.1           VTEP loopback: 10.255.1.1

WireGuard tunnel:
  wg0 on A1: 10.200.0.1/30
  wg0 on B1: 10.200.0.2/30

Overlay VNI 100: 10.0.100.0/24
  A1 bridge: 10.0.100.1
  B1 bridge: 10.0.100.2
  VMs: 10.0.100.10, 10.0.100.20 (anywhere on either site)

BGP: eBGP between sites (AS 65000 on Site A, AS 65001 on Site B)
     iBGP within each site (if multiple nodes per site)

Step 1 — WireGuard tunnel between sites

# On Node A1
cat /etc/wireguard/wg0.conf
# [Interface]
# Address = 10.200.0.1/30
# PrivateKey = <A1 private key>
# ListenPort = 51820
#
# [Peer]
# PublicKey = <B1 public key>
# Endpoint = <B1 public IP>:51820
# AllowedIPs = 10.200.0.2/32, 10.255.1.0/24
# PersistentKeepalive = 25

wg-quick up wg0

# On Node B1
# [Interface]
# Address = 10.200.0.2/30
# PrivateKey = <B1 private key>
# ListenPort = 51820
#
# [Peer]
# PublicKey = <A1 public key>
# Endpoint = <A1 public IP>:51820
# AllowedIPs = 10.200.0.1/32, 10.255.0.0/24
# PersistentKeepalive = 25

wg-quick up wg0

# Verify WireGuard is up
ping 10.200.0.2 -c 3   # from A1
ping 10.200.0.1 -c 3   # from B1

The AllowedIPs entries include the remote loopback subnet (10.255.1.0/24 for Site B, 10.255.0.0/24 for Site A). This ensures the VTEP loopback addresses — which are used as BGP update-source and VXLAN tunnel endpoints — are reachable over WireGuard.

Step 2 — VXLAN interfaces (both sites)

# Node A1
ip addr add 10.255.0.1/32 dev lo
ip link add vxlan100 type vxlan id 100 dstport 4789 local 10.255.0.1 nolearning
ip link set vxlan100 up
ip link add br100 type bridge
ip link set vxlan100 master br100
ip link set br100 up
ip addr add 10.0.100.1/24 dev br100

# Node B1
ip addr add 10.255.1.1/32 dev lo
ip link add vxlan100 type vxlan id 100 dstport 4789 local 10.255.1.1 nolearning
ip link set vxlan100 up
ip link add br100 type bridge
ip link set vxlan100 master br100
ip link set br100 up
ip addr add 10.0.100.2/24 dev br100

Step 3 — FRRouting EVPN with eBGP between sites

! Node A1 — AS 65000
router bgp 65000
 bgp router-id 10.255.0.1
 no bgp default ipv4-unicast

 ! eBGP peer to Site B
 neighbor 10.200.0.2 remote-as 65001
 neighbor 10.200.0.2 update-source wg0
 neighbor 10.200.0.2 ebgp-multihop 2

 address-family l2vpn evpn
  neighbor 10.200.0.2 activate
  advertise-all-vni
 exit-address-family
!

! Node B1 — AS 65001
router bgp 65001
 bgp router-id 10.255.1.1
 no bgp default ipv4-unicast

 ! eBGP peer to Site A
 neighbor 10.200.0.1 remote-as 65000
 neighbor 10.200.0.1 update-source wg0
 neighbor 10.200.0.1 ebgp-multihop 2

 address-family l2vpn evpn
  neighbor 10.200.0.1 activate
  advertise-all-vni
 exit-address-family
!

Step 4 — Verify cross-site EVPN

# On Node A1: check BGP peer to Site B
vtysh -c "show bgp summary"

# See Type 2 routes from Site B (MAC/IP of VMs on B1)
vtysh -c "show bgp l2vpn evpn route type macip"

# Confirm kernel FDB has remote Site B VTEP
bridge fdb show dev vxlan100 | grep 10.255.1.1

# Ping a VM on Site B from a VM on Site A
# (both VMs on 10.0.100.0/24, bridged to their local br100)
ping 10.0.100.20 -c 3
This is the full stack: WireGuard encrypts the transport. VXLAN provides the virtual Layer 2. BGP EVPN automates peer discovery and MAC learning. The result: VMs on different continents think they are on the same switch, without a single static peer entry. Add a third site? Configure a WireGuard tunnel to an existing site, peer FRR in BGP, run advertise-all-vni, and the new site learns all existing MAC bindings automatically via BGP. No changes needed on Sites A or B. This is the fundamental advantage of a control-plane-driven overlay over a manually configured one: adding a participant is O(1) from the perspective of existing nodes. The BGP route reflector receives the new VTEP's advertisements and re-advertises them to everyone else. The fabric heals itself.

7. VXLAN + EVPN + Cilium

Cilium's VXLAN mode and BGP EVPN are complementary but operate at different layers. Understanding where each lives — and when to use each — is key to building a coherent multi-site Kubernetes networking architecture.

How Cilium uses VXLAN

In VXLAN mode, Cilium creates a single VXLAN interface (cilium_vxlan) and assigns a per-cluster VNI. Every pod's traffic is encapsulated in VXLAN when it leaves the node. The Cilium agent maintains the VXLAN FDB — it programs which pod CIDRs live on which node via its own internal state (derived from Kubernetes node annotations, not BGP EVPN). This is distinct from the EVPN control plane you configured in section 4: Cilium's VXLAN is self-contained within the cluster.

# Install Cilium in VXLAN tunnel mode
helm install cilium cilium/cilium \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set routingMode=tunnel \
  --set tunnelProtocol=vxlan \
  --set k8sServiceHost=10.80.0.1 \
  --set k8sServicePort=6443

# Verify tunnel mode
cilium status | grep -i tunnel
# Tunnel:         vxlan

VXLAN mode vs native routing mode

VXLAN tunnel mode

Cilium encapsulates all pod-to-pod traffic in VXLAN. Works across any IP underlay — nodes do not need to be on the same L2 segment. Pod CIDRs are private to the cluster (no external route advertisement needed). The trade-off: encapsulation overhead and an extra MTU constraint.

// Use when: nodes are on different subnets, multi-site, // or you don't control the underlay routing

Native routing mode

Cilium does not encapsulate. Each node's pod CIDR is advertised as a real route (via BGP, or directly to the gateway). The underlay must be able to route pod CIDRs between nodes. Best performance — zero encapsulation overhead. Requires that you control the underlay routing.

// Use when: nodes are on the same subnet, or you have // BGP (FRR/BIRD) advertising pod CIDRs to upstream routers

VXLAN over WireGuard (multi-site)

For cross-site Kubernetes clusters (or Cluster Mesh), run Cilium in VXLAN mode and configure WireGuard as the underlay transport between sites. Pod traffic is VXLAN-encapsulated by Cilium, then encrypted by WireGuard at the node level. Two layers of encapsulation — plan your MTU accordingly.

// Pod frame → VXLAN (Cilium) → WireGuard (host) → physical // MTU: 1500 - 50 (VXLAN) - 60 (WireGuard) = 1390 inner MTU

Cilium Cluster Mesh

Cilium's native multi-cluster solution. Each cluster has its own Cilium, and Cluster Mesh creates tunnels between them for cross-cluster service discovery and policy. Internally uses VXLAN or Geneve tunneling. Works on top of the WireGuard mesh you built in the WireGuard Masterclass.

// Cluster Mesh: BGP for services, VXLAN for pods // across clusters — same concept as EVPN multi-site

Cilium VXLAN over a WireGuard mesh

# 1. Deploy your WireGuard mesh first (from the WireGuard Masterclass)
# All Kubernetes nodes have a wg0 interface and can reach each other
# via WireGuard tunnel IPs (e.g., 10.200.0.0/24)

# 2. Install Cilium with VXLAN and WireGuard transparent encryption
helm install cilium cilium/cilium \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set routingMode=tunnel \
  --set tunnelProtocol=vxlan \
  --set encryption.enabled=true \
  --set encryption.type=wireguard \
  --set k8sServiceHost=10.200.0.1 \
  --set k8sServicePort=6443

# Cilium manages its own WireGuard keys (per-node, auto-rotated)
# This is SEPARATE from your host WireGuard mesh
# You end up with: host WireGuard (site-to-site) + Cilium WireGuard (pod-to-pod)

# 3. Verify WireGuard encryption is active
cilium status | grep -i wireguard
# Encryption:     Wireguard

# Check which nodes have WireGuard peerings
cilium encrypt status
Cilium's VXLAN mode creates a VNI per cluster and encapsulates pod traffic in VXLAN. This means pods on different subnets (different sites, different clouds) can talk as if they are on the same network. Combined with WireGuard encryption, you get an encrypted multi-site Kubernetes overlay — the same architecture as AWS EKS networking, but on hardware you own. The important distinction: Cilium manages its own VXLAN control plane (using Kubernetes node annotations, not BGP EVPN). Your FRR EVPN fabric and Cilium VXLAN are parallel systems. Cilium handles pod-to-pod encapsulation within and between clusters. FRR EVPN handles non-Kubernetes VM overlay networking on the same nodes. They share the same physical NIC and WireGuard underlay, but they are logically independent. You can run both simultaneously on the same kldload node.

8. GENEVE — VXLAN's Successor

GENEVE (Generic Network Virtualization Encapsulation, RFC 8926) is the evolution of VXLAN. The wire format is similar — UDP encapsulation, 24-bit VNI — but GENEVE's header is variable-length and carries arbitrary TLV (Type-Length-Value) metadata. This extensibility is what VXLAN's fixed 8-byte header cannot provide.

Fixed VXLAN header

8 bytes total: 8-bit flags, 24-bit reserved, 24-bit VNI, 8-bit reserved. That is all you get. You cannot carry security tags, policy IDs, tracing context, or any other per-packet metadata. The VNI is the only identifier.

// VXLAN header: [flags 8b][reserved 24b][VNI 24b][reserved 8b] // Total: 8 bytes. Immutable. No extensions.

Variable GENEVE header

Base header is 8 bytes (same as VXLAN), but followed by zero or more TLV options of arbitrary length. Each option has a 32-bit type field (vendor-namespaced), length, and data. The receiving VTEP processes options it understands and ignores ones it does not.

// GENEVE header: [base 8b][opt1 TLV][opt2 TLV]... // Carry security group ID, flow hash, trace ID — anything

Why cloud providers use GENEVE

AWS Nitro uses GENEVE to carry per-packet security group metadata — the VPC security group evaluation happens in the Nitro card using data carried in the GENEVE options, not in a separate lookup. Azure uses GENEVE for the same reason. VXLAN cannot carry this metadata; GENEVE can.

// AWS VPC: GENEVE option carries security group bitmap // Physical NIC evaluates policy per-packet from header metadata

GENEVE on Linux

Linux has supported GENEVE since kernel 3.18. The interface is identical to VXLAN — just change the type. FRRouting and Cilium both support GENEVE. Cilium uses GENEVE by default in some configurations to carry identity metadata in TLV options.

// ip link add geneve0 type geneve id 100 remote 192.168.1.20 // Same API as VXLAN, same dstport default (6081 for GENEVE)

GENEVE on Linux

# Create a GENEVE interface (note: GENEVE default port is 6081, not 4789)
ip link add geneve0 type geneve id 100 remote 192.168.1.20 dstport 6081
ip link set geneve0 up
ip addr add 10.0.100.1/24 dev geneve0

# With FRRouting, GENEVE works the same as VXLAN
# Just create the geneve interface instead of vxlan interface
# FRR advertises and programs it the same way

# Cilium can be installed with GENEVE instead of VXLAN:
helm install cilium cilium/cilium \
  --namespace kube-system \
  --set routingMode=tunnel \
  --set tunnelProtocol=geneve
GENEVE is VXLAN with extensibility. VXLAN's header is fixed — 8 bytes, VNI only. GENEVE's header can carry arbitrary metadata: security tags, policy IDs, tracing context, tenant identifiers, QoS hints. For most on-premises deployments, VXLAN is fine — you do not need per-packet metadata. If you are building something that needs metadata in the encapsulation header (like a cloud provider that evaluates security groups in the NIC firmware), GENEVE is the correct choice. Cilium uses GENEVE options to carry its numeric security identity — the per-pod policy label hash. This means Cilium-managed NICs can enforce identity-based policy entirely from the encapsulation header, without a separate lookup. That is exactly what AWS does with Nitro, and it is available to you on bare metal kldload nodes with a modern NIC.

9. Performance Tuning

MTU — the most common VXLAN mistake

VXLAN adds 50 bytes of overhead to every packet (outer Ethernet 14 + outer IP 20 + outer UDP 8 + VXLAN header 8 = 50 bytes). WireGuard adds an additional 60 bytes (32-byte Poly1305 MAC + 28-byte transport header). If your physical MTU is 1500 and your inner MTU is also 1500, every packet that hits the MTU limit gets fragmented. Fragmentation is expensive — it requires the kernel to split and reassemble packets, burning CPU cycles and adding latency.

Physical MTU:            1500 bytes (standard Ethernet)

VXLAN only:
  VXLAN overhead:        50 bytes
  Inner MTU:             1450 bytes
  Set on bridge/VM:      ip link set br100 mtu 1450

VXLAN over WireGuard:
  VXLAN overhead:        50 bytes
  WireGuard overhead:    60 bytes
  Inner MTU:             1390 bytes
  Set on bridge/VM:      ip link set br100 mtu 1390

Jumbo frames (recommended for performance):
  Physical MTU:          9000 bytes (configure on switch + NICs)
  VXLAN overhead:        50 bytes
  Inner MTU:             8950 bytes
  Fragmentation:         never
# Set MTU on VXLAN interface and bridge
ip link set vxlan100 mtu 1450
ip link set br100 mtu 1450

# For VXLAN over WireGuard
ip link set vxlan100 mtu 1390
ip link set br100 mtu 1390
ip link set wg0 mtu 1420    # WireGuard itself needs headroom

# If your NIC and switch support jumbo frames:
ip link set eth0 mtu 9000
ip link set vxlan100 mtu 8950
ip link set br100 mtu 8950

# Verify no fragmentation is occurring
# Look for DF-bit drops or ICMP fragmentation-needed messages
ip -s link show vxlan100 | grep -i drop

Hardware offloading

# Check if your NIC supports VXLAN offloading
ethtool -k eth0 | grep -i vxlan
# tx-udp_tnl-segmentation: on
# tx-udp_tnl-csum-segmentation: on

# Most modern NICs (Intel X550, Mellanox ConnectX, Broadcom BCM57504)
# can offload VXLAN encap/decap in hardware
# When offloading is on, the NIC handles VXLAN headers in firmware
# CPU overhead drops significantly for high-throughput workloads

# Disable outer UDP checksum for performance
# (VXLAN inner checksum is enough; outer checksum is redundant on LAN)
ethtool -K eth0 tx-checksum-ip-generic off 2>/dev/null || true

# For VXLAN interface specifically
ethtool -K vxlan100 tx-checksum-ip-generic off 2>/dev/null || true

Kernel tuning for high VXLAN throughput

# Increase UDP socket buffer sizes for high-throughput VXLAN
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216

# Enable RSS (Receive Side Scaling) if your NIC supports it
# VXLAN's outer UDP source port is hashed per-flow, enabling RSS
ethtool -X eth0 equal $(nproc)

# Persist in /etc/sysctl.d/10-vxlan.conf
cat > /etc/sysctl.d/10-vxlan.conf <<EOF
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
EOF
sysctl -p /etc/sysctl.d/10-vxlan.conf
The most common VXLAN performance problem is MTU. If your physical MTU is 1500 and your inner MTU is also 1500, every maximum-size packet fragments — the VXLAN header pushes it 50 bytes over the physical MTU. The kernel silently fragments and reassembles these packets, burning CPU cycles and adding measurable latency on every bulk transfer. The fix is trivial: set inner MTU to 1450 (VXLAN only) or 1390 (VXLAN over WireGuard). The better fix is jumbo frames: set physical MTU to 9000 on your NICs and switches. Inner MTU becomes 8950, fragmentation becomes impossible, and throughput increases substantially. If your hardware supports it, always use jumbo frames in a VXLAN deployment. It is a switch configuration and a one-line ip link set command. The performance difference is not subtle.

10. VNI Design — Mapping Networks to Identifiers

With 16,777,216 available VNIs, there is room for a structured allocation scheme. A thoughtful scheme makes VNI numbers self-documenting and simplifies troubleshooting — a packet capture showing VNI 203 immediately tells you it is Site B application traffic.

VNI allocation strategies

VNI per tenant (multi-tenant)

Each customer or organizational unit gets its own VNI range. Traffic isolation is enforced at the VXLAN layer. Route leaking between tenants requires explicit policy. Used in public cloud and managed service environments.

// VNI 1000-1999: Tenant A // VNI 2000-2999: Tenant B // VNI 3000-3999: Tenant C

VNI per purpose

Separate VNIs for management, storage, application, and backup traffic. Allows per-segment QoS and security policies. All tenants share the same VNIs but traffic is logically separated at Layer 2.

// VNI 1: management (IPMI, SSH, monitoring) // VNI 2: storage (NFS, iSCSI, Ceph) // VNI 3: application (VMs, containers) // VNI 4: backup (replication, snapshots)

VNI per site + purpose

Combine site encoding with purpose encoding in the VNI number. Makes multi-site deployments self-documenting. The VNI itself encodes both where the traffic belongs and what kind of traffic it is.

// First digit: site (1xx = Site A, 2xx = Site B) // Second digit: purpose (x01 = mgmt, x02 = storage, x03 = app) // VNI 101 = Site A management. VNI 203 = Site B application.

Concrete VNI scheme for a kldload multi-site deployment

VNI Name Subnet Purpose
101 site-a-mgmt 10.101.0.0/24 Site A: IPMI, SSH, monitoring agents
102 site-a-storage 10.102.0.0/24 Site A: NFS, Ceph, ZFS replication
103 site-a-app 10.103.0.0/24 Site A: VMs, containers, workloads
201 site-b-mgmt 10.201.0.0/24 Site B: IPMI, SSH, monitoring agents
202 site-b-storage 10.202.0.0/24 Site B: NFS, Ceph, ZFS replication
203 site-b-app 10.203.0.0/24 Site B: VMs, containers, workloads
900 k8s-pods 10.244.0.0/16 Kubernetes pod overlay (Cilium)
999 quarantine 10.999.0.0/24 Isolated network for remediation
With 16 million possible VNIs, you have room for a sane allocation scheme. Use the first digit for site (1xx = Site A, 2xx = Site B, 3xx = Site C), and the last two digits for purpose (01 = management, 02 = storage, 03 = application). VNI 101 = Site A management. VNI 203 = Site B application. When you see VNI 203 in a packet capture, you immediately know what you are looking at. You do not need a lookup table. Self-documenting network IDs reduce the cognitive load of operating the fabric at 3am. Reserve VNI ranges above 9000 for transient or lab use — they are visually distinct from production VNIs and easy to spot in logs.

11. Troubleshooting

VXLAN/EVPN failures fall into a small number of categories: the BGP session is down (EVPN routes not exchanged), the FDB is not populated (routes not translated to kernel state), the VTEP is unreachable (underlay connectivity problem), or MTU fragmentation is silently dropping large packets. Work through these in order.

Check BGP and EVPN state

# FRRouting: overall BGP status
vtysh -c "show bgp summary"
# Look for: State/PfxRcd — should show a number (routes received), not "Active" or "Idle"

# All EVPN routes in the table
vtysh -c "show bgp l2vpn evpn"

# Type 2 routes (MAC/IP bindings from remote VTEPs)
vtysh -c "show bgp l2vpn evpn route type macip"

# Type 3 routes (remote VTEP list for BUM traffic)
vtysh -c "show bgp l2vpn evpn route type multicast"

# EVPN VNI summary
vtysh -c "show evpn vni"

# EVPN MAC table for a specific VNI
vtysh -c "show evpn mac vni 100"

# EVPN neighbor (ARP) table — IP-to-MAC mappings
vtysh -c "show evpn arp-cache vni 100"

Check the kernel VXLAN state

# VXLAN forwarding database — populated by FRR from EVPN routes
bridge fdb show dev vxlan100
# Expected output:
# aa:bb:cc:dd:ee:ff dst 10.255.0.2 self permanent    ← specific MAC from Type 2 route
# 00:00:00:00:00:00 dst 10.255.0.2 self permanent    ← BUM entry from Type 3 route
# 00:00:00:00:00:00 dst 10.255.0.3 self permanent    ← BUM entry for another VTEP

# VXLAN interface details — verify VNI, local IP, port
ip -d link show vxlan100
# Look for: id 100, local 10.255.0.1, dstport 4789, nolearning

# Bridge MAC table — local MACs learned from attached VMs/namespaces
bridge fdb show br100

# ARP/neighbor table for the bridge IP
ip neigh show dev br100

Check underlay connectivity

# Can you reach the remote VTEP IP?
ping 10.255.0.2 -c 3

# If using WireGuard — is the WireGuard tunnel up?
wg show
# Look for: latest handshake within last few minutes
# If "latest handshake" is empty, the tunnel is down

# Is UDP 4789 (VXLAN) reachable between VTEP IPs?
# On the sending side:
nc -u -z 10.255.0.2 4789
# On the receiving side (to verify the port is open):
ss -ulnp | grep 4789

Packet capture — see VXLAN headers

# Capture VXLAN traffic on the physical interface
# This shows the OUTER headers (VTEP-to-VTEP)
tcpdump -i eth0 -n udp port 4789 -v

# To see the inner Ethernet frames decoded:
tcpdump -i eth0 -n udp port 4789 -v -X | head -100

# Capture on the bridge interface — see the INNER frames (post-decap)
tcpdump -i br100 -n -v

# If using WireGuard, capture on wg0 to see decrypted VXLAN traffic
tcpdump -i wg0 -n udp port 4789 -v

Common failure modes and fixes

Symptom Likely cause Fix
BGP state: Active or Idle TCP port 179 blocked, wrong update-source telnet <peer> 179; check update-source matches loopback IP
BGP up, no EVPN routes Missing advertise-all-vni or VXLAN interface not up Verify VXLAN interface is up; verify advertise-all-vni in FRR config
EVPN routes in FRR, FDB empty VNI mismatch between peers, or Zebra not running vtysh -c "show evpn vni"; verify zebra=yes in /etc/frr/daemons
FDB populated, pings fail VTEP unreachable (underlay routing), MTU fragmentation Ping VTEP IPs directly; check MTU with ping -M do -s 1400 <vtep>
Small pings work, large transfers fail MTU fragmentation silently dropping oversized packets Reduce inner MTU; or increase physical MTU to 9000
ARP resolution slow or failing ARP suppression not working; Type 2 routes missing IP field vtysh -c "show evpn arp-cache vni 100"; verify VMs have IPs before FRR runs

12. The Complete Overlay Stack

Every concept in this masterclass is a layer in a single coherent architecture. Understanding each layer independently is useful. Understanding how they compose is what lets you build and operate a real fabric.

Application (VM / container / pod)
       |
   [ Bridge / br100 ]            ← Layer 2 domain for the VNI
       |
   [ VXLAN interface ]           ← Encapsulation: inner Ethernet in outer UDP/4789
       |   ↑ FDB programmed by FRR EVPN (MAC/VTEP bindings)
       |   ↑ ARP suppression: local VTEP answers ARPs from BGP knowledge
       |
   [ BGP EVPN (FRRouting) ]      ← Control plane: Type 2/3/5 route exchange
       |   ↑ distributes MAC, IP, VTEP reachability automatically
       |
   [ WireGuard (wg0) ]           ← Encrypted transport: inter-site and/or inter-node
       |   ↑ VTEP loopbacks reachable via WireGuard AllowedIPs
       |
   [ Physical NIC (eth0) ]       ← Underlay: your LAN, DC fabric, or internet

Cilium slots into this stack above the bridge, handling pod-to-pod policy and encapsulation within a Kubernetes cluster:

Pod (container)
       |
   [ Cilium eBPF dataplane ]     ← Identity-based L3/L4/L7 policy, kube-proxy replacement
       |   ↑ VXLAN or GENEVE encapsulation (Cilium-managed VNI)
       |   ↑ WireGuard per-node encryption (Cilium-managed keys)
       |
   [ Host network stack ]        ← Where FRR EVPN and host WireGuard live
       |
   [ Physical NIC ]

Comparison: manual VXLAN vs EVPN vs Cilium VXLAN vs cloud VPC

Feature Manual VXLAN EVPN (FRR) Cilium VXLAN Cloud VPC
Peer discovery Manual BGP automatic Kubernetes API Cloud control plane
MAC learning Flood and learn BGP Type 2 VTEP FDB via agent BGP EVPN internally
ARP suppression No Yes (local proxy) Yes (eBPF) Yes
Encryption None (add WireGuard) WireGuard underlay WireGuard (built-in) Provider-managed
Policy enforcement iptables iptables / nftables eBPF (identity-based) Security groups
Scales to N nodes ~10 (N×N config) Thousands Thousands (per cluster) Millions
You control it Yes Yes Yes No

The complete picture: physical NIC carries the underlay. WireGuard encrypts it between sites. VXLAN rides on top as the virtual Layer 2. BGP EVPN (FRRouting) automates every table that would otherwise be manual: the VTEP peer list, the MAC-to-VTEP mapping, the ARP suppression cache, the BUM replication list. Cilium sits above all of this for Kubernetes clusters, adding identity-based eBPF policy and its own VXLAN or GENEVE layer for pod traffic.

You now understand the complete overlay stack that every serious cloud runs. The difference between your kldload deployment and a major cloud provider's VPC is scale, not architecture. The protocols are identical. The design patterns are identical. You just have fewer nodes — and you own the hardware.

Related pages