๐งต Today, I learned about NAT collision in my Kubernetes cluster with wireguard (kubespan) to mesh the network between my home nodes and edge node.
๐งต Today, I learned about NAT collision in my Kubernetes cluster with wireguard (kubespan) to mesh the network between my home nodes and edge node.
I run a Talos Linux cluster with some nodes at home and only 1 edge baremetal at OVH, connected via KubeSpan (WireGuard mesh).
All home nodes share the same public IP and advertise the same endpoint `<home_isp_ip>:51820` to the remote peer.
The mesh mostly works because each node initiates outbound and NAT assigns different ephemeral source ports ; so WireGuard can tell them apart.
But when a tunnel drops and the OVH node needs to re-establish the connection, it only knows <home_isp_ip>:51820 for all peers. It can't distinguish them, so recovery is unreliable and causes flapping.
Fix: a unique port forward per node (51821-51824) and Talos endpoint filters to stop advertising the default :51820.
Now when a tunnel drops, the OVH node has a dedicated port to reach each home node directly.
This would have been much simpler if KubeSpan allowed overriding WireGuard's listenPort per node.
Instead of the whole extraAnnouncedEndpoints + filters workaround, I could just override WireGuard ListenPort: 5182x per node and do a simple port forward. But Kubespan hardcodes it to 51820.
There's an existing issue about this: https://github.com/siderolabs/talos/issues/9038