I just discovered something really subtle about WireGuard... TL;DR if you are adjusting interface MTUs precisely, and you have mismatched MTUs between peers in some cases, make sure your smallest MTU is always a multiple of 16!

WireGuard header overhead is said to be 32 bytes + UDP + IP, so 80 bytes for IPv6 and 60 bytes for IPv4. That's where you get the default MTU of 1420 (1500 - 80, so it works with IPv6).

But that's not precisely true! Actually, WireGuard will add up to 15 bytes of padding to the data, to make it a multiple of 16, as long as it doesn't exceed the MTU on that side of the connection.

So let's say you have a server with the MTU set at 1440, but you also have a client that is using IPv4 over PPPoE. So you set its MTU to 1432, subtracting the PPPoE overhead of 8 bytes. That should be fine, since the client will figure out the right path MTU for any connections, right?

Wrong!

The TCP client and server will negotiate an MSS that gives 1432 byte IP packets within the tunnel. But 1432 is not a multiple of 16! However, the client WireGuard instance knows that there is no headroom, so it will send 1432 + 60 = 1492 byte packets, which is the maximum PPPoE MTU. But on the way back, the server thinks it can go up to 1440! 1432 % 16 == 8, so it will try to round up to 1440. Then, it sends 1500 byte packets, which don't fit in PPPoE!

The fix is to either set both the client and server MTU to 1432, or to round down the client MTU to 1424.

@lina Sounds like a failure of the implementation of the autonegotiations, or not truly complying with the standards themselves somewhere.
@raulinbonn As far as I know WireGuard does not do PMTU discovery for the upper layer at all. You have to set the right MTU or you get fragmentation/etc issues.
@lina @raulinbonn WireGuard is, at many points, too simple. 😦
@Conan_Kudo @lina Yes, too simple an implementation --> poor or nonexistent autonegotiation. Burden is on the administrator/user setting up things, instead of on the software itself at runtime.

@raulinbonn @Conan_Kudo It's not actually possible to implement this properly. WireGuard supports roaming and TCP essentially does not. If MTU were autoconfigured it could change when roaming, and that would break TCP on the inner connection.

So there is really nothing "correct" that WireGuard could do. The only real solution is for the user to manually configured the lowest expected MTU.

@lina @raulinbonn @Conan_Kudo it is possible to implement this correctly. Do PMTU discovery on the outside of the tunnel and then emit the correct ICMP messages inside
@lina @Conan_Kudo But could the user make better guesses about lower expected MTU than the software itself? And if the user can do a good guess, the software ought to do it at least as well, either by asking the user, or making an even better best-guess. But my point: I think this type of config burden should hardly ever be left on the user. The user should not worry, and should not even care or know about any lower expected MTU or any of the tech stack and standards/protocols being used.

@raulinbonn @Conan_Kudo The software can't magically guess what kind of networks the user might roam on. The user is already "asked" since MTU is a configurable setting. You already have to configure tunnels manually in general. Some settings just have defaults, like this one.

If anything, perhaps GUI WireGuard tools should default to something lower like 1392, so that the most common roaming/mobile use cases (GUI machines like laptops) have a more conservative default (this value works for DS-Lite over PPPoE connections). There isn't much else you can do...

@lina @Conan_Kudo The user also can't magically know what kind of networks he/she might roam on. If there are issues, the user has to lower settings manually. That's something the software could lift from the user's shoulders, to some extent at least. For starters, just assuming a worst case scenario like using let's say 1000. Then, if a techy user knows what's going on, ok let that user config/optimize things manually. But for the userbase at large (just my personal philosophical stand) I think the software and its developers should lift all or as many as possible burdens from the user.

@raulinbonn @Conan_Kudo That's what I'm saying, that the default for systems likely to roam should be conservative. This is not necessary for setups that are non-roaming, like servers talking to each other or other fixed systems. It doesn't make sense to have a really conservative MTU for those by default.

1000 is invalid BTW. The minimum for IPv6 to work is 1280.

@lina that sounds like an utter pain to find out, lol
@lina i remember reading that bit of code for the padding.. i did not think it would create such confusing situations 
@lina Huh, I didn't know that (and the current MTU on my Wireguard connections is 1420 πŸ™„).
@lina oh, wait, that could finally explain why I had to set a different #MTU than what I had calculated/expected for #WireGuard tunnel for #dn42 behind a PPPoE host...
@lina we had a 1476 bytes MTU on the PPPoE interface. I would have expected a 1396 bytes MTU on the #WireGuard tunnel interface, but ping's with various sizes showed it had to be 1392 instead. Does that make sense, would the padding explain that?
@lina thank you =^__^;;=
@lina That explains the MTU weirdness I've seen before that went away after downsizing it.
@lina interesting, I would assume rounding down would be more sensible if it's a requirement for the crypto to work. I guess this is also why tailscale uses a mtu of 1280 to ensure it doesn't run into these issues.
@shironeko The rounding up to a multiple of 16 is to make traffic analysis more difficult by not leaking the exact packet lengths. The crypto itself can handle arbitrary lengths.

Will reducing the client MTU actually solve this problem in general?

I guess what happens when you do that is that the client knowing this lower MTU will advertise a smaller MSS so the server sends packets which are small enough to fit in the tunnel.

However I can think of a couple of scenarios where that would not help. If the protocol being used isn't TCP, but rather some other protocols such as one based on UDP, then you don't get the MSS negotiation. An application layer negotiation could help if the application layer protocol supports that. But then the applicaiton protocol also have to keep UDP payloads small enough, it couldn't take advantage of IP fragmentation since the IP layer is not going to know about the negotiated size.

Another scenario which won't work is if the client side is more than a single node. For example we have the need for clients to run a virtual network segment for Docker which communicates through a VPN connection. In that scenario the MSS sent by the client would be based on the Docker network and not the VPN tunnel. So the client will advertise an MSS of 1440, because that fits on the Docker network. The server will support that as well. Neither side knows about the VPN in advance, so PMTU discovery will need to work.

I think what's supposed to happen is that when the server sends a packet which won't fit in the PPPoE connection a packet-too-big error is returned to the server. And then WireGuard would need to take that into account for future packets. But if a misconfigured network drops the packet-too-big error messages, then that won't work.

@kasperd WireGuard allows fragmentation on the upper layer, so it's supposed to work with narrower upper layers (just with lower performance). The problem is that doesn't always work. Some networks seem to just drop packets on the floor (and do MSS clamping, so normal TCP stuff still works) or otherwise have problems with fragments.

If you are routing things through a narrower WireGuard tunnel then either you need to make sure PMTU discovery works there, or do MSS clamping (or both).

If you absolutely need to tunnel UDP stuff that might exceed one side's MTU then I think the only real solution is to lower both peers' MTUs to a safe value for everyone, so normal IP fragmentation works in both directions within the tunnel.

In my case I do MSS clamping within the tunnel on the server end (which does route to other hosts). It's not needed on the client end since I don't use Docker or anything like that there.

RFC 9347: Aggregation and Fragmentation Mode for Encapsulating Security Payload (ESP) and Its Use for IP Traffic Flow Security (IP-TFS)

This document describes a mechanism for aggregation and fragmentation of IP packets when they are being encapsulated in Encapsulating Security Payload (ESP). This new payload type can be used for various purposes, such as decreasing encapsulation overhead for small IP packets; however, the focus in this document is to enhance IP Traffic Flow Security (IP-TFS) by adding Traffic Flow Confidentiality (TFC) to encrypted IP-encapsulated traffic. TFC is provided by obscuring the size and frequency of IP traffic using a fixed-size, constant-send-rate IPsec tunnel. The solution allows for congestion control, as well as nonconstant send-rate usage.

IETF Datatracker