random thought: I'm kinda surprised that Linux doesn't yet have an API where you can call sendmsg() with some flag that says "if you lose this data I can regenerate the exact same data and give it to you again". then the TCP stack could discard that data as soon as it is transmitted, or also discard untransmitted data if it wants to shrink the TX window because the peer has suddenly gotten way slower.
With some mechanism where, when the kernel notices that it needs to recover data it already discarded, the kernel can post a message to the socket's error queue saying "please give me that data again", and a sendmsg() API addition so that you can specify "this is a new copy of data I already gave you before". And userspace could use SOF_TIMESTAMPING_TX_ACK to figure out until when it needs to be ready to regenerate the data.

That might make it possible for servers sending mostly-static content to use larger transmit windows without having to worry about the memory usage of socket send buffers so much (since only HTTP headers and dynamic content would still consume send buffer space while the data is in flight), and if the TLS implementation was designed for use with this feature (storing mappings between TCP byteranges and corresponding plaintext file ranges, IVs, and MACs), that might even work with content served over HTTPS...

Timestamping — The Linux Kernel documentation

@jann I wonder if this would make sense to prototype in a QUIC stack in userspace.
@alwayscurious yeah, it might be easier to prototype that way, without having to worry about creating new APIs between kernel and userspace...

@jann I suspect that is one of the reasons QUIC was invented in the first place.

Instead of constantly adding new APIs, what about providing a generic, unprivileged API to send pre-built TCP packets that match any socket one has? That would let one implement congestion control and the socket queue entirely in userspace.

@alwayscurious one downside might be that you might have to switch between userspace processes more often to push a steady stream of packets into the kernel's packet scheduler and generate ACKs?
I guess how well that works might depend a lot on whether the system is running one big server or it has a bunch of processes doing networking concurrently...
@jann I wonder how Apple’s Network.framework does it. It seems they use DMA directly to user buffers but I’m not sure how they implement that.
@alwayscurious you mean for sending (the easy case) or for receiving (the hard case that maybe works if you only want to do it for a tiny number of connections at once and the network card is sufficiently fancy; or if you're willing to do TLB flushes very frequently)?