random thought: I'm kinda surprised that Linux doesn't yet have an API where you can call sendmsg() with some flag that says "if you lose this data I can regenerate the exact same data and give it to you again". then the TCP stack could discard that data as soon as it is transmitted, or also discard untransmitted data if it wants to shrink the TX window because the peer has suddenly gotten way slower.
With some mechanism where, when the kernel notices that it needs to recover data it already discarded, the kernel can post a message to the socket's error queue saying "please give me that data again", and a sendmsg() API addition so that you can specify "this is a new copy of data I already gave you before". And userspace could use SOF_TIMESTAMPING_TX_ACK to figure out until when it needs to be ready to regenerate the data.

That might make it possible for servers sending mostly-static content to use larger transmit windows without having to worry about the memory usage of socket send buffers so much (since only HTTP headers and dynamic content would still consume send buffer space while the data is in flight), and if the TLS implementation was designed for use with this feature (storing mappings between TCP byteranges and corresponding plaintext file ranges, IVs, and MACs), that might even work with content served over HTTPS...

Timestamping — The Linux Kernel documentation

That might even improve the efficiency of the TCP stack on the happy path, because you wouldn't be constantly maintaining a gigantic pile of usually-unneeded socket buffers for unacknowledged data that might have to be written to RAM if they don't fit into caches...
Linux even already allows userspace to mess with the contents of the RX/TX queues of TCP sockets! but only in the special "repair" mode enabled by TCP_REPAIR+TCP_REPAIR_QUEUE, which is designed for CRIU, not for anything like this
@jann I wonder if this would make sense to prototype in a QUIC stack in userspace.
@alwayscurious yeah, it might be easier to prototype that way, without having to worry about creating new APIs between kernel and userspace...

@jann I suspect that is one of the reasons QUIC was invented in the first place.

Instead of constantly adding new APIs, what about providing a generic, unprivileged API to send pre-built TCP packets that match any socket one has? That would let one implement congestion control and the socket queue entirely in userspace.

@alwayscurious one downside might be that you might have to switch between userspace processes more often to push a steady stream of packets into the kernel's packet scheduler and generate ACKs?
I guess how well that works might depend a lot on whether the system is running one big server or it has a bunch of processes doing networking concurrently...
@jann I wonder how Apple’s Network.framework does it. It seems they use DMA directly to user buffers but I’m not sure how they implement that.
@alwayscurious you mean for sending (the easy case) or for receiving (the hard case that maybe works if you only want to do it for a tiny number of connections at once and the network card is sufficiently fancy; or if you're willing to do TLB flushes very frequently)?

@jann I assume you’re well aware, but just in case: I think sendfile() should get you some of that? (“If out_fd refers to a socket or pipe with zero-copy support…”)

Of course, your proposal is more flexible; one wouldn’t want to do TLS via sendfile()…

@JoachimSchipper yeah, sendfile() would get you some of that for plain http, especially if you're mostly serving a small number of hot files to lots of clients. but sendfile() still keeps the file pages that are part of the retransmit queue pinned into RAM.
(And if you use kTLS with the fancy hardware offload that works with very fancy network cards, sendfile() might even give you that same behavior for HTTPS?)

@jann I assume, with no familiarity with the Linux implementation, that the pages are shared between network buffers and the page cache for disk? That seems fairly close to optimal; are you in a setting where your data in-flight is a sizable fraction of your memory?

(Back of the napkin, I get ~1.25 GB of data in-flight for a 100 Gbps server with 100ms RTT; that’s a lot, but not a large fraction of the sort of server I’d expect to find hooked up to a 100 Gbps link.)

Of course, most things can’t use sendfile() because it’s not flexible enough… but thanks for entertaining my curiosity!

@JoachimSchipper i'm not actually in any setting at all 😛, I'm just idly wondering about things. and yeah, as long as you don't have a series of clients that each successively start downloading at maximum bandwidth and then suddenly stop acking packets, I guess you're right that that probably wouldn't be a practical concern...