2AM

Brain: "hey, what if we needed to build an NTP server to handle 100k qps?"

Me: "What?"

@kwf good brain.
But still, dear brain: that's the domain of a not-very-large FPGA impl, I'd say. Leave the poor network engineer alone.
@kwf I know someone you need to talk to if you want to do this :)
@kwf then you have the excuse I'm always looking for to buy a https://www.leobodnar.com/shop/index.php?main_page=product_info&cPath=120&products_id=365 😁
LeoNTP Time Server 1200 : Leo Bodnar Electronics

Leo Bodnar Electronics LeoNTP Time Server 1200 - LeoNTP model 1200 is a Stratum 1 NTP time server with GPS synchronised reference clock source.  Datasheet LeoNTP 1200 has unique custom design developed by Leo Bodnar Electronics. Its key features are: * maximum performance, reaching 100% of 100Mbps network speed at more than 100,000 time requests per second * supports both IPv4 and

@kwf Always interested in a 2am thought exercise, but ... I thought NTP's client backoff strategy would make sustaining this level of qps unnecessary, by design
@tychotithonus but what if you're handling tens of millions of clients.

@kwf Hmm, fair. Thundering herd (massive, forced synchronized restart) aside -- which should be extremely rare, and only happen if your tens of millions of clients were recovering from a massive power/comms failure that forced them all back to minpoll simultaneously ...

... after each individual peer's initial burst/minpoll flurry, settling down to maxpoll (1024 seconds, which most clients would be running at most of the time) ... I'd expect 10k qps to handle 10m peers, and 100k qps could handle 100m peers ... but would be near capacity.

But also, since local drfit offset is calculated and stored on each client, and since I would expect most clients to support that quenching / Kiss-of-Death thingie ... I'd expect near-capacity conditions to be brief, absorbable, and very low impact for actual time synchronization.

In other words: Dr. Mills thought about this pretty hard. 😁

@kwf Hmm, though now that I think about it, persistence of local drift, vs VM instantiation and behavior, shifts this traditional assumption I'm making, too!
@tychotithonus Then layer on all sorts of abuse you see on the public WAN, and the burst capacity you want to have to be able to handle 100k qps baseline starts getting less trivial.
@kwf Totally agreed - and you have more experience with that problem surface than I do!
@kwf It turns out that's surprisingly little hardware.