For more than four days, a server at the very core of the Internet’s domain name system was out of sync with its 12 root server peers due to an unexplained glitch that could have caused stability and security problems worldwide. This server, maintained by Internet carrier Cogent Communications, is one of the 13 root servers that provision the Internet’s root zone, which sits at the top of the hierarchical distributed database known as the domain name system, or DNS.

Given the crucial role a root server provides in ensuring one device can find any other device on the Internet, there are 13 of root servers geographically dispersed all over the world. Normally, the 13 root servers—each operated by a different entity—march in lockstep. When a change is made to the contents they host, it generally occurs on all of them within a few seconds or minutes at most.
Strange events at the C-root name server

This tight synchronization is crucial for ensuring stability. If one root server directs traffic lookups to one intermediate server and another root server sends lookups to a different intermediate server, the Internet as we know it could collapse. More important still, root servers store the cryptographic keys necessary to authenticate some of intermediate servers under a mechanism known as DNSSEC. If keys aren’t identical across all 13 root servers, there’s an increased risk of attacks such as DNS cache poisoning.

For reasons that remain unclear outside of Cogent—which declined to comment for this post—the c-root it’s responsible for maintaining suddenly stopped updating on Saturday. Stéphane Bortzmeyer, a French engineer who was among the first to flag the problem in a Tuesday post, noted then that the c-root was three days behind the rest of the root servers.

https://arstechnica.com/security/2024/05/dns-glitch-that-threatened-internet-stability-fixed-cause-remains-unclear/

A root-server at the Internet’s core lost touch with its peers. We still don’t know why.

For 4 days, the c-root server maintained by Cogent lost touch with its 12 peers.

Ars Technica

@dangoodin @bert_hubert It might be worth noting that each root "server" is actually between 6 and 345 distinct instances — but from what I've read so far, all 12 of Cogent's were impacted in the same fashion.

https://root-servers.org

Root Server Technical Operations Association

The 13 root name servers are operated by 12 independent organisations. You can find more information about each of these organisations by visiting their homepage.

@dangoodin Slight nit, there are many, many f's around the world.

@dangoodin "the Internet as we know it could collapse" might be a slight over-dramatisation.

- #DNS roots are massively anycast so mostly only Cogent's customers and peers are affected.

- root responses are cached for days so the internet is *always* out of sync - by design!

- Update delays and caching implications are routinely monitored by dns operators which is why this was noticed.

- Many significant domains (apple.com, au.) haven't changed in years, so unaffected.

@markd @dangoodin

It is likely happening much more often than people think. In 2023, I had an elusive DNS error on my hands and tracked it down to 2 out of 4 "physical" servers behind 1 out of 13 root server names being out-of-date in a _single_ geography (Japan).

Any query hitting the two unfortunate hosts would get out-of-date response, which included an expired DNSSEC signature. With an invalid DNSSEC signature, well-behaved recursive resolvers would throw on the ground any response coming from that server. In other words, DNS resolution was completely broken.

Guess what? Almost nobody saw that problem. DNS is so redundant. The query would generally be retried, and then it would hit a healthy root server. Multiple times on the DNS layer, and then potentially more times on the application layer. The rest of the world was oblivious to it, and even in Japan it would have been perceived as connectivity noise.

After discovering the issue, I contacted the operators of that root and they resolved it in a few hours. 0 apocalypses took place.

@tie @dangoodin Quite so.

The flip side of all the redundancy, roll-over strategies, local root instances and global DNS services such as 8.8.8.8 is that diagnosing inconsistencies is devilishly difficult and impossible from a single vantage point on the internet when anycast comes into play.

The temporal effect is also a huge complication. What we saw five minutes ago may be completely different to what we see now.

Not sure anyone has good answers to this beyond wide-spread monitoring.

@dangoodin @CliftonR My money is on some fraction of the Indians who can’t connect normally to many places in the US because of the Tata/Cogent spat having skillz & malice in their hearts…
It was suggested somewhere public (NANOG?) recently that Cogent’s peering practices should threaten their operation of C-Root.

(NOTE: I DO NOT AGREE)

@grumpybozo @dangoodin @CliftonR
Maybe they laid off the person in India who was delegated the manual task of updating when they shut down Tata. (They calculated it was cheaper to pay someone there than write automation code.)

@dangoodin
Loool, 6 days before, a congent DC have some water inside... With our server inside the DC...
Lot of services was down.

It will be very fun if some relation exist... !

@wargreen @dangoodin root servers are anycast so that's definitely not it.
@dangoodin <remembers Postel moving the roots> https://www.wired.com/2012/10/joe-postel/ (sic; wow, how did Wired botch that URL so badly?)

@dangoodin going on a little rant here, please bear with.

If I remember (and did a cursory search correctly, and @joeklein keep me honest here) Cogent is still refusing to peer IPv6 with Hurricane Electric without a paid relationship. Cogent and HE are two very large transport providers, especially in v6 space. (more)

@dangoodin @joeklein The HE vs Cogent battle has been going on for over a decade (maybe close to two now) and is it's own infrastructure problem in and of itself, since both companies are major network transit providers. I'll admit that it is somewhat motivated by the fact that HE has made it both very cheap and very easy to gain access to their network fo peering and transit. (more)

@dangoodin @joeklein But the fact that Cogent is not being transparent about a piece of the global shared infrastructure (c.root-servers.net) that they control is... rather disturbing. This is something that they hold in trust for the entire community, and unlike the peering dispute, people have no real way to avoid using it.

ICANN/IANA need to hold Cogent to account for this. At least a public explanation as to what happened, and what steps are taken to avoid it in the future. (more)

@dangoodin @joeklein Question: what does the agreement between ICANN/IANA and Cogent look like for their custodianship of c.root-servers.net? What kind of performance guarantees is Cogent supposed to make, and what kind of penalties are they subject to for non-compliance.

And it sounds like this issue was tumbled to by accident. Are there any major monitors of root servers (especially anycast?) out there?

Hoo boy, this is a lot to unpack. (fin)

@bluknight @dangoodin Cogent has 17 peering disputes since 2003, which in many cases lead to disruptions to their customers and longer routes for anyone using IPv6 (he.net, Google, NTT, TATA and many others). The blast radius includes Domain Name Registries (domains), ISP (routing, email [spf, dkim, dmarc], public resolvers, web hosting/registration, cloud providers, large enterprise/tech companies, government & education, security/privacy services (protonmail, nextdns, etc.). How long are the domain names/IPv4/IPv6 cached?