A bit of an unofficial post-mortem on #Optus #outage yesterday (please BOOST for visibility!) I have no insider knowledge, all I can do is look at what Optus's networking gear told the rest of the world through #BGP, and make some informed guesses based on that.

The problem yesterday started at about 4am, when Optus told the world 'I no longer have any internet connectivity', and 'Do not send any internet traffic to me, at all'. The technical description is that they withdrew ALL of their routes from the #DFZ (Which is "The Internet", as seen by all the core routers that ACTUALLY control the internet).

However, as a precursor at about 3am there was a hint that things weren't perfect, as there was a flurry of changes from Optus to the outside world saying, roughly, 'Something has changed inside my network, but you can still keep sending me stuff'.

Now, as two final bits of possibly relevant information, the default for maximum-prefix on #Cisco #ASR9000 is 1048576 (this number is 'the number of routes that can be accepted by this router'), and MOST IMPORTANTLY the DFZ ("the internet") has about 980,000 routes in it at the moment. That's only 90k odd routes LESS than the default maximum.

I'd be amazed if Optus has less than 100k internal routes that aren't visible to the internet, but are visible internally.

So here's what I think happened. The at 3am, the first core #router was upgraded, and a new config was put in place. This did not join the network correctly, and things were half broken. What SHOULD have happened is that all the changes should have stopped, and either rolled back, or waited for further investigation (the cause being that more than 1mil routes were visible, causing it to shut down)

However, someone decided 'Well, maybe if we upgrade the SECOND one, that'll fix the first one' at 4am. That broke the SECOND one, and took Optus completely off the internet.

(Continued, see next for why this is far worse than it should have been)

These things that were being upgraded are called #Route #Reflectors and they are things that are SUPER CRITICAL to a big network, but also SUPER SIMPLE to have redundancy on - you just add another and tell it to talk to the other one(s). Zero complexity.

They listen to all the OTHER routers that say 'I can get to 1.1.1.1 via x.y.z then a.b.c' or 'I can get to 1.1.1.1 via 8 different paths', and consolidate them all together, and tell every OTHER router the single best path to (in this example) 1.1.1.1

They ALSO take all the internal network routes and squash them together into something that is presented to the outside world via #bgp

Basically, if you have a MASSIVE network, you need a couple of these, and they need to be reliable, and they need to be redundant, but because they're *technically simple*, it's usually not that much of a big deal to upgrade them, as long as you do them one at a time, and *AND THIS IS THE IMPORTANT BIT THAT I THINK THEY MISSED*, make sure that the one you have just upgraded IS ACTUALLY WORKING.

So, at 4am, they all failed. This is a pretty serious failure, because if they're ALL DOWN, you have no network at all. You have to use your #OOB (Out of Band) network to access the routers, and fix them.

But what if your OOB network is using Optus? Well that's an issue if Optus is down, as you can't use the Optus network to fix the Optus network!

So then you need to get someone to physically attach themselves to the OOB network. But here's the NEXT problem - all Optus networking is offshore. There's almost no-one in Australia who can physically fix it.

So what do you do when your offshore outsourced network guys break your core network infrastructure, and you've retrenched everyone who can fix it locally?

You have a 7 hour outage, that's what you do.

Feel free to ask questions or tell me I'm wrong!

@xrobau to be honest, any large multi network outage I always just assume configuration issues at the BGP level. no matter the statement of the provider.

but it would be nice if they are transparent on who broke it and how long it took them to identify and start resolving it. compared to start and stop times and so sad it happens