Single point, meet failure.
Eggs, meet basket.

#SPOF

1/

For those unfamiliar with the term #SPOF, it's an acronym techies use, meaning "single point of failure".

That is: if you have a system, no matter how distributed, duplicated, and/or redundant elsewhere, which depends on a single service, person, piece of equipment, etc., to function, then when that element goes down, the entire system is down.

It's the idea behind the old expression, "don't put all your eggs in one basket"

Facebook's concentration of its Internet infrastructure behind a single entity responsible for identifying who it is on the Internet (that's what BGP and DNS are, both of which Facebook serves itself, as well as owning its own registrar), meant that to get to the servers in the cage at their facility some guy had to show up with an angle-grinder.

On the one hand, that's a failure mode which may be accepable.

Though the idea of blowing iron or aluminium filings all through my server racks gives me some cause to pause. Conductive
dust and electronics tend not to play well together.

https://nitter.eu/cullend/status/1445156376934862848

#SPOF

2/end/

Cullen (@cullend)

Lmao. Friend at Facebook confirmed they ended up bringing in a guy with an angle grinder to get access to the server cage

@dredmorbius I don't think the single point of failure was them managing their own infrastructure. I think here it was pushing out a change without testing it first.
@redeagle And if there's an out-of-band secondary, how catastrophic would that failure have been?
@dredmorbius Depending on the core issue, I'm not sure that would have prevented it

@redeagle In the alternative scenario:

  • How long does the outage last?
  • Is the angle-grinder still required?
  • Bonus question: As a CTO, what level of SPOF risk analysis have you performe on your own systems and procedures, and what recommendations have ou made?

    @dredmorbius The thing is that we can't answer those questions because we don't have all the information. I just find that most major outages are an operational issue rather than a technical one.

    Bonus: that's a long answer. We're a small company, and most of our operational software is SaaS. I've been working to resolve the remaining issues and improve our processes through automation.

    @redeagle My experience differs. Systems in which resilience and redundancy are factored in, where ops is seen as a risk-management practice, in which scenarios are drilled and incidents are investigated, lessons are learned, fixes and preventives deployed, and documentation updated and distributed, in time become remarkably free of spectacular foot-gunning incidents.

    I was taught not to have all my DNS, routing, monitoring, and messaging on a single system or domain over two decades ago.

    Move fast and break the world seems to still rule in Menlo Park.