I'm very late to comment on this, but holy shit I'm shocked at Cloudflare's recent postmortem.

First off, the finger-pointy tone of the post doesn't sit well with me. While their provider clearly made mistakes in terms of communications, you have to own your own availability in the way you engineer around your providers' limitations and mistakes.

1/2

Second, I would expect Cloudflare to do a MUCH better job insulating(?) their architectural decisions from assumptions about the underlying DC infra and power grid.

Finally, for the love of glob, why the FUCK would you locate three datacenters within such a tight geographic radius AND WITHIN A SUBDUCTION ZONE and claim this prevents you from natural disasters AND meets the definition of HA???

2/2

@obfuscurity "Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04."
@obfuscurity the power stuff sucked, but that's what they should be focusing on.
@mattray I'm genuinely surprised at how much detail was shared that reveals just how bad they are at engineering HA infra.

@obfuscurity

It fits a pattern I've seen many times. (Not saying this is what happened, just that it fits.)

The initial engineering is done by someone competent, who understands the overview.

Bits and pieces get push down to less knowledgeable engineers. The competent folks leave or get promoted.

You're left with folks lack vision of the whole. They might be competent engineers, but without the overview they make bad decisions.

Sigh.

@mwl @obfuscurity exactly this, a thousand times this. I personally observed this pattern in six different national scale banks in previous jobs, and I'm watching it happen again in slow motion at my current job. (Only s/leave/are being laid off in descending order of seniority/g)

@obfuscurity I’m shocked there’s 3 data centers in Hillsboro that are considered three different locations. There aren’t 3 McDonald’s in Hillsboro.

Also they had the ability to switch services to Europe and they did but also that didn’t fix it but also did fix it? They don’t have more data centers in the US?!?

It’s a weird post mortem.

@robotdeathsquad @obfuscurity They said they have the 3 DCs "around" Hillsboro. I took that to mean "near" Hillsboro. If they really meant "in Hillsboro" then wow. It's not that big.

Even if Hillsboro was as big as Portland proper, they're all relying on PGE for power so there's a nice little single point of failure right there.

It was windy here last night and it MADE THE NEWS because it caused power outages so I dunno how anyone could think PGE could manage to keep three DCs online...

Cascadia subduction zone - Wikipedia

@obfuscurity @robotfactory right, the important part is west of the cascade. Amazon, google, apple and meta all have Oregon DCs east of the cascades and within a very short hop from one of the biggest hydro power sources.

Now, Hillsboro does have some special infrastructure for power because of the Intel fabs there, you can see crazy power lines everywhere, but it’s still an odd place to put seemingly *all* of your US infra.

@robotfactory @robotdeathsquad Sorry, I shouldn't assume you don't already know where Hillsboro is in relation to Portland, obviously you do. I'm just shocked that regardless of which municipality the DCs are located in, they're clearly all in that SZ.
@obfuscurity @robotdeathsquad To not only place them in earthquake land but also put them within a few miles of each other AND rely on the same power provider AND same DC provider... Wow.
@obfuscurity Also worth noting: ISPs in Portland don't have many connections out to the Internet. Most back haul up to Seattle. AND they share a lot of fiber to do it.

@obfuscurity

Shifting all non-critical engineering functions to ensuring that the control plane is fully HA will surely not cause any problems...