Mastodawn

Buried in this nicely-detailed RCA is a pretty damning fact:

Cloudflare left .unwrap() in mission-critical Rust code.

For non-Rustaceans, .unwrap() handles a type called Result that can either be Ok with a value, or an Err with an Error. The whole point is to gracefully handle errors and not let panics make it to production code. But unwrap() assumes there's a value to extract without safeguards.

I use .unwrap() sometimes! Usually when there's a logical guarantee that the result can never be an error. But I make sure to purge it from critical processes for exactly this reason.

https://blog.cloudflare.com/18-november-2025-outage/

Show thread

EndlessMason Nov 19

@mttaggart
Isn't crashing the right choice when your rules enforcement engine can't fit all the rules into memory?

Show thread

Taggart

@EndlessMason It absolutely is not, especially on a service that must remain operational. Graceful handling of the error, including logging, is what should occur, not a panic.

Show thread

EndlessMason Nov 19

@mttaggart
How can you gracefully handle not loading your rules table when you are a rules engine?

Show thread

Taggart

Nov 19

@EndlessMason That's not what was going on here. The load wasn't the issue. Note the append. There are any number of way you could handle such an error, including truncating the list if you got a bounds error. But regardless, this was plainly not just a rules engine, was it? It was a critical piece of software that needed to remain up, despite external dependencies. This was not how failure should have been handled, and I don't know why you're arguing that point. The consequences of failure were plainly undesirable.

Show thread

EndlessMason Nov 19

@mttaggart
if you want to say reading a file is loading it, but putting the parsed results into a structure is not loading it the yes, you win via pun

If you want to say that the service just flipping a coin when it doesn't have it's bot-or-not table doesn't still count as an outage then I don't know what to tell you.

How in the world is cf going to do 90% of what its for if it can't decide if a ua is a bot?

Show thread

Ingvar Nov 19

@EndlessMason @mttaggart You load the new rules into a "new rule processing context", if loading that fails, that's fine, you still have the old one. Unless you're starting cold, where one of "do nothing" or "crash" may be appropriate. Which is why you should also check the rules against a "cold-start rules engine" before sending them from generation to production, and NOT send them, if parsing them fails.

Show thread

EndlessMason Nov 19

@vatine @mttaggart
In this specific case the new config file was twice the size as normal, and exceeded the size limit isabot is willing to load.

With your setup you have two states in production:
- all new instances of the service fail to start
- old instances of the service with just "lol, which ever rules were loaded at the time".

The old instances with the old rules are now handling more and more of the traffic as time goes on, any debugging attempt will take out a running instance, and any release that involves restarting the isabot service causes outages where it lands. A rollback of that change won't restore the the target of that rollout because the file still is too big.

If you roll out the service that generates the config the running instance might pick up the new config, but only if it's small enough again.

This sounds more difficult to drive than it going down hard and saying why it went down.

Also, having your config generation routine fail to ship updates to prod because some pipeline "kinda guesses this might be a problem" relies on massive amounts of hindsight for a start, and presents a huge risk of config propagation stalling because some version/config mismatch between prod and the pipeline. It sounds like one of the most "can we just turn it off please" false-positive laden annoyance factories imaginable.

Show thread

Ingvar Nov 20

@EndlessMason @mttaggart Ideally, the "end of the config pipeline" is "load it into the exact code that loads it into production". The alternative is that you do the "ship the configuration" in stages, first to a small amount of production, where you're sort-of OK with things exploding (because it's limited), and only once it's good there does it get shipped to the rest.

I have most definitely worked with production systems that used this approach, but I guess they were kind of small (I don't think we ever breached north of 1.75 Mreq/s while I was working on it).