Mastodawn

Buried in this nicely-detailed RCA is a pretty damning fact:

Cloudflare left .unwrap() in mission-critical Rust code.

For non-Rustaceans, .unwrap() handles a type called Result that can either be Ok with a value, or an Err with an Error. The whole point is to gracefully handle errors and not let panics make it to production code. But unwrap() assumes there's a value to extract without safeguards.

I use .unwrap() sometimes! Usually when there's a logical guarantee that the result can never be an error. But I make sure to purge it from critical processes for exactly this reason.

https://blog.cloudflare.com/18-november-2025-outage/

The Duke of Fall

@mttaggart One of the first impressions I got while digging into Rust was that *some* unwrapping was useful but LOTS of unwrapping was a code smell.

Encouraged to see that understanding appears to track.

Ted Mielczarek Nov 19

@valthonis @mttaggart my policy is to always avoid it in production code.

Mattias Eriksson 🦀🚵‍♂️⛵Nov 20

@tedmielczarek @valthonis @mttaggart
I avoid it in production code. I still think I have 1 or two, where it is clear it cannot panic due to a close by check (and other ways to handle it would just add extra complexity). But, even in those cases, I always leave a comment next to it saying //This is safe since... for the next person editing this.

Nelson Lopez Nov 19

@mttaggart so it's basically like "try" or "catch unreachable" in zig

evil.

Flounder Nov 19

@nelson @mttaggart closer to `catch unreachable` I think. My understanding is `try` is like rusts `? ` operator which returns an error if the proceeding expression evaluates to one

Nelson Lopez Nov 19

@fl0und3r @mttaggart ooo, i don’t know much about rust, thanks!

skategoat 🐐 🇵🇸Nov 19

@mttaggart could "crash and* burn" apply here?

@mttaggart I find most unwraps are opportunities to refactor code to remove the necessity to unwrap (basically removing impossible code paths / states)

@jerome Yep, completely agreed. It's actually one of my favorite parts of revision. It really feels like you're suring up the code.

Space Invader Nov 19

@mttaggart TIL. Do linters flag this?

(I do not write mission critical Rust. Or even important Rust. Just learning the language, slowly. )

@spaceinvader Yes they do!

https://rust-lang.github.io/rust-clippy/master/index.html#unwrap_used

Clippy Lints

A collection of lints to catch common mistakes and improve your Rust code.

Space Invader Nov 19

@mttaggart Very cool. I can barely code without a linter, and the tips have definitely made me a better coder. Things usually compile by the second try 😛.

But also… it’s concerning that something so easily detected by the simplest of static analysis tools doesn’t seem to have been in use for this very important code.

CounterVariable Nov 19

@mttaggart ROFL, it looks like they had a return type and everything.

@counterVariable append_with_names() almost certainly returned a Response. But fetch_features() returned a unit type. Not great.

CounterVariable Nov 22

@mttaggart I didn't mean to laugh at their expense. I don't think I would be immune to this either. These mistakes happen, it seems like an easy fix at least. It definitely reinforces the idea that unwraps are essentially trip mines in a codebase that will bring your entire system down.

@mttaggart If people aren't supposed to use it then it should be removed from tutorials and linters should default to alerting, otherwise the language will get the same reputation as, say, PHP because practically every tutorial I've seen about Rust teaches it the wrong way by default and rarely mentions the different correct ways.

I would consider going as far as deprecating it since it's such a massive foot-gun.

@zimzat It's a design choice to panic sometimes. It has its uses, but not in code that must never panic.

Linters absolutely have this config, and as far as tutorials, I don't think the Book could be clearer.

It sounds like you're asking a language to prevent introspection into an enum's values, which doesn't make a lot of sense to me.

https://doc.rust-lang.org/book/ch09-02-recoverable-errors-with-result.html#shortcuts-for-panic-on-error-unwrap-and-expect

Recoverable Errors with Result - The Rust Programming Language

@mttaggart To repeat:

The linters don't have it _on_ by default, meaning it's a foot-gun every newbie will hit directly into production. If it takes the wizened knowledge of the tribe to prevent that then the language needs to improve. Occasionally valid usage is no excuse, especially given the language has per-line overrides (double opt-in).

When I evaluated Rust initially and saw `unwrap` everywhere it seemed like a hot mess. Calling it `unwrap_or_panic` would make it obviously problematic.

Khleedril Nov 19

@zimzat @mttaggart I'm absolutely with this. Calling it unwrap_or_panic solves all problems.

@zimzat @mttaggart consider however that PHP ‘done wrong’ almost immediately leads to bizarre, potentially undefined, behaviours, often ripe for various exploitations.

But unwrap? It's entirely well defined, we know exactly what it'll do, and you explicitly invoke it. Is something labeled ‘crash this program if an error happened’ really a foot gun?

Should PHP ban use of `exit(1)` in a `catch` block?

@zbrown @mttaggart

> Is something labeled ‘crash this program if an error happened’ really a foot gun?

If it were actually labeled that then no, except that's only what it does and not what it's called, so yes it's a foot-gun. The lack of a default-on lint calling it out makes it doubly so.

The word `unwrap` has no connotations of "or panic" but saying `unwrap_or_panic` would be make it explicit and easier to call out at a glance.

EndlessMason Nov 19

@zimzat @zbrown @mttaggart
People also got upset about "or die" being everywhere in perl, so you can't really win.

> Should PHP ban use of `exit(1)` in a `catch` block?

Probably, ideally, yes.

In a modern PHP application there should only be one `exit` in the entire application: within the equivalent of the `fn main` entry point. All other exceptions should either be handled or bubble up to that point so their trace can be logged and `main` decides what to do. There definitely shouldn't be any `exit` calls in deeply nested code.

Rust's Result is supposed facilitate that same pattern.

@mttaggart I feel vindicated for adding a check for this in CI.

@mttaggart unwrap is for when you just wrapped something but it's too annoying to prove it to the type system, and even then it's better to just return a "this should never happen" error. those have always had a funny way of happening despite being impossible

@mttaggart Do you use `assert!`?

@lnicola In tests, but not really for production code. assert! doesn't provide graceful handlers for failure modes.

Matt Palmer Nov 19

@mttaggart oh dear. Unwrap should only be used at Christmas.

EndlessMason Nov 19

@mttaggart
Isn't crashing the right choice when your rules enforcement engine can't fit all the rules into memory?

@EndlessMason It absolutely is not, especially on a service that must remain operational. Graceful handling of the error, including logging, is what should occur, not a panic.

EndlessMason Nov 19

@mttaggart
How can you gracefully handle not loading your rules table when you are a rules engine?

@EndlessMason That's not what was going on here. The load wasn't the issue. Note the append. There are any number of way you could handle such an error, including truncating the list if you got a bounds error. But regardless, this was plainly not just a rules engine, was it? It was a critical piece of software that needed to remain up, despite external dependencies. This was not how failure should have been handled, and I don't know why you're arguing that point. The consequences of failure were plainly undesirable.

EndlessMason Nov 19

@mttaggart
if you want to say reading a file is loading it, but putting the parsed results into a structure is not loading it the yes, you win via pun

If you want to say that the service just flipping a coin when it doesn't have it's bot-or-not table doesn't still count as an outage then I don't know what to tell you.

How in the world is cf going to do 90% of what its for if it can't decide if a ua is a bot?

@EndlessMason @mttaggart You load the new rules into a "new rule processing context", if loading that fails, that's fine, you still have the old one. Unless you're starting cold, where one of "do nothing" or "crash" may be appropriate. Which is why you should also check the rules against a "cold-start rules engine" before sending them from generation to production, and NOT send them, if parsing them fails.

EndlessMason Nov 19

@vatine @mttaggart
In this specific case the new config file was twice the size as normal, and exceeded the size limit isabot is willing to load.

With your setup you have two states in production:
- all new instances of the service fail to start
- old instances of the service with just "lol, which ever rules were loaded at the time".

The old instances with the old rules are now handling more and more of the traffic as time goes on, any debugging attempt will take out a running instance, and any release that involves restarting the isabot service causes outages where it lands. A rollback of that change won't restore the the target of that rollout because the file still is too big.

If you roll out the service that generates the config the running instance might pick up the new config, but only if it's small enough again.

This sounds more difficult to drive than it going down hard and saying why it went down.

Also, having your config generation routine fail to ship updates to prod because some pipeline "kinda guesses this might be a problem" relies on massive amounts of hindsight for a start, and presents a huge risk of config propagation stalling because some version/config mismatch between prod and the pipeline. It sounds like one of the most "can we just turn it off please" false-positive laden annoyance factories imaginable.

@EndlessMason @mttaggart Ideally, the "end of the config pipeline" is "load it into the exact code that loads it into production". The alternative is that you do the "ship the configuration" in stages, first to a small amount of production, where you're sort-of OK with things exploding (because it's limited), and only once it's good there does it get shipped to the rest.

I have most definitely worked with production systems that used this approach, but I guess they were kind of small (I don't think we ever breached north of 1.75 Mreq/s while I was working on it).

@EndlessMason @mttaggart or you could... treat it as an error instead of just crashing out?

EndlessMason Nov 19

@yaleman @mttaggart
How do you mean?

@EndlessMason @mttaggart

It had an error option in the result code, they could have just had a nice polite error instead of shitting the bed and dying?

EndlessMason Nov 27

@yaleman
Even if it politely logs "[error] isabot cant load the rules to decide if things are a bot or nah" (and it did that in the article) the next thing the service has to do is serve requests, which it can not do without the table...

It can either serve 500s until the table reloads in the background or actually be down until it restarts and loads the table successfully (CrashLoop-style).

Both play nice with health checks.

Both are still an outage.

Flounder Nov 19

@mttaggart I feel like all programming languages toe the line between being too permissive and everything breaks or too conservative and nobody uses it. For what it's worth I think rust does a good job striking this balance.

this is cloudflares fault and I really hope no one focuses their "this must never happen again" energy at Rust.

@fl0und3r 100% agreed. This is not a Rust problem and anyone who thinks Rust is supposed to prevent all programmer error misunderstands how programming works

Jippi 🇩🇰Nov 19

@mttaggart think its kinda stretching it to say it was "Buried" - it was highlighted and explained what it did and that it was the issue

Hartmut Seichter Nov 19

@mttaggart love #rust, but as with any programming language there is always an obvious and insanely practical way to shoot yourself in the foot or head or both.

@mttaggart it really was the laziest of lazy, they're literally in a function with an error return option and yolo'ing it with unwrap was sad to see. This is why the clippy lints denying usage of it (and expect) in production code are mandatory in my eyes.

Also https://nounwrap.yaleman.org for why I think unwrap should be removed and leave expect for when you really want to panic your code 😃

No Unwrap! - Use expect() Instead

@yaleman Ehh, I think both have their place. I sometimes really want the traceback from unwrap() more than the business logic.

@mttaggart which you can get with RUST_BACKTRACE=1 🤷

Tim van der Lippe Nov 20

@mttaggart we ran into the same issue with @servo where many panics are because of unwrap. We now have an effort to reduce these: https://github.com/servo/servo/issues/40744

Servo should use `unwrap()` less · Issue #40744 · servo/servo

We should gradually phase out our use of unwrap() for graceful error handling or expect(). expect() should only be used for cases where an internal invariant has been violated. All values that go t...

GitHub

gunstick Nov 20

@mttaggart idea for rust environments: a feature one can enable telling that it's in production. Then the program should drop warnings about all the unsafe parts.