Corrosion

Corrosion is distributed service discovery based on Rust, SQLite, and CRDTs.

Fly
in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution" - you can formal method and Rust and test and watchdog yourself as much as you want, but you simply have to stop doing that or the unknown unknowns will just keep taking you down.
I mean, the thing we're saying is that instant global state with database-style consensus is unworkable. Instant state distribution though is kind of just... necessary? for a platform like ours. You bring up an app in Europe, proxies in Asia need to know about it to route to it. So you say, "ok, well, they can wait a minute to learn about the app, not the end of the world". Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

> Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

Did you ever consider envoy xDS?

There are a lot of really cool things in envoy like outlier detection, circuit breakers, load shedding, etc…

Nope. Talk a little about how how Envoy's service discovery would scale to millions of apps in a global network? There's no way we found the only possible point in the solution space. Do they do something clever here?

What we (think we) know won't work is a topologically centralized database that uses distributed consensus algorithms to synchronize. Running consensus transcontinentally is very painful, and keep the servers central, so that update proposals are local and the protocol can run quickly, subjects large portions of the network to partition risk. The natural response (what I think a lot of people do, in fact) is just to run multiple consensus clusters, but our UX includes a global namespace for customer workloads.

Is it actually necessary to run transcontinental consensus? Apps in a given location are not movable so it would seem for a given app it's known which part of the network writes can come from. That would require partitioning the namespace but, given that apps are not movable, does that matter? It feel like there are other areas like docs and tooling that would benefit from relatively higher prioritization.
Apps in a given location are extremely movable! That's the point of the service!