Mastodawn

cgb_Oct 23

Corrosion

https://fly.io/blog/corrosion/

Corrosion

Corrosion is distributed service discovery based on Rust, SQLite, and CRDTs.

Fly

Show thread

bananapub Oct 27

in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution" - you can formal method and Rust and test and watchdog yourself as much as you want, but you simply have to stop doing that or the unknown unknowns will just keep taking you down.

Show thread

tptacek Oct 27

I mean, the thing we're saying is that instant global state with database-style consensus is unworkable. Instant state distribution though is kind of just... necessary? for a platform like ours. You bring up an app in Europe, proxies in Asia need to know about it to route to it. So you say, "ok, well, they can wait a minute to learn about the app, not the end of the world". Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

Show thread

__turbobrew__Oct 27

> Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

Did you ever consider envoy xDS?

There are a lot of really cool things in envoy like outlier detection, circuit breakers, load shedding, etc…

Show thread

tptacek Oct 27

Nope. Talk a little about how how Envoy's service discovery would scale to millions of apps in a global network? There's no way we found the only possible point in the solution space. Do they do something clever here?

What we (think we) know won't work is a topologically centralized database that uses distributed consensus algorithms to synchronize. Running consensus transcontinentally is very painful, and keep the servers central, so that update proposals are local and the protocol can run quickly, subjects large portions of the network to partition risk. The natural response (what I think a lot of people do, in fact) is just to run multiple consensus clusters, but our UX includes a global namespace for customer workloads.

Show thread

__turbobrew__Oct 27

I haven’t personally worked on envoy xds, but it is what I have seen several BigCo’s use for routing from the edge to internal applications.

> Running consensus transcontinentally is very painful

You don’t necessarily have to do that, you can keep your quorum nodes (lets assume we are talking about etcd) far enough apart to be in separate failure domains (fires, power loss, natural disasters) but close enough that network latency isn’t unbearably high between the replicas.

I have seen the following scheme work for millions of workloads:

1. Etcd quorum across 3 close, but independent regions

2. On startup, the app registers itself under a prefix that all other app replicas register

3. All clients to that app issue etcd watches for that prefix and almost instantly will be notified when there is a change. This is baked as a plugin within grpc clients.

4. A custom grpc resolver is used to do lookups by service name

Show thread

tptacek Oct 27

I'm thrilled to have people digging into this, because I think it's a super interesting problem, but: no, keeping quorum nodes close-enough-but-not-too-close doesn't solve our problem, because we support a unified customer namespace that runs from Tokyo to Sydney to São Paulo to Northern Virginia to London to Frankfurt to Johannesburg.

Two other details that are super important here:

This is a public cloud. There is no real correlation between apps/regions and clients. Clients are public Internet users. When you bring an app up, it just needs to work, for completely random browsers on completely random continents. Users can and do move their instances (or, more likely, reallocate instances) between regions with no notice.

The second detail is that no matter what DX compromise you make to scale global consensus up, you still need reliable realtime update of instances going down. Not knowing about a new instance that just came up isn't that big a deal! You just get less optimal routing for the request. Not knowing that an instance went down is a very big deal: you end up routing requests to dead instances.

The deployment strategy you're describing is in fact what we used to do! We had a Consul cluster in North America and ran the global network off it.

Show thread

__turbobrew__

> I'm thrilled to have people digging into this, because I think it's a super interesting problem

Yes, somehow this is a problem all the big companies have, but it seems like there is no standard solution and nobody has open sourced their stuff (except you)!

Taking a step back, and thinking about the AWS outage last week which was caused by a buggy bespoke system built on top of DNS, it seems like we need an IETF standard for service discovery. DNS++ if you will. I have seen lots of (ab)use of DNS for dynamic service discovery and it seems like we need a better solution which is either push based or gossip based to more quickly disseminate service discovery updates.

Show thread

otterley Oct 27

I work for AWS; opinions are my own and I’m not affiliated with the service team in question.

That a DNS record was deleted is tangential to the proximate cause of the incident. It was a latent bug in the control plane that updated the records, not the data plane. If the discovery protocol were DNS++ or /etc/hosts files, the same problem could have happened.

DNS has a lot of advantages: it’s a dirt cheap protocol to serve (both in terms of bytes over the wire and CPU utilization), is reasonably flexible (new RR types are added as needs warrant), isn’t filtered by middleboxes, has separate positive and negative caching, and server implementations are very robust. If you’re doing to replace DNS, you’re going to have a steep hill to climb.

Show thread

tptacek Oct 27

I'm nodding my head to this but have to call out that DNS with "interesting" RRs is extensively filtered by middleboxes --- just none of the middleboxes AWS would deploy or allow to be deployed anywhere it peers.