Mastodawn

bananapub Oct 27

in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution" - you can formal method and Rust and test and watchdog yourself as much as you want, but you simply have to stop doing that or the unknown unknowns will just keep taking you down.

I mean, the thing we're saying is that instant global state with database-style consensus is unworkable. Instant state distribution though is kind of just... necessary? for a platform like ours. You bring up an app in Europe, proxies in Asia need to know about it to route to it. So you say, "ok, well, they can wait a minute to learn about the app, not the end of the world". Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

> Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

Did you ever consider envoy xDS?

There are a lot of really cool things in envoy like outlier detection, circuit breakers, load shedding, etc…

Nope. Talk a little about how how Envoy's service discovery would scale to millions of apps in a global network? There's no way we found the only possible point in the solution space. Do they do something clever here?

What we (think we) know won't work is a topologically centralized database that uses distributed consensus algorithms to synchronize. Running consensus transcontinentally is very painful, and keep the servers central, so that update proposals are local and the protocol can run quickly, subjects large portions of the network to partition risk. The natural response (what I think a lot of people do, in fact) is just to run multiple consensus clusters, but our UX includes a global namespace for customer workloads.

__turbobrew__

I haven’t personally worked on envoy xds, but it is what I have seen several BigCo’s use for routing from the edge to internal applications.

> Running consensus transcontinentally is very painful

You don’t necessarily have to do that, you can keep your quorum nodes (lets assume we are talking about etcd) far enough apart to be in separate failure domains (fires, power loss, natural disasters) but close enough that network latency isn’t unbearably high between the replicas.

I have seen the following scheme work for millions of workloads:

1. Etcd quorum across 3 close, but independent regions

2. On startup, the app registers itself under a prefix that all other app replicas register

3. All clients to that app issue etcd watches for that prefix and almost instantly will be notified when there is a change. This is baked as a plugin within grpc clients.

4. A custom grpc resolver is used to do lookups by service name

I'm thrilled to have people digging into this, because I think it's a super interesting problem, but: no, keeping quorum nodes close-enough-but-not-too-close doesn't solve our problem, because we support a unified customer namespace that runs from Tokyo to Sydney to São Paulo to Northern Virginia to London to Frankfurt to Johannesburg.

Two other details that are super important here:

This is a public cloud. There is no real correlation between apps/regions and clients. Clients are public Internet users. When you bring an app up, it just needs to work, for completely random browsers on completely random continents. Users can and do move their instances (or, more likely, reallocate instances) between regions with no notice.

The second detail is that no matter what DX compromise you make to scale global consensus up, you still need reliable realtime update of instances going down. Not knowing about a new instance that just came up isn't that big a deal! You just get less optimal routing for the request. Not knowing that an instance went down is a very big deal: you end up routing requests to dead instances.

The deployment strategy you're describing is in fact what we used to do! We had a Consul cluster in North America and ran the global network off it.

> I'm thrilled to have people digging into this, because I think it's a super interesting problem

Yes, somehow this is a problem all the big companies have, but it seems like there is no standard solution and nobody has open sourced their stuff (except you)!

Taking a step back, and thinking about the AWS outage last week which was caused by a buggy bespoke system built on top of DNS, it seems like we need an IETF standard for service discovery. DNS++ if you will. I have seen lots of (ab)use of DNS for dynamic service discovery and it seems like we need a better solution which is either push based or gossip based to more quickly disseminate service discovery updates.

otterley Oct 27

I work for AWS; opinions are my own and I’m not affiliated with the service team in question.

That a DNS record was deleted is tangential to the proximate cause of the incident. It was a latent bug in the control plane that updated the records, not the data plane. If the discovery protocol were DNS++ or /etc/hosts files, the same problem could have happened.

DNS has a lot of advantages: it’s a dirt cheap protocol to serve (both in terms of bytes over the wire and CPU utilization), is reasonably flexible (new RR types are added as needs warrant), isn’t filtered by middleboxes, has separate positive and negative caching, and server implementations are very robust. If you’re doing to replace DNS, you’re going to have a steep hill to climb.

I'm nodding my head to this but have to call out that DNS with "interesting" RRs is extensively filtered by middleboxes --- just none of the middleboxes AWS would deploy or allow to be deployed anywhere it peers.

> you still need reliable realtime update of instances going down

The way I have seen this implemented is through a cluster of service watcher that ping all services once every X seconds and deregister the service when the pings fail.

Additionally you can use grpc with keepalives which will detect on the client side when a service goes down and automatically remove it from the subset. Grpc also has client side outlier detection so the clients can also automatically remove slow servers from the subset as well. This only works for grpc though, so not generally useful if you are creating a cloud for HTTP servers…

Detecting that the service went down is easy. Notifying every proxy in the fleet that it's down is not. Every proxy in the fleet cannot directly probe every application on the platform.

I believe it is possible within envoy to detect a bad backend and automatically remove it from the load balancing pool, so why can the proxy not determine that certain backend instances are unavailable and remove them from the pool? No coordination needed and it also handles other cases where the backend is bad such as overload or deadlock?

It also seems like part of your pain point is that there is an any-to-any relationship between proxy and backend, but that doesn’t need to be the case necessarily, cell based architecture with shuffle sharding of backends between cells can help alleviate that fundamental pain. Part of the advantage of this is that config and code changes can then be rolled out cell by cell which is much safer as if your code/configs cause a fault in a cell it will only affect a subset of infrastructure. And if you did shuffle sharding correctly, it should have a negligible affect when a single cell goes down.

Ok, again: this isn't a cluster of load balancers in front of a discrete collection of app servers in a data center. It's thousands of load balancers handling millions of applications scattered all over the world, with instances going up and down constantly.

The interesting part of this problem isn't noticing that an instance is down. Any load balancer can do that. The interesting problem is noticing than and then informing every proxy in the world.

I feel like a lot of what's happening in these threads is people using a mental model that they'd use for hosting one application globally, or, if not one, then a collection of applications they manage. These are customer applications. We can't assume anything about their request semantics.

> The interesting problem is noticing than and then informing every proxy in the world.

Yes and that is why I suggested why your any-to-any relationship of proxy to application is a decision you have made which is part of the painpoint that caused you to come up with this solution. The fact that any proxy box can proxy to any backend is a choice which was made which created the structure and mental model you are working within. You could batch your proxies into say 1024 cells and then assign a customer app to say 4/1024 cells using shuffle sharding. Then that decomposes the problem into maintaining state within a cell instead of globally.

Im not saying what you did was wrong or dumb, I am saying you are working within a framework that maybe you are not even consciously aware of.

Again: it's the premise of the platform. If you're saying "you picked a hard problem to work on", I guess I agree.

We cannot in fact assign our customers apps to 0.3% of our proxies! When you deploy an app in Chicago on Fly.io, it has to work from a Sydney edge. I mean, that's part of the DX; there are deeper reasons why it would have to work that way (due to BGP4), but we don't even get there before becoming a different platform.

I think the impedance mismatch here is I am assuming we are talking about a hyperscaler cloud where it would be reasonable to have say 1024 proxies per region. Each app would be assigned to 4/1024 proxies in each region.

I have no idea how big of a compute footprint fly.io is, and maybe due to that the design I am suggesting makes no sense for you.