Mastodawn

Thomas H. Ptacek Feb 2, 2023

I wrote a thing. https://fly.io/blog/carving-the-scheduler-out-of-our-orchestrator/

Carving The Scheduler Out Of Our Orchestrator

A deep dive into container scheduling and Flyd, our new orchestrator.

Fly

Show thread

Thomas H. Ptacek Feb 2, 2023

Like, the interesting story for me here is path dependence.

I don’t think we set out to write a bidding schedule design so much as we set out not to have dependencies on Raft consensus (run a single global high-volume Raft cluster some time and see how you end up feeling about distributed consensus).

Like, the inception of `flyd` was literally: “pull the driver code out of Nomad, and make it not depend on Raft”.

But once you do that, you can’t easily do a central planning scheduler anymore. You become Orchestration Milton Friedman. A totally different tech tree.

Show thread

Nat Bennett Feb 2, 2023

@tqbf This is fascinating!

Cloud Foundry had a similar journey in its orchestration system

It started with a very fancy pub/sub based system without a central orchestration node. This was hard to debug, fragile, etc.

Then rewrote it with an auction-based central co-ordinator in Go, call Diego, that used etcd and consul for state.

Then, finally, migrated from etcd and consul to SQL because GOOD LORD those things were a pain to run.

Show thread

Nat Bennett

@tqbf The SQL, I should note, still has a consensus algorithm for most production deployments, since by default it uses Galera.

A very carefully managed, hardened Galera that is not allowed to get up to any SHIT.

And getting it there took years and many painful outages and data loss incidents.

But, Cloud Foundry is designed to run workloads in data centers that can't access the internet so it's gotta bring and manage its own SQL DB.