Like, the interesting story for me here is path dependence.
I don’t think we set out to write a bidding schedule design so much as we set out not to have dependencies on Raft consensus (run a single global high-volume Raft cluster some time and see how you end up feeling about distributed consensus).
Like, the inception of `flyd` was literally: “pull the driver code out of Nomad, and make it not depend on Raft”.
But once you do that, you can’t easily do a central planning scheduler anymore. You become Orchestration Milton Friedman. A totally different tech tree.
@tqbf This is fascinating!
Cloud Foundry had a similar journey in its orchestration system
It started with a very fancy pub/sub based system without a central orchestration node. This was hard to debug, fragile, etc.
Then rewrote it with an auction-based central co-ordinator in Go, call Diego, that used etcd and consul for state.
Then, finally, migrated from etcd and consul to SQL because GOOD LORD those things were a pain to run.
@tqbf If you haven't taken a look at Diego you might find it interesting as an example of a pretty production-hardened orchestrator that also makes very different choices from k8s.
https://github.com/cloudfoundry/diego-release
https://github.com/cloudfoundry/diego-design-notes
(The notes are out of date but the fundamentals haven't changed *that* much since that period. IIRC the big thing that's changed is mostly that more logic got moved into it out of the CF API.)
@tqbf Your thread triggered me to pass your hiring page on to several folks who have worked on that system and *especially* on its CLI. (Which, for reasons you noted in your article, is where a lot of the complex logic that makes Cloud Foundry powerful lives.)
Your interview process is very well-targeted for the kinds of folks I suspect you're looking for.
@tqbf The SQL, I should note, still has a consensus algorithm for most production deployments, since by default it uses Galera.
A very carefully managed, hardened Galera that is not allowed to get up to any SHIT.
And getting it there took years and many painful outages and data loss incidents.
But, Cloud Foundry is designed to run workloads in data centers that can't access the internet so it's gotta bring and manage its own SQL DB.