Like, the interesting story for me here is path dependence.
I don’t think we set out to write a bidding schedule design so much as we set out not to have dependencies on Raft consensus (run a single global high-volume Raft cluster some time and see how you end up feeling about distributed consensus).
Like, the inception of `flyd` was literally: “pull the driver code out of Nomad, and make it not depend on Raft”.
But once you do that, you can’t easily do a central planning scheduler anymore. You become Orchestration Milton Friedman. A totally different tech tree.
If you were just scheduling whole apps, I think the Omega designs would have kept scaling indefinitely (we’d have ended up federating somehow).
But as a consequence of chasing this design and becoming orchestration libertarians, we’re not just scheduling apps anymore; the same scheduler design makes it super easy for us to let customers spin random VMs up, to sandbox code, to respond to web requests, to run background jobs, that sort of thing.
I don’t like hyping what we do up in articles, but I’ll do it here, apparently. :)
Here’s a paper we just should have cited in this post: Sparrow.
https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf
Motivation: schedule jobs on clusters in response to HTTP queries: ✅.
Deliver sub-second scheduling by relaxing constraints, running many schedulers w/o a complete picture of available resources: ✅.
Optimize scheduling with P2C: ❌ (Sparrow does this, we don’t. We should consider it!)
Run diverse jobs without a single long-running queuing executor (ie, running arbitrary Docker containers): ❌ (Sparrow explicitly doesn’t do this, and we have to.)
@tqbf This is fascinating!
Cloud Foundry had a similar journey in its orchestration system
It started with a very fancy pub/sub based system without a central orchestration node. This was hard to debug, fragile, etc.
Then rewrote it with an auction-based central co-ordinator in Go, call Diego, that used etcd and consul for state.
Then, finally, migrated from etcd and consul to SQL because GOOD LORD those things were a pain to run.
@tqbf If you haven't taken a look at Diego you might find it interesting as an example of a pretty production-hardened orchestrator that also makes very different choices from k8s.
https://github.com/cloudfoundry/diego-release
https://github.com/cloudfoundry/diego-design-notes
(The notes are out of date but the fundamentals haven't changed *that* much since that period. IIRC the big thing that's changed is mostly that more logic got moved into it out of the CF API.)
@tqbf Your thread triggered me to pass your hiring page on to several folks who have worked on that system and *especially* on its CLI. (Which, for reasons you noted in your article, is where a lot of the complex logic that makes Cloud Foundry powerful lives.)
Your interview process is very well-targeted for the kinds of folks I suspect you're looking for.
@tqbf The SQL, I should note, still has a consensus algorithm for most production deployments, since by default it uses Galera.
A very carefully managed, hardened Galera that is not allowed to get up to any SHIT.
And getting it there took years and many painful outages and data loss incidents.
But, Cloud Foundry is designed to run workloads in data centers that can't access the internet so it's gotta bring and manage its own SQL DB.
@tqbf Great stuff! I would really like to have a middle ground of more orchestrators with no schedulers.
Reading your description of the new stuff, I was wondering if you read the Join Idle Queue paper. Cause, while not the same, it reminds me of it quite a bit.
There has been limited chat about it, but I think it has been one of the most impressive advances this past decade in scheduling/load balancing. The original paper is quite approachable, just ignore the proof.
The prevalence of dynamic-content web services, exemplified by search and online social networking, has motivated an increasingly wide web-facing front end. Horizontal scaling in the Cloud is favored for its elasticity, and distributed design of load balancers is highly desirable. Existing algorithms with a centralized design, such as Join-the-Shortest-Queue (JSQ), incur high communication overhead for […]
Is this platform engineering?! ;)
Last time I broke orchestration, something like just only few weeks ago, then while healing my burns by answering disrupted complains, finally asked myself:
What if cli/gui/api orchestrations had unavoidable 'show/query/demo/simulate' intermediate step before any real execution?
@tqbf this is shockingly[*] similar to how Heroku’s orchestrator/scheduler (“railgun”, because originally it only deployed Rails apps) worked.
[*] not actually shocking that two groups of smart people facing a similar problem would come to similar solutions
@tqbf I've been super excited to read this write up ever since you cryptically mentioned dropping Nomad on Twitter a while back.
Having recently had a bit more production experience with Nomad and a big chunk more of Kubernetes recently (and previously ran a bunch of very heterogeneous workloads on ECS) I'm particularly interested in trade offs between the big 3 orchestration platforms from an end user perspective while also wishing I was in a role I could afford to look beyond off the shelf.