Carving The Scheduler Out Of Our Orchestrator

A deep dive into container scheduling and Flyd, our new orchestrator.

Fly

Like, the interesting story for me here is path dependence.

I don’t think we set out to write a bidding schedule design so much as we set out not to have dependencies on Raft consensus (run a single global high-volume Raft cluster some time and see how you end up feeling about distributed consensus).

Like, the inception of `flyd` was literally: “pull the driver code out of Nomad, and make it not depend on Raft”.

But once you do that, you can’t easily do a central planning scheduler anymore. You become Orchestration Milton Friedman. A totally different tech tree.

(everything past “extract driver, lose Raft” is JP’s, who I’d link to if I understood Mastodon. The design his team came up with is very elegant, and also, what’s the word I’m looking for, “rigid in a good way”, oh right RIGOROUS, which is not something you can say about any of the code I wrote in nomad-firecracker).

If you were just scheduling whole apps, I think the Omega designs would have kept scaling indefinitely (we’d have ended up federating somehow).

But as a consequence of chasing this design and becoming orchestration libertarians, we’re not just scheduling apps anymore; the same scheduler design makes it super easy for us to let customers spin random VMs up, to sandbox code, to respond to web requests, to run background jobs, that sort of thing.

I don’t like hyping what we do up in articles, but I’ll do it here, apparently. :)

Here’s a paper we just should have cited in this post: Sparrow.

https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf

Motivation: schedule jobs on clusters in response to HTTP queries: ✅.

Deliver sub-second scheduling by relaxing constraints, running many schedulers w/o a complete picture of available resources: ✅.

Optimize scheduling with P2C: ❌ (Sparrow does this, we don’t. We should consider it!)

Run diverse jobs without a single long-running queuing executor (ie, running arbitrary Docker containers): ❌ (Sparrow explicitly doesn’t do this, and we have to.)

@tqbf reminds me of that old Erlang on Xen demo where they tried to build a erlang unikernel. The demo would boot a new vm, reply to http request and shutdown in a no time, back in 2009.
@tqbf The path dependence here is not the idea of pulling the driver out of Nomad, but the fact that earlier in your career you built Stockfighter.

@tqbf This is fascinating!

Cloud Foundry had a similar journey in its orchestration system

It started with a very fancy pub/sub based system without a central orchestration node. This was hard to debug, fragile, etc.

Then rewrote it with an auction-based central co-ordinator in Go, call Diego, that used etcd and consul for state.

Then, finally, migrated from etcd and consul to SQL because GOOD LORD those things were a pain to run.

@tqbf The Auctioneer AFAIK also started out pretty sophisticated and ended up being pretty simple, because it turns out the job is *mostly* about balancing memory across the cell fleets with the kind of workloads Diego runs. It doesn't have your requirements around global distribution, though.

@tqbf If you haven't taken a look at Diego you might find it interesting as an example of a pretty production-hardened orchestrator that also makes very different choices from k8s.

https://github.com/cloudfoundry/diego-release
https://github.com/cloudfoundry/diego-design-notes

(The notes are out of date but the fundamentals haven't changed *that* much since that period. IIRC the big thing that's changed is mostly that more logic got moved into it out of the CF API.)

GitHub - cloudfoundry/diego-release: BOSH Release for Diego

BOSH Release for Diego. Contribute to cloudfoundry/diego-release development by creating an account on GitHub.

GitHub
@tqbf I can't find it now but somewhere in there they've got a set of "simulation" tests that they used to figure out the implications of choices they were making with the auctioneer.
@nat This is very cool, thank you!

@tqbf Your thread triggered me to pass your hiring page on to several folks who have worked on that system and *especially* on its CLI. (Which, for reasons you noted in your article, is where a lot of the complex logic that makes Cloud Foundry powerful lives.)

Your interview process is very well-targeted for the kinds of folks I suspect you're looking for.

@tqbf The SQL, I should note, still has a consensus algorithm for most production deployments, since by default it uses Galera.

A very carefully managed, hardened Galera that is not allowed to get up to any SHIT.

And getting it there took years and many painful outages and data loss incidents.

But, Cloud Foundry is designed to run workloads in data centers that can't access the internet so it's gotta bring and manage its own SQL DB.

@tqbf that's pretty neato! Simplistic design and a clear analogy.
@tqbf I enjoyed reading this. Thank you.
@tqbf and I finally properly understand what orchestration is, thanks for that!
@tqbf Of course you did. I have to tell you I have enjoyed quite a bit yout posts at fly. Thank you. Very entertaining and informative.
@tqbf it may just be me, but I love the way you write with a sarcastic sort of self deprecating, but not really, voice in these.
@zaphar You’re one of the ones who likes it; it drives other people nuts.
@tqbf I can imagine it's a polarizing style.

@tqbf Great stuff! I would really like to have a middle ground of more orchestrators with no schedulers.

Reading your description of the new stuff, I was wondering if you read the Join Idle Queue paper. Cause, while not the same, it reminds me of it quite a bit.

There has been limited chat about it, but I think it has been one of the most impressive advances this past decade in scheduling/load balancing. The original paper is quite approachable, just ignore the proof.

https://www.microsoft.com/en-us/research/publication/join-idle-queue-a-novel-load-balancing-algorithm-for-dynamically-scalable-web-services/

Join-Idle-Queue: A Novel Load Balancing Algorithm for Dynamically Scalable Web Services - Microsoft Research

The prevalence of dynamic-content web services, exemplified by search and online social networking, has motivated an increasingly wide web-facing front end. Horizontal scaling in the Cloud is favored for its elasticity, and distributed design of load balancers is highly desirable. Existing algorithms with a centralized design, such as Join-the-Shortest-Queue (JSQ), incur high communication overhead for […]

Microsoft Research

@tqbf

Is this platform engineering?! ;)

@teixi Extremely yes.

@tqbf

Last time I broke orchestration, something like just only few weeks ago, then while healing my burns by answering disrupted complains, finally asked myself:

What if cli/gui/api orchestrations had unavoidable 'show/query/demo/simulate' intermediate step before any real execution?

@tqbf this is shockingly[*] similar to how Heroku’s orchestrator/scheduler (“railgun”, because originally it only deployed Rails apps) worked.

[*] not actually shocking that two groups of smart people facing a similar problem would come to similar solutions

@tqbf this is a really good post. Stop making me want to work at fly.
@tqbf NuMad sounds great! Awesome post.
@mitchellh I cheated and just wrote what JP told me to. :)
@tqbf thank you for writing this. No thank you for making me want a 🥪
@tqbf @mitchellh holy shit!! How are you solving the service mesh problem? There’s very few shops that solve this problem and I’d love to chat about it 🙂
@sienna @mitchellh So, you can run an Envoy sidecar for all your apps if you really want to, but we use IPv6, WireGuard, and BPF to do many of the things K8s would use a “service mesh stack” to accomplish. We’re profoundly allergic to mTLS. https://fly.io/blog/incoming-6pn-private-networks/
Incoming! 6PN Private Networks

Fly.io turns your Docker apps into fleets of Firecracker VMs that can talk to the Internet, and to each other over private IPv6 networks

Fly
@tqbf not sure I grok everything but always an entertaining read!
@tqbf Thanks so much for publishing these! Can’t tell you what help it is in navigating this space. Cheers from Australia!

@tqbf I've been super excited to read this write up ever since you cryptically mentioned dropping Nomad on Twitter a while back.

Having recently had a bit more production experience with Nomad and a big chunk more of Kubernetes recently (and previously ran a bunch of very heterogeneous workloads on ECS) I'm particularly interested in trade offs between the big 3 orchestration platforms from an end user perspective while also wishing I was in a role I could afford to look beyond off the shelf.

@tqbf great article, thanks for sharing! I'm fulfilling my contractual obligations here by submitting "caremad" as a dad-joke name for flyd.
@tqbf I love the verve and panache of Fly’s blog posts but even as I admire your wordsmithing and analogies, I can’t help but wince at how needlessly difficult my non-native-English-speaking colleagues would find the posts.
@22 I don’t write that way on purpose to be panache-y or whatever; I just wouldn’t enjoy writing any other way, and if I don’t enjoy writing, I’m not doing it. I don’t think there’s a way to get this kind of post out of me without the dorky dad in-jokes.
@tqbf ah understood, nor would I ask Terry Pratchett to stop with his puns :). Loved the article!
@tqbf You had me at "You can run a Docker image as VM. You’re almost done! Time to draw the rest of the owl." 🤣 Great read. Many thanks!