I am looking at Dagobah today, a Data-centric Meta-scheduler from @NetflixEng. This is a 🧵 summarising it.

P.S: This is part of the reading I am doing into schedulers these days. Watch this space for more.

This one is a bit old (from 2015), and was never open-sourced. But it has some great ideas. I am relying on@MattBossenbroek's talk here to understand it:
https://www.youtube.com/watch?v=V2E1PdboYLk
Dagobah, a Data centric Meta scheduler - Matt Bossenbroek

On the Netflix search team we had many data pipelines, patched together using different technologies, which made it difficult to integrate and monitor system...

Motivations:

1. Need for data-centric pipelines emphasizing traceability, predictability, provenance, and structural sharing.

2. Encourage fine-grained, testable, functional computations. (a.k.a. transformations as pure functions).

3. Compose pipelines from multiple resources (Docker, Spark, Hive, etc).

4. Version-controlled, API centric configuration.

5. Resilience against job failures and restarts.

Salient Features:

1. The pipelines are initiated as user-requests for materializing leaf nodes in a DAG.

2. Each node has an associated logical address. This address is a vector containing a target-date & keywords about the data. The address uniquely repesents the node output.

3. A node describes a data-dependency by listing the address of the data it needs.

4. The node points to the code to run (could be Spark code, Pigpen script, etc). The code itself is not part of the node definition. Thus, separating concerns of orchestration and computation.

5. The request to a node can explicitly point to specific version of the dependencies (to specific logical addresses). This provides the user fine-grained control over the
6. Resiliency is ensured by using node outputs as checkpoints. So retries need not reprocess again starting from the root nodes. Instead they start from the last checkpoints.
Although it is not modeled explicitly in Dagobah, one can think of an entire execution triggered against a user-request to be part of a transaction. And checkpoints are stored against this transaction. On retry, this transaction can be restarted from the checkpoint.
The interesting part is the structural sharing transactions can have with other transactions. If a given transaction is a subset (or shares parts of its computation) of another transaction, Dagobah is quick to reuse computations.
A transaction T is a subset of another transaction U if the logical address of T is materialized as part of the materialization of the logical address of U.

The other add-on features provided by Dagobah are:
- Gates: conditional materialization if error-thresholds etc are hit. This is based on user-input.

- Actions: Another way to control conditional materialization; this one does not need user intervention.