Mastodawn

Sujith Jay Mar 18, 2021

I am looking at Dagobah today, a Data-centric Meta-scheduler from @NetflixEng. This is a 🧵 summarising it.

P.S: This is part of the reading I am doing into schedulers these days. Watch this space for more.

Sujith Jay Mar 18, 2021

This one is a bit old (from 2015), and was never open-sourced. But it has some great ideas. I am relying on@MattBossenbroek's talk here to understand it:
https://www.youtube.com/watch?v=V2E1PdboYLk

Dagobah, a Data centric Meta scheduler - Matt Bossenbroek

On the Netflix search team we had many data pipelines, patched together using different technologies, which made it difficult to integrate and monitor system...

Show thread

Sujith Jay

Motivations:

1. Need for data-centric pipelines emphasizing traceability, predictability, provenance, and structural sharing.

2. Encourage fine-grained, testable, functional computations. (a.k.a. transformations as pure functions).

Show thread

Sujith Jay Mar 18, 2021

3. Compose pipelines from multiple resources (Docker, Spark, Hive, etc).

4. Version-controlled, API centric configuration.

5. Resilience against job failures and restarts.

Show thread

Sujith Jay Mar 18, 2021

Salient Features:

1. The pipelines are initiated as user-requests for materializing leaf nodes in a DAG.

Show thread

Sujith Jay Mar 18, 2021

2. Each node has an associated logical address. This address is a vector containing a target-date & keywords about the data. The address uniquely repesents the node output.

Show thread

Sujith Jay Mar 18, 2021

3. A node describes a data-dependency by listing the address of the data it needs.

4. The node points to the code to run (could be Spark code, Pigpen script, etc). The code itself is not part of the node definition. Thus, separating concerns of orchestration and computation.

Show thread

Sujith Jay Mar 18, 2021

5. The request to a node can explicitly point to specific version of the dependencies (to specific logical addresses). This provides the user fine-grained control over the

Show thread

Sujith Jay Mar 18, 2021

6. Resiliency is ensured by using node outputs as checkpoints. So retries need not reprocess again starting from the root nodes. Instead they start from the last checkpoints.

Show thread

Sujith Jay Mar 18, 2021

Although it is not modeled explicitly in Dagobah, one can think of an entire execution triggered against a user-request to be part of a transaction. And checkpoints are stored against this transaction. On retry, this transaction can be restarted from the checkpoint.

Show thread

Sujith Jay Mar 18, 2021

The interesting part is the structural sharing transactions can have with other transactions. If a given transaction is a subset (or shares parts of its computation) of another transaction, Dagobah is quick to reuse computations.

Show thread

Sujith Jay Mar 18, 2021

A transaction T is a subset of another transaction U if the logical address of T is materialized as part of the materialization of the logical address of U.

Show thread

Sujith Jay Mar 18, 2021

The other add-on features provided by Dagobah are:
- Gates: conditional materialization if error-thresholds etc are hit. This is based on user-input.

- Actions: Another way to control conditional materialization; this one does not need user intervention.