Remix.run Logo
stuartaxelowen 3 days ago

I'm working on a partition-oriented declarative data build system. The inspiration comes from working with systems like Airflow and AWS step functions, where data orchestration is described explicitly, and the dependency relationships between input and produced data partitions is complex. Put simply, writing orchestration code for this case sucks - the goal of the project is to enable whole data platforms to be made up of jobs that declare their input and output partition deps, so that they can be automatically fulfilled, enabling kubernetes-like continuous reconciliation of desired partitions.

This means, instead of the answer to "how do we produce this output data" being "trigger and pray everything upstream is still working", we can answer with "the system was asked to produce this output data partition and its dependencies were automatically built for it". My hope is that this allows the interface with the system to instead be continuously telling it what partitions we want to exist, and letting it figure out the rest, instead of the byzantine DAGs that get built in airflow/etc.

This comes out of a big feeling that even more recent orchestrators like Prefect, Dagster, etc are still solving the wrong problem, and not internalizing the right complexity.

efromvt 3 days ago | parent | next [-]

Very much agree that to this is the direction data orchestration platforms should go towards - the basic DAG creation can be straightforward, depending on how you do the authoring - (parsing SQL is always the wrong answer, but is tempting) - but backfills, code updates, etc are when it starts to get spicy.

stuartaxelowen 3 days ago | parent [-]

I think this is where it gets interesting. With partition dependency propagation, backfills are just “hey this range of partitions should exist”. Or, your “wants” partitions are probably still active, and you can just taint the existing partitions. This invalidates the existing partitions, so the wants trigger builds again, and existing consumers don’t see the tainted partitions as live. I think things actually get a lot simpler when you stop trying to reason about those data relationships manually!

hosh 3 days ago | parent | prev [-]

Wow, neat idea!

Is this going to be proprietary or OSS?

stuartaxelowen 3 days ago | parent [-]

Definitely open source!

hosh 2 days ago | parent [-]

I’ll be on the lookout for a Show HN!