Remix.run Logo
thomasingalls 2 days ago

What do people do to curate/version /transform their raw datasets these days? I am vaguely aware of the "chuck it all into s3" strategy for hanging onto raw data, and related strategies where instead of s3 it's a db of some flavor. What are folks doing for record-keeping for what today's raw data contains vs tomorrow's?

And the next step - a curated dataset has a time-bound provenance - what are folks doing to keep track of the transformations/cleaning steps that makes the raw data useful for the data at the time it's being processed? Does this bit fall under the purview of metaflow, or is this different tooling?

Or maybe my assumptions are off base! Curious about what other teams are doing with their datasets.

patcon 2 days ago | parent [-]

I'm exploring kedro and Kedro-viz lately, in case that's in the vicinity of your question. It ties most closely with MLFlow for artifacts, but storing locally works fine too