Remix.run Logo
zamalek 4 days ago

One things have seen through my more recent exposure to experienced data engineers is the lack of repeatability rigor (CI/CD, IaC, etc.). There's a lot of doing things in notebooks and calling that production-ready. Databricks has git (GitHub only from what I can tell) integration, but that's just checking out and directly committing to trunk, if it's in git then we have SDLC right, right? It's fucking nuts.

Anyone have workflows or tooling that are highly compatible with the entrenched notebook approach, and are easy to adopt? I want to prevent theses people from learning well-trodden lessons the hard way.

faxmeyourcode 4 days ago | parent | next [-]

This is insane to read as a data engineer who actually builds software. These sound like amateurs, not experienced data engineers to be perfectly honest.

There are plenty of us out here with many repos, dozens of contributors, and thousands of lines of terraform, python, custom GitHub actions, k8s deployments running airflow and internal full stack web apps that we're building, EMR spark clusters, etc. All living in our own Snowflake/AWS accounts that we manage ourselves.

The data scientists that we service use notebooks extensively, but it's my teams job to clean it up and make it testable and efficient. You can't develop real software in a notebook, it sounds like they need to upskill into a real orchestration platform like airflow and run everything through it.

Unit test the utility functions and helpers, data quality test the data flowing in and out. Build diff reports for understanding big swings in the data to sign off changes.

My email is in my profile I'm happy to discuss further! :-)

RobinL 4 days ago | parent | prev | next [-]

I think this may be a databricks thing? From what I've seen there's a gap between data engineers forced to use databricks and everyone else. From what I've seen, at least how it's used in practice, databricks seems to result in a mess of notebooks with poor dependency and version management.

zamalek 4 days ago | parent [-]

Interesting, databricks has been my first exposure to DE at scale and it does seem to solve many problems (even though it sounds like it's causing some). So what does everyone else do? Run spark etc. themselves?

sdairs 4 days ago | parent | next [-]

tbh I see just as much notebook-hell outside of dbx, it's certainly not contained to just them. There's some folks doing good SDLC with Spark jobs in java/scala, but I've never found it to be overly common, I see "dump it on the shared drive" equally as much lol. IME data has always been a bit behind in this area

personally you couldn't pay me to run Spark myself these days (and I used to work for the biggest Hadoop vendor in the mid 2010s doing a lot of Spark!)

RobinL 4 days ago | parent | prev [-]

We use aws glue for spark (but are increasingly moving towards duckdb because it's faster for our workloads and easier to test and deploy).

For Spark, glue works quite well. We use it as 'spark as a service', keeping our code as close to vanilla pyspark as possible. This leaves us free to write our code in normal python files, write our own (tested) libraries which are used in our jobs, use GitHub for version control and ci and so on

jochem9 3 days ago | parent | prev | next [-]

Last time I worked with Databricks you could just create branches in their interface. PRs etc happened in your git provider, which for us was azure devops back then. We also managed some CI/CD.

You're still dealing with notebooks. Back then there was a tool to connect your IDE to a Databricks cluster. That got killed, not sure if they have something new.

esafak 4 days ago | parent | prev | next [-]

For CI, try dagger. It's code based and runs locally too, so you can write tests. But it is a moving target and more complex than Docker.

ViewTrick1002 4 days ago | parent | prev [-]

That is what dbt solves. Version your SQL and continuously rehydrate the data to match the most recent models.