▲ | zamalek 4 days ago | ||||||||||||||||||||||
One things have seen through my more recent exposure to experienced data engineers is the lack of repeatability rigor (CI/CD, IaC, etc.). There's a lot of doing things in notebooks and calling that production-ready. Databricks has git (GitHub only from what I can tell) integration, but that's just checking out and directly committing to trunk, if it's in git then we have SDLC right, right? It's fucking nuts. Anyone have workflows or tooling that are highly compatible with the entrenched notebook approach, and are easy to adopt? I want to prevent theses people from learning well-trodden lessons the hard way. | |||||||||||||||||||||||
▲ | faxmeyourcode 4 days ago | parent | next [-] | ||||||||||||||||||||||
This is insane to read as a data engineer who actually builds software. These sound like amateurs, not experienced data engineers to be perfectly honest. There are plenty of us out here with many repos, dozens of contributors, and thousands of lines of terraform, python, custom GitHub actions, k8s deployments running airflow and internal full stack web apps that we're building, EMR spark clusters, etc. All living in our own Snowflake/AWS accounts that we manage ourselves. The data scientists that we service use notebooks extensively, but it's my teams job to clean it up and make it testable and efficient. You can't develop real software in a notebook, it sounds like they need to upskill into a real orchestration platform like airflow and run everything through it. Unit test the utility functions and helpers, data quality test the data flowing in and out. Build diff reports for understanding big swings in the data to sign off changes. My email is in my profile I'm happy to discuss further! :-) | |||||||||||||||||||||||
▲ | RobinL 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||
I think this may be a databricks thing? From what I've seen there's a gap between data engineers forced to use databricks and everyone else. From what I've seen, at least how it's used in practice, databricks seems to result in a mess of notebooks with poor dependency and version management. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | jochem9 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Last time I worked with Databricks you could just create branches in their interface. PRs etc happened in your git provider, which for us was azure devops back then. We also managed some CI/CD. You're still dealing with notebooks. Back then there was a tool to connect your IDE to a Databricks cluster. That got killed, not sure if they have something new. | |||||||||||||||||||||||
▲ | esafak 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||
For CI, try dagger. It's code based and runs locally too, so you can write tests. But it is a moving target and more complex than Docker. | |||||||||||||||||||||||
▲ | ViewTrick1002 4 days ago | parent | prev [-] | ||||||||||||||||||||||
That is what dbt solves. Version your SQL and continuously rehydrate the data to match the most recent models. |