| ▲ | RobinL 4 days ago |
| I think this may be a databricks thing? From what I've seen there's a gap between data engineers forced to use databricks and everyone else. From what I've seen, at least how it's used in practice, databricks seems to result in a mess of notebooks with poor dependency and version management. |
|
| ▲ | zamalek 4 days ago | parent [-] |
| Interesting, databricks has been my first exposure to DE at scale and it does seem to solve many problems (even though it sounds like it's causing some). So what does everyone else do? Run spark etc. themselves? |
| |
| ▲ | sdairs 4 days ago | parent | next [-] | | tbh I see just as much notebook-hell outside of dbx, it's certainly not contained to just them. There's some folks doing good SDLC with Spark jobs in java/scala, but I've never found it to be overly common, I see "dump it on the shared drive" equally as much lol. IME data has always been a bit behind in this area personally you couldn't pay me to run Spark myself these days (and I used to work for the biggest Hadoop vendor in the mid 2010s doing a lot of Spark!) | |
| ▲ | RobinL 4 days ago | parent | prev [-] | | We use aws glue for spark (but are increasingly moving towards duckdb because it's faster for our workloads and easier to test and deploy). For Spark, glue works quite well. We use it as 'spark as a service', keeping our code as close to vanilla pyspark as possible. This leaves us free to write our code in normal python files, write our own (tested) libraries which are used in our jobs, use GitHub for version control and ci and so on |
|