Remix.run Logo
jochem9 3 days ago

One thing that I don't see mentioned but that does bug me: data engineers often use a lot of Python and SQL, even the ones that have heavily adopted software engineering best practices. Yet both languages are not great for this.

Python is dynamically typed, which you can patch a bit with type hints, but it's still easy to go to production with incompatible types, leading to panics in prod. It's uncompiled nature also makes it very slow.

SQL is pretty much impossible to unit test, yet often you will end up with logic that you want to test. E.g. to optimize a query.

For SQL I don't have a solution. It's a 50 year old language that lacks a lot of features you would expect. It's also the defacto standard for database access.

For Python I would say that we should start adopting statically typed compiled languages. Rust has polars as dataframe package, but the language itself isn't that easy to pick up. Go is very easy to learn, but has no serious dataframe package, so you end up doing a lot of that work yourself in goroutines. Maybe there are better options out there.

orochimaaru 3 days ago | parent | next [-]

If you’re using some variety of spark for your data engineering then scala is an option too.

In general, choice of language isn’t important - again if you’re using spark your data frame structure schema defines that structure Python or not.

Most folks confuse pandas with “data engineering”. It’s not. Most data engineering is spark.

rovr138 3 days ago | parent [-]

in spark, doesn't pyspark and sql both still get translated to scala?

orochimaaru 3 days ago | parent [-]

Yes. But with pyspark there is a Python gateway, the sql I think is translated natively in spark.

But when you create a dataframe in spark, that schema needs to be defined - or if it’s sql takes the form of the columns returned.

Use of Python can create hotspots with data transfers between spark and the Python gateway. Python UDFs are a common culprit.

Either way, my point is there are architectural and design points to your data solution that can cause many more problems than choice of language.

sbrother 3 days ago | parent | prev | next [-]

When I was most recently at Google (2021-ish) my team owned a bunch of SQL Pipelines that had fairly effective SQL tests. Not my favorite thing to work on, but it was a productive way to transform data. There are lots of open source versions of the same idea, but I have yet to see them accompanied with ergonomic testing. Any recommendations or pointers to open source SQL testing frameworks?

physicles 3 days ago | parent [-]

Could you describe what made those tests effective? I just wrote some tools to write concise tests for some analytics queries, and some principles I stumbled on are:

- input data should be pseudorandom, so the chance of a test being “accidentally correct” is minimized

- you need a way to verify only part of the result set. Or, at the very least, a way to write tests so that if you add a column to the result set, your test doesn’t automatically break

In addition, I added CSV exports so you can verify the results by hand, and hot-reload for queries with CTEs — if you change a .sql file then it will immediately rerun each CTE incrementally and show you which ones’ output changed.

greekorich 3 days ago | parent | prev [-]

I've been a professional java dev for a decade. I've written a little python, clojure, lots of JS/TS/Node.

SQL is the most beautiful, expressive, get stuff done language I've used.

It is perfect for whatever data engineering is defined as.

antupis 3 days ago | parent [-]

SQL is beautiful when it works but when it doesn’t you end up with some abomination eg if you need some kind dynamic query.

omgwtfbyobbq a day ago | parent | next [-]

Can you share more detail? Every pain point I've seen can be handled by a more suitable/performant/flexible DB and/or software that makes working with SQL less painful.

mrugge 3 days ago | parent | prev [-]

The two most helpful things with SQL are (1) always use set-based operations (never a cursor) and (2) break up your queries into smallest possible reusable chunks (CTEs). Sprinkle in tests to taste. Without some discipline SQL can get out of hand. This is what made dbt popular.