Remix.run Logo
orochimaaru 3 days ago

If you’re using some variety of spark for your data engineering then scala is an option too.

In general, choice of language isn’t important - again if you’re using spark your data frame structure schema defines that structure Python or not.

Most folks confuse pandas with “data engineering”. It’s not. Most data engineering is spark.

rovr138 3 days ago | parent [-]

in spark, doesn't pyspark and sql both still get translated to scala?

orochimaaru 3 days ago | parent [-]

Yes. But with pyspark there is a Python gateway, the sql I think is translated natively in spark.

But when you create a dataframe in spark, that schema needs to be defined - or if it’s sql takes the form of the columns returned.

Use of Python can create hotspots with data transfers between spark and the Python gateway. Python UDFs are a common culprit.

Either way, my point is there are architectural and design points to your data solution that can cause many more problems than choice of language.