Remix.run Logo
orochimaaru 3 days ago

Yes. But with pyspark there is a Python gateway, the sql I think is translated natively in spark.

But when you create a dataframe in spark, that schema needs to be defined - or if it’s sql takes the form of the columns returned.

Use of Python can create hotspots with data transfers between spark and the Python gateway. Python UDFs are a common culprit.

Either way, my point is there are architectural and design points to your data solution that can cause many more problems than choice of language.