Remix.run Logo
mynameisash 4 days ago

The comments here are... interesting, as they indicate a strong split between analysts and those engineers that can operationalize things. I see another dimension to it all.

My title is senior data engineer at GAMMA/FAANG/whatever we're calling them. I have a CS degree and am firmly in the engineering. My passion, though, is in using software engineering and computer science principles to make very large-scale data processing as stupid fast as we can. To the extent I can ignore it, I don't personally care much about the tooling and frameworks and such (CI/CD, Airflow, Kafka, whatever). I care about how we're affinitizing our data, how we index it, whether and when we can use data sketches to achieve a good tradeoff between accuracy and compute/memory, and so on.

While there are plenty of folks in this thread bashing analysts, one could also bash other "proper" engineers that can do the CI/CD but don't know shit about how to be efficient with petabyte-scale processing.

kentm 3 days ago | parent | next [-]

People who can utilize the tooling to process petabytes of data efficiently aren’t the ones that are catching flack. The people I’m thinking of basically run massive inefficient SQL queries and then throw their hands up when it runs slowly or gets an oom error. They don’t even know how to do an explain plan. And if you try to explain to them things like partitioning, indexes, sketches, etc then they are not able to comprehend and argue that it’s not their job to learn, and that it’s the “proper engineers” job to scale the processing.

CalRobert 3 days ago | parent | next [-]

My boss at a large company years ago wrote a query for daily stats and then proceeded to run it on the entire event history every day for the life of the company just to get DAU, etc. The solution was to just keep paying more for redshift until the bill was a few million a year. Suggestions to fix his crap were met with disdain.

That job taught me a lot.

itsoktocry 3 days ago | parent | prev [-]

>And if you try to explain to them things like partitioning, indexes, sketches, etc then they are not able to comprehend and argue that it’s not their job to learn, and that it’s the “proper engineers” job to scale the processing.

Make up a person and attack him, literal strawman. You sound pleasant to work with.

kentm 3 days ago | parent [-]

I’m referring to actual people I have worked and interacted with so no not made up.

They’re not engineers and shouldn’t have been labeled data engineers. They have some other value to the company, presumably, but trying to repackage them as data engineers does cause issues. That’s the topic of this thread.

VirusNewbie 4 days ago | parent | prev | next [-]

>one could also bash other "proper" engineers that can do the CI/CD but don't know shit about how to be efficient with petabyte-scale processing.

But that would be SWEs no?

I was a 'data engineer' (until they changed the terrible title) at a startup and I ended up having to fight with Spark and Apache Beam at times, eventually contributing back to improve throughput for our use cases.

That's not the same thing as a Business Analyst who can run a little pyspark query.

tdb7893 4 days ago | parent | prev | next [-]

I mean this very sincerely but I'm a little lost how data engineering is distinct from software engineering. It seems like just a subset of it, my title was software engineer and I've done what sounds like very similar work.

briankelly 4 days ago | parent [-]

I’m pretty sure the term came from Google (at least that is where I heard it first described) and just referred to a backend engineer with speciality in this area. Now usually these roles have “distributed systems” in the title, even if you aren’t really on the inside of the systems. That or “systems and infrastructure”, “data infrastructure”, or “AI/ML infrastructure” or sometimes “MLE” for those kinds of orgs. Or back to good ole “big data” now that it’s no longer tacked on everything.

food4u 3 days ago | parent | prev [-]

[dead]