Remix.run Logo
dinobones 3 days ago

I always see these fancy DB engines and data lake blog posts and I am curious… why?

At every place I’ve worked at this is a solved problem: Hive+Spark, just keep everything sharded across a ton of machines.

It’s cheaper to pay for a Hive cluster that does dumb queries than paying these expensive DB licenses, data engineers building arbitrary indices, etc… just throw compute at the problem, who cares. 1TB of RAM /flash is so cheap these days.

Even working on the worlds “biggest platforms” a daily partition of user data is like 2TB.

You’re telling me a F500 can’t buy a 5 machine/40TB cluster for like $40k and basically be set?

pragmatic 2 days ago | parent [-]

A fellow data swamp enjoyer.

Just dump it in Hadoop became an anti-pattern and everyone yearned for databases and clean data and not dealing with internal IT and the cluster “admins”.