Great article. Hadoop (and other similar tools) are for datasets so huge they don't fit on one machine.

vjerancrnjak 8 hours ago | parent | next [-]

https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one...

I like this one where they put a dataset on 80 machines only then for someone to put the same dataset on 1 Intel NUC and outperform in query time.

https://altinity.com/blog/2020-1-1-clickhouse-cost-efficienc...

Datasets never become big enough…

▲

saberience 7 hours ago | parent | next [-]

Well, at my old company we had some datasets in the 6-8 PB range, so tell me how we would run analytics on that dataset on an Intel NUC.

Just because you don't have experience of these situations, it doesn't mean they don't exist. There's a reason Hadoop and Spark became synonymous with "big data."

	▲	dapperdrake 5 hours ago \| parent \| next [-]
		These situations are rare not difficult. The solutions are well known even to many non-programmers who actually have that problem: There are also sensor arrays that write 100,000 data points per millisecond. But again, that is a hardware problem not a software problem.
	▲	7 hours ago \| parent \| prev [-]
		[deleted]

▲

literalAardvark 8 hours ago | parent | prev | next [-]

Well yeah, but that's a _very_ different engineering decision with different constraints, it's not fully apples to apples.

Having materialised views increases insert load for every view, so if you want to slice your data in a way that wasn't predicted, or that would have increased ingress load beyond what you've got to spare, say, find all devices with a specific model and year+month because there's a dodgy lot, you'll really wish you were on a DB that can actually run that query instead of only being able to return your _precalculated_ results.

▲

DetroitThrow 7 hours ago | parent | prev [-]

>Datasets never become big enough…

Not only is this a contrived non-comparison, but the statement itself is readily disproven by the limitations basically _everyone_ using single instance ClickHouse often run into if they actually have a large dataset.

Spark and Hadoop have their place, maybe not in rinky dink startup land, but definitely in the world of petabyte and exabyte data processing.

	▲	zX41ZdbW 5 hours ago \| parent [-]
		When a single server is not enough, you deploy ClickHouse on a cluster, up to thousands of machines, e.g., https://clickhouse.com/blog/how-clickhouse-powers-ahrefs-the...

▲

PunchyHamster 8 hours ago | parent | prev [-]

And we can have pretty fucking big single machines right now

	▲	dapperdrake 5 hours ago \| parent [-]
		https://yourdatafitsinram.net/