Remix.run Logo
KolmogorovComp 7 hours ago

> The first thing to do is get a lot of game data. This proved more difficult than I thought it would be, but after some looking around online I found a git repository on GitHub from rozim that had plenty of games. I used this to compile a set of 3.46GB of data, which is about twice what Tom used in his test. The next step is to get all that data into our pipeline.

It would be interesting to redo the benchmark but with a (much) larger database.

Nowadays the biggest open-data for chess must comes from Lichess https://database.lichess.org, with ~7B games and 2.34 TB compressed, ~14TB uncompressed.

Would Hadoop win here?

woooooo 5 hours ago | parent | next [-]

If you get all the data on fast SSDs in a single chassis, you probably still beat EMR over S3. But then you have a whole dedicated server to manage your 14TB of chess games.

The "EMR over S3" paradigm is based on the assumption that the data isn't read all that frequently, 1-10x a day typically, so you want your cheap S3 storage but once in a while you'll want to crank up the parallelism to run a big report over longer time periods.

dapperdrake 5 hours ago | parent | prev [-]

Probably not.

The compressed data can fit onto a local SSD. Decompression can definitely be streamed.