Remix.run Logo
andyg_blog 2 hours ago

This reminds me a bit of "Command-line Tools can be 235x Faster than your Hadoop Cluster" from 2014 https://adamdrake.com/command-line-tools-can-be-235x-faster-...

The constraint I, and I bet many here, have is just how much data there is. 3GB like in the 2014 article is one .pdf

Enterprise level data store is measured in hundreds of GB for a single customer, and you'll get murdered on data egress costs if you try to search an entire corpus, if you can even get through it all before the request times out or the customer decides after 5 minutes that enough is enough.

You'd need a true distributed filesystem to even start attempting what the authors suggest at any scale outside of your local machine.