Remix.run Logo
destroycom 8 hours ago

This doesn't seem like an article that was made with proper research or proper sincerity.

The claim is that there is one million times more data to feed to LLMs, citing a few articles. The articles estimate that there is 180-200 zettabytes (the number mentioned in TFA) of data total in the world, including all cloud services, including all personal computers, etc. The vast majority of that data are not useful to train LLMs at all, they will be movies, games, databases. There is a massive amount of duplication in that data. Only a tiny-tiny fraction will be something useful.

> Think of today’s AI like a giant blender where, once you put your data in, it gets mixed with everyone else’s, and you lose all control over it. This is why hospitals, banks, and research institutions often refuse to share their valuable data with AI companies, even when that data could advance critical AI capabilities.

This is not the reason, the reason is that this data is private. LLMs do not just learn from data, they can often reproduce it verbatim, you cannot give medical records or bank records of real people, that will put them at a very real risk.

Let alone that a lot of them will be, well-structured, yes, but completely useless information for LLM training. You will not get any improvement in the perceived "intellect" of a model by overfitting it with terabytes of tables with bank transaction records.