Remix.run Logo
pmdr 2 days ago

> The Internet was turned into a slop warehouse well before LLMs became a thing

I suppose that's thanks to Google and their search algos favoring ad-ridden SEO spam. LLMs are indeed more appealing and convenient. But I fear that legitimate websites (ad-supported or otherwise) that actually provide useful information will be on the decline. Let's just hope then that updated information will find its way into LLMs when such websites are gone.

TeMPOraL 2 days ago | parent [-]

In terms of utility as training data, the Internet is a poisoned well now, and the poison is becoming more potent over time. Part of it is the SEO spam and content marketing slop, both of which kept growing and accumulating. Part of it is even more slop produced by LLMs, especially by cheap (= weak) models, but also by LLMs in general (any LLM used to produce content is doing worse job at it than a model from subsequent generation, so it's kinda always suboptimal for training purposes). And now part of it are people mass-producing bullshit out of spite, just to screw with AI companies. SNR on the web is dropping like a brick falling into a black hole.

It's a bit of a gamble at this point - will the larger models, or new architectures, or training protocols, be able to reject all that noise and extract the signal? If yes, then training on the Internet is still safe. If not, it's probably better for them to freeze the datasets blindly scrapped from the Internet now, and focus on mining less poisoned sources (like books, academic papers, and other publications not yet ravaged by the marketing communications cancer[0], also ideally published before the last 2 years).

I don't know which is more likely - but I'm not dismissing the possibility that the models will be able to process increasingly poisoned data sets just fine, if the data sets are large enough, because of a very basic and powerful idea: self-consistency. True information is always self-consistent, because it reflects the underlying reality. Falsehoods may be consistent in the small, but at scale they're not.