Remix.run Logo
PeterisP 2 days ago

The key missing step which breaks the loop is that while indeed a larger and larger portion of the web is written by language models, that data isn't being used to train new models - at the beginning of LLMs people did indeed want to use "all the web" to train models, but that's not being done now anymore, you either take only old pre-LLM data, or you pay for new 'clean' data, or take extensive filtering steps to avoid accidentally ingesting synthetic data.

The main phrase of the title "model collapse is happening" is untrue and not substantiated in the article - all the true statements in the article are about the hypothetical problem, warning of the bad consequences that would likely happen if makers of major models did something they aren't doing, but they aren't doing that because that is a known issue that they're avoiding. It's like writing an article "Foot shooting epidemic is happening" with a long, solid (and true!) proof that if you'll shoot yourself in the foot, it will indeed cause serious injury...