▲ | theturtletalks 8 days ago | ||||||||||||||||
The Veritasium video brought up an interesting point about how LLMs, if trained too much on their own content, will fall victim to the Markov Chain and just repeat the same thing over and over. Is this still possible with the latest models being trained on synthetic data? And if it possible, what would that one phrase be? | |||||||||||||||||
▲ | roadside_picnic 8 days ago | parent [-] | ||||||||||||||||
That original model collapse paper has largely been misunderstood and in practice, this is only true if you're not curating the generated data at all. The original paper even specifies (emphasis mine): > We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. [0] In practice nobody is "indiscriminately" using model output to fine-tune models since that doesn't even make sense. Even if you're harvesting web data generated by LLMs, that data has in fact been curated by it's acceptance on whatever platform you found it on is a form of curation. There was a very recent paper Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) [1] whose content is pretty well summarized by the title. So long as the data is curated in some way, you are providing more information to the model and the results should improve somewhat. 0. https://www.nature.com/articles/s41586-024-07566-y 1. https://www.arxiv.org/pdf/2507.12856 edit: updated based on cooksnoot's comment | |||||||||||||||||
|