The Veritasium video brought up an interesting point about how LLMs, if trained too much on their own content, will fall victim to the Markov Chain and just repeat the same thing over and over.

Is this still possible with the latest models being trained on synthetic data? And if it possible, what would that one phrase be?

▲

roadside_picnic 8 days ago | parent [-]

That original model collapse paper has largely been misunderstood and in practice, this is only true if you're not curating the generated data at all. The original paper even specifies (emphasis mine):

> We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. [0]

In practice nobody is "indiscriminately" using model output to fine-tune models since that doesn't even make sense. Even if you're harvesting web data generated by LLMs, that data has in fact been curated by it's acceptance on whatever platform you found it on is a form of curation.

There was a very recent paper Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) [1] whose content is pretty well summarized by the title. So long as the data is curated in some way, you are providing more information to the model and the results should improve somewhat.

0. https://www.nature.com/articles/s41586-024-07566-y

1. https://www.arxiv.org/pdf/2507.12856

edit: updated based on cooksnoot's comment

▲

cootsnuck 8 days ago | parent [-]

There's multiple papers on model collapse. Being able to avoid model collapse is different from it "being disproven".

If you just mean its risk has been over exaggerated and/or over simplified then yea, you'd have a point.

	▲	roadside_picnic 8 days ago \| parent [-]
		Fair point, I've updated the post to highlight that even the original paper specifics "indiscriminate" use of model outputs. Having spent quite a bit of time diving into many questionable "research" papers (the original model collapse paper is not actually one of these, it's a solid paper), there's a very common pattern of showing that something does or does not work under special conditions but casually making generalized claims about those results. It's so easy with LLMs to find a way to get the result you want that there are far too many papers out there that people quickly take as fact when the claims are much, much weaker than the papers let on. So I tend to get a bit reactionary when addressing many of these "facts" about LLMs. But you are correct that with the model collapse paper this is much more the public misunderstanding the claims of the original paper than any fault with that paper itself.