I think you're confused about the training steps for LLMs. What the industry generally calls pre-training is when the LLM learns the job of predicting the most probable next token given a huge volume of data. A large percentage of that data has not been cleaned at all because it just comes directly from web crawling. It's not uncommon to open up a web crawl dataset that is used for pretraining and immediately read something sexual, nonsensical, or both really.

LLMs really do find the signal in this noise because even just pre-training alone reveals incredible language capabilities but that's about it. They don't have any of the other skills you would expect and they most certainly aren't "safe". You can't even really talk to a pre-trained model because they haven't been refined into the chat-like interface that we're so used to.

The hard part after that for AI labs was getting together high quality data that transforms them from raw language machines into conversational agents. That's post-training and it's where the armies of humans have worked tirelessly to generate the refinement for the model. That's still valuable signal, sure, but it's not the signal that's found in the pre-training noise. The model doesn't learn much, if any, of its knowledge during post-training. It just learns how to wield it.

To be fair, some of the pre-training data is more curated. Like collections of math or code.

▲

imiric 12 hours ago | parent [-]

No, I think you're confused, and doubling down on it, for some reason.

Base models (after pre-training) have zero practical value. They're absolutely useless when it comes to separating signal from noise, using any practical definition of those terms. As you said yourself, their output can be nonsensical, based solely on token probability in the original raw data.

The actual value of LLMs comes after the post-training phase, where the signal is injected into the model from relatively smaller amounts of high quality data. This is the data processed by armies of humans, without which LLMs would be completely worthless.

So whatever capability you think LLMs have to separate signal from noise is exclusively the product of humans. When that job becomes harder, the quality of LLMs will go down. Unless we figure out a way to automate data cleaning/labeling, which seems like an unsolvable problem, or for models to filter it during inference, which is what you're wrongly implying they already do. LLMs could assist humans with cleaning/labeling tasks, but that in itself has many challenges, and is not a solution to the model collapse problem.

▲

sosodev 11 hours ago | parent [-]

I'm not saying that pre-trained only models are useless. They've clearly extracted a ton of knowledge from the corpus. The interface may seem strange because it's not what we're accustom to but they still prove valuable. Code completion models, for example, are just LLMs that have pre-trained exclusively on code. They work very well despite their simplicity because... the model has extracted the signal from the noise.

	▲	imiric 7 hours ago \| parent [-]
		You have a strange definition of "signal" and "noise". Code completion models can be useful because they output the most probable chain of tokens given a specific input, same as any LLM. There is no "signal" there besides probability. Besides, even those models are fine-tuned to follow best practices, specific language idioms, etc. When we talk about "signal" in the context of general knowledge we refer to information that is meaningful and accurate for a specific context and input. So that if the user asks proof of the Earth being flat, the model doesn't give them false information from a random blog. Of course, LLMs still fall short at this, but post-training is crucial to boost the signal away from the noise. There's nothing inherent in the way LLMs work to make them do this. It is entirely based on the quality of the training data.