▲ | thom 5 days ago | ||||||||||||||||||||||||||||||||||
An existing trained LLM is an enormous amount of 'data' however it might be encoded. AlphaZero didn't start with Stockfish or a database of games. | |||||||||||||||||||||||||||||||||||
▲ | tucnak 5 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
AlphaZero is oftentimes dragged out to ridicule the so-called "self-play LLM training" techniques, although I don't think these arguments are terribly convincing. You can think of AlphaZero games as effectively synthetic data in adversarial setting; yes, it's easy to produce and verify as the rules of chess are verifiable, so it doesn't require much data on paper. This is not the case for most texts, with some notable exceptions in verifiable domains, where self-play is coincidentally applied most successfully. Thus, you could make an argument that the pre-existing "trained LLM" is merely functioning as a verifier proxy, analogous to the well-defined chess verifier in AlphaZero. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
▲ | magicalhippo 5 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
As I understand it the point of the article isn't to train a LLM from scratch, it's to teach a non-reasoning model to reason without additional explicit training data. | |||||||||||||||||||||||||||||||||||
|