Remix.run Logo
falcor84 5 days ago

What am I missing? From my skimming, there's zero external data beyond what is needed for the Challenger to generate questions.

thom 5 days ago | parent [-]

An existing trained LLM is an enormous amount of 'data' however it might be encoded. AlphaZero didn't start with Stockfish or a database of games.

tucnak 5 days ago | parent | next [-]

AlphaZero is oftentimes dragged out to ridicule the so-called "self-play LLM training" techniques, although I don't think these arguments are terribly convincing. You can think of AlphaZero games as effectively synthetic data in adversarial setting; yes, it's easy to produce and verify as the rules of chess are verifiable, so it doesn't require much data on paper. This is not the case for most texts, with some notable exceptions in verifiable domains, where self-play is coincidentally applied most successfully. Thus, you could make an argument that the pre-existing "trained LLM" is merely functioning as a verifier proxy, analogous to the well-defined chess verifier in AlphaZero.

nerpderp82 5 days ago | parent [-]

Thank you for your mature intelligent answer.

magicalhippo 5 days ago | parent | prev [-]

As I understand it the point of the article isn't to train a LLM from scratch, it's to teach a non-reasoning model to reason without additional explicit training data.

YeGoblynQueenne 5 days ago | parent [-]

The abstract does use the term "from scratch":

>> To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch.

Giving the benefit of the doubt, they're just using it wrong, but the way they use it sure reads like they claim they found a way to initialise LLMs with 0 data. Only the absurdity of the claim protects the reader from such misunderstanding, and that's never a good thing in a research paper.

magicalhippo 5 days ago | parent [-]

If you included the previous and following sentences, it's at least to me clear what they mean:

However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence

To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch.

Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver.

Training a LLM is a multi-stage process[1], and they're tackling the stage at the end. That's where you do fine-tuning or reinforcement learning. They're not training a LLM from scratch. They're explicitly stating they start from a base LLM, ie a pretrained non-tuned model.

As I understand it, and as they mention, training data for the latter stages has typically required high-quality human-curated samples in large numbers, even if they're augmented using LLMs, say by generating multiple variations of each human-curated training sample.

Their proposal is to have a generative adversarial network generate that data without any initial human input, ie from scratch.

[1]: https://snorkel.ai/blog/large-language-model-training-three-...

YeGoblynQueenne 4 days ago | parent [-]

That's a fair reading but when you write a technical paper you must try to minimise the number of different possible readings of each sentence, otherwise different people will understand different things, and that's the most important thing you need to avoid.

magicalhippo 4 days ago | parent [-]

> but when you write a technical paper you must try to minimise the number of different possible readings of each sentence

Fair point. It would indeed have been much more clear had they written something like this instead:

a fully autonomous framework that generates its own fine-tuning/RL training data from scratch.