Conceptually, it's effectively a GAN

frumiousirc 5 days ago | parent | next [-]

My initial thought as well. But, what is the "Discriminator" here? What grounds the training toward reality? The "Challenger" and "Solver" adversity alone can only serve to amplify hallucination.

Ahh, GPT-4o is the arbiter.

So, basically, this is a way to perform LLM model compression (GPT-4o to qwen3) while maximizing the in-distribution domain size. As such, it seems reasonable and useful.

However the reliance on an arbiter LLM makes the claim that it will overcome the problem of a lack of training data unreasonable. Once the target LLM is scaled up to reach the in-distribution domain size of the arbiter, it seems to me it will turn back into a hallucination amplifier.

▲

djoldman 5 days ago | parent [-]

See Figure 2.

The solver/challenger is the GAN discriminator/generator.

The challenger is trained to create difficult questions. The solver is trained to strengthen pathways that correctly solve the questions like so:

> To guide the Challenger toward producing challenging yet solvable questions, we first define an uncertainty score. For a generated question x, we query the current Solver... The most frequent response is treated as the pseudo-label y˜(x), and we compute the Solver’s empirical accuracy....The uncertainty reward is then defined.... This function incentivizes questions where the Solver is maximally uncertain (accuracy approaches 50%)

Identifying the best pseudo-label seems like it would be the limitation of the approach.

	▲	frumiousirc 4 days ago \| parent [-]
		> Identifying the best pseudo-label seems like it would be the limitation of the approach. Yes, I think this says in a different way what I'm trying to express. In GAN, the Discriminator pegs the training to some chosen reality (assuming the "real" data set is truly real). In Challenger/Solver alone, there is no peg. The Solver could hallucinate consistently and "win" the race. It's the consistency that is the goal. With GPT-4o as an arbiter of the Challenger/Solver training it provides the reality peg (or rather, the peg that biases toward GPT-4o's training set).

▲

magicalhippo 5 days ago | parent | prev | next [-]

For those not in the know, that's Generative Adversarial Networks[1], where two neural networks are trained in a competitive way.

One network typically generates tasks for the other, and is rewarded if it manages to make the other network fail the task. The other network is rewarded if it successfully completes the task.

Thus the adversarial network tries to find weaknesses to exploit, and the combined training makes the solving network much stronger. Or at least that's the idea.

[1]: https://en.wikipedia.org/wiki/Generative_adversarial_network

▲

torginus 5 days ago | parent | prev [-]

GAN's are a supervised training method, not really self-improving (after converging to being able to reproduce the training set).