▲ | frumiousirc 5 days ago | |||||||
My initial thought as well. But, what is the "Discriminator" here? What grounds the training toward reality? The "Challenger" and "Solver" adversity alone can only serve to amplify hallucination. Ahh, GPT-4o is the arbiter. So, basically, this is a way to perform LLM model compression (GPT-4o to qwen3) while maximizing the in-distribution domain size. As such, it seems reasonable and useful. However the reliance on an arbiter LLM makes the claim that it will overcome the problem of a lack of training data unreasonable. Once the target LLM is scaled up to reach the in-distribution domain size of the arbiter, it seems to me it will turn back into a hallucination amplifier. | ||||||||
▲ | djoldman 5 days ago | parent [-] | |||||||
See Figure 2. The solver/challenger is the GAN discriminator/generator. The challenger is trained to create difficult questions. The solver is trained to strengthen pathways that correctly solve the questions like so: > To guide the Challenger toward producing challenging yet solvable questions, we first define an uncertainty score. For a generated question x, we query the current Solver... The most frequent response is treated as the pseudo-label y˜(x), and we compute the Solver’s empirical accuracy....The uncertainty reward is then defined.... This function incentivizes questions where the Solver is maximally uncertain (accuracy approaches 50%) Identifying the best pseudo-label seems like it would be the limitation of the approach. | ||||||||
|