▲ | dmos62 3 days ago | |||||||||||||||||||||||||||||||||||||
How do you propose that would work? A pipeline that goes through query-response pairs to deduce response quality and then uses the low-quality responses for further training? Wouldn't you need a model that's already smart enough to tell that previous model's responses weren't smart enough? Sounds like a chicken and egg problem. | ||||||||||||||||||||||||||||||||||||||
▲ | irthomasthomas 3 days ago | parent [-] | |||||||||||||||||||||||||||||||||||||
It just means that once you send your test questions to a model API, that company now has your test. So 'private' benchmarks take it on faith that the companies won't look at those requests and tune their models or prompts to beat them. | ||||||||||||||||||||||||||||||||||||||
|