▲ | NitpickLawyer 2 days ago | |
True. And it's even worse than that, because each test probably gets "talked about" a lot in various places. And people come up with variants. And those variants get ingested. And then the whole thing becomes a mess. This was noticeable with the early Phi models. They were originally trained fully on synthetic data (cool experiment tbh) but the downside was that GPT3 / 4 was "distilling" benchmarks "hacks" into it. It became aparent when new benchmarks were released, after the published date, and there was one that measured "contamination" of about 20+%. Just from distillation. |