Remix.run Logo
antupis 2 days ago

Also, even if you want to be honest, at this point, probably every public or semipublic benchmark is part of CommonCrawl.

NitpickLawyer 2 days ago | parent [-]

True. And it's even worse than that, because each test probably gets "talked about" a lot in various places. And people come up with variants. And those variants get ingested. And then the whole thing becomes a mess.

This was noticeable with the early Phi models. They were originally trained fully on synthetic data (cool experiment tbh) but the downside was that GPT3 / 4 was "distilling" benchmarks "hacks" into it. It became aparent when new benchmarks were released, after the published date, and there was one that measured "contamination" of about 20+%. Just from distillation.