▲ | antupis 2 days ago | |
Also, even if you want to be honest, at this point, probably every public or semipublic benchmark is part of CommonCrawl. | ||
▲ | NitpickLawyer 2 days ago | parent [-] | |
True. And it's even worse than that, because each test probably gets "talked about" a lot in various places. And people come up with variants. And those variants get ingested. And then the whole thing becomes a mess. This was noticeable with the early Phi models. They were originally trained fully on synthetic data (cool experiment tbh) but the downside was that GPT3 / 4 was "distilling" benchmarks "hacks" into it. It became aparent when new benchmarks were released, after the published date, and there was one that measured "contamination" of about 20+%. Just from distillation. |