| ▲ | BoorishBears 3 days ago | |
I feel like public datasets are something we're holding onto with LLM benchmarks for historical reasons, but need to move on from. Older, non-instruction tuned models needed post-training on public datasets to even reliably produce meaningful answers. Now we're testing tasks that are so complex that the LLM should reasonably be expected to answer without additional post-training. Once you have a public dataset, even feeding those examples to an LLM and producing synthetic variations is enough to let you game the benchmark. And the worst part is you don't need to be unethical to do this: some people would say it's just a good way to expand your training data even though it incidentally allows you to overfit on the task, without overfitting on the public dataset. So everyone's doing stuff like that, and we're getting models that are increasing overfit to a few narrow tasks. - The alternative is just giving detailed plain english descriptions of the tasks in question. Those can be used to generate synthetic tasks, but won't result in matching the benchmark's "shape" perfectly (as long as the questions stay hidden), and that alone is enough to ensure some level of generalization takes place. | ||