| ▲ | briga 4 hours ago | |
Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent. But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest. | ||
| ▲ | stephc_int13 4 hours ago | parent [-] | |
Even if the benchmark themselves are kept secret, the process to create them is not that difficult and anyone with a small team of engineers could make a replica in their own labs to train their models on. Given the nature of how those models work, you don't need exact replicas. | ||