Remix.run Logo
Lerc 13 hours ago

I'm not sure of the benefit of keeping particular forms of problems secret.

Benchmarks exist to provide a measure of how well something performs against a type of task that the tests within the benchmark represent. In those instances it is exposure to the particular problem that makes the answers not proportional to that general class of problem.

It should be easy to find another representative problem. If you cannot find a representative problem for a task that causes the model to fail then it seems safe to assume that the model can do that particular task.

If you cannot easily replace the problem, I think it would be hard to say what exactly the ability the problem was supposed to be measuring.