Remix.run Logo
bee_rider 15 hours ago

Hmm. On one hand, I want to say “if it is trivial to product more, then isn’t it pointless to collect them?”

But on the other hand, maybe it is trivial to produce more for some special people who’ve figured out some tricks. So maybe looking at their examples can teach us something.

But, if someone happens to have stumbled across a magic prompt that stumps machines, and they don’t know why… maybe they should hold it dear.

Lerc 12 hours ago | parent [-]

I'm not sure of the benefit of keeping particular forms of problems secret.

Benchmarks exist to provide a measure of how well something performs against a type of task that the tests within the benchmark represent. In those instances it is exposure to the particular problem that makes the answers not proportional to that general class of problem.

It should be easy to find another representative problem. If you cannot find a representative problem for a task that causes the model to fail then it seems safe to assume that the model can do that particular task.

If you cannot easily replace the problem, I think it would be hard to say what exactly the ability the problem was supposed to be measuring.