Remix.run Logo
criemen 3 hours ago

Partially, 2.2 Submission workflow W2 deals with this:

> Stage W2 The five project-active models, see Table 2, attempted the question. Their answers were compared to the original answer by an LLM judge. If at most three models answered correctly, the contributor could proceed.

So "trivially contained in the training data" is excluded, as then all models could/should easily come up with the solution.