| ▲ | strongpigeon 2 hours ago | |
Something that I don't understand after reading the technical report is: Why is having access to a python interpreter as part of the harness not allowed (like the Duke harness), but using one hidden behind the model API (as a built-in tool) considered kosher? | ||
| ▲ | cdetrio 14 minutes ago | parent [-] | |
The Duke harness was specifically designed for these puzzles, that's why they don't want to measure it. My reading of that part in the technical report (models "could be using their own tools behind the model’s API, which is a blackbox"), is that there's no way to prevent it. But from fchollet's comment here, using tools and harnesses is encouraged, as long as they are generic and not arc-agi specific. In that case, the models should be benchmarked by prompting through claude code and codex, rather than the through API (as from the api we only expect raw LLM output, and no tool use). | ||