| ▲ | cdetrio an hour ago | |
The Duke harness was specifically designed for these puzzles, that's why they don't want to measure it. My reading of that part in the technical report (models "could be using their own tools behind the model’s API, which is a blackbox"), is that there's no way to prevent it. But from fchollet's comment here, using tools and harnesses is encouraged, as long as they are generic and not arc-agi specific. In that case, the models should be benchmarked by prompting through claude code and codex, rather than the through API (as from the api we only expect raw LLM output, and no tool use). | ||
| ▲ | FINDarkside an hour ago | parent [-] | |
OpenAi does have python execution behind general purpose api, but it has to be enabled with a flag so I don't think it was used. | ||