| ▲ | languid-photic 5 hours ago | |||||||||||||||||||||||||||||||
Naively tested a set of agents on this task. Each ran the same spec headlessly in their native harness (one shot). Results:
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness". | ||||||||||||||||||||||||||||||||
| ▲ | lawrencechen 4 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
codex cli + gpt-5-2-codex-xhigh got to 1606 with the prompt "beat 1487 cycles. go." ~53 minutes. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | raphaelj an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Could you try with some open-weighted models, e.g. Qwen3-coder, GLM-4.7 or Devstral-2? | ||||||||||||||||||||||||||||||||
| ▲ | ponyous 4 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential. | ||||||||||||||||||||||||||||||||
| ▲ | a24j 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Can you share the agent-comparison harness code or point to something similar? I want to learn about benchmarking models in a basic or practical sense. | ||||||||||||||||||||||||||||||||
| ▲ | forgotpwd16 4 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Could you make a repo with solutions given by each model inside a dir/branch for comparison? | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | giancarlostoro 4 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||
I do wonder how Grok would compare, specifically their Claude Code Fast model. | ||||||||||||||||||||||||||||||||