Remix.run Logo
yorwba 7 hours ago

SWE-Bench Pro consists of 1865 tasks. https://arxiv.org/abs/2509.16941 Qwen3-Coder-Next solved 44.3% (826 or 827) of these tasks. To solve a single task, it took between ≈50 and ≈280 agent turns, ≈150 on average. In other words, a single pass through the dataset took ≈280000 agent turns. Kimi-K2.5 solved ≈84 fewer tasks, but also only took about a third as many agent turns.

zamadatix 5 hours ago | parent | next [-]

Ah, a spread of the individual tests makes plenty of sense! Many thanks (same goes to the other comments).

regularfry 6 hours ago | parent | prev [-]

If this is genuinely better than K2.5 even at a third the speed then my openrouter credits are going to go unused.