Remix.run Logo
bisonbear 4 hours ago

I'm actually currently working on benchmarking the opus 4.7 reasoning curve against real-world tasks, and have found that reasoning effort does not seem to monotonically improve results (at least on the slice I'm looking at). I've been puzzling about this but perhaps the fact that claude code has adaptive thinking explains some of it - even at medium reasoning effort, it can use more thinking tokens when needed to solve a complex problem.

Snapshot of the results (sorry for busted format, ask your llm for dataviz. cant seem to format a good table in the comments)

Opus 4.7 on GraphQL-go-tools:

Low: 23/29 pass, 10/29 equivalent, 5/29 review-pass, custom avg 2.598, $2.50/task, 384s/task

Medium: 28/29 pass, 14/29 equivalent, 10/29 review-pass, custom avg 2.759, $3.15/task, 451s/task

High: 26/29 pass, 12/29 equivalent, 7/29 review-pass, custom avg 2.670, $5.01/task, 716s/task

Xhigh: 25/29 pass, 11/29 equivalent, 4/29 review-pass, custom avg 2.669, $6.51/task, 804s/task

Max: 27/29 pass, 13/29 equivalent, 8/29 review-pass, custom avg 2.690, $8.84/task, 997s/task

(custom avg is a set of rubrics used for llm-as-a-judge, graded out of 4)

Practically, the results indicate that medium has better outcomes, or at least the same outcomes, considering variance, as higher reasoning efforts, at a much lower cost/time.