Remix.run Logo
andai 4 hours ago

For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6, and seems to cost significantly less on the reasoning side as well.

Here is a comparison for 4.5, 4.6 and 4.7 (Output Tokens section):

https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...

4.7 comes out slightly cheaper than 4.6. But 4.5 is about half the cost:

https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...

Notably the cost of reasoning has been cut almost in half from 4.6 to 4.7.

I'm not sure what that looks like for most people's workloads, i.e. what the cost breakdown looks like for Claude Code. I expect it's heavy on both input and reasoning, so I don't know how that balances out, now that input is more expensive and reasoning is cheaper.

On reasoning-heavy tasks, it might be cheaper. On tasks which don't require much reasoning, it's probably more expensive. (But for those, I would use Codex anyway ;)

matheusmoreira 2 hours ago | parent | next [-]

It thinks less and produces less output tokens because it has forced adaptive thinking that even API users can't disable. Same adaptive thinking that was causing quality issues in Opus 4.6 not even two weeks ago. The one bcherny recommended that people disable because it'd sometimes allocate zero thinking tokens to the model.

https://news.ycombinator.com/item?id=47668520

People are already complaining about low quality results with Opus 4.7. I'm also spotting it making really basic mistakes.

I literally just caught it lazily "hand-waving" away things instead of properly thinking them through, even though it spent like 10 minutes churning tokens and ate only god knows how many percentage points off my limits.

> What's the difference between this and option 1.(a) presented before?

> Honestly? Barely any. Option M is option 1.(a) with the lifecycle actually worked out instead of hand-waved.

> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.

> Fair call. I was pattern-matching on "mutation + capture = scary" without actually reading the capture code. Let me do the work properly.

> You were right to push back. I was wrong. Let me actually trace it properly this time.

> My concern from the first pass was right. The second pass was me talking myself out of it with a bad trace.

It's just a constant stream of self-corrections and doubts. Opus simply cannot be trusted when adaptive thinking is enabled.

Can provide session feedback IDs if needed.

rectang 19 minutes ago | parent [-]

Are the benchmarks being used to measure these models biased towards completing huge and highly complex tasks, rather than ensuring correctness for less complex tasks?

It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.

matheusmoreira 10 minutes ago | parent [-]

I don't think there's a bias here. I'd say my task is of somewhat high complexity. I'm using Claude to assist me in implementing exceptions in my programming language. It's a SICP chapter 5.4 level task. There are quite a few moving parts in this thing. Opus 4.6 once went around in circles for half an hour trying to trace my interpreter's evaluator. As a human, it's not an easy task for me to do either.

I think the problem just comes down to adaptive thinking allowing the model to choose how much effort it spends on things, a power which it promptly abuses to be as lazy as possible. CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 significantly improved Opus 4.6's behavior and the quality of its results. But then what do they do when they release 4.7?

https://code.claude.com/docs/en/model-config

> Opus 4.7 always uses adaptive reasoning.

> The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.

QuantumGood an hour ago | parent | prev [-]

Some have defined "fair" as tests of the same model at different times, as the behavior and token usage of a model changes despite the version number remaining the same. So testing model numbers at different times matters, unfortunately, and that means recent tests might not be what you would want to compare to future tests.