Remix.run Logo
matheusmoreira 3 hours ago

It thinks less and produces less output tokens because it has forced adaptive thinking that even API users can't disable. Same adaptive thinking that was causing quality issues in Opus 4.6 not even two weeks ago. The one bcherny recommended that people disable because it'd sometimes allocate zero thinking tokens to the model.

https://news.ycombinator.com/item?id=47668520

People are already complaining about low quality results with Opus 4.7. I'm also spotting it making really basic mistakes.

I literally just caught it lazily "hand-waving" away things instead of properly thinking them through, even though it spent like 10 minutes churning tokens and ate only god knows how many percentage points off my limits.

> What's the difference between this and option 1.(a) presented before?

> Honestly? Barely any. Option M is option 1.(a) with the lifecycle actually worked out instead of hand-waved.

> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.

> Fair call. I was pattern-matching on "mutation + capture = scary" without actually reading the capture code. Let me do the work properly.

> You were right to push back. I was wrong. Let me actually trace it properly this time.

> My concern from the first pass was right. The second pass was me talking myself out of it with a bad trace.

It's just a constant stream of self-corrections and doubts. Opus simply cannot be trusted when adaptive thinking is enabled.

Can provide session feedback IDs if needed.

codethief an hour ago | parent | next [-]

> > Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.

In my experience, prompts like this one, which 1) ask for a reason behind an answer (when the model won't actually be able to provide one), 2) are somewhat standoff-ish, don't work well at all. You'll just have the model go the other way.

What works much better is to tell the model to take a step back and re-evaluate. Sometimes it also helps to explicitly ask it to look at things from a different angle XYZ, in other words, to add some entropy to get it away from the local optimum it's currently at.

matheusmoreira an hour ago | parent [-]

That's good advice. I managed to get the session back on track by doing that a few turns later. I started making it very explicit that I wanted it to really think things through. It kept asking me for permission to do things, I had to explicitly prompt it to trace through and resolve every single edge case it ran into, but it seems to be doing better now. It's running a lot of adversarial tests right now and the results at least seem to be more thorough and acceptable. It's gonna take a while to fully review the output though.

It's just that Opus 4.6 DISABLE_ADAPTIVE_THINKING=1 doesn't seem to require me to do this at all, or at least not as often. It'd fully explore the code and take into account all the edge cases and caveats without any explicit prompting from me. It's a really frustrating experience to watch Anthropic's flagship subscription-only model burn my tokens only to end up lazily hand-waving away hard questions unless I explicitly tell it not to do that.

I have to give it to Opus 4.7 though: it recovered much better than 4.6.

13 minutes ago | parent | prev | next [-]
[deleted]
rectang 2 hours ago | parent | prev [-]

Are the benchmarks being used to measure these models biased towards completing huge and highly complex tasks, rather than ensuring correctness for less complex tasks?

It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.

matheusmoreira 2 hours ago | parent [-]

I don't think there's a bias here. I'd say my task is of somewhat high complexity. I'm using Claude to assist me in implementing exceptions in my programming language. It's a SICP chapter 5.4 level task. There are quite a few moving parts in this thing. Opus 4.6 once went around in circles for half an hour trying to trace my interpreter's evaluator. As a human, it's not an easy task for me to do either.

I think the problem just comes down to adaptive thinking allowing the model to choose how much effort it spends on things, a power which it promptly abuses to be as lazy as possible. CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 significantly improved Opus 4.6's behavior and the quality of its results. But then what do they do when they release 4.7?

https://code.claude.com/docs/en/model-config

> Opus 4.7 always uses adaptive reasoning.

> The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.