| ▲ | rectang 2 hours ago | |
Are the benchmarks being used to measure these models biased towards completing huge and highly complex tasks, rather than ensuring correctness for less complex tasks? It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps. | ||
| ▲ | matheusmoreira 2 hours ago | parent [-] | |
I don't think there's a bias here. I'd say my task is of somewhat high complexity. I'm using Claude to assist me in implementing exceptions in my programming language. It's a SICP chapter 5.4 level task. There are quite a few moving parts in this thing. Opus 4.6 once went around in circles for half an hour trying to trace my interpreter's evaluator. As a human, it's not an easy task for me to do either. I think the problem just comes down to adaptive thinking allowing the model to choose how much effort it spends on things, a power which it promptly abuses to be as lazy as possible. CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 significantly improved Opus 4.6's behavior and the quality of its results. But then what do they do when they release 4.7? https://code.claude.com/docs/en/model-config > Opus 4.7 always uses adaptive reasoning. > The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it. | ||