Remix.run Logo
ai-tamer 4 days ago

Same. The numbers match your feel. Going from 4.6 to 4.7: +14.6 on MCP-Atlas, +10.9 on SWE-bench Pro, tool errors cut by two-thirds. But BrowseComp dropped 4.7 points. Anthropic's own announcement says 4.7 "takes the instructions literally" where 4.6 interpreted them loosely, and recommends re-tuning prompts accordingly. In a conversational loop with an opinionated developer, that translates to less quality because less reasoning — the model executes instead of thinking through. https://llm-stats.com/blog/research/claude-opus-4-7-vs-opus-... https://www.anthropic.com/news/claude-opus-4-7

siva7 4 days ago | parent [-]

So it became gpt 5.4 xhigh but ten times the cost?

ai-tamer 4 days ago | parent | next [-]

We're rich ;-)

More seriously, in a multi-agent setup the per-token cost matters less: a bit of Claude, a bit of Codex, a bit of Gemini-CLI, ... No single model carries the full bill, and having three different training sets catches more "green tests, wrong code" than any single xhigh pass would. Even at 10x per token, one well-placed Opus in the reviewer seat beats one full Opus session on everything.

ai-tamer 3 days ago | parent | prev [-]

[dead]