| ▲ | MrCheeze 4 hours ago | ||||||||||||||||||||||
In my experience with the models (watching Claude play Pokemon), the models are similar in intelligence, but are very different in how they approach problems: Opus 4.5 hyperfocuses on completing its original plan, far more than any older or newer version of Claude. Opus 4.6 gets bored quickly and is constantly changing its approach if it doesn't get results fast. This makes it waste more time on"easy" tasks where the first approach would have worked, but faster by an order of magnitude on "hard" tasks that require trying different approaches. For this reason, it started off slower than 4.5, but ultimately got as far in 9 days as 4.5 got in 59 days. | |||||||||||||||||||||||
| ▲ | KronisLV 4 hours ago | parent | next [-] | ||||||||||||||||||||||
I got the Max subscription and have been using Opus 4.6 since, the model is way above pretty much everything else I've tried for dev work and while I'd love for Anthropic to let me (easily) work on making a hostable server-side solution for parallel tasks without having to go the API key route and not have to pay per token, I will say that the Claude Code desktop app (more convenient than the TUI one) gets me most of the way there too. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | Jach 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
I haven't kept up with the Claude plays stuff, did it ever actually beat the game? I was under the impression that the harness was artificially hampering it considering how comparatively more easily various versions of ChatGPT and Gemini had beat the game and even moved on to beating Pokemon Crystal. | |||||||||||||||||||||||
| ▲ | DaKevK 3 hours ago | parent | prev [-] | ||||||||||||||||||||||
Genuinely one of the more interesting model evals I've seen described. The sunk cost framing makes sense -- 4.5 doubles down, 4.6 cuts losses faster. 9 days vs 59 is a wild result. Makes me wonder how much of the regression complaints are from people hitting 4.6 on tasks where the first approach was obviously correct. | |||||||||||||||||||||||
| |||||||||||||||||||||||