| ▲ | uberman 8 hours ago |
| On actual code, I see what you see a 30% increase in tokens which is in-line with what they claim as well. I personally don't tend to feed technical documentation or random pros into llms. Given that Opus 4.6 and even Sonnet 4.6 are still valid options, for me the question is not "Does 4.7 cost more than claimed?" but "What capabilities does 4.7 give me that 4.6 did not?" Yesterday 4.6 was a great option and it is too soon for me to tell if 4.7 is a meaningful lift. If it is, then I can evaluate if the increased cost is justified. |
|
| ▲ | tetha 6 hours ago | parent | next [-] |
| Yeah that was an interesting discovery in a development meeting. Many people were chasing after the next best model and everything, though for me, Sonnet 4.6 solves many topics in 1-2 rounds. I mainly need some focus on context, instructions and keeping tasks well-bounded. Keeping the task narrow also simplifies review and staying in control, since I usually get smaller diffs back I can understand quickly and manage or modify later. I'll look at the new models, but increasing the token consumptions by a factor of 7 on copilot, and then running into all of these budget management topics people talk about? That seems to introduce even more flow-breakers into my workflow, and I don't think it'll be 7 times better. Maybe in some planning and architectural topics where I used Opus 4.6 before. |
|
| ▲ | pier25 8 hours ago | parent | prev | next [-] |
| haven't people been complaining lately about 4.6 getting worse? |
| |
| ▲ | solenoid0937 8 hours ago | parent | next [-] | | People complain about a lot of things. Claude has been fine: https://marginlab.ai/trackers/claude-code-historical-perform... | | |
| ▲ | addisonj 7 hours ago | parent | next [-] | | I will be the first to acknowledge that humans are a bad judge of performance and that some of the allegations are likely just hallucinations... But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story? My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users. | |
| ▲ | Majromax 8 hours ago | parent | prev | next [-] | | While that's a nice effort, the inter-run variability is too high to diagnose anything short of catastrophic model degradation. The typical 95% confidence interval runs from 35% to 65% pass rates, a full factor of two performance difference. Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes. | | |
| ▲ | yorwba 5 hours ago | parent [-] | | Yes, MarginLab only tests 50 tasks a day, which is too few to give a narrower confidence interval. On the other hand, this really calls into question claims of performance degradation that are based on less intensive use than that. Variance is just so high that long streaks of bad luck are to be expected and plausibly the main source of such complaints. Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds (thus guaranteeing low variance), or you make a lot of calls (i.e. probably through the API and not in interactive mode.) |
| |
| ▲ | sumedh 2 hours ago | parent | prev | next [-] | | Your link shows there have been huge drops. How is it fine? | |
| ▲ | cbg0 7 hours ago | parent | prev [-] | | That performance monitor is super easy to game if you cache responses to all the SWE bench questions. | | |
| ▲ | solenoid0937 5 hours ago | parent [-] | | You dramatically overestimate how much time engineers at hypergrowth startups have on their hands | | |
| ▲ | cbg0 5 hours ago | parent [-] | | Caching some data is time consuming? They can just ask Claude to do it. |
|
|
| |
| ▲ | ed_elliott_asc 8 hours ago | parent | prev [-] | | No we increased our plans |
|
|
| ▲ | grim_io 8 hours ago | parent | prev [-] |
| How long will they host 4.6? Maybe longer for enterprise, but if you have a consumer subscription, you won't have a choice for long, if at all anymore. |
| |
| ▲ | Jeremy1026 7 hours ago | parent | next [-] | | I was trying to figure out earlier today how to get 4.6 to run in Claude Code, as part of the output it included "- Still fully supported — not scheduled for retirement until Feb 2027." Full caveat of, I don't know where it came up with this information, but as others have said, 4.5 is still available today and it is now 5, almost 6 months old. | |
| ▲ | hypercube33 7 hours ago | parent | prev | next [-] | | I'm still using 4.5 because it gets the niche work I'm using it for where 4.6 would just fight me. | |
| ▲ | nfredericks 8 hours ago | parent | prev [-] | | Opus 4.5 is still available | | |
|