Remix.run Logo
pier25 8 hours ago

haven't people been complaining lately about 4.6 getting worse?

solenoid0937 8 hours ago | parent | next [-]

People complain about a lot of things. Claude has been fine:

https://marginlab.ai/trackers/claude-code-historical-perform...

addisonj 7 hours ago | parent | next [-]

I will be the first to acknowledge that humans are a bad judge of performance and that some of the allegations are likely just hallucinations...

But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?

My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.

Majromax 7 hours ago | parent | prev | next [-]

While that's a nice effort, the inter-run variability is too high to diagnose anything short of catastrophic model degradation. The typical 95% confidence interval runs from 35% to 65% pass rates, a full factor of two performance difference.

Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.

yorwba 5 hours ago | parent [-]

Yes, MarginLab only tests 50 tasks a day, which is too few to give a narrower confidence interval. On the other hand, this really calls into question claims of performance degradation that are based on less intensive use than that. Variance is just so high that long streaks of bad luck are to be expected and plausibly the main source of such complaints. Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds (thus guaranteeing low variance), or you make a lot of calls (i.e. probably through the API and not in interactive mode.)

sumedh 2 hours ago | parent | prev | next [-]

Your link shows there have been huge drops.

How is it fine?

cbg0 7 hours ago | parent | prev [-]

That performance monitor is super easy to game if you cache responses to all the SWE bench questions.

solenoid0937 5 hours ago | parent [-]

You dramatically overestimate how much time engineers at hypergrowth startups have on their hands

cbg0 5 hours ago | parent [-]

Caching some data is time consuming? They can just ask Claude to do it.

ed_elliott_asc 8 hours ago | parent | prev [-]

No we increased our plans