Remix.run Logo
verdverm 6 hours ago

Some of the benchmarks went down, has that happened before?

andy12_ 6 hours ago | parent | next [-]

If you mean for Anthropic in particular, I don't think so. But it's not the first time a major AI lab publishes an incremental update of a model that is worse at some benchmarks. I remember that a particular update of Gemini 2.5 Pro improved results in LiveCodeBench but scored lower overall in most benchmarks.

https://news.ycombinator.com/item?id=43906555

grandinquistor 6 hours ago | parent | prev | next [-]

Probably deprioritizing other areas to focus on swe capabilities since I reckon most of their revenue is from enterprise coding usage.

cmrdporcupine 6 hours ago | parent [-]

It's frankly becoming difficult for me to imagine what the next level of coding excellence looks like though.

By which I mean, I don't find these latest models really have huge cognitive gaps. There's few problems I throw at them that they can't solve.

And it feels to me like the gap now isn't model performance, it's the agenetic harnesses they're running in.

nothinkjustai 5 hours ago | parent [-]

Ask it to create an iOS app which natively runs Gemma via Litert-lm.

It’s incredibly trivial to find stuff outside their capabilities. In fact most stuff I want AI to do it just can’t, and the stuff it can isn’t interesting to me.

ACCount37 6 hours ago | parent | prev | next [-]

Constantly. Minor revisions can easily "wobble" on benchmarks that the training didn't explicitly push them for.

Whether it's genuine loss of capability or just measurement noise is typically unclear.

grandinquistor 5 hours ago | parent | prev [-]

looking at the system card for opus 4.7 the MCRC benchmark used for long context tasks dropped significantly from 78% to 32%

I wonder what caused such a large regression in this benchmark