Remix.run Logo
grandinquistor 9 hours ago

Quite a big improvement in coding benchmarks, doesn’t seem like progress is plateauing as some people predicted.

msavara 8 hours ago | parent | next [-]

Only in benchmarks. After couple of minutes of use it feels same dumb as nerfed 4.6

solenoid0937 8 hours ago | parent [-]

It's alot better for me especially on xhigh

cpan22 7 hours ago | parent | prev | next [-]

But it majorly regressed in long context retrieval? Which is arguably getting more and more important?

verdverm 9 hours ago | parent | prev | next [-]

Some of the benchmarks went down, has that happened before?

andy12_ 9 hours ago | parent | next [-]

If you mean for Anthropic in particular, I don't think so. But it's not the first time a major AI lab publishes an incremental update of a model that is worse at some benchmarks. I remember that a particular update of Gemini 2.5 Pro improved results in LiveCodeBench but scored lower overall in most benchmarks.

https://news.ycombinator.com/item?id=43906555

grandinquistor 9 hours ago | parent | prev | next [-]

Probably deprioritizing other areas to focus on swe capabilities since I reckon most of their revenue is from enterprise coding usage.

cmrdporcupine 9 hours ago | parent [-]

It's frankly becoming difficult for me to imagine what the next level of coding excellence looks like though.

By which I mean, I don't find these latest models really have huge cognitive gaps. There's few problems I throw at them that they can't solve.

And it feels to me like the gap now isn't model performance, it's the agenetic harnesses they're running in.

nothinkjustai 8 hours ago | parent [-]

Ask it to create an iOS app which natively runs Gemma via Litert-lm.

It’s incredibly trivial to find stuff outside their capabilities. In fact most stuff I want AI to do it just can’t, and the stuff it can isn’t interesting to me.

ACCount37 9 hours ago | parent | prev | next [-]

Constantly. Minor revisions can easily "wobble" on benchmarks that the training didn't explicitly push them for.

Whether it's genuine loss of capability or just measurement noise is typically unclear.

grandinquistor 8 hours ago | parent | prev [-]

looking at the system card for opus 4.7 the MCRC benchmark used for long context tasks dropped significantly from 78% to 32%

I wonder what caused such a large regression in this benchmark

ACCount37 9 hours ago | parent | prev | next [-]

People were "predicting" the plateau since GPT-1. By now, it would take extraordinary evidence for me to take such "predictions" seriously.

William_BB 6 hours ago | parent | prev [-]

Are you one of those naive people that still take these coding benchmarks seriously?