Quite a big improvement in coding benchmarks, doesn’t seem like progress is plateauing as some people predicted.

Only in benchmarks. After couple of minutes of use it feels same dumb as nerfed 4.6

	▲	solenoid0937 8 hours ago \| parent [-]
		It's alot better for me especially on xhigh

▲

cpan22 7 hours ago | parent | prev | next [-]

But it majorly regressed in long context retrieval? Which is arguably getting more and more important?

▲

verdverm 9 hours ago | parent | prev | next [-]

Some of the benchmarks went down, has that happened before?

▲

andy12_ 9 hours ago | parent | next [-]

If you mean for Anthropic in particular, I don't think so. But it's not the first time a major AI lab publishes an incremental update of a model that is worse at some benchmarks. I remember that a particular update of Gemini 2.5 Pro improved results in LiveCodeBench but scored lower overall in most benchmarks.

https://news.ycombinator.com/item?id=43906555

▲

grandinquistor 9 hours ago | parent | prev | next [-]

Probably deprioritizing other areas to focus on swe capabilities since I reckon most of their revenue is from enterprise coding usage.

▲

cmrdporcupine 9 hours ago | parent [-]

It's frankly becoming difficult for me to imagine what the next level of coding excellence looks like though.

By which I mean, I don't find these latest models really have huge cognitive gaps. There's few problems I throw at them that they can't solve.

And it feels to me like the gap now isn't model performance, it's the agenetic harnesses they're running in.

	▲	nothinkjustai 8 hours ago \| parent [-]
		Ask it to create an iOS app which natively runs Gemma via Litert-lm. It’s incredibly trivial to find stuff outside their capabilities. In fact most stuff I want AI to do it just can’t, and the stuff it can isn’t interesting to me.

▲

ACCount37 9 hours ago | parent | prev | next [-]

Constantly. Minor revisions can easily "wobble" on benchmarks that the training didn't explicitly push them for.

Whether it's genuine loss of capability or just measurement noise is typically unclear.

▲

grandinquistor 8 hours ago | parent | prev [-]

looking at the system card for opus 4.7 the MCRC benchmark used for long context tasks dropped significantly from 78% to 32%

I wonder what caused such a large regression in this benchmark

▲

ACCount37 9 hours ago | parent | prev | next [-]

People were "predicting" the plateau since GPT-1. By now, it would take extraordinary evidence for me to take such "predictions" seriously.

▲

William_BB 6 hours ago | parent | prev [-]

Are you one of those naive people that still take these coding benchmarks seriously?