| ▲ | grandinquistor 9 hours ago |
| Quite a big improvement in coding benchmarks, doesn’t seem like progress is plateauing as some people predicted. |
|
| ▲ | msavara 8 hours ago | parent | next [-] |
| Only in benchmarks. After couple of minutes of use it feels same dumb as nerfed 4.6 |
| |
|
| ▲ | cpan22 7 hours ago | parent | prev | next [-] |
| But it majorly regressed in long context retrieval? Which is arguably getting more and more important? |
|
| ▲ | verdverm 9 hours ago | parent | prev | next [-] |
| Some of the benchmarks went down, has that happened before? |
| |
| ▲ | andy12_ 9 hours ago | parent | next [-] | | If you mean for Anthropic in particular, I don't think so. But it's not the first time a major AI lab publishes an incremental update of a model that is worse at some benchmarks. I remember that a particular update of Gemini 2.5 Pro improved results in LiveCodeBench but scored lower overall in most benchmarks. https://news.ycombinator.com/item?id=43906555 | |
| ▲ | grandinquistor 9 hours ago | parent | prev | next [-] | | Probably deprioritizing other areas to focus on swe capabilities since I reckon most of their revenue is from enterprise coding usage. | | |
| ▲ | cmrdporcupine 9 hours ago | parent [-] | | It's frankly becoming difficult for me to imagine what the next level of coding excellence looks like though. By which I mean, I don't find these latest models really have huge cognitive gaps. There's few problems I throw at them that they can't solve. And it feels to me like the gap now isn't model performance, it's the agenetic harnesses they're running in. | | |
| ▲ | nothinkjustai 8 hours ago | parent [-] | | Ask it to create an iOS app which natively runs Gemma via Litert-lm. It’s incredibly trivial to find stuff outside their capabilities. In fact most stuff I want AI to do it just can’t, and the stuff it can isn’t interesting to me. |
|
| |
| ▲ | ACCount37 9 hours ago | parent | prev | next [-] | | Constantly. Minor revisions can easily "wobble" on benchmarks that the training didn't explicitly push them for. Whether it's genuine loss of capability or just measurement noise is typically unclear. | |
| ▲ | grandinquistor 8 hours ago | parent | prev [-] | | looking at the system card for opus 4.7 the MCRC benchmark used for long context tasks dropped significantly from 78% to 32% I wonder what caused such a large regression in this benchmark |
|
|
| ▲ | ACCount37 9 hours ago | parent | prev | next [-] |
| People were "predicting" the plateau since GPT-1. By now, it would take extraordinary evidence for me to take such "predictions" seriously. |
|
| ▲ | William_BB 6 hours ago | parent | prev [-] |
| Are you one of those naive people that still take these coding benchmarks seriously? |