Remix.run Logo
kqr 8 hours ago

It was never that great, it seems. For all of 2025 there was virtually no improvement in the rate at which models produced quality code. They only got better at passing automated tests.

https://entropicthoughts.com/no-swe-bench-improvement

civvv 5 hours ago | parent | next [-]

This is likely true. I think model quality has stagnated and that its likely a non-trivial task to find a new improvement vector. Scaling the width of the model (which has been the driving force behind the speed of improvement thus far) seems to have reached its limit.

It will be interesting to see the implications of this. Tooling can only do so much in the long term.

mxwsn 5 hours ago | parent | next [-]

How do you know that width scaling has been the driving force of improvement?

waterTanuki 10 minutes ago | parent [-]

I mean, it's not exactly a PhD level question. One can infer from the extreme demand of GPUs and DRAM + new data center construction that all the providers are banking on width.

3 hours ago | parent | prev [-]
[deleted]
cjsaltlake 5 hours ago | parent | prev | next [-]

But, that's an enormous source of coding productivity, and it's why Anthropic is worth billions... The reason SWE-bench has been so successful and useful for coding is that software engineering has a ton of tradition and infrastructure for making and using automated tests.

greenchair an hour ago | parent | prev [-]

maybe this is why these companies pricing plans are getting more limited and expensive..