Remix.run Logo
martinald 11 hours ago

I thought that but it does do a lot better on other benchmarks.

Perhaps SWE bench just doesn't capture a lot of the improvement? If the web design improvements people have been posting on twitter, I suspect this will be a huge boon for developers. SWE benchmark is really testing bugfixing/feature dev more.

Anyway let's see. I'm still hyped!

camdenreslink 9 hours ago | parent | next [-]

It seems the benchmarks that had a big jump had to do with visual capabilities. I wonder how that will translate to improvements to the workloads LLMs are currently used for (or maybe it will introduce new workloads).

rfoo 10 hours ago | parent | prev | next [-]

SWE Bench doesn't even test bugfixing / feature dev properly after you achieve roughly 70% if you don't benchmaxx it .

catigula 11 hours ago | parent | prev [-]

That would be great! But AI is a bubble if these models can’t do serious engineering work.