| ▲ | martinald 11 hours ago | |
I thought that but it does do a lot better on other benchmarks. Perhaps SWE bench just doesn't capture a lot of the improvement? If the web design improvements people have been posting on twitter, I suspect this will be a huge boon for developers. SWE benchmark is really testing bugfixing/feature dev more. Anyway let's see. I'm still hyped! | ||
| ▲ | camdenreslink 9 hours ago | parent | next [-] | |
It seems the benchmarks that had a big jump had to do with visual capabilities. I wonder how that will translate to improvements to the workloads LLMs are currently used for (or maybe it will introduce new workloads). | ||
| ▲ | rfoo 10 hours ago | parent | prev | next [-] | |
SWE Bench doesn't even test bugfixing / feature dev properly after you achieve roughly 70% if you don't benchmaxx it . | ||
| ▲ | catigula 11 hours ago | parent | prev [-] | |
That would be great! But AI is a bubble if these models can’t do serious engineering work. | ||