▲ | arvindh-manian 6 days ago | |
Interesting link. Worth noting that the pull requests were judged by o3-mini. Further, I'm not sure that 55% vs 45% is a huge difference. | ||
▲ | marsh_mellow 6 days ago | parent | next [-] | |
Good point. They said they validated the results by testing with other models (including Claude), as well as with manual sanity checks. 55% to 45% definitely isn't a blowout but it is meaningful — in terms of ELO it equates to about a 36 point difference. So not in a different league but definitely a clear edge | ||
▲ | servercobra 5 days ago | parent | prev | next [-] | |
Maybe not as much to us, but for people building these tools, 4.1 being significantly cheaper than Clause 3.7 is a huge difference. | ||
▲ | elAhmo 6 days ago | parent | prev [-] | |
I first read it as 55% better, which sounds significantly higher than ~22% which they report here. Sounds misleading. |