| ▲ | freediddy 6 hours ago | |
is 51% good enough to reliably use? There's no world in which I use an AI agent where it gets even 15% of the code wrong, that's as bad a Tesla FSD where you need to pay attention to the road while engaging FSD. What's the point? My attention is what I'm trying to relieve, not mostly correct functionality. The only thing that matters is whether you can one-shot code like Claude or Codex, I'm not interested in a small but mostly-okay-but-annoyingly-buggy-every-now-and-then AI. | ||
| ▲ | VygmraMGVl 6 hours ago | parent | next [-] | |
Claude opus 4.6 scores 51.9% on the same benchmark. Microsoft's result is quite good. | ||
| ▲ | IanCal 5 hours ago | parent | prev [-] | |
51% does not mean it randomly gets things wrong half the time. These things can be useful if you can accurately predict which tasks they will reliably do, and which they will usually fail on. Then you can get much more reliable work from them. | ||