Remix.run Logo
ilaksh 10 hours ago

The only one it doesn't win is SWE bench which it is significantly behind Claude Sonnet. You just can't take down Sonnet.

svantana 10 hours ago | parent | next [-]

One percentage point is not significant, neither in the colloquial nor the scientific sense[1].

[1] Binomial formula gives a confidence interval of 3.7%, using p=0.77, N=500, confidence=95%

stavros 10 hours ago | parent | prev [-]

Codex has been much better than Sonnet for me.

dotancohen 10 hours ago | parent [-]

On what types of tasks?