Remix.run Logo
trunch 11 hours ago

Which of the LiveCodeBench Pro and SWE-Bench Verified benchmarks comes closer to everyday coding assistant tasks?

Because it seems to lead by a decent margin on the former and trails behind on the latter

veselin 10 hours ago | parent | next [-]

I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side.

However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.

Snuggly73 10 hours ago | parent | prev [-]

Neither :(

LCB Pro are leet code style questions and SWE bench verified is heavily benchmaxxed very old python tasks.