| ▲ | trunch 11 hours ago | |
Which of the LiveCodeBench Pro and SWE-Bench Verified benchmarks comes closer to everyday coding assistant tasks? Because it seems to lead by a decent margin on the former and trails behind on the latter | ||
| ▲ | veselin 10 hours ago | parent | next [-] | |
I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side. However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance. | ||
| ▲ | Snuggly73 10 hours ago | parent | prev [-] | |
Neither :( LCB Pro are leet code style questions and SWE bench verified is heavily benchmaxxed very old python tasks. | ||