▲ | ActionHank 4 days ago | ||||||||||||||||
That is a weak argument. The point is to benchmark against a human solving a problem. Typically these problems are posed as a question or a blank project, without that history. You are arguing for a an apples to oranges comparison because the LLM performs better. Rather than a realistic comparison. | |||||||||||||||||
▲ | sigmoid10 4 days ago | parent [-] | ||||||||||||||||
You apparently don't know what SWE-bench is [1]. First of all, it tries to evaluate skills that explicitly go beyond blank project questions with given solutions. Secondly, it does not contain "optimal" or sometimes even correct solutions. That's because it uses real world software development examples from actual PRs in popular repos. These very likely had humans use all the tools at their disposal as well (e.g. web search, git commands, code execution). Assuming an LLM could have solved these just by looking at a piece of code turns out to be very myopic. | |||||||||||||||||
|