▲ | sigmoid10 4 days ago | |||||||
You apparently don't know what SWE-bench is [1]. First of all, it tries to evaluate skills that explicitly go beyond blank project questions with given solutions. Secondly, it does not contain "optimal" or sometimes even correct solutions. That's because it uses real world software development examples from actual PRs in popular repos. These very likely had humans use all the tools at their disposal as well (e.g. web search, git commands, code execution). Assuming an LLM could have solved these just by looking at a piece of code turns out to be very myopic. | ||||||||
▲ | zelphirkalt 4 days ago | parent [-] | |||||||
This does not present a good understanding of how to test machine learning models. If you let the "labels" be visible at test time, your test is worthless. Your analogy of "real world software development" doesn't hold up. In real world software development, I don't have magical commits from the future, that I merely need to check out and look at to see the solution. | ||||||||
|