Remix.run Logo
ActionHank 4 days ago

That is a weak argument.

The point is to benchmark against a human solving a problem. Typically these problems are posed as a question or a blank project, without that history.

You are arguing for a an apples to oranges comparison because the LLM performs better. Rather than a realistic comparison.

sigmoid10 4 days ago | parent [-]

You apparently don't know what SWE-bench is [1]. First of all, it tries to evaluate skills that explicitly go beyond blank project questions with given solutions. Secondly, it does not contain "optimal" or sometimes even correct solutions. That's because it uses real world software development examples from actual PRs in popular repos. These very likely had humans use all the tools at their disposal as well (e.g. web search, git commands, code execution). Assuming an LLM could have solved these just by looking at a piece of code turns out to be very myopic.

[1] https://arxiv.org/html/2310.06770v3

zelphirkalt 4 days ago | parent [-]

This does not present a good understanding of how to test machine learning models. If you let the "labels" be visible at test time, your test is worthless. Your analogy of "real world software development" doesn't hold up. In real world software development, I don't have magical commits from the future, that I merely need to check out and look at to see the solution.

sigmoid10 4 days ago | parent [-]

This was never in question and an actual developer has already answered in another comment why this is not as big of an issue as the title might suggest. The question in this chain was about the value of additional data, which is perfectly clear if you're familiar with this particular benchmark.