Remix.run Logo
zelphirkalt 4 days ago

Can anyone tell me what is the difficulty in simply not having .git at all during a benchmark run? Why not simply remove anything that is not the code the benchmark runs on? Or just simple oversight?

sigmoid10 4 days ago | parent | next [-]

Coding agents are so powerful because they are not just looking at static code. Looking through git histories is a valid method for humans to solve certain kinds of bugs, so it makes sense that models should also be able to do that too. And realistically, a lot of modern production code will have git information, so it's not like this wouldn't be a common real world application.

ActionHank 4 days ago | parent | next [-]

That is a weak argument.

The point is to benchmark against a human solving a problem. Typically these problems are posed as a question or a blank project, without that history.

You are arguing for a an apples to oranges comparison because the LLM performs better. Rather than a realistic comparison.

sigmoid10 4 days ago | parent [-]

You apparently don't know what SWE-bench is [1]. First of all, it tries to evaluate skills that explicitly go beyond blank project questions with given solutions. Secondly, it does not contain "optimal" or sometimes even correct solutions. That's because it uses real world software development examples from actual PRs in popular repos. These very likely had humans use all the tools at their disposal as well (e.g. web search, git commands, code execution). Assuming an LLM could have solved these just by looking at a piece of code turns out to be very myopic.

[1] https://arxiv.org/html/2310.06770v3

zelphirkalt 4 days ago | parent [-]

This does not present a good understanding of how to test machine learning models. If you let the "labels" be visible at test time, your test is worthless. Your analogy of "real world software development" doesn't hold up. In real world software development, I don't have magical commits from the future, that I merely need to check out and look at to see the solution.

sigmoid10 4 days ago | parent [-]

This was never in question and an actual developer has already answered in another comment why this is not as big of an issue as the title might suggest. The question in this chain was about the value of additional data, which is perfectly clear if you're familiar with this particular benchmark.

diggan 4 days ago | parent | prev | next [-]

I think this issue is specifically about the agents looking at "future repository state" (according to the linked issue at least), so while looking at the history might be a normal method for solving issues, running `git log --all` to take a peek at the future which already includes the fix isn't very typical (yet?).

fp64 4 days ago | parent | prev [-]

Well, there's legacy code and/or horrible git history that also needs fixing at some point. Also I have witnessed how the history can send you down a wrong path. I don't agree that this is a good argument.

4 days ago | parent | prev [-]
[deleted]