Remix.run Logo
piotrgrabowski 4 days ago

Author here.

So far in this benchmark we based the tasks on a couple of open-source projects (like curl, jq, GNU Coreutils).

Even on those "simple" projects we managed to make the tasks difficult - Claude Opus 4.1 was the only one to correctly cross-compile curl for arm64 (+ make it statically-linked) [1].

In the future we'd like to test it with projects like FFmpeg or chromium - those should be much more difficult.

[1] https://www.compilebench.com/curl-ssl-arm64-static/

fuhsnn 4 days ago | parent | next [-]

You didn't make the tasks difficult, you make them easier.

The entire coreutils is reduced to one utility (sha1sum) and the test doesn't even try to feed a real file to it (just a stdin string)[0], same goes to the jq task, there isn't even a json file feed to it, what's being verified[1] is barely a calculator.

These project ship with "make check", please tell AI to use it.

[0] https://github.com/QuesmaOrg/CompileBench/blob/86d9aeda88a16...

[1] https://github.com/QuesmaOrg/CompileBench/blob/86d9aeda88a16...

jcranmer 4 days ago | parent | prev | next [-]

A long time ago, I did a project where I downloaded a year's worth of nightly builds for Thunderbird so that I could collect nightly code coverage information. Over the course of doing so, I discovered that there was one dependency (pango, I think?) such that no version could support the entire year's worth of source--the newer version didn't work with the older builds, and the older version didn't work with the newer builds.

Come to think of it, in terms of trying to get old code building, the CVS days of Firefox should be interesting... because the first command in that build step is "download the source code" and that CVS server isn't running anymore. And some of the components are downloaded from a CVS tag rather than trunk, and the converted CVS repositories I'm aware of all only converted the trunk and none of the branches or tags.

OtherShrezzing 4 days ago | parent | prev [-]

For the _reviving 20 year old code_ type tasks, are the tested outcomes things we'd expect to be in the public domain? For example, in the way the 'SWEBenchVerified' tests are poisoned tests, because the LLMs are able to look up bug fixes in the project git repository.

criemen 4 days ago | parent [-]

> because the LLMs are able to look up bug fixes in the project git repository

That's not the (only) problem: Even if you take the internet away, we know/assume that all LLMs are heavily trained on public GitHub repositories. Therefore, they know/remember details of the code and organization in a way they can't for your private (or new, past knowledge cut-off date) code.