Remix.run Logo
zaptheimpaler 5 days ago

It's honestly ridiculous they left git history lying around during a benchmark, and this benchmark made to ICLR in Jan 2024 and no one has detected this issue until now. I don't really trust any benchmarking or tools or claims from this space when they can make such huge basic errors.

dolmen 5 days ago | parent | next [-]

Next models will use zero-day to escape the sandbox and access the answer.

Nijikokun 5 days ago | parent | prev | next [-]

There was a lot of speculation whether or not the model would use them or even if it would attempt to use them and they noted this months ago. Now they have clear evidence of them doing so. Seems reasonable.

lieret 5 days ago | parent | prev [-]

[On swe-bench team] We read and analyzed a lot of trajectories but seems like only recently models have started to exploit this in a small fraction of instances. But yes, clearly shouldn't have happened (and is now fixed in the new container versions).