| ▲ | ripvanwinkle 8 hours ago | |
>>In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix used as the ground-truth reference, known as the gold patch, or verbatim problem statement specifics for certain tasks, indicating that all of them have seen at least some of the problems and solutions during training this statement alone seems to invalidate the SWE-bench tests | ||