Remix.run Logo
threepts an hour ago

That is why we have SWE bench pro, they test architecture design too, turns out 1000 dollars of tokens outperform 10k dollars of labor in meta design.

SpicyLemonZest an hour ago | parent [-]

That's just not accurate. I haven't studied SWE Bench Pro in detail, so I can't tell you exactly what the flaw is, but SOTA models routinely make bad architectural choices I have to intervene to fix.

threepts 17 minutes ago | parent [-]

You can read the paper here: https://labs.scale.com/papers/swe_bench_pro

TL;DR its very effective as it directly tests model on REAL codebases: "The benchmark is constructed from GPL-style copyleft repositories and private proprietary codebases". The use case is very real.