Remix.run Logo
willtemperley 5 hours ago

Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples?

With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.