Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples?
With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.