Remix.run Logo
jansan 3 hours ago

So you believe one marketing department more than the other?

NitpickLawyer 3 hours ago | parent [-]

The brits have a step-based benchmark that they use for this - https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...

They seem pretty close, in both average and "best run" scores. And, in a highly verifiable domain, "best run" or pass@n is what you're looking for.

aesthesia 2 hours ago | parent [-]

Worth looking at the followup post that evaluates the current version of Mythos, which solves one of the main tasks that GPT-5.5-Cyber does not. https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber...

827a an hour ago | parent [-]

I believe the correct way to interpret AISI’s findings is that both Mythos and 5.5-Cyber are capable of solving their full benchmark (the only two models that can); Mythos does it with fewer tokens and more consistently.

Two things of note: 5.5-Cyber is likely to be substantially cheaper than Mythos, given it is priced around Opus. Additionally: AISI has never tested OpenAI’s best public model and actual Mythos competitor: 5.5-Pro.