| ▲ | lebovic 2 days ago | |
I think the third chart is the most notable; Mythos is the first model which saturated that eval from the UK AISI [1]. Personally, I think we crossed the threshold of meaningfully useful capabilities for autonomous hacking with Opus 4.6 [2], mostly because its behaviors and persistence are useful for finding vulnerabilities out of the box [3]. But it still seems like Mythos is another step up. [1]: https://cdn.prod.website-files.com/663bd486c5e4c81588db7a48/... [2]: https://www.noahlebovic.com/testing-an-autonomous-hacker/ | ||