| ▲ | METR can barely measure Claude Mythos – 50% task horizon now exceeds 16 hours(hugonomy.com) | |
| 1 points by GlyphWeaver_a 11 hours ago | 2 comments | ||
| ▲ | overthinker_jp 10 hours ago | parent | next [-] | |
Capability benchmarks may become less meaningful once agents operate across long execution horizons with external tools and permissions. The governance problem starts shifting toward execution boundaries and observability. | ||
| ▲ | GlyphWeaver_a 11 hours ago | parent | prev [-] | |
[dead] | ||