Remix.run Logo
METR can barely measure Claude Mythos – 50% task horizon now exceeds 16 hours(hugonomy.com)
1 points by GlyphWeaver_a 11 hours ago | 2 comments
overthinker_jp 10 hours ago | parent | next [-]

Capability benchmarks may become less meaningful once agents operate across long execution horizons with external tools and permissions. The governance problem starts shifting toward execution boundaries and observability.

GlyphWeaver_a 11 hours ago | parent | prev [-]

[dead]