| ▲ | sigmar 2 hours ago | |||||||||||||||||||||||||||||||
The system card is 319 pages, at what point do we call it a "book" instead of a "card"? There's a quote from a METR report on page 52: >We ran [Mythos 5] on 38 of our hardest software tasks, including tasks centered around R&D. [Mythos5] generally outperformed an early checkpoint of Claude Mythos Preview in these, including by succeeding on some tasks that had not been solved by any public model we have previously evaluated. However, we still observed the model occasionally failing to correctly interpret nuanced instructions in difficult tasks... Based on the available evidence, we believe [Mythos 5] is likely unable to fully and reliably automate R&D for frontier projects spanning multiple weeks. We believe that a better, more confident assessment would require more time, evaluations, and information from the model developer. | ||||||||||||||||||||||||||||||||
| ▲ | baq 2 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
> we believe [Mythos 5] is likely unable to fully and reliably automate R&D for frontier projects spanning multiple weeks this is good news, right? right...? | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | romanovcode an hour ago | parent | prev [-] | |||||||||||||||||||||||||||||||
But did it mention developer in the park eating the sandwitch? That is the most important question! | ||||||||||||||||||||||||||||||||