| ▲ | jasondigitized 3 hours ago | ||||||||||||||||
A single 8h task? I'm sorry, but that's just asking for trouble. | |||||||||||||||||
| ▲ | queuebert 2 hours ago | parent | next [-] | ||||||||||||||||
I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large. | |||||||||||||||||
| |||||||||||||||||
| ▲ | maxall4 2 hours ago | parent | prev | next [-] | ||||||||||||||||
Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/ | |||||||||||||||||
| |||||||||||||||||
| ▲ | int_19h 19 minutes ago | parent | prev | next [-] | ||||||||||||||||
My record for a single uninterrupted session (albeit with Codex, not Claude) is 80+ hours. It was very productive, too. The trick is having large, extensive test suites and forcing the agent to run them regularly. | |||||||||||||||||
| ▲ | yalok an hour ago | parent | prev | next [-] | ||||||||||||||||
if there're some specific tests/evals to satisfy that an agent can test by itself, it can easily iterate for hours. And this time also includes running those tests/evals, which may not be small. | |||||||||||||||||
| ▲ | 32 minutes ago | parent | prev [-] | ||||||||||||||||
| [deleted] | |||||||||||||||||