Remix.run Logo
jasondigitized 3 hours ago

A single 8h task? I'm sorry, but that's just asking for trouble.

queuebert 2 hours ago | parent | next [-]

I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large.

standardUser an hour ago | parent [-]

You have to build up a context, or otherwise seed the memory, to get anything useful out of these LLMs on a large or existing project.

maxall4 2 hours ago | parent | prev | next [-]

Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/

jwood27 2 hours ago | parent [-]

Those are tasks that would take a human 3 hours to complete, not tasks that the model works on for 3 hours.

jadar 38 minutes ago | parent [-]

That’s even smaller then!

int_19h 19 minutes ago | parent | prev | next [-]

My record for a single uninterrupted session (albeit with Codex, not Claude) is 80+ hours. It was very productive, too.

The trick is having large, extensive test suites and forcing the agent to run them regularly.

yalok an hour ago | parent | prev | next [-]

if there're some specific tests/evals to satisfy that an agent can test by itself, it can easily iterate for hours. And this time also includes running those tests/evals, which may not be small.

32 minutes ago | parent | prev [-]
[deleted]