Maybe I’m naive but the longest single workflow I ran was maybe 15 minutes. How do you steer agents to run “overnight”? And what is the quality of such execution?

▲

dregitsky 9 minutes ago | parent | next [-]

To add to what @nab said, the longest ("overnight") runs are usually after going back and forth to build out a big multi-phase plan doc -- especially when each phase has an extensive manual test plan (agent runs the app in a browser, clicks through the workflow, watches logs, confirms behavior, etc).

These can go for many hours from all the manual testing and debugging. Quality really depends on how much you spec things out beforehand, and how you define the test plan / "success" gates. If the agent can't even run the app to test it then things can definitely go off the rails!

▲

notrealyme123 2 hours ago | parent | prev | next [-]

Usually coding where the closed loop evaluation takes time.

E.g code debugging

	▲	nab 2 hours ago \| parent [-]
		This. Very few people are doing this right now (probably because it sucks having 5 copies of your app running in parallel on your laptop), but in the past few months models have gotten really good at testing your running app live. If you have an environment where you can run your full app and models can get it at via playwright and chromium, they can click around, take actions, and actually verify that their code works. With boxes.dev I've starting pushing agents harder to run the full app and test their work end to end, and send me screenshots as proof. This takes time, sometimes up to 30-40 minutes, but is much more likely to be bug free at the end of the day.

▲

ai_slop_hater 42 minutes ago | parent | prev | next [-]

I think they are just bullshitting.

▲

FergusArgyll an hour ago | parent | prev | next [-]

In codex, is you use /goal it can go for a while. I've never seen overnight but > 1 hr is common

▲

smrtinsert 14 minutes ago | parent | prev [-]

"build me a 10 million dollar MRR saas, make no mistakes"