Remix.run Logo
nyellin a day ago

There's a bit more to it!

For example, the agent in the post will demonstrate 'early stopping' where it finishes before the task is really done. You'd think you can solve this with reasoning models, but it doesn't actually work on SOTA models.

To fix 'early stopping' you need extra features in the agent harness. Claude Code does this with TODOs that are injected back into every prompt to remind the LLM what tasks remain open. (If you're curious somewhere in the public repo for HolmesGPT we have benchamrks with all the experiments we ran to solve this - from hypothesis tracking to other exotic approaches - but TODOs always performed best.)

Still, good article. Agents really are just tools in a loop. It's not rocket science.

d4rkp4ttern a day ago | parent | next [-]

Yes this “premature termination”, becomes particularly evident when you switch out Opus/Sonnet with a weaker LLM, and also happens more often in Codex CLI with GPT-5.

Since one of the replies asked for an example: the agent works for a bit and just stops. We’ve all seen cases where the agent simply says “ok, let me read the blah.py to understand the context better”, and just stops. It has essentially forgotten to use a tool for its next edit or read etc.

rtgfhyuj a day ago | parent | prev [-]

why would it early stop? examples?

mickeyp 21 hours ago | parent | next [-]

Models just naturally arrive at a conclusion that they are done. TODO hints can help, but is not infallible: Claude will stop and happily report there's more work to be done and "you just say the word Mister and I'll continue" --- this is a RL problem where you have to balance the chance of an infinite loop (it keeps thinking there's a little bit more to do when there is not) versus the opposite where it stops short of actual completion.

wxce 11 hours ago | parent [-]

> this is a RL problem where you have to balance the chance of an infinite loop (it keeps thinking there's a little bit more to do when there is not) versus the opposite where it stops short of actual completion.

Any idea on why the other end of the spectrum is this way -- thinking that it always has something to do?

I can think of a pet theory on it stopping early -- that positive tool responses and such bias it towards thinking it's complete (could be extremely wrong)

yencabulator 6 hours ago | parent [-]

> Any idea on why the other end of the spectrum is this way -- thinking that it always has something to do?

Who said anything about "thinking"? Smaller models were notorious for getting stuck repeating a single word over and over, or just "eeeeeee" forever. Larger models only change probabilities, not the fundamental nature of the machine.

embedding-shape a day ago | parent | prev [-]

Not all models are trained with long one-shot task following by themselves, seems many of them prefer closer interactions with the user. You could always add another layer/abstraction above/below to work around it.

fastball a day ago | parent [-]

Can't this just be a Ralph Wiggum loop (i.e. while True)

embedding-shape 18 hours ago | parent [-]

Sure, but I think just about everyone wants the agent to eventually say "done" in one way or another.