Agents push the full conversation history into context every turn

Why?

Maybe this is a dumb question, but why wouldn't an agent "keep the conversation going", like I do when interacting with an LLM through a web page? (I understand how it's impractical for long-running tasks where the agent has to wait days for the next input, but assume that's not the majority of use cases)

▲

sosodev 2 hours ago | parent | next [-]

I’m not sure I understand your question. Every interaction you have with a model in a web page does the same thing in the backend. It feeds the whole conversation history, perhaps with a bit of processing, into the model so it can process the next generation. Filling the context window is how these models retain coherence.

▲

eknkc an hour ago | parent | prev | next [-]

BTW, the openai responses api has a store parameter and a thread id input. Makes it possible to send a thread id and append a new message, ask for completion. So it feels like keeping the conversation going.

Technically it does retrieve the entire history and reevaulate it since the LLM is stateless. Just more ergonomic for the developer.

And prompt caching helps cut the costs down when a conversation drags on.

▲

isbvhodnvemrwvn 2 hours ago | parent | prev | next [-]

LLMs are stateless, to predict next tokens they need the history. When you write your own agents you will be very selective and might trim context and heavily segment different tasks, but generic ones don't do that (at best they spawn subjects to handle smaller tasks)

	▲	lxgr 5 minutes ago \| parent [-]
		That said, the KV cache is very much not stateless, so internally inference APIs will be highly incentivized to route requests to instances with as much a shared prefix cached as possible.

▲

BoredPositron 2 hours ago | parent | prev [-]

The "web page" does the same you just don't see it.