Remix.run Logo
isbvhodnvemrwvn 2 hours ago

LLMs are stateless, to predict next tokens they need the history. When you write your own agents you will be very selective and might trim context and heavily segment different tasks, but generic ones don't do that (at best they spawn subjects to handle smaller tasks)

lxgr 6 minutes ago | parent [-]

That said, the KV cache is very much not stateless, so internally inference APIs will be highly incentivized to route requests to instances with as much a shared prefix cached as possible.