| ▲ | edg5000 4 hours ago | |||||||||||||||||||||||||||||||
There is nothing wrong with the HTTP layer, it's just a way to get a string into the model. The problem is the industry obsession on concatenating messages into a conversation stream. There is no reason to do it this way. Every time you run inference on the model, the client gets to compose the context in any way they want; there are more things than just concatenating prompts and LLM ouputs. (A drawback is caching won't help much if most of the context window is composed dynamically) Coding CLIs as well as web chat works well because the agent can pull in information into the session at will (read a file, web search). The pain point is that if you're appending messages a stream, you're just slowly filling up the context. The fix is to keep the message stream concept for informal communication with the prompter, but have an external, persistent message system that the agent can interact with (a bit like email). The agent can decide which messages they want to pull into the context, and which ones are no longer relevant. The key is to give the agent not just the ability to pull things into context, but also remove from it. That gives you the eternal context needed for permanent, daemonized agents. | ||||||||||||||||||||||||||||||||
| ▲ | vanviegen 2 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
I've been working on a coding agent that does this on and of for about a year. Here's my latest attempt: https://github.com/vanviegen/maca#maca - This one allows agents to request (and later on drop) 'views' on functions and other logical pieces of code, and always get to see the latest version of it. (With some heuristics to not destroy kv-caches at every turn.) The problem is that the models are not trained for this, nor for any other non-standard agentic approach. It's like fighting their 'instincts' at every step, and the results I've been getting were not great. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | zknill 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
> "and which ones are no longer relevant." This is absolutely the hardest bit. I guess the short-cut is to include all the chat conversation history, and then if the history contains "do X" followed by "no actually do Y instead", then the LLM can figure that out. But isn't it fairly tricky for the agent harness to figure that out, to work out relevancy, and to work out what context to keep? Perhaps this is why the industry defaults to concatenating messages into a conversation stream? | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | alehlopeh an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
As you noted briefly, a big drawback is not getting to take advantage of the cache. Seems like a pretty big drawback. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | sourcecodeplz 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Yeah, opencode was/is like this and they never got caching right. Caching is a BIG DEAL to get right. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | zahlman an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
> the industry obsession Or maybe they haven't thought about it? Or they tried some simple alternatives and didn't find clear benefits? > The key is to give the agent not just the ability to pull things into context, but also remove from it. But then you need rules to figure out what to remove. Which probably involves feeding the whole thing to a(nother?) model anyway, to do that fuzzy heuristic judgment of what's important and what's a distraction. And simply removing messages doesn't add any structure, you still just have a sequence of whatever remains. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | raincole 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
> The key is to give the agent not just the ability to pull things into context, but also remove from it Of course Anthropic/OpenAI can do it. And the next day everyone will be complaining how much Claude/Codex has been dumbed down. They don't even comply to the context anymore! | ||||||||||||||||||||||||||||||||
| ▲ | asixicle 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
To be utterly shameless, this what I've been building: https://github.com/ASIXicle/persMEM Three persistent Claude instances share AMQ with an additional Memory Index to query with an embedding model (that I'm literally upgrading to Voyage 4 nano as I type). It's working well so far, I have an instance Wren "alive" and functioning very well for 12 days going, swapping in-and-out of context from the MCP without relying on any of Anthropic's tools. And it's on a cheap LXC, 8GB of RAM, N97. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | ElFitz 4 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||
Hmm. Maybe there’s a way to play around with this idea in pi. I’ll dig into it. | ||||||||||||||||||||||||||||||||