| ▲ | storus 8 hours ago | ||||||||||||||||
Why not replace the context tokens on the GPU during inference when they become no longer relevant? i.e. some tool reads a 50k token document, LLM processes it, so then just flush those document tokens out of active context, rebuild QKV caches and store just some log entry in the context as "I already did this ... with this result"? | |||||||||||||||||
| ▲ | killerstorm 7 hours ago | parent | next [-] | ||||||||||||||||
Anthropic added features like this into 4.5 release: https://claude.com/blog/context-management > Context editing automatically clears stale tool calls and results from within the context window when approaching token limits. > The memory tool enables Claude to store and consult information outside the context window through a file-based system. But it looks like nobody has it as a part of an inference loop yet: I guess it's hard to train (i.e. you need a training set which is a good match for what people use context in practice) and make inference more complicated. I guess more high-level context management is just easier to implement - and it's one of things which "GPT wrapper" companies can do, so why bother? | |||||||||||||||||
| ▲ | zozbot234 8 hours ago | parent | prev [-] | ||||||||||||||||
This is what agent calls do under the hood, yes. | |||||||||||||||||
| |||||||||||||||||