| ▲ | tontinton 2 hours ago | |||||||
Is it similar to rtk? Where the output of tool calls is compressed? Or does it actively compress your history once in a while? If it's the latter, then users will pay for the entire history of tokens since the change uncached: https://platform.claude.com/docs/en/build-with-claude/prompt... How is this better? | ||||||||
| ▲ | BloondAndDoom an hour ago | parent | next [-] | |||||||
This is a bit more akin to distill - https://github.com/samuelfaj/distill Advantage of SML in between some outputs cannot be compressed without losing context, so a small model does that job. It works but most of these solutions still have some tradeoff in real world applications. | ||||||||
| ||||||||
| ▲ | thebeas an hour ago | parent | prev [-] | |||||||
We do both: We compress tool outputs at each step, so the cache isn't broken during the run. Once we hit the 85% context-window limit, we preemptively trigger a summarization step and load that when the context-window fills up. | ||||||||