| ▲ | GodelNumbering 4 hours ago |
| Another neat thing is, they publish hourly caching states for ALL model/provider combinations. I did some research on it to come up with a provider tiers list and found a bunch of open-source 3rd party hosts are simply trash tier https://dirac.run/posts/cache-hit-rates-agents |
|
| ▲ | kflansburg 2 hours ago | parent | next [-] |
| I would recommend tracking this data over time. I work on Cloudflare's KV cache for Kimi K2.6, and while there are periods where our cache rate is low, we are frequently in the 80-90% range. OpenRouter shows us at 87.3% at the time of this post. We observe cache rates change quite a bit from hour to hour. |
| |
| ▲ | GodelNumbering an hour ago | parent [-] | | True for Kimi, but the results I published are average across the models (CF has over 10 models on openrouter). Your current Kimi K2.6 is over 80% but Gemma 4 26B A4B is 0%. https://openrouter.ai/google/gemma-4-26b-a4b-it This is also the reason providers like Anthropic scored lower because while Opus 4.7 is close to 90%, Opus 4.5 is 45% |
|
|
| ▲ | rkagerer 2 hours ago | parent | prev | next [-] |
| Agents push the full conversation history into context every turn Why? Maybe this is a dumb question, but why wouldn't an agent "keep the conversation going", like I do when interacting with an LLM through a web page? (I understand how it's impractical for long-running tasks where the agent has to wait days for the next input, but assume that's not the majority of use cases) |
| |
| ▲ | sosodev 2 hours ago | parent | next [-] | | I’m not sure I understand your question. Every interaction you have with a model in a web page does the same thing in the backend. It feeds the whole conversation history, perhaps with a bit of processing, into the model so it can process the next generation. Filling the context window is how these models retain coherence. | |
| ▲ | eknkc an hour ago | parent | prev | next [-] | | BTW, the openai responses api has a store parameter and a thread id input. Makes it possible to send a thread id and append a new message, ask for completion. So it feels like keeping the conversation going. Technically it does retrieve the entire history and reevaulate it since the LLM is stateless. Just more ergonomic for the developer. And prompt caching helps cut the costs down when a conversation drags on. | |
| ▲ | isbvhodnvemrwvn 2 hours ago | parent | prev | next [-] | | LLMs are stateless, to predict next tokens they need the history. When you write your own agents you will be very selective and might trim context and heavily segment different tasks, but generic ones don't do that (at best they spawn subjects to handle smaller tasks) | | |
| ▲ | lxgr 5 minutes ago | parent [-] | | That said, the KV cache is very much not stateless, so internally inference APIs will be highly incentivized to route requests to instances with as much a shared prefix cached as possible. |
| |
| ▲ | BoredPositron 2 hours ago | parent | prev [-] | | The "web page" does the same you just don't see it. |
|
|
| ▲ | gnulinux 4 hours ago | parent | prev [-] |
| Thank you so much for this! I've been working on exactly this problem this week (which OpenRouter providers have the highest cache rate on average) because cache cost is sometimes half your cost: I'd much rather use a provider with more input caching with a more expensive/better LLM. Your results and lists seem more comprehensive than what I've done so far. Very helpful! |