Boris, wait, wait, wait,

Why not use tired cache?

Obviously storage is waaay cheaper than recalculation of embeddings all the way from the very beginning of the session.

No matter how to put this explanation — it still sounds strange. Hell — you can even store the cache on the client if you must.

Please, tell me I’m not understanding what is going on..

otherwise you really need to hire someone to look at this!)

▲

krackers 6 hours ago | parent | next [-]

Same question I had in https://news.ycombinator.com/item?id=47819914

I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.

▲

solarkraft 7 hours ago | parent | prev | next [-]

I assume they are already storing the cache on flash storage instead of keeping it all in VRAM. KV caches are huge - that’s why it’s impractical to transfer to/from the client. It would also allow figuring out a lot about the underlying model, though I guess you could encrypt it.

What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.

	▲	tonyarkles 7 hours ago \| parent \| next [-]
		Just to contextualize this... https://lmcache.ai/kv_cache_calculator.html. They only have smaller open models, but for Qwen3-32B with 50k tokens it's coming up with 7.62GB for the KV cache. Imagining a 900k session with, say, Opus, I think it'd be pretty unreasonable to flush that to the client after being idle for an hour.
	▲	2001zhaozhao 4 hours ago \| parent \| prev \| next [-]
		I wonder whether prompt caches would be the perfect use case of something like Optane. It's kept for long enough that it's expensive to store in RAM, but short enough that the writes are frequent and will wear down SSD storage
	▲	ohcmon 7 hours ago \| parent \| prev [-]
		Yes — encryption is the solution for client side caching. But even if it’s not — I can’t build a scenario in my head where recalculating it on real GPUs is cheaper/faster than retrieving it from some kind of slower cache tier

▲

rkuska 7 hours ago | parent | prev [-]

I don't think you can store the cache on client given the thinking is server side and you only get summaries in your client (even those are disabled by default).

	▲	sargunv 7 hours ago \| parent [-]
		If they really need to guard the thinking output, they could encrypt it and store it client side. Later it'd be sent back and decrypted on their server. But they used to return thinking output directly in the API, and that was _the_ reason I liked Claude over OpenAI's reasoning models.