I am excitedly waiting for the first company (guessing / hoping it'll be anthropic) to invest heavily in improvements to caching.

The big ones that come to mind are cheap long term caching, and innovations in compaction, differential stuff - like is there a way to only use the parts of the cached input context we need?

▲

manmal 8 months ago | parent [-]

Isn’t a problem there that a cache would be model specific, where the cached items are only valid for exactly the same weights and inference engine? I think those are both heavily iterated on.

	▲	simonw 8 months ago \| parent [-]
		Prompt caches right now only last a few minutes - I believe they involve keeping a bunch of calculations in-memory, hence why for Gemini and Anthropic you get charged an initial fee for using the feature (to populate the cache), but then get a discount on prompts that use that cache.