Remix clone Hacker News

new | show | ask | jobs Github

	▲	xg15 2 hours ago
		Isn't it also, most fundamentally, dependent on the model weights? My understanding was that what the KV cache stores is nothing else than the "activations" of the W_k and W_v matrices of an attention module for a given input sequence. So I don't quite understand how this is supposed to work: > Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. Should a publisher precompute the cache for every popular model that is out there?