Remix clone Hacker News

new | show | ask | jobs Github

	▲	Mikealcl 5 hours ago
		Could you explain why prompt processing is the bottle neck please? I've seen this behavior but I don't understand why.
	▲	zozbot234 5 hours ago \| parent [-]
		You should be able to save a lot on prefill by stashing KV-cache shared prefixes (since KV-cache for plain transformers is an append-only structure) to near-line bulk storage and fetching them in as needed. Not sure why local AI engines don't do this already since it's a natural extension of session save/restore and what's usually called prompt caching.