Remix clone Hacker News

new | show | ask | jobs Github

	▲	enjeyw 2 days ago
		Overly specific LLM research into KV cache eviction. The vast majority of tokens in a sequence will be irrelevant to an attention mechanism outside of a very small window. Right now however we tend to either keep all cache values forever, or dump them all once they hit a certain age. My theory is that you can train model to look at the key vectors and from that information alone work out how long to keep a the token in the cache for. Results so far look promising and it’s easy to add after the fact without retraining the core model itself.