Remix clone Hacker News

new | show | ask | jobs Github

	▲	saagarjha 4 hours ago
		Sure, but any classical attention mechanism is quadratic in context length.
	▲	zozbot234 2 hours ago \| parent [-]
		But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.