The problem with this approach is that even recomputing a "draft" of the KV cache is still quadratic in context length. Maybe you can get some constant savings by always recomputing the earliest tokens, but it's not a good tradeoff as context sizes grow.

▲

saagarjha 4 hours ago | parent [-]

Sure, but any classical attention mechanism is quadratic in context length.

	▲	zozbot234 2 hours ago \| parent [-]
		But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.