Remix.run Logo
saagarjha 4 hours ago

Sure, but any classical attention mechanism is quadratic in context length.

zozbot234 2 hours ago | parent [-]

But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.