| ▲ | saagarjha 4 hours ago | |
Sure, but any classical attention mechanism is quadratic in context length. | ||
| ▲ | zozbot234 2 hours ago | parent [-] | |
But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse. | ||