| ▲ | zozbot234 5 hours ago | |||||||
The problem with this approach is that even recomputing a "draft" of the KV cache is still quadratic in context length. Maybe you can get some constant savings by always recomputing the earliest tokens, but it's not a good tradeoff as context sizes grow. | ||||||||
| ▲ | saagarjha 4 hours ago | parent [-] | |||||||
Sure, but any classical attention mechanism is quadratic in context length. | ||||||||
| ||||||||