| ▲ | cubefox 4 hours ago | |||||||
I think DeepSeek V3.2 is sub n^2, but it clearly performs quite well, refuting the alleged lower bounds in the paper. | ||||||||
| ▲ | andy12_ 3 hours ago | parent | next [-] | |||||||
It really isn't sub N^2. The main attention is only O(Nk), but only thanks to a lightning indexer that still has complexity O(N^2). So overall it still has the same complexity; just with a smaller constant factor [1] > DSA reduces the core attention complexity of the main model from O(L^2) to O(Lk), where k (<< L) is the number of selected tokens. Although the lightning indexer still has a complexity of O(L^2), it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus | ||||||||
| ||||||||
| ▲ | 3 hours ago | parent | prev [-] | |||||||
| [deleted] | ||||||||