| ▲ | adastra22 5 hours ago | |
Attention is quadratic, so you have to pick a cutoff for context window size. In addition, the error/noise in state space increases with longer contexts, resulting in poorer performance. So even if you're willing to take the O(n^2) slowdown of a larger context window, it still won't work. | ||
| ▲ | fancy_pantser 4 hours ago | parent [-] | |
> Attention is quadratic Exactly. Standard Multi-Head Attention uses a matrix that grows to 4B parameters for a 64K sequence as a starting place. FlashAttention v2 helps slightly, but as you grow to 128K context length, you still need over 1TB/s memory bandwidth to stay compute-bound in practice even with this optimization. So there has been a lot of research in this area and model architectures released this year are showing some promising improvements. Sliding windows lose context fidelity and if you go fully linear, you sacrifice math, logic, and long multi-turn (agentic) capabilities, so everyone is searching for a good alternative compromise. MiniMax-M1 had lightning attention to scale up to 1M context lengths. It's "I/O aware" via tiling and calculates attention two ways block-wise (intra-block traditional attention and inter-block linear attention), thereby avoiding the speed-inhibiting cumulative summation. DeepSeek V3.2 uses DeepSeek Sparse Attention (DSA), which is sub-linear by only computing "interesting" pairs. For example, in 128K context lengths this requires only 10-20% of attention pairs to be materialized. Both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which is borrowed from Mamba2. In Qwen3-Next it alternates three Gated DeltaNet (linear attention) layers for every one gated [full] attention. The speedup is from a delta rule, which basically amounts to caching in a hand-wavy way. There's no universally-adopted solution yet, as these are all pretty heavy-duty compromises, but the search is going strong right now for linear or better attention mechanisms that still perform well. | ||