| ▲ | wxw 5 hours ago | |
> SSA replaces the O(n²) dense attention pass with a learned sparse formulation that scales linearly with context length. > At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2. Awesome stuff. Solving context at the model architecture layer rather than trying to bolt on extra memory is the right direction IMO. | ||