Remix clone Hacker News

new | show | ask | jobs Github

	▲	wxw 5 hours ago
		> SSA replaces the O(n²) dense attention pass with a learned sparse formulation that scales linearly with context length. > At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2. Awesome stuff. Solving context at the model architecture layer rather than trying to bolt on extra memory is the right direction IMO.