Remix clone Hacker News

new | show | ask | jobs Github

	▲	magicalhippo 11 hours ago
		> the practical section covers top-k approximations that capture most of the probability mass cheaply. You say cheaply, but top-k with k=20 still means 20 forward passes for each position in the predicted KV cache vector, no? So to compute the residual at position i+1 you need another 20 passes? It's late, perhaps I'm missing something.