Remix clone Hacker News

new | show | ask | jobs Github

	▲	EGreg 10 hours ago
		the k forward passes for different candidate tokens share all their prefix computation -- the KV cache up to position i-1 is identical for all candidates, so you run one pass through the shared layers and then k cheap single-token extensions. At long context lengths the shared prefix dominates the cost. This is also structurally what speculative decoding already does, so the infrastructure largely exists.