Remix clone Hacker News

new | show | ask | jobs Github

	▲	joliu 9 days ago
		It does run inference, but on the batch of tokens that were drafted, akin to the prefill phase. So your draft model can decode N new tokens, then the real model does one inference pass to score the N new drafted tokens. Prefill is computation bound whereas decode is bandwidth bound, so in practice doing one prefill over N tokens is cheaper than doing N decode passes.