▲ | joliu 9 days ago | |
It does run inference, but on the batch of tokens that were drafted, akin to the prefill phase. So your draft model can decode N new tokens, then the real model does one inference pass to score the N new drafted tokens. Prefill is computation bound whereas decode is bandwidth bound, so in practice doing one prefill over N tokens is cheaper than doing N decode passes. |