| ▲ | Klaus23 3 hours ago | |||||||||||||||||||
It really is. This is because LLMs with a single output/user are strongly bandwidth limited. Although the hardware can generate multiple tokens simultaneously, it is slowed down if the tokens depend on each other, as is the case with regular text generation. The draft model essentially predicts the next token quickly, enabling you to start generating the subsequent token in parallel. If the guess is right, the second generated token is correct. If it is wrong, the second generated token is also potentially wrong, so it must be generated again using the correct prior token obtained through the big model. A poor draft model will simply slow down the process without affecting the output. | ||||||||||||||||||||
| ▲ | furyofantares 3 hours ago | parent [-] | |||||||||||||||||||
> If the guess is right This is the crux. What makes the guess "right"? I think the acceptance criteria is not that the token is exactly the token the big model would have produced. It's accepted of the big model verifies that the probability of that token was high enough. How close it is to the same output (or same distribution of outputs) you'd get from running the big model would be dependent on temperature, top-k, top-p settings, or other inference parameters. | ||||||||||||||||||||
| ||||||||||||||||||||