▲ | tomp 4 days ago | |
No, the parent is wrong. Checking a token is the same as generating it. The benefit however is in the next (third) token. After generating tokens 1 and 2 (in one turn), you start generating token 3 (and 4). You also get the “real” prediction for token 2. If the “real” prediction matches the MTP (Multi-Token Prediction) from previous turn, you have just generated 3 correct tokens (and another speculative). If not, you’ve now corrected token 2, but token 3 is wrong (it follows the wrong token 2) so you need ti generate it again. | ||
▲ | bigwheels 4 days ago | parent | next [-] | |
Thanks for the clarification. Your comment made me connect the similarity (in spirit) of Speculative Decoding to Speculative Execution [1] in CPUs. Very cool and clever optimization strategy for LLMs, IMHO. [1] https://en.wikipedia.org/wiki/Speculative_execution Does it work to predict tokens 3 and 4 (or 5, 6) in the same way? I wonder how extreme the hit rate drop-off is. | ||
▲ | jychang 2 days ago | parent | prev [-] | |
To clarify, I should have stated: "Instead of generating tokens one at a time, you generate the second one as well WITH MTP, and then use speculative decoding on that second token (instead of having the second token be produced by a draft model like Qwen 0.6b). If the FIRST MTP token is checked and is correct, then the second token gets generated MUCH faster." |