| ▲ | el_isma 3 hours ago | |
How is this different from the speculative decoding that we had before? You could pair a big and small model like qwen 32b with qwen 4b and had that same dynamic of the small model generating tokens and the big one "certifiying" them. The blog says something about re-using the big model's data? | ||
| ▲ | adrian_b 3 hours ago | parent | next [-] | |
Multi token prediction is the same thing as speculative decoding. This is mentioned in the Google pages describing their MTP implementation. Google has now provided small models for each of the previous Gemma 4 models, e.g. "gemma-4-26B-A4B-it-assistant" for "gemma-4-26B-A4B-it". The difference vs. Qwen is that here each small model is not some general-purpose smaller model, but a model that has been optimized specifically for this task, to predict the output of the bigger model with which it is paired. This specialization and optimization of the Google "gemma-4-*-assistant" models ensures that they are much smaller and thus much faster than general-purpose small models. | ||
| ▲ | OneDeuxTriSeiGo 3 hours ago | parent | prev [-] | |
As far as I can tell MTP is unique from regular speculative decode because the small model is trained to consume and operate on the big model's hidden state for prediction. | ||