Remix clone Hacker News

new | show | ask | jobs Github

	▲	adrian_b 3 hours ago
		Multi token prediction is the same thing as speculative decoding. This is mentioned in the Google pages describing their MTP implementation. Google has now provided small models for each of the previous Gemma 4 models, e.g. "gemma-4-26B-A4B-it-assistant" for "gemma-4-26B-A4B-it". The difference vs. Qwen is that here each small model is not some general-purpose smaller model, but a model that has been optimized specifically for this task, to predict the output of the bigger model with which it is paired. This specialization and optimization of the Google "gemma-4-*-assistant" models ensures that they are much smaller and thus much faster than general-purpose small models.