| ▲ | sigmar 4 hours ago | ||||||||||||||||||||||
>publish these incredible papers explaining how they achieved their gains - something the American labs no longer do unfortunately. Google is still releasing a lot of llm architecture research. They introduced speculative decoding of LLMs in 2022[1], then released the code to perform sceculative decoding for their Gemma 4 model this year[2] [1] https://arxiv.org/abs/2211.17192 [2] https://github.com/google-gemma/cookbook/blob/main/docs/mtp/... | |||||||||||||||||||||||
| ▲ | kamranjon 3 hours ago | parent | next [-] | ||||||||||||||||||||||
Thanks for the clarification - Google does publish more than others - and I actually really appreciate the work they are doing with the Gemma models, which are truly competitive open models. I do wish they’d publish more in depth papers on their Gemma models but appreciate that they are open weights. | |||||||||||||||||||||||
| ▲ | DiabloD3 3 hours ago | parent | prev [-] | ||||||||||||||||||||||
They weren't the first to do MTP like this, and arguably did it wrong: the MTP heads are kept in a separate file and have to be welded in by the inference engine. Qwen 3.6 shipped with working MTP first, and had working MTP in llama.cpp first. | |||||||||||||||||||||||
| |||||||||||||||||||||||