Remix.run Logo
albertzeyer a day ago

> Google's first LLM to use diffusion in place of transformers.

But this is a wrong statement? Google never made this statement? You can have a Transformer diffusion models. Actually Transformers are very standard for all of the discrete diffusion language models, so I would expect Gemini Diffusion also uses Transformers.

Edit Ah sorry, I missed, this was already addressed, also linked in the post: https://news.ycombinator.com/item?id=44057939 Maybe my remaining post is still useful to some.

The difference is, it's an encoder-only Transformer, and not a decoder-only Transformer. I.e. it gets fed in a full sequence (but noisy/corrupted), and it predicts the full correct sequence. And then you can iterate on that. All frames in the sequence can be calculated in parallel, and if you need only a few iterations, this is faster than the sequential decoding in decoder-only models (although speculative decoding also gets you some speedup for similar reasons). Those discrete diffusion models / encoder-only Transformers are usually trained with BERT-like masking, but that's actually an active field of research. It's really a pity that they don't provide any details here (on training and modeling).

I wonder how this relates to Gemini. Does it use the same modeling? Was the model checkpoint even imported from Gemini, and then further finetuned for discrete diffusion? Or knowledge distillation? Or is it just branding?