Anyone able to summarize the current 'hold up' with diffusion models? I know exactly how Transformers work, but I'm not a diffusion expert. Diffusion is so much more powerful tho (from what I know) it seems like diffusion would already be beating Transformers. Why isn't it?

▲

boroboro4 7 months ago | parent [-]

Diffusion is about what goes into the model and what’s a result (in this case it’s denoising of the content) as opposed to autoregressive models (where the process is to predict continuation based on prefix). It’s orthogonal to model architecture, which can be transformer or (for example) mamba. I’m pretty sure Gemini diffusion is transformer too.

Diffusion brings different set of trade offs, and as you can see it improves speed but I would expect it increases compute required for generation. But this is hard to say for sure without knowing their exact sampling process.

Interestingly we have opposite direction in case with gpt-4o, OpenAI made autoregressive image generation model and it seems it works great.

▲

atq2119 7 months ago | parent [-]

Diffusion could potentially be more efficient for local inference. With auto-regressive models, token generation is basically one token at a time, and so is not compute intensive at all -- it's bandwidth bound. With diffusion, you always run the model on a decently sized batch of tokens, so you should be (close to) compute bound even for local inference.

If the "output quality per compute" is roughly the same for diffusion and auto-regression (is it? I have no idea...), then diffusion will be much more efficient for local inference because the same amount of compute can be packed into a much shorter time period.

	▲	boroboro4 7 months ago \| parent [-]
		Yeah, it might be a win for local inference. I think "output quality per compute" will be loss for diffusion models, but it might be similar (or even better?) for "output quality per number of parameters". Which will still make it better for local inference. However autoregressive models also have own way of dealing with low compute utilization - it's speculative decoding. You can use smaller (and faster) model to generate bunch of different possible continuations and verify all of them at once. I think Eagle3 for example achieves ~8 tokens per iteration speedup this way (and to be frank I believe it can be even better).