▲ | atq2119 2 days ago | |
Diffusion could potentially be more efficient for local inference. With auto-regressive models, token generation is basically one token at a time, and so is not compute intensive at all -- it's bandwidth bound. With diffusion, you always run the model on a decently sized batch of tokens, so you should be (close to) compute bound even for local inference. If the "output quality per compute" is roughly the same for diffusion and auto-regression (is it? I have no idea...), then diffusion will be much more efficient for local inference because the same amount of compute can be packed into a much shorter time period. | ||
▲ | boroboro4 a day ago | parent [-] | |
Yeah, it might be a win for local inference. I think "output quality per compute" will be loss for diffusion models, but it might be similar (or even better?) for "output quality per number of parameters". Which will still make it better for local inference. However autoregressive models also have own way of dealing with low compute utilization - it's speculative decoding. You can use smaller (and faster) model to generate bunch of different possible continuations and verify all of them at once. I think Eagle3 for example achieves ~8 tokens per iteration speedup this way (and to be frank I believe it can be even better). |