Remix.run Logo
boroboro4 a day ago

Yeah, it might be a win for local inference.

I think "output quality per compute" will be loss for diffusion models, but it might be similar (or even better?) for "output quality per number of parameters". Which will still make it better for local inference.

However autoregressive models also have own way of dealing with low compute utilization - it's speculative decoding. You can use smaller (and faster) model to generate bunch of different possible continuations and verify all of them at once. I think Eagle3 for example achieves ~8 tokens per iteration speedup this way (and to be frank I believe it can be even better).