Remix clone Hacker News

new | show | ask | jobs Github

	▲	samuelknight 2 hours ago
		Some of these comments miss the advantage of diffusion. This is will have a big impact on edge devices, such as your phone or the GPU in your computer. An LLM's decoder computes tokens one-at-a-time because attention has to account for each previous token. The existing LLM decoders scale well when you have enough load to batch many inferences together. Diffusion of limited benefit there. On edge you have a different problem: your inference accelerator is starved while sloshing GB of weights back and forth from RAM. That's because the consumer RAM like LPDDRx/GDDRx is lower bandwidth than HBM, and the requests are serial so you can't batch compute common weights. Diffusion can compute tokens in parallel which relieves the memory bandwidth bottle neck.