▲ | transformi 2 days ago | |
Interesting to see if GROQ hardware can run this diffusion architecture..it will be two time magnitude of currently known speed :O | ||
▲ | randomgoogler1 2 days ago | parent [-] | |
(Disc: Googler but don't have any specific knowledge of this architecture) My understanding of Groq is that the reason it is fast is that all the weights are kept in SRAM and since the SRAM <-> Compute bandwidth is much faster than HBM <-> Compute bandwidth, you can generate tokens faster (During generation the main bottleneck is just bringing in the weights + KV caches into compute). If the diffusion models just do multiple unmasked forward passes through a transformer, then the activation * weights computation + (attention computation) will be the bottleneck which will make each denoising step compute bound and there won't be any advantage in storing the weights in SRAM since you can overlap the HBM -> compute transfer with compute itself. But my knowledge of diffusion is non-existent, so take this with a truck of salt. |