| ▲ | ac29 3 hours ago | |
> I'm curious what the downside for this speed is here "DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs" | ||