▲ | littlestymaar 9 days ago | |
> Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot. Nitpick: it's only bottlenecked by memory bandwidth if the batch size is too low (that is: if you don't have many users calling the same model in parallel). Speculative decoding is just a way of running a single query as if it was parallel queries. |