Remix clone Hacker News

new | show | ask | jobs Github

	▲	littlestymaar 9 days ago
		> Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot. Nitpick: it's only bottlenecked by memory bandwidth if the batch size is too low (that is: if you don't have many users calling the same model in parallel). Speculative decoding is just a way of running a single query as if it was parallel queries.