Remix clone Hacker News

new | show | ask | jobs Github

	▲	bhaney 9 days ago
		> How does the target model validate the draft tokens without running the inference as normal? It does run the inference as normal, just in parallel with the other inferences > if it is doing just that, I don't get the point Running inferences in parallel allows you to only read the model weights out of memory only once for N parallel inferences, as opposed to reading them out of memory N times for N serial inferences. Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot.
	▲	littlestymaar 9 days ago \| parent [-]
		> Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot. Nitpick: it's only bottlenecked by memory bandwidth if the batch size is too low (that is: if you don't have many users calling the same model in parallel). Speculative decoding is just a way of running a single query as if it was parallel queries.