Remix clone Hacker News

new | show | ask | jobs Github

	▲	tssge an hour ago
		The reason speculative decoding shows diminishing returns in batched workloads is because the principle of both is the same. Speculative decoding predicts a group of tokens and verifies this group using the main model in one pass instead of decoding each token separately. Eg. for this group, the weights are loaded from RAM per group instead of per token: roughly the same computation is performed but not the same memory movement (and other overhead like kernel launches). Batching utilizes the same mechanism, so speculative decoding is essentially an attempt to batch a single stream using prediction. An attempt, because the verification may reject some tokens if the prediction was inaccurate.