Remix clone Hacker News

new | show | ask | jobs Github

	▲	dot_treo an hour ago
		It is all about moving the bottleneck. During prompt processing everything can be calculated in parallel, while during token generation you create a single token at a time. For example, using an RTX 4000 Ada, I'm getting 2700 t/s for prompt processing, and 48 t/s for token generation using an 8B class model. Their approach is essentially a speculative decoding approach where multiple tokens are predicted at once and then verified. Therefore getting more tokens to be created at a speed that is closer to the prompt processing speed. It seems to be special because their approach yields the exact same output distribution as the base model and it only takes a negligable amount of additional memory. The main catch is that if your prompt processing speed is already bad, it will not help you all that much. For example, the M-series Macs (up to M4) have a relative high generation speed compared to their prompt processing speed. That means they will not benefit as much (if at all). With the M5 the prompt processing speed has increased 4x, so those can expect to see a good uplift.