Remix clone Hacker News

new | show | ask | jobs Github

	▲	phamilton 5 hours ago
		Generation is basically just memory bandwidth math. Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.
	▲	SlavikCA an hour ago \| parent [-]
		And with MTP (or other speculation techniques) you can ~double that.