Remix clone Hacker News

new | show | ask | jobs Github

	▲	dannyw 2 hours ago
		Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth. TPS = active weights in GB / your memory bandwidth. That’s it for decode. That’s all.