Remix clone Hacker News

new | show | ask | jobs Github

	▲	pornel 2 hours ago
		LLM generation is bottlenecked by RAM bandwidth and latency. You can get almost linear scaling by evaluating more prompts in parallel, because the GPU has nothing to for the relative eternity it takes to read all of the weights from DRAM for every layer for every token. On Apple Silicon you can get 4x-8x more tokens per second if you run more queries in parallel (as long as your inference server supports it, and has enough spare RAM for more KV caches). When inference is done at datacenter scales, when you distribute generation across multiple GPUs and have kernels carefully tuned to specific hardware, the compute vs DRAM bandwidth speed ratio gets absurd like 200:1. That's why everyone gives you batch inference at a steep discount.