Remix clone Hacker News

new | show | ask | jobs Github

	▲	YetAnotherNick 3 days ago
		For LLM inference of batch size 1, it's hard to be saturate PCIe bandwidth specially for less powerful chips. You would get close to linear performance[1]. The obvious issue is few things on multiple GPU is harder, and many softwares don't fully support it or isn't optimized for it. [1]: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...