Remix.run Logo
YetAnotherNick 3 days ago

For LLM inference of batch size 1, it's hard to be saturate PCIe bandwidth specially for less powerful chips. You would get close to linear performance[1]. The obvious issue is few things on multiple GPU is harder, and many softwares don't fully support it or isn't optimized for it.

[1]: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...