because then instead of RAM bandwidth now you're dealing with PCIe BW which is way less.

YetAnotherNick 3 days ago | parent | next [-]

For LLM inference of batch size 1, it's hard to be saturate PCIe bandwidth specially for less powerful chips. You would get close to linear performance[1]. The obvious issue is few things on multiple GPU is harder, and many softwares don't fully support it or isn't optimized for it.

[1]: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

▲

mythz 3 days ago | parent | prev [-]

Also less power efficient, takes up more PCI slots and a lot of software doesn't support GPU clustering. Already have 4x 16GB GPUs which is unable to run large models exceeding 16GB.

Currently running them different VMs to be able to make full use of them, used to have them running in different docker containers however OOM Exceptions would frequently bring down the whole server, which running in VMs helped resolve.

	▲	zargon 3 days ago \| parent [-]
		What’s your application for high-VRAM that doesn’t leverage multiple gpus?