Remix.run Logo
YetAnotherNick 4 days ago

> I'd much prefer paying 3x cost for 3x VRAM

Why not just buy 3 card then? These cards doesn't require active cooling anyways and you can just fit 3 in decent sized case. You will get 3x VRAM speed and 3x compute. And if your usecase is llm inference, it will be a lot faster than 1x card with 3x VRAM.

zargon 3 days ago | parent | next [-]

We will buy 4 cards if they are 48 GB or more. At a measly 16 GB, we’re just going to stick with 3090s, P40s, MI50s, etc.

> 3x VRAM speed and 3x compute

LLM scaling doesn’t work this way. If you have 4 cards, you may get 2x performance increase if you use vLLM. But you’ll also need enough VRAM to run FP8. 3 cards would only run at 1x performance.

_zoltan_ 4 days ago | parent | prev [-]

because then instead of RAM bandwidth now you're dealing with PCIe BW which is way less.

YetAnotherNick 3 days ago | parent | next [-]

For LLM inference of batch size 1, it's hard to be saturate PCIe bandwidth specially for less powerful chips. You would get close to linear performance[1]. The obvious issue is few things on multiple GPU is harder, and many softwares don't fully support it or isn't optimized for it.

[1]: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

mythz 3 days ago | parent | prev [-]

Also less power efficient, takes up more PCI slots and a lot of software doesn't support GPU clustering. Already have 4x 16GB GPUs which is unable to run large models exceeding 16GB.

Currently running them different VMs to be able to make full use of them, used to have them running in different docker containers however OOM Exceptions would frequently bring down the whole server, which running in VMs helped resolve.

zargon 3 days ago | parent [-]

What’s your application for high-VRAM that doesn’t leverage multiple gpus?