▲ | oblio 12 hours ago | |||||||
> And you should be able to get two and load half your model into each. It should be about the same speed as if a single card had 32GB. This seems super duper expensive and not really supported by the more reasonably priced Nvidia cards, though. SLI is deprecated, NVLink isn't available everywhere, etc. | ||||||||
▲ | Dylan16807 11 hours ago | parent [-] | |||||||
No, no, nothing like that. Every layer of an LLM runs separately and sequentially, and there isn't much data transfer between layers. If you wanted to, you could put each layer on a separate GPU with no real penalty. A single request will only run on one GPU at a time, so it won't go faster than a single GPU with a big RAM upgrade, but it won't go slower either. | ||||||||
|