Would "lots of gpus" even help for huge models? Maybe this is exposing my lack of knowledge but don't you need to keep the whole model and context in a single GPU's VRAM? My understanding is that multiple GPUs help with scaling (can handle N X inference requests simultaneously) but it doesn't help with using large models. If that were the case, I could jam another GPU in my box and double the size of model I can serve.

▲

Kirby64 3 hours ago | parent | next [-]

> Would "lots of gpus" even help for huge models? Maybe this is exposing my lack of knowledge but don't you need to keep the whole model and context in a single GPU's VRAM?

How do you think the large providers do inference? No single GPU has 1TB plus of memory on board. It’s a cluster of a bunch of gpus.

▲

2ndorderthought 3 hours ago | parent | prev [-]

1t model instances(opus, gpt,etc) are not running on a single GPU. The catch is how the cards communicate and how the model is broken up. There's a bit that goes into it but the answer is yes the more gpus the bigger the model you can run.

▲

ryandrake 2 hours ago | parent [-]

Really cool. I'm very much still learning about this stuff. Sounds like this inter-GPU communication is a feature of special hardware (not consumer GPUs).

	▲	punchmesan an hour ago \| parent \| next [-]
		Ever hear of SLi (now called NVLink)? It's a GPU interconnect that's been available for a good long while now on high-end Nvidia GPU's. I believe AMD's implementation is called Crossfire. GPU interconnect speeds are a big bottleneck today for GPU's in AI applications. Data can't move between them fast enough.
	▲	Tostino an hour ago \| parent \| prev \| next [-]
		Most consumer cards had faster interlinks included on them until one generations ago when they decided they wanted to differentiate their data center hardware more, And remove the inner links that have been on the cards in various forms for 20 plus years.
	▲	2ndorderthought 2 hours ago \| parent \| prev [-]
		Not really, there's various ways it can be done but even I think the old 1080tis could do it. Keep reading about it, my interest is in small models on a single GPU though so I don't fuss over those details.