This is almost true but not quite - I don't think much of the (dollar) spend on enterprise GPUs (H100, B200, etc.) would transfer if there was a 128 GB consumer card. The problem is both memory bandwidth (HBM) and networking (NVLink), which NVIDIA definitely uses to segment consumer vs enterprise hardware.

I think your argument is still true overall, though, since there are a lot of "gpu poors" (i.e. grad students) who write/invent in the CUDA ecosystem, and they often work in single card settings.

Fwiw Intel did try this with Arctic Sound / Ponte Vecchio, but it was late out the door and did not really perform (see https://chipsandcheese.com/p/intels-ponte-vecchio-chiplets-g...). It seems like they took on a lot of technical risk; hopefully some of that transfers over to a future project though Falcon Shores was cancelled. They really should should have released some of those chips even at a loss, but I don't know the cost of a tape out.

▲

AnthonyMouse 3 days ago | parent [-]

NVLink matters if you want to combine a whole bunch of GPUs, e.g. you need more VRAM than any individual GPU is available with. Many workloads exist that don't care about that or don't have working sets that large, particularly if the individual GPU actually has a lot of VRAM. If you need 128GB and you have GPUs with 40GB of VRAM then you need a fast interconnect. If you can get an individual GPU with 128GB, you don't.

There is also work being done to make this even less relevant because people are already interested in e.g. using four 16GB cards without a fast interconnect when you have a 64GB model. The simpler implementation of this is to put a quarter of the model on each card split in the order it's used and then have the performance equivalent of one card with 64GB of VRAM by only doing work on the card with that section of the data in its VRAM and then moving the (much smaller) output to the next card. A more sophisticated implementation does something similar but exploits parallelism by e.g. running four batches at once, each offset by a quarter, so that all the cards stay busy. Not all workloads can be split like this but for some of the important ones it works.

	▲	singhrac 3 days ago \| parent [-]
		I think we might just disagree about how much of the GPU spend is on small vs large model (inference or training). I think it’s something like 99.9% of spending interest is on models that don’t fit into 128 GB (remember KV cache matters too). Happy to be proven wrong!