By the way I wonder, what has more performance, a $25 000 professional GPU or a bunch of cheaper consumer GPUs costing $25 000 in total?

▲

omneity 9 days ago | parent [-]

Consumer GPUs in theory and by a large margin (10 5090s will eat an H100 lunch with 6 times the bandwidth, 3x VRAM and a relatively similar compute ratio), but your bottleneck is the interconnect and that is intentionally crippled to avoid beowulf GPU clusters eating into their datacenter market.

Last consumer GPU with NVLink was the RTX 3090. Even the workstation-grade GPUs lost it.

https://forums.developer.nvidia.com/t/rtx-a6000-ada-no-more-...

	▲	sigbottle 9 days ago \| parent [-]
		H100s also has custom async WGMMA instructions among other things. From what I understand, at least the async instructions formalize the notion of pipelining, which engineers were already implicitly using because to optimize memory accesses you're effectively trying to overlap them in that kind of optimal parallel manner.