Remix.run Logo
derefr 4 days ago

But all of the most-ridiculous hyperscale deployments, where bandwidth + latency most matter, have multiple GPUs per CPU, with the CPU responsible for splitting/packing/scheduling models and inference workloads across its own direct-attached GPUs, providing the network the abstraction of a single GPU with more (NUMA) VRAM than is possible for any single physical GPU to have.

How do you do that, if each GPU expects to be its own backplane? One CPU daughterboard per GPU, and then the CPU daughterboards get SLIed together into one big CPU using NVLink? :P

wmf 4 days ago | parent [-]

GPU as motherboard really only makes sense for gaming PCs. Even there SXM might be easier.