Remix.run Logo
spwa4 7 hours ago

Yeah, PCIe is the bottleneck. The point being that whether the data originates from RAM or from NVME or Optane, you cannot get data to the GPU faster with RAM than with SSDs.

Meanwhile PCIe switches exist. So why not build:

1 CPU + memory + ...

N PCIe switch with each 1 low-memory GPU + 6 NVME drives (in theory 5 can saturate the GPU)

Each of those should only bother the CPU when they have some tokens produced and have plenty of PCIe lanes to get at their data.

Such a setup should be able to get a 6 to 8 times speedup from the solution detailed here, and a model compute increase should make relatively little difference in performance.