Remix.run Logo
littlestymaar 3 hours ago

It doesn't need to, during inference there's little data exchange between one chip and another (just a single embedding vector per token).

It's completely different during training because of the backward pass and weight update, which put a lot of strain on the inter-chip communication, but during inference even x4 PCIe4.0 is enough to connect GPUs together and not lose speed.