Remix.run Logo
sspiff 2 days ago

There is no way to achieve a high throughput low latency connection between 25 Strix Halo systems. After accounting for storage and network, there are barely any PCIe lanes left to link two of them together.

You might be able to use USB4 but unsure how the latency is for that.

0manrho 2 days ago | parent | next [-]

In general I agree with you, the IO options exposed by Strix Halo are pretty limited, but if we're getting technical you can tunnel PCIe over USB4v2 by the spec in a way that's functionally similar to Thunderbolt 5. That gives you essentially 3 sets of native PCIe4x4 from the chipset and an additional 2 sets tunnelled over USB4v2. TB5 and USB4 controllers are not made equal, so in practice YMMV. Regardless of USB4v2 or TB5, you'll take a minor latency hit.

Strix Halo IO topology: https://www.techpowerup.com/cpu-specs/ryzen-ai-max-395.c3994

Frameworks mainboard implements 2 of those PCIe4x4 GPP interfaces as M.2 PHY's which you can use a passive adapter to connect a standard PCIe AIC (like a NIC or DPU) to, and also interestingly exposes that 3rd x4 GPP as a standard x4 length PCIe CEM slot, though the system/case isn't compatible with actually installing a standard PCIe add in card in there without getting hacky with it, especially as it's not an open-ended slot.

You absolutely could slap 1x SSD in there for local storage, and then attach up to 4x RDMA supporting NIC's to a RoCE enabled switch (or Infiniband if you're feeling special) to build out a Strix Halo cluster (and you could do similar with Mac Studio's to be fair). You could get really extra by using a DPU/SmartNIC that allows you to boot from a NVMeoF SAN to leverage all 5 sets of PCIe4x4 for connectivity without any local storage but we're hitting a complexity/cost threshold with that that I doubt most people want to cross. Or if they are willing to cross that threshold, they'd also be looking at other solutions better suited to that that don't require as many workarounds.

Apple's solution is better for a small cluster, both in pure connectivity terms and also with respect to it's memory advantages, but Strix Halo is doable. However, in both cases, scaling up beyond 3 or especially 4 nodes you rapidly enter complexity and cost territory that is better served by nodes that are less restrictive unless you have some very niche reason to use either Mac's (especially non-pro) or Strix Halo specifically.

bee_rider 2 days ago | parent | prev | next [-]

Do they need fast storage, in this application? Their OS could be on some old SATA drive or whatever. The whole goal is to get them on a fast network together; the models could be stored on some network filesystem as well, right?

pests 2 days ago | parent [-]

It's more than just the model weights. During inference there would be a lot of cross-talk as each node broadcasts its results and gathers up what it needs from the others for the next step.

icedchai 2 days ago | parent | prev [-]

I figured, but it's good to have confirmation.