Remix.run Logo
angoragoats 16 hours ago

> Also, as the OP noted, this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac!! All the more reason for Apple to invest in something like QSFP.

This isn’t any different with QSFP unless you’re suggesting that one adds a 200GbE switch to the mix, which:

* Adds thousands of dollars of cost,

* Adds 150W or more of power usage and the accompanying loud fan noise that comes with that,

* And perhaps most importantly adds measurable latency to a networking stack that is already higher latency than the RDMA approach used by the TB5 setup in the OP.

fenced_load 16 hours ago | parent | next [-]

Mikrotik has a switch that can do 6x200g for ~$1300 and <150W.

https://www.bhphotovideo.com/c/product/1926851-REG/mikrotik_...

throwaway2037 7 hours ago | parent | next [-]

Wow, this switch (MikroTik CRS812) is scary good for the price point. A quick Google search fails to find any online vendors with stock. I guess it is very popular! Retail price will be <= 1300 USD.

I did some digging to find the switching chip: Marvell 98DX7335

Seems confirmed here: https://cdn.mikrotik.com/web-assets/product_files/CRS812-8DS...

And here: https://cdn.mikrotik.com/web-assets/product_files/CRS812-8DS...

    > Switch chip model 98DX7335
From Marvell's specs: https://www.marvell.com/content/dam/marvell/en/public-collat...

    > Description: 32x50G / 16x100G-R2 / 8x100G-R4 / 8x200G-R4 / 4x400G-R8
    > Bandwidth: 1600Gbps
Again, those are some wild numbers if I have the correct model. Normally, Mikrotik includes switching bandwidth in their own specs, but not in this case.
cess11 5 hours ago | parent [-]

They are very popular and make quite good products, but as you noticed it can be tricky to find them in stock.

Besides stuff like this switch they've also produced pretty cool little micro-switches you can PoE and run as WLAN hotspots, e.g. to distance your mobile user device from some network you don't really trust, or more or less maliciously bridge a cable network through a wall because your access to the building is limited.

wtallis 15 hours ago | parent | prev | next [-]

That switch appears to have 2x 400G ports, 2x 200G ports, 8x 50G ports, and a pair of 10G ports. So unless it allows bonding together the 50G ports (which the switch silicon probably supports at some level), it's not going to get you more than four machines connected at 200+ Gbps.

angoragoats 15 hours ago | parent [-]

As with most 40+GbE ports, the 400Gbit ports can be split into 2x200Gbit ports with the use of special cables. So you can connect a total of 6 machines at 200Gbit.

wtallis 15 hours ago | parent | next [-]

Ah, good point. Though if splitter cables are an option, then it seems more likely that the 50G ports could be combined into a 200G cable. Marvell's product brief for that switch chip does say it's capable of operating as an 8x 200G or 4x 400G switch, but Mikrotik may need to do something on their end to enable that configuration.

throwaway2037 6 hours ago | parent | next [-]

I'm not trolling here: Do you think that Marvell sells the chips wholesale buy the vendor buys the feature set (IP/drivers/whatever)? That would allow Marvell to effectively sell the same silicon but segment the market depending upon what buyers needs. Example: A buyer might need a config that is just a bunch of 50GB/s ports and another 100GB/s ports and another a mix. (I'm thinking about blowing fuses in the manuf phase, similar to what AMD and Intel do.) I write this as a complete noob in switching hardware.

wtallis 43 minutes ago | parent [-]

I think if Marvell were doing that, they would have more part numbers in their catalog.

angoragoats 4 hours ago | parent | prev [-]

You’re talking about link aggregation (LACP) here, which requires specific settings on both the switch and client machine to enable, as well as multiple ports on the client machine (in your example, multiple 50Gbps ports). So while it’s likely possible to combine 50Gbps ports like you describe, that’s not what I was referring to.

wtallis 44 minutes ago | parent [-]

No, I'm not talking about LACP, I'm talking about configuring four 50Gb links on the switch to operate as a single 200Gb link as if those links were wired up to a single QSFP connector instead of four individual SFP connectors.

The switch in question has eight 50Gb ports, and the switch silicon apparently supports configurations that use all of its lanes in groups of four to provide only 200Gb ports. So it might be possible with the right (non-standard) configuration on the switch to be able to use a four-way breakout cable to combine four of the 50Gb ports from the switch into a single 200Gb connection to a client device.

sgjohnson 15 hours ago | parent | prev [-]

Breakout cables typically split to 4.

e.g. QSFP28 (100GbE) splits into 4x SFP28s (25GbE each), because QSFP28 is just 4 lanes of SFP28.

Same goes for QSFP112 (400GbE). Splits into SFP112s.

It’s OSFP that can be split in half, i.e. into QSFPs.

angoragoats 15 hours ago | parent [-]

Here’s an example of the cables I was referring to that can split a single 400Gbit QSFP56-DD port to two 200Gbit ports:

https://www.fs.com/products/101806.html

But all of this is pretty much irrelevant to my original point.

angoragoats 15 hours ago | parent | prev [-]

Cool! So for marginally less in cost and power usage than the numbers I quoted, you can get 2 more machines than with the RDMA setup. And you’ve still not solved the thing that I called out as the most important drawback.

nicky_nickell 15 hours ago | parent [-]

how significant is the latency hit?

angoragoats 15 hours ago | parent [-]

The OP makes reference to this with a link to a GitHub repo that has some benchmarks. TCP over Thunderbolt compared to RDMA over Thunderbolt has roughly 7-10x higher latency, ~300us vs 30-50us. I would expect TCP over 200GbE to have similar latency to TCP over Thunderbolt.

Put another way, see the graphs in the OP where he points out that the old way of clustering performs worse the more machines you add? I’d expect that to happen with 200GbE also.

And with a switch, it would likely be even worse, since the hop to the switch adds additional latency that isn’t a factor in the TB5 setup.

wmf 14 hours ago | parent | next [-]

You're ignoring RoCE which would have the same or lower latency than RoTB. And I think macOS already supports RoCE.

angoragoats 4 hours ago | parent [-]

MacOS does not support RoCE.

Hikikomori 5 hours ago | parent | prev [-]

Switch probably does cut through so it starts forwarding the frame before its even fully received.

SoftTalker 12 hours ago | parent | prev [-]

For RDMA you'd want Infiniband not Ethernet.

johncolanduoni 12 hours ago | parent [-]

RDMA for new AI/HPC clusters is moving toward ethernet (the keyword to look for is RoCE). Ethernet gear is so much cheaper that you can massively over-provision to make up for some of the disadvantages of asynchronous networking, and it lets your run jobs on hyperscalers (only Azure ever supported actual IB). Most HPC is not latency-sensitive enough that it needs Infiniband’s lower jitter/median, and vendors have mostly caught up on the hardware acceleration front.