Remix.run Logo
johncolanduoni 12 hours ago

RDMA for new AI/HPC clusters is moving toward ethernet (the keyword to look for is RoCE). Ethernet gear is so much cheaper that you can massively over-provision to make up for some of the disadvantages of asynchronous networking, and it lets your run jobs on hyperscalers (only Azure ever supported actual IB). Most HPC is not latency-sensitive enough that it needs Infiniband’s lower jitter/median, and vendors have mostly caught up on the hardware acceleration front.