> The main challenge is latency since you have to do much more frequent communication.

Earlier this year I experimented with building a cluster to do tensor parallelism across large cache CPUs (AMD EPYC 7773X have 768mb of L3). My thought was to keep an entire model in SRAM and take advantage of the crazy memory bandwidth between CPU cores and their cache, and use Infiniband between nodes for the scatter/gather operations.

Turns out the sum of intra-core latency and PCIe latency absolutely dominate. The Infiniband fabric is damn fast once you get data to it, but getting it there quickly is a struggle. CXL would help but I didn't have the budget for newer hardware. Perhaps modern Apple hardware is better for this than x86 stuff.

▲

wmf 2 days ago | parent [-]

That's how Groq works. A cluster of LPUv2s would probably be faster and cheaper than an Infiniband cluster of Epycs.

▲

dpe82 2 days ago | parent | next [-]

Yeah I'm familiar; I was hoping I could do something related on previous generation commodity(ish) hardware. It didn't work but I learned a ton.

▲

fooblaster 2 days ago | parent | prev [-]

what is an lpuv2

	▲	wmf 2 days ago \| parent [-]
		The chip that Groq makes.