For a bit more context, those posts are using pipeline parallelism. For N machines put the first L/N layers on machine 1, next L/N layers on machine 2, etc. With pipeline parallelism you don't get a speedup over one machine - it just buys you the ability to use larger models than you can fit on a single machine.

The release in Tahoe 26.2 will enable us to do fast tensor parallelism in MLX. Each layer of the model is sharded across all machines. With this type of parallelism you can get close to N-times faster for N machines. The main challenge is latency since you have to do much more frequent communication.

▲

dpe82 2 days ago | parent | next [-]

> The main challenge is latency since you have to do much more frequent communication.

Earlier this year I experimented with building a cluster to do tensor parallelism across large cache CPUs (AMD EPYC 7773X have 768mb of L3). My thought was to keep an entire model in SRAM and take advantage of the crazy memory bandwidth between CPU cores and their cache, and use Infiniband between nodes for the scatter/gather operations.

Turns out the sum of intra-core latency and PCIe latency absolutely dominate. The Infiniband fabric is damn fast once you get data to it, but getting it there quickly is a struggle. CXL would help but I didn't have the budget for newer hardware. Perhaps modern Apple hardware is better for this than x86 stuff.

▲

wmf 2 days ago | parent [-]

That's how Groq works. A cluster of LPUv2s would probably be faster and cheaper than an Infiniband cluster of Epycs.

▲

dpe82 2 days ago | parent | next [-]

Yeah I'm familiar; I was hoping I could do something related on previous generation commodity(ish) hardware. It didn't work but I learned a ton.

▲

fooblaster 2 days ago | parent | prev [-]

what is an lpuv2

	▲	wmf 2 days ago \| parent [-]
		The chip that Groq makes.

▲

aimanbenbaha 2 days ago | parent | prev | next [-]

Exo-Labs is an open source project that allows this too, pipeline parallelism I mean not the latter, and it's device agnostic meaning you can daisy-chain anything you have that has memory and the implementation will intelligently shard model layers across them, though its slow but scales linearly with concurrent requests.

Exo-Labs: https://github.com/exo-explore/exo

▲

liuliu 2 days ago | parent | prev | next [-]

But that's only for prefilling right? Or is it beneficial for decoding too (I guess you can do KV lookup on shards, not sure how much speed-up that will be though).

▲

zackangelo 2 days ago | parent | next [-]

No you use tensor parallelism in both cases.

The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.

EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)

▲

liuliu 2 days ago | parent [-]

I usually call it "head parallelism" (which is a type of tensor parallelism, but paralllelize for small clusters, and specific to attention). That is what you described: sharding input tensor by number of heads and send to respective Q, K, V shard. They can do Q / K / V projection, rope, qk norm whatever and attention all inside that particular shard. The out projection will be done in that shard too but then need to all reduce sum amongst shard to get the final out projection broadcasted to every participating shard, then carry on to do whatever else themselves.

I am asking, however, is whether that will speed up decoding as linearly as it would for prefilling.

▲

awnihannun 2 days ago | parent [-]

Right, my comment was mostly about decoding speed. For prefill you can get a speed up but there you are less latency bound.

In our benchmarks with MLX / mlx-lm it's as much as 3.5x for token generation (decoding) at batch size 1 over 4 machines. In that case you are memory bandwidth bound so sharding the model and KV cache 4-ways means each machine only needs to access 1/4th as much memory.

	▲	liuliu 2 days ago \| parent [-]
		Oh! That's great to hear. Congrats! Now, I want to get the all-to-all primitives ready in s4nnc...

▲

monster_truck 2 days ago | parent | prev [-]

Even if it wasn't outright beneficial for decoding by itself, it would still allow you to connect a second machine running a smaller, more heavily quantized version of the model for speculative decoding which can net you >4x without quality loss

▲

2 days ago | parent | prev [-]

[deleted]