| ▲ | Aurornis an hour ago | |||||||||||||||||||
Excellent article. The game benchmarks are fun but the LLM improvements are where this gets really interesting for practical use. I love Apple platforms as an approachable way to run local models with a lot of RAM, but their relatively slow prompt processing speed is often overlooked. > Here you can see the big issue with Macs: the prompt processing (aka “prefill”) speed. It just gets worse and worse, the longer the prompt gets. At a 4K-token prompt, which doesn’t seem very long, it takes 17 seconds for the M4 MacBook Air to parse before we even start generating a response. Meanwhile, if you strap the eGPU to it, it’ll only take 150ms. It’s 120x faster. The prefill problem goes unnoticed when you’re playing around with the LLM with small chats. When you start trying to use it for bigger work pieces the compute limit becomes a bottleneck. The time to first token (TTFT) charts don’t look bad until you notice that they had to be shown on a logarithmic scale because the Mac platforms were so much slower than full GPU compute. | ||||||||||||||||||||
| ▲ | superlopuh an hour ago | parent [-] | |||||||||||||||||||
I'm curious and not an expert here, do you know why the TTFT is so much worse on Mac? To elaborate, the article just says that this step is compute bound, but I'm wondering whether it is just that simple or if it might also be less optimised in MLX? | ||||||||||||||||||||
| ||||||||||||||||||||