| ▲ | blobbers 7 hours ago | |
How does llama.cpp use the GPU efficiently as opposed to MLX? Is there any way to use MLX and GPU at the same time? Or does memory become a big problem? TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software. If I can generate voice at the same time as video, that would be useful. | ||
| ▲ | dannyw 7 hours ago | parent [-] | |
Llama.cpp uses the GPU very effectively because inference of LLMs is very rudimentary and basically as simple as your GPU memory bandwidth. That's essentially the baseline performance ceiling, with model-specific optimisations like MTP potentially increasing it. The neural cores aren't suitable for LLMs/transformers and isn't used in LLM inference. On the M5 and later chips, it comes with neural accelerators, aka Tensor Cores, which speed up the 'prefill' (i.e. processing your context window) part, but don't do anything for inference. The MLX vs GGUF debate is mostly irrelevant. The GGUF pathways are optimised for apple silicon to the extent of practically identical performance to MLX. MLX is just one way of using Apple GPUs, it comes with many optimisations in the box, but they're not hard and they're no longer MLX-exclusive. | ||