▲ | slacka 4 days ago | |||||||||||||||||||||||||
I too found that interesting that Apple's Neural Engine doesn't work with local LLMs. Seems like Apple, AMD, and Intel are missing the AI boat by not properly supporting their NPUs in llama.cpp. Any thoughts on why this is? | ||||||||||||||||||||||||||
▲ | bigyabai 4 days ago | parent | next [-] | |||||||||||||||||||||||||
Most NPUs are almost universally too weak to use for serious LLM inference. Most of the time you get better performance-per-watt out of GPU compute shaders, the majority of NPUs are dark silicon. Keep in mind - Nvidia has no NPU hardware because that functionality is baked-into their GPU architecture. AMD, Apple and Intel are all in this awkward NPU boat because they wanted to avoid competition with Nvidia and continue shipping simple raster designs. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | numpad0 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
Perhaps due to sizes? AI/NN models before LLM were magnitudes smaller, as evident in effectively all LLMs carrying "Large" in its name regardless of relative size differences. | ||||||||||||||||||||||||||
▲ | Someone 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
I guess that hardware doesn’t make things faster (¿yet?). If so I guess they would have mentioned it in https://machinelearning.apple.com/research/core-ml-on-device.... That is updated for Sequoia and says “This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.” | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | GeekyBear 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
There is no NPU "standard". Llama.cpp would have to target every hardware vendor's NPU individually and those NPUs tend to have breaking changes when newer generations of hardware are released. Even Nvidia GPUs often have breaking changes moving from one generation to the next. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | svachalek 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
I think I saw something that got Ollama to run models on it? But it only works with tiny models. Seems like the neural engine is extremely power efficient but not fast enough to do LLMs with billions of parameters. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | jondwillis 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
▲ | 4 days ago | parent | prev [-] | |||||||||||||||||||||||||
[deleted] |