Remix.run Logo
slacka 4 days ago

I too found that interesting that Apple's Neural Engine doesn't work with local LLMs. Seems like Apple, AMD, and Intel are missing the AI boat by not properly supporting their NPUs in llama.cpp. Any thoughts on why this is?

bigyabai 4 days ago | parent | next [-]

Most NPUs are almost universally too weak to use for serious LLM inference. Most of the time you get better performance-per-watt out of GPU compute shaders, the majority of NPUs are dark silicon.

Keep in mind - Nvidia has no NPU hardware because that functionality is baked-into their GPU architecture. AMD, Apple and Intel are all in this awkward NPU boat because they wanted to avoid competition with Nvidia and continue shipping simple raster designs.

aurareturn 4 days ago | parent [-]

Apple is in this NPU boat because they are optimized for mobile first.

Nvidia does not optimize for mobile first.

AMD and Intel were forced by Microsoft to add NPUs in order to sell “AI PCs”. Turns out the kind of AI that people want to run locally can’t run on an NPU. It’s too weak like you said.

AMD and Intel both have matmul acceleration directly in their GPUs. Only Apple does not.

bigyabai 4 days ago | parent [-]

Nvidia's approach works just fine on mobile. Devices like the Switch have complex GPGPU pipelines and don't compromise whatsoever on power efficiency.

Nonetheless, Apple's architecture on mobile doesn't have to define how they approach laptops, destops and datacenters. If the mobile-first approach is limiting their addressable market, then maybe Tim's obsessing over the wrong audience?

aurareturn 3 days ago | parent [-]

MacBooks benefit from mobile optimization. Apple just needs to add matmul hardware acceleration into their GPUs.

numpad0 4 days ago | parent | prev | next [-]

Perhaps due to sizes? AI/NN models before LLM were magnitudes smaller, as evident in effectively all LLMs carrying "Large" in its name regardless of relative size differences.

Someone 4 days ago | parent | prev | next [-]

I guess that hardware doesn’t make things faster (¿yet?). If so I guess they would have mentioned it in https://machinelearning.apple.com/research/core-ml-on-device.... That is updated for Sequoia and says

“This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.”

cma 3 days ago | parent [-]

If it uses a lot less power it could still be a win for some use cases, like while on battery you might still want to run transformer based speech to text, RTX voice-like microphone denoising, image generation/infill in photo editing programs. In some use cases like RTX-voice like stuff during multiplayer gaming, you might want the GPU free to run the game even if it still suffers some memory bandwidth impact from having it running.

GeekyBear 4 days ago | parent | prev | next [-]

There is no NPU "standard".

Llama.cpp would have to target every hardware vendor's NPU individually and those NPUs tend to have breaking changes when newer generations of hardware are released.

Even Nvidia GPUs often have breaking changes moving from one generation to the next.

montebicyclelo 4 days ago | parent [-]

I think OP is suggesting that Apple / AMD / Intel do the work of integrating their NPUs into popular libraries like `llama.cpp`. Which might make sense. My impression is that by the time the vendors support a certain model with their NPUs the model is too old and nobody cares anyway. Whereas llama.cpp keeps up with the latest and greatest.

svachalek 4 days ago | parent | prev | next [-]

I think I saw something that got Ollama to run models on it? But it only works with tiny models. Seems like the neural engine is extremely power efficient but not fast enough to do LLMs with billions of parameters.

reddit_clone 3 days ago | parent [-]

I am running Ollama with 'SimonPu/Qwen3-Coder:30B-Instruct_Q4_K_XL' on a M4 pro MBP with 48 GB of memory.

From Emacs/gptel, it seems pretty fast.

I have never used the proper hosted LLMS, so I don't have a direct comparison. But the above LLM answered coding questions in a handful of seconds.

The cost of memory (and disk) upgrades in apple machines is exorbitant.

jondwillis 4 days ago | parent | prev | next [-]

https://github.com/Anemll/Anemll

4 days ago | parent | prev [-]
[deleted]