I've always been a bit confused about when to run models on the GPU vs the neural engine. The best I can tell, GPU is simpler to use as a developer especially when shipping a cross platform app. But an optimized neural engine model can run lower power.

With the addition of NPUs to the GPU, this story gets even more confusing...

▲

avianlyric 6 days ago | parent [-]

In reality you don’t much of a choice. Most of the APIs Apple exposes for running neural nets don’t let you pick. Instead some Apple magic in one of their frameworks decides where it’s going to host your network. At least from what I’ve read, these frameworks will usually distribute your networks over all available matmul compute, starting on the neural net (assuming your specific network is compatible) and spilling onto the GPU as needed.

But there isn’t a trivial way to specifically target the neural engine.

▲

babl-yc 6 days ago | parent [-]

You're right there is no way to specifically target the neural engine. You have to use it via CoreML which abstracts away the execution.

If you use Metal / GPU compute shaders it's going to run exclusively on GPU. Some inference libraries like TensorFlow/LiteRT with backend = .gpu use this.

	▲	scosman 5 days ago \| parent [-]
		Exactly. And most folks are using a framework like llama.cpp which does control where it’s run.