When something has an 30 TOPS NPU, what are the implications? Do NPUs like this have some common backend that ggml/llama.cpp targets? Is it proprietary and only works for some specific software? Does it have access to all the system RAM and at what bandwidth?

I know the concept has been around for a while but no idea if it actually means anything. I assume that people are targeting ones in common devices like Apple, but what about here?

▲

heavyset_go 12 hours ago | parent | next [-]

Ignorant of this NPU, but in my experience, you're expected to use some cursed stack of proprietary tools/runtimes/SDKs/etc and no, it will not play nicely with anything you want it to unless you write the support yourself.

▲

Y_Y a day ago | parent | prev | next [-]

The specific NPU doesn't seem to be mentioned in TFA, but my guess is that the blessed way to deal with it is the Neon SDK: https://www.arm.com/technologies/neon

I've not found Neon to be fun or easy to use, and I frequently see devices ignoring the NPU and inferring on CPU because it's easier. Maybe you get lucky and someone has made a backend for something specific you want, but it's not common.

▲

snops 15 hours ago | parent [-]

TFA does directly mention the NPU "Arm-China Zhouyi: 30 TOPS (Dedicated)"

"you cannot simply use standard versions of PyTorch or TensorFlow out of the box. You must use the NeuralONE AI SDK."

Neon is a SIMD instruction set for the CPU, not a separate accelerator. It doesn't need an SDK to use, it's supported by compiler intrinsics and assembly language in any modern ARM compiler.

	▲	Y_Y 14 hours ago \| parent [-]
		Quite right, I mixed up Neon with NN: https://www.arm.com/products/silicon-ip-cpu/ethos/arm-nn

▲

moffkalast 20 hours ago | parent | prev | next [-]

NPUs like this tend to have one thing in common: being decorative without drivers and support 9 times out of 10.

Even if it worked though, they're usually heavily bandwidth bottlenecked and near useless for LLM inference. CPU wins every time.

▲

downrightmike 11 hours ago | parent | prev | next [-]

30 TOPS NPU is the almost-useful minimum for a device, but as we've seen that even microsoft couldn't come up with anything useful to do with it in the AI laptops. This has all but disappeared, they are pushing the cloud licensing over local AI now

▲

cmrdporcupine a day ago | parent | prev | next [-]

Can't speak to this specific NPU but these kind of accelerators are really made more for more general ML things like machine vision etc. For example while people have made the (6 TOPS) NPU in the (similar board) RK3588 work with llama.cpp it isn't super useful because of the RAM constraints. I believe it has some sort of 32-bit memory addressing limit, so you can never give it more than 3 or 4 GB for example. So for LLMs, not all that useful.

▲

ekianjo a day ago | parent | prev [-]

It needs specific support, and for example llama.cpp would have support for some of them. But that comes with limitations in how much RAM they can allocate. But when they work, you see a flat CPU usage and the NPU does everything for inference.