Remix.run Logo
Neywiny 6 days ago

Disappointing on the NPU. I have found it's a point where industry wide improvement is necessary. People talk tokens/sec, model sizes, what formats are supported... But I rarely see an objective accuracy comparison. I repeatedly see that AI models are resilient to errors and reduced precision which is what allows the 1 bit quantization and whatnot.

But at a certain point I guess it just breaks? And they need an objective "I gave these tokens, I got out those tokens". But I guess that would need an objective gold standard ground truth that's maybe hard to come by.

topspin 2 days ago | parent | next [-]

There are a couple outfits making M.2 AI accelerators. Recently I noticed this one: DeepX DX-M1M 25 TOPS (INT8) M.2 module from Radxa[1]: https://radxa.com/products/aicore/dx-m1m

If you're in the business of selling unbundled edge accelerators, you're strongly incentivized to modularize your NPU software stack for arbitrary hosts, which increases the likelihood that it actually works, and for more than one particular kernel.

If I had an embedded AI use case, this is something I'd look at hard.

jerf 2 days ago | parent | prev | next [-]

So, this is slightly off topic, but out of curiousity, what are NPUs good for right this very second? What software uses them? What would this NPU be able to run if it was in fact accessible?

This is an honest, neutral question, and it's specifically about what can concretely be done with them right now. Their theoretical use is clear to me. I'm explicitly asking only about their practical use, in the present time.

(One of the reasons I am asking is I am wondering if this is a classic case of the hardware running too far ahead of the actual needs and the result is hardware that badly mismatches the actual needs, e.g., an "NPU" that blazingly accelerates a 100 million parameter model because that was "large" when someone wrote the specs down, but is uselessly small in practice. Sometimes this sort of thing happens. However I'm still honestly interested just in what can be done with them right now.)

geerlingguy 3 days ago | parent | prev | next [-]

The even more confounding factor is there are specific builds provided by every vendor of these Cix P1 systems: Radxa, Orange Pi, Minisforum, now MetaComputing... it is painful to try to sort it out, as someone who knows where to look.

I couldn't imagine recommending any of these boards to people who aren't already SBC tinkerers.

Havoc 3 days ago | parent | prev | next [-]

>But I rarely see an objective accuracy comparison.

There are some perplexity comparison numbers for the previous gen - Orange pi 5 in link below.

Bit of a mixed bag, but doesn't seem catastrophic across the board. Some models are showing minimal perplexity loss at Q8...

https://github.com/invisiofficial/rk-llama.cpp/blob/rknpu2/g...

coredog64 3 days ago | parent | prev | next [-]

I was also onboard until he got to the NPU downsides. I don't care about use for an LLM, but I would like to see the ability to run smallish ONNX models generated from a classical ML workflow. Not only is a GPU overkill for the tasks I'm considering, but I'm also concerned that unattended GPUs out on the edge will be repurposed for something else (video games, crypto mining, or just straight up ganked)

cyanydeez 6 days ago | parent | prev [-]

just try to find some benchmark top_k, temp, etc parameters for llama.cpp. There's no consistent framing of any of these things. Temp should be effectively 0 so it's atleast deterministic in it's random probabilities.

Neywiny 5 days ago | parent | next [-]

Right. There are countless parameters and seeds and whatnots to tweak. But theoretically if all the inputs are the same the outputs should be within Epsilon of a known good. I wouldn't even mandate temperature or any other parameter be a specific value, just that it's the same. That way you can make sure even the pseudorandom processes are the same, so long as nothing pulls from a hardware rng or something like that. Which seems reasonable for them to do so idk maybe an "insecure rng" mode

andai 3 days ago | parent | prev [-]

>Temp should be effectively 0 so it's atleast deterministic in it's random probabilities.

Is this a thing? I read an article about how due to some implementation detail of GPUs, you don't actually get deterministic outputs even with temp 0.

But I don't understand that, and haven't experimented with it myself.

kingstnap 3 days ago | parent [-]

By default CUDA isn't deterministic because of thread scheduling.

The main difference comes from rounding order of reduction difference.

It does make a small difference. Unless you have an unstable floating point algorithm, but if you have an unstable floating point algorithm on a GPU at low precision you were doomed from the start.