| ▲ | syntaxing 2 days ago |
| I feel like calling it a “30B” model is slightly disingenuous. It’s a 30B-A3B. So only 3B parameters is active at a given time. While still impressive nevertheless, being able to get 8T/s for a “A3B” compared to a dense 30B is very different. |
|
| ▲ | CamperBob2 2 days ago | parent | next [-] |
| Out of curiosity, I just tried Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.70bpw.gguf (the version they recommend for the Raspberry Pi) on a Blackwell GPU. It cranked out 200+ tokens per second on some private benchmark queries, and it is surprisingly sharp. It punches well above the weight class expected from 3B active parameters. You could build the bear in Spielberg's "AI" with this thing, if not the kid. |
| |
|
| ▲ | throwaway894345 2 days ago | parent | prev [-] |
| What does it mean that only 3B parameters are active at a time? Also any indication of whether this was purely CPU or if it’s using the Pi’s GPU? |
| |
| ▲ | kouteiheika 2 days ago | parent | next [-] | | > What does it mean that only 3B parameters are active at a time? In a nutshell: LLMs generate tokens one at a time. "only 3B parameters active a a time" means that for each of those tokens only 3B parameters need to be fetched from memory, instead of all of them (30B). | | |
| ▲ | tgv 2 days ago | parent [-] | | Then I don't understand why it would matter. Or does it really mean that for each input token 10% of the total network runs, and then another 10% for the next token, rather than running each 10 batches of 10% for each token? If so, any idea or pointer to how the selection works? | | |
| ▲ | kouteiheika 2 days ago | parent [-] | | Yes, for each token only, say, 10% of the weights are necessary, so you don't have to fetch the remaining 90% from memory, which makes inference much faster (if you're memory bound; if you're doing single batch inference then you're certainly memory bound). As to how the selection works - each mixture-of-experts layer in the netwosk has essentially a small subnetwork called a "router" which looks at the input and calculates the scores for each expert; then the best scoring experts are picked and the inputs are only routed to them. |
|
| |
| ▲ | numpad0 2 days ago | parent | prev [-] | | I've asked Gemini about it the other day(I'm dumb and shameless). Apparently it means that the model branches into bunch of 3B sections in the middle and joins at both ends, totaling in parameters at 30B. This means computational footprint reduces to (bottom "router" parts + 3B + top parts) of effectively-5B or whatever specific to that model implied by "3B", rather than the full 30B. MoE models still operate on token-by-token basis, i.e. "pot/at/o" -> "12345/7654/8472". "Experts" are selected on per-token basis, not per-interation, so "expert" naming might be a bit of a misnomer, or marketing. |
|