| ▲ | stingraycharles 8 hours ago | |
397A17B = 397B total weights, 17B per expert? | ||
| ▲ | zackangelo 8 hours ago | parent | next [-] | |
17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active. If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap). When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger. | ||
| ▲ | wongarsu 8 hours ago | parent | prev | next [-] | |
397B params, 17B activated at the same time Those 17B might be split among multiple experts that are activated simultaneously | ||
| ▲ | littlestymaar 8 hours ago | parent | prev [-] | |
That's not how it works. Many people get confused by the “expert” naming, when in reality the key part of the original name “sparse mixture of experts” is sparse. Experts are just chunks of each layers MLP that are only partially activated by each token, there are thousands of “experts” in such a model (for Qwen3-30BA3, it was 48 layers x 128 “experts” per layer with only 8 active at each token) | ||