| ▲ | zshn25 10 hours ago | ||||||||||||||||||||||
What do all the numbers 6-35B-A3B mean? | |||||||||||||||||||||||
| ▲ | dunb 10 hours ago | parent | next [-] | ||||||||||||||||||||||
3.6 is the release version for Qwen. This model is a mixture of experts (MoE), so while the total model size is big (35 billion parameters), each forward pass only activates a portion of the network that’s most relevant to your request (3 billion active parameters). This makes the model run faster, especially if you don’t have enough VRAM for the whole thing. The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | cshimmin 10 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
The 6 is part of 3.6, the model version. 35B parameters, A3B means it's a mixture of experts model with only 3B parameters active in any forward pass. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | joaogui1 10 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
3.6 is model number, 35B is total number of parameters, A3B means that only 3B parameters are activated, which has some implications for serving (either in you you shard the model, or you can keep the total params on RAM and only road to VRAM what you need to compute the current token, which will make it slower, but at least it runs) | |||||||||||||||||||||||
| ▲ | JLO64 10 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
35B (35 billion) is the number of parameters this model has. Its a Mixture of Experts model (MoE) so A3B means that 3B parameters are Active at any moment. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | 10 hours ago | parent | prev [-] | ||||||||||||||||||||||
| [deleted] | |||||||||||||||||||||||