3.6 is the release version for Qwen. This model is a mixture of experts (MoE), so while the total model size is big (35 billion parameters), each forward pass only activates a portion of the network that’s most relevant to your request (3 billion active parameters). This makes the model run faster, especially if you don’t have enough VRAM for the whole thing.

The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters.

▲

wongarsu 8 hours ago | parent | next [-]

And even if you have enough VRAM to fit the entire thing, inference speed after the first token is proportional to (activated parameters)/(vram bandwidth)

If you have the vram to spare, a model with more total params but fewer activated ones can be a very worthwhile tradeoff. Of course that's a big if

▲

zshn25 8 hours ago | parent | prev [-]

Sorry, how did you calculate the 10.25B?

	▲	darrenf 8 hours ago \| parent [-]
		> > The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters. > Sorry, how did you calculate the 10.25B? The geometric mean of two numbers is the square root of their product. Square root of 105 (35*3) is ~10.25.