| ▲ | simonw 6 hours ago | ||||||||||||||||
It's the number of active parameters for a Mixture of Experts (misleading name IMO) model. Qwen3.5-35B-A3B means that the model itself consists of 35 billion floating point numbers - very roughly 35GB of data - which are all loaded into memory at once. But... on any given pass through the model weights only 3 billion of those parameters are "active" aka have matrix arithmetic applied against them. This speeds up inference considerably because the computer has to do less operations for each token that is processed. It still needs the full amount of memory though as the 3B active it uses are likely different on every iteration. | |||||||||||||||||
| ▲ | zozbot234 4 hours ago | parent [-] | ||||||||||||||||
It will benefit from a full amount of memory for sure, but AIUI if you use system memory and mmap for your experts you can execute the model with only enough memory for the active parameters, it's just unbearably slow since it has to swap in new experts for every token. So the more memory you have in excess to that, the more inactive but often-used experts can be kept in RAM for better performance. | |||||||||||||||||
| |||||||||||||||||