| ▲ | ainch 4 days ago | |||||||
Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query. | ||||||||
| ▲ | pama 4 days ago | parent [-] | |||||||
> only uses 1/18th of the total parameters per-query. only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query. | ||||||||
| ||||||||