| ▲ | dormento 4 hours ago | |||||||
"Mixture-of-experts", AKA "running several small models and activating only a few at a time". Thanks for introducing me to that concept. Fascinating. (commentary: things are really moving too fast for the layperson to keep up) | ||||||||
| ▲ | hasperdi 4 hours ago | parent | next [-] | |||||||
As pointed out by a sibling comment. MOE consists of a router and a number of experts (eg 8). These experts can be imagined as parts of the brain with specialization, although in reality they probably don't work exactly like that. These aren't separate models, they are components of a single large model. Typically, input gets routed to a number of of experts eg. top 2, leaving the others inactive. This reduces number of activation / processing requirements. Mistral is an example of a model that's designed like this. Clever people created converters to transform dense models to MOE models. These days many popular models are also available in MOE configuration | ||||||||
| ▲ | whimsicalism 4 hours ago | parent | prev [-] | |||||||
that's not really a good summary of what MoEs are. you can more consider it like sublayers that get routed through (like how the brain only lights up certain pathways) rather than actual separate models. | ||||||||
| ||||||||