| ▲ | frotaur 9 hours ago | |
Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens. | ||
| ▲ | etiam 9 minutes ago | parent [-] | |
I'd have thought at least a tiny explicit penalty term for switching, to discourage messing around with the composition without any expected gains from it. If one is to use these on hardware that can't keep everything loaded I guess someone should examine how it works out in practice. Interpretability may be be a too much to ask, but I can't spontaneously see any reason why the experts can't at least be pushed to incorporate what's needed to remain the good choice for a longer segment. | ||