▲ | yorwba 3 days ago | |
Mixture-of-Depths trains the model to choose different numbers of layers for different tokens to reduce inference compute. This method is more like stochastic depth / layer dropout, where whether or not the intermediate layers are skipped for a token is random independent of the token value, and they're only using it as a training optimization. As far as I can tell, during inference all tokens are always processed by all layers. |