Remix clone Hacker News

new | show | ask | jobs Github

	▲	yorwba 3 days ago
		Mixture-of-Depths trains the model to choose different numbers of layers for different tokens to reduce inference compute. This method is more like stochastic depth / layer dropout, where whether or not the intermediate layers are skipped for a token is random independent of the token value, and they're only using it as a training optimization. As far as I can tell, during inference all tokens are always processed by all layers.