Remix clone Hacker News

new | show | ask | jobs Github

	▲	tgv 2 days ago
		Then I don't understand why it would matter. Or does it really mean that for each input token 10% of the total network runs, and then another 10% for the next token, rather than running each 10 batches of 10% for each token? If so, any idea or pointer to how the selection works?
	▲	kouteiheika 2 days ago \| parent [-]
		Yes, for each token only, say, 10% of the weights are necessary, so you don't have to fetch the remaining 90% from memory, which makes inference much faster (if you're memory bound; if you're doing single batch inference then you're certainly memory bound). As to how the selection works - each mixture-of-experts layer in the netwosk has essentially a small subnetwork called a "router" which looks at the input and calculates the scores for each expert; then the best scoring experts are picked and the inputs are only routed to them.