Remix clone Hacker News

new | show | ask | jobs Github

	▲	gavmor 2 days ago
		How does mixture of experts architecture work? Are they debating, or merely delegating? From what I've read, for each token or input patch, the gate computes a set of probabilities (or scores) over the experts, then selects a small subset (often the top‑[k]) and routes that input only to those. Ie each expert computes its own transformation on the same original input (or a shared intermediate representation), and then their outputs are combined at the next layer via the gate’s weights. That’s post hoc combination, not B reasoning over A’s reasoning.