Remix.run Logo
gavmor 2 days ago

How does mixture of experts architecture work? Are they debating, or merely delegating?

From what I've read, for each token or input patch, the gate computes a set of probabilities (or scores) over the experts, then selects a small subset (often the top‑[k]) and routes that input only to those.

Ie each expert computes its own transformation on the same original input (or a shared intermediate representation), and then their outputs are combined at the next layer via the gate’s weights.

That’s post hoc combination, not B reasoning over A’s reasoning.