| ▲ | gavmor 2 days ago | |
How does mixture of experts architecture work? Are they debating, or merely delegating? From what I've read, for each token or input patch, the gate computes a set of probabilities (or scores) over the experts, then selects a small subset (often the top‑[k]) and routes that input only to those. Ie each expert computes its own transformation on the same original input (or a shared intermediate representation), and then their outputs are combined at the next layer via the gate’s weights. That’s post hoc combination, not B reasoning over A’s reasoning. | ||