Remix.run Logo
amarshall 3 hours ago

Can take the geometric mean of total and active parameters of MoE to get approximate equivalent quality to dense model params. So sqrt(35*10)≈18.7.

The trade-off of MoE is that it is worse but faster for the same total size.