Remix.run Logo
htrp 4 hours ago

why 27b vs 35b? Is MoE that much worse for coding?

amarshall an hour ago | parent | next [-]

Can take the geometric mean of total and active parameters of MoE to get approximate equivalent quality to dense model params. So sqrt(35*10)≈18.7.

The trade-off of MoE is that it is worse but faster for the same total size.

electronsoup 3 hours ago | parent | prev [-]

Yeah MoE is a little worse for the same size, but you can often run bigger MoEs at respectable speeds even on cpu ram offload. The dense models really need to be 100% vram