| ▲ | htrp 4 hours ago | |
why 27b vs 35b? Is MoE that much worse for coding? | ||
| ▲ | amarshall an hour ago | parent | next [-] | |
Can take the geometric mean of total and active parameters of MoE to get approximate equivalent quality to dense model params. So sqrt(35*10)≈18.7. The trade-off of MoE is that it is worse but faster for the same total size. | ||
| ▲ | electronsoup 3 hours ago | parent | prev [-] | |
Yeah MoE is a little worse for the same size, but you can often run bigger MoEs at respectable speeds even on cpu ram offload. The dense models really need to be 100% vram | ||