| ▲ | swalsh a day ago | |
Its coding to coding. I could care less how the model is architected, i only care how it performs in a real world scenario. | ||
| ▲ | petu a day ago | parent | next [-] | |
If you don't care about how it's architectured, why you care about size? Compare it to Q3.5 397B-A17B. Just like smaller size models are speed / cost optimization, so is MoE. G4 26B-A4B goes 150 t/s on 4090/5090, 80 t/s on M5 Max. Q3.5 35B-A3B is comparably fast. They are flash-lite/nano class models. G4 31B despite small increase in total parameter count is over 5 times slower. Q3.5 27B is comparably slow. They are approximating flash/mini class models (I believe sizes of proprietary models in this class are closer to Q3.5 122B-A10B or Llama 4 Scout 109B-A17B). | ||
| ▲ | daemonologist a day ago | parent | prev [-] | |
The implication is that there is (should be) a major speed difference - naively you'd expect the MoE to be 10x faster and cheaper, which can be pretty relevant on real world tasks. | ||