| ▲ | moffkalast 3 hours ago | ||||||||||||||||
42B active params, sliding window attention. There's your tradeoff. | |||||||||||||||||
| ▲ | vlovich123 3 hours ago | parent | next [-] | ||||||||||||||||
Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth. | |||||||||||||||||
| |||||||||||||||||
| ▲ | bearjaws 2 hours ago | parent | prev [-] | ||||||||||||||||
Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE. | |||||||||||||||||