| ▲ | vlovich123 3 hours ago | |||||||
Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth. | ||||||||
| ▲ | moffkalast 2 hours ago | parent [-] | |||||||
Seems to be for both according to the spec [0], maybe it's wrong though. 128 sounds really tiny, I wonder if they mean some kind of blocks? [0] https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#4... | ||||||||
| ||||||||