Remix.run Logo
moffkalast 3 hours ago

42B active params, sliding window attention. There's your tradeoff.

vlovich123 3 hours ago | parent | next [-]

Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.

moffkalast 2 hours ago | parent [-]

Seems to be for both according to the spec [0], maybe it's wrong though.

128 sounds really tiny, I wonder if they mean some kind of blocks?

[0] https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#4...

E-Reverance 2 hours ago | parent [-]

No

> It uses 384 routed experts (top-8) with hybrid attention (full-attention + sliding-window 128 at 6:1 ratio) over 70 layers (1 dense + 69 MoE)

https://recipes.vllm.ai/XiaomiMiMo/MiMo-V2.5-Pro

bearjaws 2 hours ago | parent | prev [-]

Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.