▲ | boroboro4 5 days ago | |
> even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch. I don’t think this is correct - MoE routing happens at per token basis. It can be non deterministic and batch related if you try to balance out your experts load in a batch but that’s performance optimization (just like all of the blogpost) and not the way models are trained to work. | ||
▲ | eldenring 5 days ago | parent [-] | |
Ah interesting, good point. So I guess expert-choice routing leaks across the batch. Now I'm not sure. |