Very impressive! I guess this still wouldn't affect their original example

> For example, you might observe that asking ChatGPT the same question multiple times provides different results.

even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.

> Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.

The router also leaks batch-level information across sequences.

▲

boroboro4 5 days ago | parent [-]

> even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.

I don’t think this is correct - MoE routing happens at per token basis. It can be non deterministic and batch related if you try to balance out your experts load in a batch but that’s performance optimization (just like all of the blogpost) and not the way models are trained to work.

	▲	eldenring 5 days ago \| parent [-]
		Ah interesting, good point. So I guess expert-choice routing leaks across the batch. Now I'm not sure.