Remix.run Logo
Miraste 10 hours ago

What? 35B-A3B is not nearly as smart as 27B.

stratos123 4 hours ago | parent | next [-]

One interesting thing about Qwen3 is that looking at the benchmarks, the 35B-A3B models seem to be only a bit worse than the dense 27B ones. This is very different from Gemma 4, where the 26B-A4B model is much worse on several benchmarks (e.g. Codeforces, HLE) than 31B.

zozbot234 3 hours ago | parent [-]

> This is very different from Gemma 4, where the 26B-A4B model is much worse on several benchmarks (e.g. Codeforces, HLE) than 31B.

Wouldn't you totally expect that, since 26A4B is lower on both total and active params? The more sensible comparison would pit Qwen 27B against Gemma 31B and Gemma 26A4B against Qwen 35A3B.

ekianjo 10 hours ago | parent | prev | next [-]

yeah the 27B feels like something completely different. If you use it on long context tasks it performs WAY better than 35b-a3b

Der_Einzige 9 hours ago | parent [-]

I've been telling analysts/investors for a long time that dense architectures aren't "worse" than sparse MoEs and to continue to anticipate the see-saw of releases on those two sub-architectures. Glad to continuously be vindicated on this one.

For those who don't believe me. Go take a look at the logprobs of a MoE model and a dense model and let me know if you can notice anything. Researchers sure did.

zkmon 10 hours ago | parent | prev [-]

Yes.