▲ | irthomasthomas 4 days ago | |
I realize my SpongeBob post came off flippant, and that wasn't the intent. The Spongebob ASCII test (picked up from Qwen's own Twitter) is explicitly a rote-memorization probe; bigger dense models usually ace it because sheer parameter count can store the sequence With Qwen3's sparse-MoE, though, the path to that memory is noisier: two extra stochastic draws (a) which expert(s) fire, (b) which token gets sampled from them. Add the new gated-attention and multi-token heads and you've got a pipeline where a single routing flake or a dud expert can break vertical alignment halfway down the picture. Anyway, I think qwen3-coder was uniquely trained on this - so it's not a fair comparison. Here are some other qwen3 models: Model: chutes/Qwen/Qwen3-235B-A22B
Model: chutes/Qwen/Qwen3-235B-A22B-Instruct-2507
Model: chutes/Qwen/Qwen3-235B-A22B-Thinking-2507
Model: chutes/Qwen/Qwen3-Next-80B-A3B-Instruct
Model: chutes/Qwen/Qwen3-Next-80B-A3B-Thinking
Model: chutes/Qwen/Qwen3-30B-A3B-Instruct-2507
|