Remix clone Hacker News

new | show | ask | jobs Github

	▲	irthomasthomas 4 days ago
		I realize my SpongeBob post came off flippant, and that wasn't the intent. The Spongebob ASCII test (picked up from Qwen's own Twitter) is explicitly a rote-memorization probe; bigger dense models usually ace it because sheer parameter count can store the sequence With Qwen3's sparse-MoE, though, the path to that memory is noisier: two extra stochastic draws (a) which expert(s) fire, (b) which token gets sampled from them. Add the new gated-attention and multi-token heads and you've got a pipeline where a single routing flake or a dud expert can break vertical alignment halfway down the picture. Anyway, I think qwen3-coder was uniquely trained on this - so it's not a fair comparison. Here are some other qwen3 models: Model: chutes/Qwen/Qwen3-235B-A22B `/~\ ( * * ) ( o o o ) \ - / \ /\ / \ / \/ /\|\|\|\ /\|\|\|\|\|\ /\|\|\|\|\|\|\|\|\ ( o o o ) \ W / \___/` Model: chutes/Qwen/Qwen3-235B-A22B-Instruct-2507 `/\_/\ ( o.o ) > ^ < / \ \| \| \| \| \ / '-'-'` Model: chutes/Qwen/Qwen3-235B-A22B-Thinking-2507 `.-----------. \| [] [] \| \| \| \| __ __ \| \| \| \|\| \| \| \| \|__\|\|__\| \| \| \| '-----------'` Model: chutes/Qwen/Qwen3-Next-80B-A3B-Instruct `__ / \ / \ / \ / \ /__________\ \| o o \| \| __ \| \| \____/ \| \| \| \| ________ \| \| \ / \| \| \____/ \| \| \| \|____________\| ___________ / \ / \ \| _______ \| \| \| \| \| \| \| ___ \| \| \| \| \| \| \| \| \| \| \|___\| \| \| \| \|_______\| \| \| \| \|_______________\|` Model: chutes/Qwen/Qwen3-Next-80B-A3B-Thinking `.-. / \ \| o o\| \| > \| \| ---\| \___/` Model: chutes/Qwen/Qwen3-30B-A3B-Instruct-2507 `_________________________ / \ \| _ _ _ _ \| \| / \ / \ / \ / \ \| \| \| \| \| \| \| \| \| \| \| \| \_/ \_/ \_/ \_/ \| \| \| \| _ _ _ _ \| \| / \ / \ / \ / \ \| \| \| \| \| \| \| \| \| \| \| \| \_/ \_/ \_/ \_/ \| \| \| \| SpongeBob SquarePants \| \|_________________________\|`