What about Qwen? Does it get that right?

I've run several local models that get this right. Qwen 3.5 122B-A10B gets this right, as does Gemma 4 31B. These are local models I'm running on my laptop GPU (Strix Halo, 128 GiB of unified RAM).

And I've been using this commonly as a test when changing various parameters, so I've run it several times, these models get it consistently right. Amazing that Opus 4.7 whiffs it, these models are a couple of orders of magnitude smaller, at least if the rumors of the size of Opus are true.

▲

qingcharles 3 hours ago | parent [-]

Does Gemma 4 31B run full res on Strix or are you running a quantized one? How much context can you get?

▲

lambda 2 hours ago | parent [-]

I'm running an 8 bit quant right now, mostly for speed as memory bandwidth is the limiting factor and 8 bit quants generally lose very little compared to the full res, but also to save RAM.

I'm still working on tweaking the settings; I'm hitting OOM fairly often right now, it turns out that the sliding window attention context is huge and llama.cpp wants to keep lots of context snapshots.

	▲	qingcharles an hour ago \| parent [-]
		I had a whole bunch of trouble getting Gemma 4 working properly. Mostly because there aren't many people running it yet, so there aren't many docs on how to set it up correctly. It is a fantastic model when it works, though! Good luck :)