Qwen 3.6 is far ahead of Gemma for most (but not all) things. I've deployed it out across a number of M5 MacBooks and it's genuinely useful for many tasks. It won't replace an Opus or current gen Sonnet sized model but it's still amazingly good for its size and probably as good as or just a bit before Sonnet 4 era. Far more reliable for tool calling, coding, agentic tasks and faster than the Gemma models especially with MTP.

▲

zozbot234 2 hours ago | parent | next [-]

Qwen 3.6 is a toy compared to DeepSeek V4 Flash or Pro. These models can now run on Apple Silicon hardware with as little as 32GB RAM for the Flash (with 2-bit quant, which is still quite capable) using SSD offloading, with just-about-reasonable performance for interactive use, and far better performance on longer contexts than Qwen (due to the more efficient KV cache/attention mechanisms in DeepSeek).

Very significant improvements may be viable for unattended inference via large-scale batches, which can reuse sparse experts and thereby mask some of the latency involved - this is quite unique to DeepSeek, again due to its efficient KV cache.

▲

greenavocado 2 hours ago | parent [-]

Qwen 3.6 27B still curb stomps Deepseek V4 in coding

▲

epolanski an hour ago | parent [-]

1. Deepseek V4 is still in preview (training is not finished)

2. Qwen is much more demanding and borderline unusable on consumer hardware because it's a dense model. The 27B parameters are active all time for each token. It's not a MoE architecture where a router activates only some of them.

3. Qwen doesn't like quantization at all.

	▲	kgeist 9 minutes ago \| parent \| next [-]
		I have to disagree with most claims. I run Qwen3.6-27b at 260k context and 40-60 tok/sec. It handles most coding problems as well as Sonnet 4.6 under OpenCode on our production tasks. (As an experiment, I run the same prompts for the same issues in parallel for Qwen 3.6 and Sonnet 4.6 and usually see little difference in performance). I see zero degradation from quantization in practice. Settings: RTX 5090, 5-bit weights (Unsloth), FP8 KV cache. Last time I tried running large MoEs on this PC, they had inferior quality at 2-3 bits compared to much smaller dense models at 5-6 bits, and were slower anyway.
	▲	trollbridge 8 minutes ago \| parent \| prev [-]
		You can run the 35B A3B model which is an MoE. Runs great on a 5090.

▲

epolanski an hour ago | parent | prev | next [-]

Qwen suffers quantization a lot, rendering it borderline unusable.

▲

Pxtl 2 hours ago | parent | prev [-]

I've got a Qwen 3.5 running on a 12GB 3060 and it's dumb as a stump but still smart enough to get some useful work done. Since it's my daily driver desktop I havent jumped to 3.6 since last time I did I quickly ran out of vram and locked the desktop environment.

But yeah, the Qwen line is pretty impressive on commodity hardware.

	▲	derefr 2 hours ago \| parent [-]
		I must be using LLMs very differently than y'all, because I can't think of a single thing I would rely on an LLM that's "dumb as a stump" to do for me. To me, LLMs are for asking research questions + exploring design spaces + pointing at codebases to investigate bugs. And those all benefit from the model being as "smart" (in terms of both fluid intelligence and burned-in knowledge) as possible. I'm guessing there exist problems where "intelligence past a certain point" doesn't matter, so these medium-sized models can match the performance of the bigger models. But what problems might those be?