Remix clone Hacker News

new | show | ask | jobs Github

	▲	gcr 2 hours ago
		There are two flavors of Qwen 3.6: - A 27B "dense" model - A 35B "Mixture of Experts" model, which activates only 3B parameters for each token. For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec. The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable.
	▲	julianlam an hour ago \| parent \| next [-]
		May I ask why the M instead of XL? Obviously bigger != better but I don't know what the differences are.
	▲	pixelesque an hour ago \| parent \| prev [-]
		Thank you - I'll give that a go!