Remix.run Logo
arxell 8 hours ago

Each has it's pros and cons. Dense models of equivalent total size obviously do run slower if all else is equal, however, the fact is that 35A3B is absolutely not 'a lot smarter'... in fact, if you set aside the slower inference rates, Qwen3.5 27B is arguably more intelligent and reliable. I use both regularly on a Strix Halo system... the Just see the comparison table here: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF . The problem that you have to acknowledge if running locally (especially for coding tasks) is that your primary bottleneck quickly becomes prompt processing (NOT token generation) and here the differences between dense and MOE are variable and usually negligible.

nunodonato 6 hours ago | parent | next [-]

I was hoping this would be the model to replace our Qwen3.5-27B, but the difference is marginally small. Too risky, I'll pass and wait for the release of a dense version.

Mikealcl 7 hours ago | parent | prev [-]

Could you explain why prompt processing is the bottle neck please? I've seen this behavior but I don't understand why.

zozbot234 6 hours ago | parent [-]

You should be able to save a lot on prefill by stashing KV-cache shared prefixes (since KV-cache for plain transformers is an append-only structure) to near-line bulk storage and fetching them in as needed. Not sure why local AI engines don't do this already since it's a natural extension of session save/restore and what's usually called prompt caching.