Running 27B dense model on M5 128GB is ok, but one can do better.

On M5 128GB one can make use of the ram and use sparse MoE. For example, DeepSeek-V4-Flash will fit, served by DwarfStar (https://github.com/antirez/ds4). One will probably improve 2x the token/sec speed, given DS4F 13B activated params in the MoE are ~1/2 of the ~27B of the dense Qwen.

27B Of the Qwen fit even on a cheaper 24GB card, e.g. amd 7900xtx (<$1K?) or slightly dearer nvidia 3090 (with cuda). With ~900 GB/s bandwidth they will likely be ~50% faster than the M5 with 600 GB/s.

▲

brandall10 3 hours ago | parent | next [-]

This is discussed in the article:

"My personal impression is that within these quantizations Qwen 3.6 27B is as good as (or maybe slightly better than) DwarfStar4. Though, I won’t be surprised if for longer context projects DS4 has an edge."

▲

drnick1 4 hours ago | parent | prev | next [-]

Works beautifully on a 3090, very usable speed. Don't expect Opus 4.8-level performance, but there are some things you just need to keep local.

▲

ljosifov 4 hours ago | parent [-]

True - they are workhorses. Not super bright, but good enough for lots of everyday tasks. I've found sweet spot to be turning thinking off, as it adds small or no value, while increasing the token count and waiting time. Last 27B I used was https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-GGUF - specifically post-train adapted a bit to run with thinking off. I saw today the 35B-A3B MoE from the same HF acc is out, downloading that rn to try.

▲

kroaton 2 hours ago | parent [-]

Please don't use that garbage. Just use the base Qwen models or Nex/Orinth, as those are the only properly post-trained finetunes. The Qwopus models are marketing.

	▲	aand16 an hour ago \| parent [-]
		Can you expand on why Qwopus is not recommended and what "Nex/Orinth" brings to the table?

▲

kroaton 2 hours ago | parent | prev [-]

"DeepSeek-V4-Flash will fit" At Q2, 2bit? Lobotomized to death.