The 6-bit versions + 8-bit KV cache seems to save a good bit of memory without a significant loss of quality. The Qwen 35B is pretty fast in my testing, but MiniMax M2.7 230B is in some ways faster (way fewer tokens to arrive at an answer) even though it is much larger.

▲

SwellJoe 9 days ago | parent | next [-]

Qwen 3.6 35B-A3B with MTP at 8 bits is blazing fast, something like 50-60 tokens per second. That's plenty fast for interactive use, so I haven't tried lower bits. Unfortunately the MoE is notably dumber than the dense model (for the case I have data about...I've been benchmarking models for security vulnerability scanning, and 27B is notably better on hard bugs).

The dense model is almost usable, but feels really sluggish, even with MTP. I think it's about 12-15 tokens/second on the Strix Halo. Slow enough to where I'd rather pay to use a cloud model.

I might try the 6-bit version of the dense model to see how it behaves, though. Maybe it'll retain its bug hunting abilities while making it fast enough to use interactively and not take all day for benchmark runs.

	▲	hedgehog 9 days ago \| parent [-]
		Same chip, with a 6 bit 35B and 8 bit KV cache I see about 500 prefill and 55 decode at 30k into the context window. MiniMax seemed a bit lower token rate but much, much less prone to 40k tokens of monologue before generating an answer. A pattern I like is to use a smaller model to do most execution and then a larger model to review transcripts and output and do any fixups and tooling improvements (this is all batch jobs so all I care about is overall throughput).

▲

milch 9 days ago | parent | prev [-]

What hardware do you need to run MiniMax M2.7 230B locally?

	▲	hedgehog 9 days ago \| parent [-]
		Ryzen 395 is what I'm using, anything with 128GB+ of RAM accessible to the GPU should work fine for a 4 bit version of the model (so Spark or Mac Studio should be ok too).