| ▲ | Tepix 5 hours ago | ||||||||||||||||
Using lmstudio-community/Qwen3-Coder-Next-GGUF:Q8_0 I'm getting up to 32 tokens/s on Strix Halo, with room for 128k of context (out of 256k that the model can manage). From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed. | |||||||||||||||||
| ▲ | dimgl 5 hours ago | parent | next [-] | ||||||||||||||||
How's the Strix Halo? I'd really like to get a local inference machine so that I don't have to use quantized versions of local models. | |||||||||||||||||
| |||||||||||||||||
| ▲ | cmrdporcupine 5 hours ago | parent | prev [-] | ||||||||||||||||
I'm getting similar numbers on NVIDIA Spark around 25-30 tokens/sec output, 251 token/sec prompt processing... but I'm running with the Q4_K_XL quant. I'll try the Q8 next, but that would leave less room for context. I tried FP8 in vLLM and it used 110GB and then my machine started to swap when I hit it with a query. Only room for 16k context. I suspect there will be some optimizations over the next few weeks that will pick up the performance on these type of machines. I have it writing some Rust code and it's definitely slower than using a hosted model but it's actually seeming pretty competent. These are the first results I've had on a locally hosted model that I could see myself actually using, though only once the speed picks up a bit. I suspect the API providers will offer this model for nice and cheap, too. | |||||||||||||||||
| |||||||||||||||||