Remix.run Logo
lee_ars 4 hours ago

I'm currently fiddling with a DGX Spark and Qwen3.6-35B-A3B (specifically Qwen3.6-35B-A3B-NVFP4 under vLLM, with EAGLE3 speculative decoding via eagle3-dogacel-vllm), and it's pretty okay in terms of smarts. The speed is relatively usable at about 50 tok/sec with a 256k context window, and it's definitely smart enough to one-shot some basic coding tasks. I had it doing reverse engineering/disassembly of some ancient MS-DOS assembly language games from the 80s and it handled the task well and produced good outputs.

But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me.

Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.

coder543 an hour ago | parent | next [-]

Compared to a dynamic quant like Unsloth's UD-Q4_K_XL, which keeps some important parameters in higher precision, a basic NVFP4 quant seems to do a lot more damage to the model unless it is carefully calibrated.

I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models.

As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again.

Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.

cpburns2009 2 hours ago | parent | prev | next [-]

Looping is a common problem with the Qwen models. I've had good luck using --repeat-penalty=1.1 with llama.cpp and 27B. vLLM should have a similar option.

anon373839 2 hours ago | parent | prev | next [-]

I use Qwen 3.6 35B-A3B constantly, but I don’t see the type of behavior you mentioned. I’m using Unsloth’s Q8_K_XL quant.

rnxrx 3 hours ago | parent | prev | next [-]

There are also nvfp4 quants of Qwen 3.6 27/35 floating around. I've done benchmarks of both and the quality difference vs fp8/bf16 was barely notable. Honestly the nvfp4 capability is the most interesting feature of the Spark (at least for me).

gnerd00 38 minutes ago | parent | prev [-]

`llama-server` looping mitigations --repeat-penalty something greater than 1.0, set reasoning/thinking OFF explicitly, prefer a gguf with more than 4bit quant