Remix.run Logo
porphyra 7 hours ago

You can also run Qwen 3.6 27B dense model on DGX Spark with comparable performance [1][2] for about $4000 (Asus Ascent GX10 is $3999 at various retailers).

In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.

[1] https://x.com/MiaAI_lab/status/2070859135399182444

[2] https://github.com/MiaAI-Lab/Qwen3.6-27B-NVFP4-vLLM

esperent 7 hours ago | parent | next [-]

> 48GB of VRAM with, say, two 3090s

So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.

fluoridation 6 hours ago | parent [-]

>Plus I assume it's considerably more effort to get it working.

Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.

lee_ars 3 hours ago | parent | prev [-]

The tweet you link shows "Qwen 3.6 35b NVFP4 - 256k ctx, 110 tok/s", but I'm getting only half that, around 50 tok/sec, on a DGX Spark with Qwen3.6-35B-A3B-NVFP4 (via vLLM) plus speculative decode w/EAGLE3. I'd be ecstatic to see 110 tok/sec and I wish they had some more sourcing for the exact config, because it's double what I'm getting.

edit - after actually reading the tweets (had to use xcancel) and visiting the source git repo, switching to MTP for speculative decode makes things a hell of a lot faster, and the abliterated model plus dflash makes it even faster! I'm now seeing 70-90 tok/sec for most stuff. I like!

porphyra 22 minutes ago | parent [-]

I think Atlas might also be slightly faster than vLLM:

https://flowtivity.ai/blog/120-tok-s-1m-context-private-ai-d...