| ▲ | How to run Qwen 3.5 locally(unsloth.ai) |
| 84 points by Curiositry 8 hours ago | 18 comments |
| |
|
| ▲ | moqizhengz 2 hours ago | parent | next [-] |
| Running 3.5 9B on my ASUS 5070ti 16G with lm studio gives a stable ~100 tok/s.
This outperforms the majority of online llm services and the actual quality of output matches the benchmark.
This model is really something, first time ever having usable model on consumer-grade hardware. |
| |
| ▲ | throwdbaaway an hour ago | parent | next [-] | | There are Qwen3.5 27B quants in the range of 4 bits per weight, which fits into 16G of VRAM. The quality is comparable to Sonnet 4.0 from summer 2025. Inference speed is very good with ik_llama.cpp, and still decent with mainline llama.cpp. | | |
| ▲ | teaearlgraycold 13 minutes ago | parent [-] | | Qwen3.5 35B A3B is much much faster and fits if you get a 3 bit version. How fast are you getting 27B to run? On my M3 Air w/ 24GB of memory 27B is 2 tok/s but 35B A3B is 14-22 tok/s which is actually usable. |
| |
| ▲ | yangikan 2 hours ago | parent | prev | next [-] | | Do you point claude code to this? The orchestration seems to be very important. | | | |
| ▲ | lukan an hour ago | parent | prev [-] | | What exact model are you using? I have a 16GB GPU as well, but have never run a local model so far. According to the table in the article, 9B and 8-bit -> 13 GB and 27B and 3-bit seem to fit inside the memory. Or is there more space required for context etc? |
|
|
| ▲ | vvram 3 minutes ago | parent | prev | next [-] |
| What would be optimal HW configurations/systems recommended? |
|
| ▲ | Curiositry 2 hours ago | parent | prev | next [-] |
| Qwen3.5 9b seems to be fairly competent at OCR and text formatting cleanup running in llama.cpp on CPU, albeit slow. However, I have compiled it umpteen ways and still haven't gotten GPU offloading working properly (which I had with Ollama), on an old 1650 Ti with 4GB VRAM (it tries to allocate too much memory). |
| |
| ▲ | acters an hour ago | parent | next [-] | | I have a 1660ti and the cachyos + aur/llama.cpp-cuda package is working fine for me.
With about 5.3 GB of usable memory, I find that the 35B model is by far the most capable one that performs just as fast as the 4B model that fits entirely on my GPU.
I did try the 9B model and was surprisingly capable. However 35B still better in some of my own anecdotal test cases.
Very happy with the improvement. However, I notice that qwen 3.5 is about half the speed of qwen 3 | |
| ▲ | WhyNotHugo 30 minutes ago | parent | prev [-] | | If you’re building from source, the vulkan backend is the easiest to build and use for GPU offloading. | | |
| ▲ | Curiositry 25 minutes ago | parent [-] | | Yes, that's what I tried first. Same issue with trying to allocate more memory than was available. |
|
|
|
| ▲ | Twirrim 5 hours ago | parent | prev [-] |
| I've been finding it very practical to run the 35B-A3B model on an 8GB RTX 3050, it's pretty responsive and doing a good job of the coding tasks I've thrown at it. I need to grab the freshly updated models, the older one seems to occasionally get stuck in a loop with tool use, which they suggest they've fixed. |
| |
| ▲ | fy20 2 hours ago | parent | next [-] | | I guess you are doing offloading to system RAM? What tokens per second do you get? I've got an old gaming laptop with a RTX 3060, sounds like it could work well as a local inference server. | | |
| ▲ | manmal 24 minutes ago | parent [-] | | In the article, they claim up to 25t/s for the LARGEST model with a 24GB VRAM card. Need a lot of RAM obviously |
| |
| ▲ | ufish235 4 hours ago | parent | prev | next [-] | | Can you give an example of some coding tasks? I had no idea local was that good. | | |
| ▲ | hooch 2 hours ago | parent [-] | | Changed into a directory recently and fired up the qwen code CLI and gave it two prompts: "so what's this then?" - to which it had a good summary across stack and product, and then "think you can find something todo in the TODO?" - and while I was busy in Claude Code on another project, it neatly finished three HTML & CSS tasks - that I had been procrastinating on for weeks. This was a qwen3-coder-next 35B model on M4 Max with 64GB which seems to be 51GB size according to ollama. Have not yet tried the variants from the TFA. | | |
| |
| ▲ | fragmede 4 hours ago | parent | prev [-] | | Which models would that be? |
|