Remix.run Logo
roxolotl 2 hours ago

What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram.

jychang 2 hours ago | parent [-]

32GB vram is more than enough for Qwen 3.5 35b

You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.

If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.

roxolotl an hour ago | parent [-]

Nice ok I’ll play with that. I’m mostly just learning what’s possible. Qwen 3.5 35b has been great without any customizations but it’s interesting to learn what the options are.