| ▲ | roxolotl 2 hours ago | |||||||
What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram. | ||||||||
| ▲ | jychang 2 hours ago | parent [-] | |||||||
32GB vram is more than enough for Qwen 3.5 35b You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags. If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work. | ||||||||
| ||||||||