Remix clone Hacker News

new | show | ask | jobs Github

	▲	jychang 2 hours ago
		32GB vram is more than enough for Qwen 3.5 35b You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags. If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.