Remix clone Hacker News

new | show | ask | jobs Github

	▲	nyrikki 2 hours ago
		You can get all the Qwen 3.x models up to ~1 million tokens using YaRN with llama.cpp.[0] Personally I am using `--no-context-shift` and feeding in context back in on failure at the harness level. I have 2x1080ti + 1xTitanV that have a full 262,144 tokens context on 262,144 tokens with `-sm tensor` at 62.04 t/s which isn't so bad. But I also have a 1x3090 running unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL at 41.89 t/s but with only 130k context, but if you have a modular programming style both work pretty well. But play with YaRN if you really need it. [0]https://qwen.readthedocs.io/en/v3.0/run_locally/llama.cpp.ht...
	▲	Vaskivo 24 minutes ago \| parent [-]
		How can you get it to run at 41 t/s? I also have a single 3090 and even with MTP can't break 20 t/s. HEre's my setup: `llama-server --port 9999 --model /MODELS/LLMs/Qwen3.6-27B-UD-Q4_K_XL.gguf --ctx-size 128000 --threads 12 --flash-attn on --device CUDA0 --jinja --gpu-layers 52 --mmproj /MODELS/LLMs/Qwen3.6-27B-mmproj-F16.gguf --cache-type-k q8_0 --cache-type-v q8_0 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0 --spec-type draft-mtp --spec-draft-n-max 2` (I'm not filling out 100% of the VRAM, as I have other stuff I need it for.)