Remix clone Hacker News

new | show | ask | jobs Github

	▲	lambda 3 hours ago
		I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan. The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date. But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning. Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time. In my models.ini, I have this for the Qwen3.6 models: `chat-template-kwargs = {"preserve_thinking": true}` There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.
	▲	ndom91 3 hours ago \| parent [-]
		+1 using llama.cpp Vulkan releases with the Qwen models - runs much better than the ROCm releases. I'll have to give the preserve_thinking a shot.