| ▲ | lambda 3 hours ago | |
I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan. The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date. But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning. Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time. In my models.ini, I have this for the Qwen3.6 models:
There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton. | ||
| ▲ | ndom91 3 hours ago | parent [-] | |
+1 using llama.cpp Vulkan releases with the Qwen models - runs much better than the ROCm releases. I'll have to give the preserve_thinking a shot. | ||