| ▲ | lambda 3 hours ago | ||||||||||||||||||||||||||||
This is very similar to my setup. Pi in a container (I do let it have network access, just no access to creds or anything, only the one directory that I'm working on at the time and my ~/.pi directory), talking to llama.cpp in another container. I'm on a Strix Halo 128 GiB unified memory laptop. I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare. And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc. But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often. For other chat tasks and translation, I'll frequently use Gemma 4 31B. For audio, I'll use Gemma 4 12B. I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this. | |||||||||||||||||||||||||||||
| ▲ | chakspak 3 hours ago | parent [-] | ||||||||||||||||||||||||||||
Hopefully this isn't off-topic, but your setup sounds just like mine, Strix Halo and (I'm assuming) llama.cpp on ROCm, and I'm finding that the Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how? | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||