Remix.run Logo
Greenpants 4 hours ago

I have! I care about data privacy and LLMs being free. I'm using the Pi coding harness but containerized and sandboxed, to make sure it's running completely offline. On my Mac Studio with 128GB RAM (or MacBook with 36GB RAM) I'm using Qwen3.6 35b, with only 3b active parameters so that it runs really fast. I've done a complete redesign for my website's homepage and blog with Django + Wagtail. The latter is interesting, because Wagtail is a bit less well-known, so the agent, without giving it internet access, doesn't always know how to develop for Wagtail. I've used Qwen3.5 122b for when things get more complex. At 10b active parameters, it's significantly slower though.

I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.

It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).

Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)

lambda 3 hours ago | parent | next [-]

This is very similar to my setup. Pi in a container (I do let it have network access, just no access to creds or anything, only the one directory that I'm working on at the time and my ~/.pi directory), talking to llama.cpp in another container. I'm on a Strix Halo 128 GiB unified memory laptop.

I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.

And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.

But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.

For other chat tasks and translation, I'll frequently use Gemma 4 31B.

For audio, I'll use Gemma 4 12B.

I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.

chakspak 3 hours ago | parent [-]

Hopefully this isn't off-topic, but your setup sounds just like mine, Strix Halo and (I'm assuming) llama.cpp on ROCm, and I'm finding that the Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?

lambda 3 hours ago | parent | next [-]

I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan.

The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.

But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.

Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.

In my models.ini, I have this for the Qwen3.6 models:

  chat-template-kwargs = {"preserve_thinking": true}
There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.
ndom91 3 hours ago | parent [-]

+1 using llama.cpp Vulkan releases with the Qwen models - runs much better than the ROCm releases.

I'll have to give the preserve_thinking a shot.

dnautics an hour ago | parent | prev | next [-]

> Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?

Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?

lambda 5 minutes ago | parent [-]

So, one of the ways that this problem manifests is that most local models aren't trained on preserving the full reasoning between turns. Every turn, they skip passing the reasoning trace from previous turns to the the LLM. So if on one turn you have a long interleaved chain of reasoning and tool calls, then it responds to you, and then you give a new prompt to fix something, it has to re-process all of those tools calls now with the reasoning stripped out.

Qwen 3.6 has finally been trained both with and without preserving thinking, so you can optionally enable preserving thinking. This will use up a bit more context, but it will avoid having to do this re-processing of long agentic turns, and also the preserved thinking can avoid having to re-do some of the same reasoning over again in later turns.

Besides that, modern LLMs don't only use full attention (apparently, attention is not all you need). Full attention is very expensive to compute and store (0(n^2)). But additionally, full attention is actually bad at certain kinds of reasoning; keeping track of some value that gets replaced over the course of time, for example. So most models these days use various forms of local attention which is fixed length and gets updated as you go; sliding window attention, Mamba-2 state space models, etc.

But one advantage of attention is that you can go back and reprocess by truncating the KV cache and starting over. You can't do that with other forms of local attention; you've lost the state earlier in the sequence.

So to allow you to go back without fully recomputing the cache all over again, your engine will save snapshots of the local attention state at various times, so if you need to go back to recompute the cache, you can start from the last snapshot. However, these snapshots can get large, you can't keep too many of these, so sometimes you need to go back quite far to get to one, or they're all past the point you need to go back to and you need to start over again from the beginning.

There have been particular bugs in llama.cpp that have caused this to be triggered more often than it should; for instance, it wouldn't take snapshots before turns that included images at one point, so if you had an image heavy agentic workflow, that issue plus the lack of preserving thinking would mean you would frequently have to go back and start over from scratch.

Some of these issue have been fixed, some are addressed by preserving thinking. There are still some issues sometimes; for instance, one that's hard to fix is that the tokens generated autoregressively don't always parse the same when doing prefill. For instance, you could generate something as two tokens "pre" and "fill", but it turns out that "prefill" is also a single token so the tokenizer will use that, so when you send that back again on the next turn, it will see a divergence and have to recompute from that point. It might be possible to ignore that and use the not fully greedy tokenization that's in the cache, but I've definitely seen llama.cpp have to do some cache recomputation due to that.

LoganDark 2 hours ago | parent | prev [-]

What harness are you using? Some of them (e.g. OpenCode) mutate the system prompt every turn, and therefore can't work with a KV cache.

I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)

adyavanapalli 3 hours ago | parent | prev | next [-]

For the edit tool, you should consider implementing a hash-based approach where each line of code is hashed and referenced by it when doing replacements. You can read up on the approach here: https://blog.can.ac/2026/02/12/the-harness-problem/

I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMV

electronsoup 3 hours ago | parent | prev | next [-]

> It gets into loops quite often, and surprisingly often gets the edit tool call wrong

I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn

Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off

ltononro 3 hours ago | parent | prev | next [-]

What kind of coding do you do? Do you keep track of frontier models to vibe check the differences and re-evaluate constantly or are you ok with having a nerfed model forever? (not being judmental, just really wanto to know your framework here)

Greenpants 3 hours ago | parent [-]

Some of the work I do, I do for an (EU) organisation that doesn't have clear rules or guidelines on the use of AI yet. Though I have seen colleague-developers blatantly putting source code into external Claude-like models, I stay true to my principles and don't. I know for certain that everything that I run through my local, offline Pi Container Sandbox cannot leave the machine, and thus can't result in a data breach. I do this for the peace of mind.

I do (unscientifically) experiment whenever a new capable local LLM (<=130b) releases with a license that permits commercial use. As for knowing my models require more work than Opus, I don't mind still having to puzzle on getting the architecture right. In any case, it forces me to stay in the loop of what's being built, which is a good thing.

0xbadcafebee 4 hours ago | parent | prev | next [-]

The harness and the LLM parameters are pretty essential to getting better results and reducing loops. Tweak the parameters and you can mostly eliminate loops without negatively affecting performance (it's a bit complex but ask a SOTA AI to guide you and it's not hard). The harness should also react more intelligently to failures; it can do things like return additional context or hints as it tracks error rates and avg duration of calls. Pi can be easily extended, and it's suggested by the author you modify it to perform better for your use case.

dotancohen 2 hours ago | parent | prev | next [-]

  > you really need to know what you're asking, and be precise
Any chance that you could share some recent prompts to give other HNers a head start on his to approach Qwen? If you are uncomfortable posting them here, my Gmail username is the same as my HN username.

Thank you.

Greenpants an hour ago | parent [-]

I'm glad you're asking. I already started writing a blog post on how to best make use of local models. I'll share it as soon as I have a complete enough list. If anyone else reading this would like to chime in with their tips & tricks, let us know!

For the time being, off the top of my head, I'd say:

- Prompt Engineering tips & tricks apply here (like being complete in the relevant context you provide in your question, and the specific task(s) the agent should do like reasoning, modifying one file, or trying to fix a complex task all at once (not recommended)).

- If you already know which files the agent should look into, mention them to save time and potentially context.

- In my personal workflow, I write down lots of atomic TODOs needed to solve a problem. As I write it down, I'll notice assumptions I'm making, or the fact that the TODO could still be decomposed further into (atomic) subtasks.

- It's best to get a feeling yourself for how Qwen handles your repository. I noticed if I don't specify an architecture for development, it'll make quick & dirty fixes. If I don't tell it to remove debug statements, it won't. This is what was meant with "be precise" – Claude Opus might think for you and act in your best interest. Smaller Qwen models will just do what you ask them to, and no more. They have design knowledge, but you have to explicitly ask them to "activate" that part of their knowledge.

jmuguy 3 hours ago | parent | prev | next [-]

Given your knowledge on this - do you think we'll see an open source model with Opus levels of capability? IMO if/when this happens - I would 100% stop using Anthropic.

Greenpants 3 hours ago | parent | next [-]

Let me put it like this. I started with local LLMs when ChatGPT still used GPT-3.5. I was amazed how my MacBook with 8GB RAM could run openhermes2.5-mistral: a 7b parameter model that could generate short stories that sort of made sense. Incredible!

Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then.

I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :)

jmuguy 3 hours ago | parent [-]

Haha well I ask because I don't really want/need anything beyond Opus most of the time. And I'm paranoid that Anthropic is going to be forced to charge the true cost of all this before too long.

Greenpants 3 hours ago | parent [-]

The other upside of running local LLMs is that there's no cloud provider to suddenly charge more for the same, or even less, model use.

It's personal, but I prefer CapEx over OpEx for this. If you can purchase a device upfront that runs a decent local LLM, you get the peace of mind that your setup won't suddenly change over time and can only get better.

lambda 3 hours ago | parent | prev | next [-]

If you believe the benchmarks, Qwen 3.6 35B-A3B already outperforms Claude 4 Opus.

Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.

It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.

But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.

MrScruff 3 hours ago | parent [-]

You really need to take the benchmarks with a massive pinch of salt. I’ve been testing local LLMs since the original llama and there’s nothing I’ve tried that is in the same category as Opus.

lambda 3 hours ago | parent [-]

Which Opus? They certainly outperform Claude 3 Opus.

Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks.

mapontosevenths 2 hours ago | parent | next [-]

There's a guy on Youtube named Bijan Bowen who tests all the models (open and frontier) on a series of one/few shot programming exercises and has been for a long while now. You can pretty much watch him compare the results for any two models you're likely to be interested in.

I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.

lambda an hour ago | parent [-]

OK, it looks like he did a browser OS test with both Claude 4 Opus and Qwen 3.6 35B-A3B.

Claude 4 Opus: https://youtu.be/J7omabtqnBM?t=193

Qwen 3.6 35B A3B: https://youtu.be/gVU-DQeqkI0?t=215

Qwen 3.6 produced far more working functionality than Claude 4 Opus did.

Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.

MrScruff 3 hours ago | parent | prev [-]

I’m normally comparing frontier open/cheap models against frontier closed source. I use deepseek/glm regularly, they’re fine and you can get real work done with them but it’s super obvious when you switch back to opus or even sonnet. A 3B active param MoE model is not comparable.

lambda an hour ago | parent [-]

Yeah. I was pointing out that local 3b active models outperform frontier models from a year ago.

Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago.

Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run.

It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are.

zozbot234 3 hours ago | parent | prev | next [-]

People can't seem to agree on what "Opus class" even means (the latest Opus is apparently pretty weak) but DeepSeek Pro, Kimi and GLM all are quite capable.

computerex 3 hours ago | parent [-]

Nothing compares to Opus when it comes to "taste" in web design in my experience. Nothing compares to opus in very difficult HPC/model inference development. I worked on this with opus: https://github.com/computerex/dlgo

OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.

rvnx 3 hours ago | parent | prev [-]

To me totally yes, even further, if they keep their existing route, over time people will stop using Anthropic.

More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval).

In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?

Just use Gemma/Gemini/Siri or whatever.

Pornography and uncensored models is also pushing toward local models.

It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).

The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.

For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.

It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).

hparadiz 3 hours ago | parent | prev | next [-]

I am right there with you. Mind-boggling. It's a indistinguishable from magic technology!! I tried running some basic tasks through Qwen with Opencode on a 10 year old dual Xeon server for shits and giggles. I gave it a simple task like "use ffprobe first but convert this webm to mp4" and it was able to complete the task with zero network calls outside my network. On 10 year old hardware. It took about 3 minutes to complete the task. Now you may be saying 3 minutes? pfft. But I dare you to do it yourself. You're gonna be googling the CLI switches for at least 10 minutes and setting up your command. I had it actually optimize all the switches on the fly for me based on an initial ffprobe to see what is optimal.

bluerooibos 20 minutes ago | parent [-]

> 10 year old dual Xeon server...On 10 year old hardware.

Hold on, what are the specs of your rig? How much RAM?

I'm been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.

hparadiz 5 minutes ago | parent [-]

I inherited a box with dual Xeons and 256 GB of DDR4. I then ran several tests and benchmarks of the hardware with several models.

I've been meaning to write a blog post but well whatever here's the md.

https://gist.github.com/hparadiz/f3596d00a62d8ebb2dadcc46ee5...

Qwen3.5 9B performed best.

You can absolutely still use this to do some basic stuff like tell opencode to convert a video file from one format to another. But frankly you're better off getting two AMD GPUs. Say a dual 7900XT would get way better performance.

motbus3 3 hours ago | parent | prev | next [-]

Try deepseek V4 flash

nyxtom 2 hours ago | parent | prev | next [-]

Have you found that being much more spec driven helps guide it better?

amelius 3 hours ago | parent | prev | next [-]

Sounds super cool, don't get me wrong, but I suppose for most people the bar is higher than HTML/CSS.

q3k 2 hours ago | parent [-]

I love to warm up a whole rack of servers just so that some shitass buggy TUI can generate a line of bash that comments out my test runner.

We truly live in the dumbest timeline.

GardenLetter27 4 hours ago | parent | prev | next [-]

Could the harness not check for a failed tool call and pass it to a small model for correction without clogging up the main context?

lambda 3 hours ago | parent | next [-]

The thing is, to do a proper fix it would really need all of the context (maybe the tool call that failed was for an edit to a file that was last touched way at the beginning of the context), so you'd need to either keep that smaller model running doing prompt processing all the time, or have a very long wait while it does prompt processing on your whole session.

And then also, sometimes the tool call errors are because of something like a file was changed out from under it; the larger model is probably going to do a better job of figuring that out and fixing it up.

Finally, in Pi, you can always just use the /tree command to skip back to before a series of failed tool calls, with a summary if you want to let the model know what happened. The Pi /tree command is pretty powerful in managing your context

everforward an hour ago | parent [-]

An illustrative example I've seen a lot is creating Jira tickets in projects with custom fields marked as mandatory. It tries to create the ticket without the field and the tool call fails. The LLM needs access to the full context so that it can generate text to put in the "Why couldn't this meeting be an email?" field.

Greenpants 4 hours ago | parent | prev [-]

I'm actually quite sure that directly retrying the tool call would often fix the edit-call already. But these models have been trained to "think" for a while for any problem solving, so they'll presume the problem of the edit is more fundamental and spend unnecessary tokens filling up the context.

I'll experiment more with the effectiveness of AGENTS.md rules for local Pi agents. I feel like smaller (local) LLMs just lack in attentiveness to elements in the context window, like precise instructions, compared to e.g. Claude models.

yieldcrv 3 hours ago | parent | prev | next [-]

> It gets into loops quite often

matches my experience and a deal breaker

also the context window sizes are too low. I can't operate in 65,000 windows any more because even just reading the code's file structure overruns it and gets me nowhere. Definitely its own art form.

200k context windows and above for me now

I saw a paper last night that should help this a lot though

kennywinker 2 hours ago | parent | next [-]

Qwen3.6-35b handles 256k context fine if you’ve got room for it. I’m running it with 128k context with just 16gb vram.

Greenpants 2 hours ago | parent | prev [-]

I get that it's a deal breaker to some; it definitely requires patience.

In Pi, /new is my best friend and most-used command for sure. For simple tasks (I decompose complex ones anyway since I don't trust small local LLMs to do this for me), the model doesn't need much context, given that I'm proficient in my codebase myself: "I'd like Feature X. Look into files 1, 2 and 3 to make your edits."

nobody_r_knows 3 hours ago | parent | prev [-]

[dead]