This has been exactly my experience too. I've tried multiple harnesses (pi, claude code, codex) with multiple variants of qwen3.6 and gemma4 driven by both o mlx and ollama - and every single time I try to do anything meaningful I end up in a loop. On a 64GB Macbook Pro M3 Max.

I really don't know what the hell people are doing locally, and suspect a lot of the hype around running these models locally is bullshit. Sure, you can make it do something but certainly nothing useful or substantial.

▲

sleepyeldrazi 2 hours ago | parent | next [-]

I have been testing and using Qwen3.6 27B (running from my 3090) since it dropped and I genuinely think this is the first consumer hardware-grade model that can actually replace frontiers for a lot of workloads.

I ran 8 tests on a variety of open-weights models, and opus 4.7 (1mil ctx version) and the little dense model was right behind it: https://github.com/sleepyeldrazi/llm_programming_tests/tree/... Of note is that opus was the only model to push back against the spec on the hardest challenge, saying 'thats not possible', when there are links in the spec to examples of it being done.

There may be problems with the mlx versions, as i haven't had any looping in all the testing i've done, which is all my agentic and coding work the last couple of days (since it dropped). I have had tool_call misses 4 or 5 times so far, which isn't ideal but no looping. First I used it in pi-mono and later when i realized it's a serious model switched to opencode.

My setup is llama.cpp running on a 3090 in WSL, unsloth IQ4_NL with those flags: --ctx-size 128000 \ --jinja \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --repeat-penalty 1.0 \ --presence-penalty 0.0 \ --threads 12 \ --gpu-layers 99 \ --no-warmup \ --no-mmap \ -fa on

▲

bityard 2 hours ago | parent | prev | next [-]

Hosted models are big, and there is a lot going on behind the scenes that we users have no visibility into. OpenAI, Anthropic, Google, etc do much more than just feed raw prompt tokens straight into a big 1-2TB static model and pipe the output tokens back to the web browser. The result of this is that they can do more, and end-users can get away with a lot more in terms of vague prompts and missing background.

The biggest lesson I've learned working with local models so far is: with the smaller models, you have to understand their limitations, be willing to run experiments, and fine-tune the heck out of everything. There are endless choices to be made: which model to use, which quant, thinking or not, sampling parameters, llama.cpp vs vLLM, etc. They much more fiddly for serious work than just downloading Claude Code and having it one-shot your application. But some of us enjoy fiddling so it all works out in the end.

▲

usagisushi an hour ago | parent | prev | next [-]

If the "loop" you mean is the infinite reasoning cycle ("Wait, actually... On second thought..."), you might want to try setting a reasoning budget. For llama.cpp, use `--reasoning-budget 1024 --reasoning-budget-message "Proceed to final answer."` to force the model to reach a conclusion.

I admit I sometimes get caught up in the tooling for its own sake, but I find local models useful for specific tasks like migrating configuration schemas, writing homelab scripts, or exploring financial data.

It might sound a bit paranoid, but privacy is another major driver for me. Keeping credentials and private information off cloud services is worth the extra friction.

▲

NitpickLawyer 3 hours ago | parent | prev | next [-]

> a lot of the hype around running these models locally is bullshit. Sure, you can make it do something but certainly nothing useful or substantial.

There is certainly a lot of hype around local models. Some of it is overhype, some of it is just "people finding out" and discovering what cool stuff you can do. I suspect the post is a reply to the other one a few days ago where someone from hf posted a pic with them in the plane, using a local model, and saying it's really really close to opus. That was BS.

That being said, I've been working with local LMs since before chatgpt launched. The progress we've made from the likes of gpt-j (6B) and gpt-neoX (22B) (some of the first models you could run on regular consumer hardware) is absolutely amazing. It has gone way above my expectations. We're past "we have chatgpt at home" (as it was when launched), and now it is actually usable in a lot of tasks. Nowhere near SotA, but "good enough".

I will push back a bit on the "substantial" part, and I will push a lot on "nothing useful". You can, absolutely get useful stuff out of these models. Not in a claude-code leave it to cook for 6 hours and get a working product, but with a bit of hand holding and scope reduction you can get useful stuff. When devstral came out (24B) I ran it for about a week as a "daily driver" just to see where it's at. It was ok-ish. Lots of hand holding, figured out I can't use it for planning much (looked fine at a glance, but either didn't make sense, or used outdated stuff). But with a better plan, it could handle implementation fine. I coded 2 small services that have been running in prod for ~6mo without any issues. That is useful, imo. And the current models are waaay better than devstral1.

As to substantial, eh... Your substantial can be someone else's taj mahal, and their substantial could be your toy project. It all depends. I draw the line at useful. If you can string together a couple of useful tasks, it starts to become substantial.

▲

bachmeier 2 hours ago | parent | prev | next [-]

> Sure, you can make it do something but certainly nothing useful or substantial.

It works great for me. But I like to review the code and understand what it's doing, which doesn't appear to be how people do "useful or substantial" programming these days.

▲

ryandrake 4 hours ago | parent | prev | next [-]

Same here. Every time a new local model comes out, I give it a spin with a pretty vanilla coding task ("refactor this method to take two parameters instead of one", or "fix this class of compiler warning across the ~20 file codebase") and more often than not, they get in endless loops, or fail in very unusual ways. They don't yet even approach the usefulness of SOTA models. It's obviously not a fair comparison, though. My 20GB GPU is never going to beat whatever enormous backend Google or Anthropic have.

	▲	2ndorderthought 4 hours ago \| parent [-]
		You can do this with really small models but you have to do a more legwork. I wouldn't expect most trivially small models to handle anything more than 1 file reliably. The new qwen 3.6 is different though, I have heard cases where it is behaving close to sonnet. That said I don't see why people are so scared to touch code even if it saves them 500 euro a month. Using my IDEs find across my repo and auto replacing 2 patterns is trivial to do and way faster to do by hand. I mostly use small models, it prevents a lot of the issues I've seen with large models and vibe/agentic coding medium to long term. I also write a lot of code.

▲

proxysna 4 hours ago | parent | prev | next [-]

You need to set sampling parameters for the llm. Had the same issue with Qwen3.5 when i first started. You can grab them off the model card page usually.

From Qwen3.6 page:

Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

▲

deanc 4 hours ago | parent | next [-]

Yes, have tried all of these (as per the docs). Have you actually tried these? Because I have tried all 3 configurations with agentic coding that you mentioned and have the same result - loops.

	▲	proxysna 2 hours ago \| parent [-]
		I've used only Qwen3.5 so far for work and was, after initial struggles, successful with GPU setup, no mlx. Ngl the fact that they are using `presence_penalty: 0` and no `max_tokens` is weird after that exact setup caused me "initial struggles", but i've set up a simple docker-compose with vllm and qwen3.6 right now to test it out and it worked perfectly fine for me. Gist with the compose and example of an output. https://gist.github.com/meaty-popsicle/f883f4a118ff345b430c3...

▲

Der_Einzige 2 hours ago | parent | prev [-]

min_p author here. min_p is strictly better than top_p and top_k. The big labs don't know shit about sampling, and give absolutely nuts recommendations like this.

set min_p to like 0.3 and ignore top_p and top_k and you'll be fine.

There's better samplers now like top N sigma, top-h, P-less decoding, etc, but they're often not available in your LLM inference engine (i.e. vLLM)

▲

mft_ 3 hours ago | parent | prev | next [-]

I’m frequently surprised how little I can find online about exactly this - different harnesses for local models and how to set them up. The documentation for opencode with local models is (IMO) pretty bad - and even Claude Opus (!) struggled to get it running. And so far I’ve not found a decent alternative to Claude Desktop.

(I’ve recently discovered that you can pipe local models into Claude’s Code and Desktop, so this is on my list to try).

▲

2ndorderthought 3 hours ago | parent [-]

Qwen3.6 is brand new. But also, search engines are so plastered with AI slop that is written by tools and companies that have no interest in you using local models. Ollama makes it 1 command to run local small models, but with the newest ones there can be kinks to work out first.

/R/localllama is okay for some information but beyond that there is so much noise and very little signal. I think it's intentional.

▲

mft_ 2 hours ago | parent [-]

Thanks. I’ve been experimenting with local models for over a year now, on and off, so this isn’t just limited to the latest Qwen. Anyway, I have no problem running them, but there’s a huge difference between running something via a chat interface and running it a la Claude Code so that it can interact with the local environment and create/edit files. This is the aspect that’s difficult, in my experience.

	▲	RALaBarge 2 hours ago \| parent [-]
		It’s all about tooling, if the ai can fetch data it can do something rad with it. Use something like an ai harness to have an mcp server and other tooling to improve the harness and the tools I made this for my own learning: GitHub.com/ralabarge/beigebox

▲

2ndorderthought 4 hours ago | parent | prev | next [-]

In the article the author describes what they made. It's definitely not bullshit, but it's also not as reliable or as handsfree as the 1t models.

For people who aren't completely vibe or agent coding these models are better than say copilot or the free models appearing after a Google search. Probably better than chatgpts flagships in some ways.

I mostly use 4b to 9b models for basic inquiries and code examples from libraries I haven't used before. Many of them can solve pretty hard math problems, and these are several steps away from say qwen3.6.

I would not discount running models locally. It's the best case scenario of a future with LLMs from a human rights and ecological perspective.

▲

xienze 3 hours ago | parent | prev [-]

It's probably a combination of things:

* New models running in llama.cpp (what's under the hood of ollama et al) frequently require bug fixes.

* The GGUF models that run in llama.cpp frequently require bug fixes (Unsloth is notorious for this -- they release GGUF models about 10 minutes after official .safetensors releases).

* You're probably running a <Q8 quantization of the model, and a good chance <BF16 quantization for KV cache. This makes for compounding issues as context grows and tool calls multiply.

Local models really are great but I think a major problem are the people in groups like r/localllama who run models at absurd quantization levels in order to cram them on their underpowered hardware and convince themselves that they're running SOTA at home.

The best way to run these models is, frankly, a lot of VRAM and vLLM (which is what the people developing these models are almost certainly targeting).