Qwen3 runs locally on reasonable hardware, and is comparable to a mid-2025 Claude Sonnet (albeit possibly rather slower) .

Local models are chasing the online frontier models pretty hard.

So worst case, that's the fallback (FWIW, YMMV)

edit: Qwen-3.5 MoE (and other local MoE models like it)

▲

HWR_14 3 days ago | parent | next [-]

Whats "reasonable hardware"?

▲

Someone1234 3 days ago | parent | next [-]

People have tried to run Qwen3-235B-A22B-Thinking-2507 on 4x $600 used, Nvidia 3090s with 24 GB of VRAM each (96 GB total), and while it runs, it is too slow for production grade (<8 tokens/second). So we're already at $2400 before you've purchased system memory and CPU; and it is too slow for a "Sonnet equivalent" setup yet...

You can quantize it of course, but if the idea is "as close to Sonnet as possible," then while quantized models are objectively more efficient they are sacrificing precision for it.

So next step is to up that speed, so we're at 4x $1300, Nvidia 5090s with 32 GB of VRAM each (128 GB), or $5,200 before RAM/CPU/etc. All of this additional cost to increase your tokens/second without lobotomizing the model. This still may not be enough.

I guess my point is: You see this conversation a LOT online. "Qwen3 can be near Sonnet!" but then when asked how, instead of giving you an answer for the true "near Sonnet" model per benchmarks, they suddenly start talking about a substantially inferior Qwen3 model that is cheap to run at home (e.g. 27B/30B quantized down to Q4/Q5).

The local models absolutely DO exist that are "near Sonnet." The hardware to actually run them is the bottleneck, and it is a HUGE financial/practical bottleneck. If you had a $10K all-in budget, it isn't actually insane for this class of model, and the sky really is the limit (again to reduce quantization and or increase tokens/second).

PS - And electricity costs are non-trivial for 4x 3090s or 4x 5090s.

▲

Kim_Bruning 3 days ago | parent | next [-]

I may have genuinely new data for you.

Qwen3.5-35B-A3B is reported to perform slightly better than the model you mentioned.

It runs fine but non-optimal on a single 3090 with even 131072 tokens of context , and due to the hybrid attention architecture, the memory usage and compute scale rather less drastically than ctx^2. I've had friends with smaller cards still getting work out of it. Generation is at around 20 tokens/sec on that 3090 (without doing anything special yet) . You'll need enough DRAM to hold the bits of the model that don't fit. Nothing to write home about, but genuinely usable in a pinch or for tasks that don't need immediate interactivity.

It's the first local model that passes my personal kimbench usability benchmark at least. Just be aware that it is extremely verbose in thinking mode. Seems to be a qwen thing.

(edit: On rechecking my numbers; I now realize I can possibly optimize this a lot better)

▲

Someone1234 3 days ago | parent | next [-]

With respect, this isn't "new data" it is an anecdote. And it kind of represents exactly the problem I was talking about above:

- Qwen is near Sonnet 4.5!

- How do I run that?

- [Starts talking about something inferior that isn't near Sonnet 4.5].

It is this strange bait/switch discussion that happens over and over. Least of all because Sonnet has a 200K context window, and most of these ancdotes aren't for anywhere near that context size.

	▲	Kim_Bruning 3 days ago \| parent [-]
		You're not wrong; but... imho it's closer to Sonnet 4.0 [1] on my personal benchmark [2]. And I HAVE run it at just over 200Ktoken context, it works, it's just a bit slow at that size. It's not great, but ... usable to me? I used Sonnet 4.0 over api for half a year or so before, after all. Only way to know if your own criteria are now matched -or not yet- is to test it for yourself with your own benchmark or what have you. And it does show a promising direction going forward: usable (to some) local models becoming efficient enough to run on consumer hardware. [1] released mid-2025 [2] take with salt - only tests personal usability + Note that some benchmarks do show Qwen3.5-35B-A3B matching Sonnet 4.5 (released later last year); but I treat those with the same skepticism you do , clearly ;)

▲

yencabulator a day ago | parent | prev [-]

One sure would expect Qwen3.5-35B-A3B to "perform slightly better" than Qwen3-235B-A22B!

▲

zozbot234 3 days ago | parent | prev [-]

> The hardware to actually run them is the bottleneck, and it is a HUGE financial/practical bottleneck.

That's unsurprising, seeing as inference for agentic coding is extremely context- and token-intensive compared to general chat. Especially if you want it to be fast enough for a real-time response, as opposed to just running coding tasks overnight in a batch and checking the results as they arrive. Maybe we should go back to viewing "coding" as a batch task, where you submit a "job" to be queued for the big iron and wait for the results.

▲

Borealid 3 days ago | parent | prev | next [-]

A machine with 128GB of unified system RAM will run reasonable-fidelity quantizations (4-bit or more).

If you ever want to answer this type of question yourself, you can look at the size of the model files. Loading a model usually uses an amount of RAM around the size it occupies on disk, plus a few gigabytes for the context window.

Qwen3.5-122B-A10B is 120GB. Quantized to 4 bits it is ~70GB. You can run a 70GB model in 80GB of VRAM or 128GB of unified normal RAM.

Systems with that capability cost a small number of thousand USD to purchase new.

If you are willing to sacrifice some performance, you can take advantage of the model being a mixture-of-experts and use disk space to get by with less RAM/VRAM, but inference speed will suffer.

▲

fy20 3 days ago | parent | prev [-]

If you want something off the shelf get a MacBook Pro M5 (base "Pro" CPU) with 48GB RAM:

Gemma 4 31B Q6: 9tok/s, I'd say it is smarter than GPT-4o, but yeah it's slow. Good for coding.

Gemma 4 26B A4B Q4: 50tok/s. Feels faster than ChatGPT 5.4, but not as smart (as it reasons less). Good for general chatting and research.

▲

fortyseven 2 days ago | parent | prev [-]

Give Gemma4 a look, too. I've had terrific results with that and OpenCode locally.