Remix.run Logo
root_axis 6 hours ago

You are greatly underestimating the hardware requirements for productive local LLMs. Research consistently shows that parameter count sets the practical ceiling for a model's reliability. Quantized models with double digit param counts will never be reliable enough to achieve results in the realm of something like Opus 4.6.

thot_experiment 2 hours ago | parent | next [-]

Flat wrong. Q6 Gemma 31b feels a lot like opus 4.5 to me when run in a harness so it can retrieve information and ground itself. The gap is not that big for a lot of usecases. Qwen MoE is fast as fuck locally for things that are oneshottable. I have subscriptions to all the major providers right now and since Gemma 4 and Qwen 3.6 came out I haven't hit limits a single time. I'm actually super surprised by the number of things I try with Gemma 4 with the intent of seeing how it fails and then having Claude do it only to come away with something perfectly usable from the local model.

cbg0 an hour ago | parent | next [-]

Your n=1 might not be very relevant outside your personal use. In less contaminated benchmarks Gemma 4 is way below Sonnet 4.5, let alone Opus models: https://swe-rebench.com/

larodi an hour ago | parent | next [-]

I’m building a pipeline and testing against gemma4 and Gemini’s 3-1 flash. Both are very good on certain tasks and even n-way clustering works almost perfect almost always.

But they diverge greatly on other particular ones whenever the ViT tower and the apriori knowledge of the world is crucial. I wish Gemma was on par but both me and Google know they not.

onion2k an hour ago | parent | prev [-]

You do need to ask whether or not Sonnet or Opus are overkill for a lot of work though. If Gemma4 with some human effort can achieve the same result as Sonnet then it's arguably a lot more cost effective as you're paying for the person to operate each one regardless.

thot_experiment a few seconds ago | parent [-]

I 100% agree with your philosophy but I wanna note that I genuinely find Gemma 4 31b to be better than Sonnet. To be clear, this makes NO sense to me, so I'm probably just high and making stuff up or just biased by a small sample size since I don't use Sonnet that often. I find that Gemma 4 makes the sort of "dumb AI" mistakes Sonnet makes less often, especially in agentic mode. I genuinely don't know how that can be true but Sonnet feels much more like "autocomplete" and Gemma 4 feels like "some facsimile of thought".

alfiedotwtf an hour ago | parent | prev | next [-]

I’m guessing Qwen3.6 for agentic coding and Gemma4 for non-coding stuff?

thot_experiment 20 minutes ago | parent [-]

No, exactly the opposite actually. Qwen3.6 is too imprecise for long running agentic tasks. It doesn't have the same ability to check itself as Gemma does in my testing. I keep Qwen MoE in vram by default because there are tons of tasks i trust it to oneshot and it's 90tok/sec is unparalleled, anything where I don't want to have to intervene too much it can't be trusted.

root_axis an hour ago | parent | prev [-]

Sorry but you're just seeing what you want to see. The idea that a 31b model is anywhere even in the ballpark of something like Opus 4.5 is just absurd on its face.

thot_experiment 10 minutes ago | parent [-]

False. The absolute capability is irrelevant, with the proper harness 31b is more than adequate for a very large portion of the tasks I ask AI to do. The metric isn't how good the model is at Erdos Problems, it's how reliably it can remove drudgery in my life. It just autonomously reverse engineered a bluetooth protocol with minimal intervention, it's ability to react to data and ground itself is constantly impressive to me. I do a ton of testing with these models, today I had Gemma answer a physics problem that Opus 4.7 gave up on. With a decent harness and context the set of tasks where their capabilities are both good enough is very surprising. The tasks I have that stump Gemma often also stump Opus 4.7.

wincy 6 hours ago | parent | prev | next [-]

Won’t these H100s drop in price in a few years? With the data center build out surely these will become 1/10th the price and you’ll be able to set up a local LLM as good as opus 4.7. Even if the frontier model become more advanced, and memory hungry, you could use the same power usage as your oven to run a current day frontier model as needed? If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.

root_axis 5 hours ago | parent | next [-]

> Won’t these H100s drop in price in a few years

Doubtful. The increase in demand is greatly outpacing supply, and all signs point to a continued acceleration in demand

> If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.

lol well obviously, but realistically that price point is going to be closer to $100k, with a perpetual $1k a month in power costs.

wincy 5 hours ago | parent | next [-]

Cool, thanks for the information. I guess they drive prices down by massively parallelizing requests on say an H100 X8 array? So this is spread across. So if I say, wanted to use it for 8 hours a day in my theoretical world it’d be too expensive. My work definitely wouldn’t pay $100,000 for a server farm even if it’d give an AI to all our employees, you’d have to have engineers, a colocation space, basically all the problems that companies didn’t like and went to AWS for.

root_axis 4 hours ago | parent [-]

Well $100k was a generous guesstimate for some time in the future where something like an Opus 4.7 is old news.

If we think about the near future, something like Kimi2.6 is within the realm of Opus 4.6 today, but requires closer to $700k in hardware to run.

dyauspitr an hour ago | parent | prev | next [-]

Why? These models are going to keep drastically improving and given all the new data centers token prices will probably drop a lot in the future. Seems shortsighted given the absurd timelines these things have been improving on.

aaronblohowiak an hour ago | parent | prev [-]

taalas!!!

33MHz-i486 5 hours ago | parent | prev [-]

opus 4.7 caliber models are trillions of params, and a single instance would likely run on multiple h200s. $100k of hardware. not coming to your laptop anytime soon.

stubish 2 hours ago | parent | prev | next [-]

It depends on what you mean for 'productive'. Article mainly seems to be about targeting consumer level hardware, such as the Neural Processing Unit you need for a 'Copilot PC'. Windows Recall is (was?) one such local AI application. If Microsoft get their way and my next PC has one, I look forward to using it for 'productive' purposes such as playing games, handling natural language stuff and leaving my GPU free for GPUing.

segmondy 6 hours ago | parent | prev | next [-]

Jokes on you. We are already running Deepseekv4Flash, Mimo2.5, MiniMax2.7, Qwen3-397B locally in very affordable hardware. These models are in the real of Opus4.6. For those of us a bit crazy, we are running KimiK2.6, GLM5.1 and more ...

root_axis 5 hours ago | parent | next [-]

I have two A100s and have been playing with local models for years. There's definitely moments where they are quite impressive, but small context sizes and unreliability become immediately obvious.

> For those of us a bit crazy, we are running KimiK2.6, GLM5.1

Yes, those can compare to Opus, but you can't run those unquantized for less than $400k in hardware.

doctorpangloss 5 hours ago | parent [-]

Two Mac Studio M3 Ultra 512GB and 1 USB cable can run all those models - maybe about $30,000 in hardware - and based on my benchmarks, those Mac Studios were twice as fast as the A100s on Deepseek v4 Flash, which has a quantization but not really a lossy one.

root_axis 4 hours ago | parent [-]

That cannot run KimiK2.6 or GLM5.1 i.e models within the ballpark of anything offered by frontier companies.

binyu 5 hours ago | parent | prev [-]

They all still fall short of Opus 4.6, definitely though. They are good but fail on extremely complex tasks, in contrast with a frontier model that will keep on trying until it succeeds or exhausts the solutions space.

julianlam 4 hours ago | parent [-]

Not by much, and moving goalposts makes for a bad comparison. Local open weight models are already more powerful than frontier models from only a year back.

If you believe what you read here, the gap is closing fast.

CuriouslyC 6 hours ago | parent | prev | next [-]

Parameter size gets you world knowledge and better persistence of behavior as context grows. Both of those things can be engineered around to a large degree, and the latest Qwen models show that small models can be quite smart in narrow domains and short time windows.

alfiedotwtf an hour ago | parent [-]

… maybe we should just teach models how to get their world knowledge from a local Postgres connection! Then the model can be tiny, and it can query to its little heart desires AND run on commodity hardware TODAY!

josteink 30 minutes ago | parent | prev | next [-]

> You are greatly underestimating the current hardware requirements for productive local LLMs.

Fixed that for you. Right now most models produced are based on floating point maths and probabilities, which is "expensive" to do math on.

Microsoft has researched 1-bit LLMs which can run much more efficiently, and on much cheaper hardware[1].

If this research is reproducable and reusable outside their research models, this means the cost of running self-hosted LLMs will be reduced by an order of magnitude once this hits mainstream.

[1] https://github.com/microsoft/BitNet

byzantinegene 6 hours ago | parent | prev [-]

i would argue we don't need anything near Opus to be productive. Sonnet is plenty productive enough

root_axis 6 hours ago | parent | next [-]

I use Opus 4.6 as an example because it's the LLM that has been widely recognized by the public as being reliably capable of doing real work across many domains. However, the same logic applies to Opus 4.5 and even previous generations. These models have huge parameter counts and large context sizes, there's no training technique that can compensate for those qualities in small and quantized models.

JumpCrisscross 6 hours ago | parent | prev [-]

> we don't need anything near Opus to be productive. Sonnet is plenty productive enough

For niche applications, sure. For general use, I think the tendency towards the best model being used for everything will–to the model publishers' delight–continue. It's just much easier to get a feel for Opus and then do everything with it, versus switch back and forth and keep track of how Haiku came up with novel ways to dumbfuck this Sunday evening.