Remix.run Logo
woggy 8 hours ago

What's the chance of getting Opus 4.5-level models running locally in the future?

dragonwriter 8 hours ago | parent | next [-]

So, there are two aspects of that:

(1) Opus 4.5-level models that have weights and inference code available, and

(2) Opus 4.5-level models whose resource demands are such that they will run adequately on the machines that the intended sense of “local” refers to.

(1) is probable in the relatively near future: open models trail frontier models, but not so much that that is likely to be far off.

(2) Depends on whether “local” is “in our on prem server room” or “on each worker’s laptop”. Both will probably eventually happen, but the laptop one may be pretty far off.

SOLAR_FIELDS 8 hours ago | parent | prev | next [-]

Probably not too far off, but then you’ll probably still want the frontier model because it will be even better.

Unless we are hitting the maxima of what these things are capable of now of course. But there’s not really much indication that this is happening

woggy 8 hours ago | parent | next [-]

I was thinking about this the other day. If we did a plot of 'model ability' vs 'computational resources' what kind of relationship would we see? Is the improvement due to algorithmic improvements or just more and more hardware?

chasd00 8 hours ago | parent | next [-]

i don't think adding more hardware does anything except increase performance scaling. I think most improvement gains are made through specialized training (RL) after the base training is done. I suppose more GPU RAM means a larger model is feasible, so in that case more hardware could mean a better model. I get the feeling all the datacenters being proposed are there to either serve the API or create and train various specialized models from a base general one.

ryoshu 8 hours ago | parent | prev [-]

I think the harnesses are responsible for a lot of recent gains.

NitpickLawyer 8 hours ago | parent [-]

Not really. A 100 loc "harness" that is basically a llm in a loop with just a "bash" tool is way better today than the best agentic harness of last year.

Check out mini-swe-agent.

SOLAR_FIELDS 4 hours ago | parent [-]

Everyone is currently discovering independently that “Ralph Wigguming” is a thing

gherkinnn 8 hours ago | parent | prev | next [-]

Opus 4.5 is at a point where it is genuinely helpful. I've got what I want and the bubble may burst for all I care. 640K of RAM ought to be enough for anybody.

dust42 8 hours ago | parent | prev [-]

I don't get all this frontier stuff. Up to today the best model for coding was DeepSeek-V3-0324. The newer models are getting worse and worse trying to cater for an ever larger audience. Already the absolute suckage of emoticons sprinkled all over the code in order to please lm-arena users. Honestly, who spends his time on lm-arena? And yet it spoils it for everybody. It is a disease.

Same goes for all these overly verbose answers. They are clogging my context window now with irrelevant crap. And being used to a model is often more important for productivity than SOTA frontier mega giga tera.

I have yet to see any frontier model that is proficient in anything but js and react. And often I get better results with a local 30B model running on llama.cpp. And the reason for that is that I can edit the answers of the model too. I can simply kick out all the extra crap of the context and keep it focused. Impossible with SOTA and frontier.

teej 8 hours ago | parent | prev | next [-]

Depends how many 3090s you have

woggy 8 hours ago | parent [-]

How many do you need to run inference for 1 user on a model like Opus 4.5?

ronsor 8 hours ago | parent [-]

8x 3090.

Actually better make it 8x 5090. Or 8x RTX PRO 6000.

adastra22 4 hours ago | parent | next [-]

48x 3090’s actually.

worldsavior 8 hours ago | parent | prev [-]

How is there enough space in this world for all these GPUs

filoleg 8 hours ago | parent | next [-]

Just try calculating how many RTX 5090 GPUs by volume would fit in a rectangular bounding box of a small sedan car, and you will understand how.

Honda Civic (2026) sedan has 184.8” (L) × 70.9” (W) × 55.7” (H) dimensions for an exterior bounding box. Volume of that would be ~12,000 liters.

An RTX 5090 GPU is 304mm × 137mm, with roughly 40mm of thickness for a typical 2-slot reference/FE model. This would make the bounding box of ~1.67 liters.

Do the math, and you will discover that a single Honda Civic would be an equivalent of ~7,180 RTX 5090 GPUs by volume. And that’s a small sedan, which is significantly smaller than an average or a median car on the US roads.

worldsavior 7 hours ago | parent [-]

What about what's around the GPU? Motherboard etc.

Forgeties79 8 hours ago | parent | prev [-]

Milk crates and fans, baby. Party like it’s 2012.

kgwgk 8 hours ago | parent | prev | next [-]

99.99% but then you will want Opus 42 or whatever.

lifetimerubyist 6 hours ago | parent | prev | next [-]

Never because the AI companies are gonna buy up all the supply to make sure you can’t afford the hardware to do it.

rvz 7 hours ago | parent | prev | next [-]

Less than a decade.

greenavocado 8 hours ago | parent | prev | next [-]

GLM 4.7 is already ahead when it comes to troubleshooting a complex but common open source library built on GLib/GObject. Opus tried but ended up thrashing whereas GLM 4.7 is a straight shooter. I wonder if training time model censorship is kneecapping Western models.

sanex 8 hours ago | parent [-]

Glm won't tell me what happened in Tianenman square in 1989. Is that a different type of censorship?

heliumtera 8 hours ago | parent | prev [-]

RAM and compute is sold out for the future, sorry. Maybe another timeline can work for you?