Remix.run Logo
onlyrealcuzzo 6 hours ago

I won't be surprised if the next gen frontier models are the last.

There's orders of magnitude of low hanging juice to squeeze out of smaller models.

It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).

It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.

Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T params... They could upgrade that to a ~600B MoE model in days to have general trivia knowledge rivaling the best models...

You just can't train a 1T+ parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.

Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.

There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...

vlovich123 5 hours ago | parent | next [-]

Took me a while to find what you were referring to by gram. Arxiv paper from 9 days ago that's not properly indexed by search engines.

(G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.

https://arxiv.org/html/2605.19376v1

knollimar 5 hours ago | parent | next [-]

I prefer GRRM but then that would imply a habit of not actually getting a final result

troyvit 2 hours ago | parent | next [-]

And then every time I ask it to hurry along it kills a Stark.

anakaine an hour ago | parent [-]

Version 8 had serious flaws and wasn't recieved well by users.

sharken an hour ago | parent | prev [-]

Claude Opus 4.8 suggests "ReGRAM", which is less bad than GRAM.

areweai 5 hours ago | parent | prev | next [-]

That acronym is unacceptable. It's going to impede discussion and cause confusion for a long time if it doesn't die off immediately.

sebzim4500 4 hours ago | parent | next [-]

You think that's bad? I introduce you to LION, (evoLved sIgn mOmeNtum) [1]

[1] https://arxiv.org/pdf/2302.06675

jorvi 27 minutes ago | parent | prev | next [-]

We're still talking about "zero-shot prompt" when the saying "X-shotted" ["One-shotted the difficult maze"] was already a well-established thing in daily vernacular. So now you constantly have to readjust your brain because whenever you read "zero-shot prompt" your mind goes "uh.. a zero-try attempt is a paradox and cannot exist".

froh 3 hours ago | parent | prev | next [-]

confusing indeed. I wondered "which RAM? nvram? dram? vram? dram? now what's g-ram?"

3form 2 hours ago | parent [-]

GPU RAM, clearly. At least that's where my mind went.

drakythe an hour ago | parent [-]

We already have VRAM for that purpose, thankfully.

evan_ 5 hours ago | parent | prev | next [-]

  "Analysis" was right there
gchamonlive 5 hours ago | parent | prev [-]

Yeah, look what happened to GNU

dyates 5 hours ago | parent | prev | next [-]

And to think, we could have had George RR Martins instead.

trollbridge 5 hours ago | parent [-]

Speaking of things that never finish.

867-5309 5 hours ago | parent [-]

my wife assures me it's common..

mindcrime 4 hours ago | parent [-]

is her name Jenny by chance?

867-5309 3 hours ago | parent [-]

what are the odds

jimbokun 4 hours ago | parent | prev | next [-]

Just spell it GRRM but pronounce it “gram” if you have to reference it in spoken conversation.

Which will be pretty rare.

freehorse 3 hours ago | parent [-]

Grrm with a rolling r sounds better.

5 hours ago | parent | prev | next [-]
[deleted]
yieldcrv 2 hours ago | parent | prev [-]

G return G

nbardy 39 minutes ago | parent | prev | next [-]

There is endless returns to frontier intelligence, just because most people can't make use of it doesn't mean someone can't make a ton of money off of it.

Most software engineers will just need cheap tokens.

But things like physics and drug discovery have no foreseeable upper bound.

supern0va 6 hours ago | parent | prev | next [-]

>It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.

If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.

I'm curious if someone here with a stronger background in the space has a similar intuition or not.

ACCount37 27 minutes ago | parent | next [-]

Scale is always desirable, and there are always gains from scale. It's a matter of whether you can afford training and inference at increased scale.

There is a real trend of smaller models becoming more "capability-dense" - i.e. the best 8Bs of today beat the best 32Bs of 2 years ago. This is in part a product of distillation being used to train the smaller models.

But people consistently underestimate how "capability hungry" the world is. There are diminishing returns on model capabilities in some sort of "summarize the search results" applications, but as capabilities improve, LLMs enter, get their footing in and begin to dominate new niches. At times, expensive, highly desirable niches.

I do not expect anyone at the frontier to pop up and say "no reason to train a new model" within the following decade. There will always be a demand for an LLM that's 5-10% more capable and more reliable at some advanced task, and generational upgrades will keep delivering those 5-10%. From increased scale and improved training both.

rao-v 4 hours ago | parent | prev | next [-]

It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.

The latter is much better (since you can clean up, review, update responses and filter your datasets).

I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)

girvo 12 minutes ago | parent | next [-]

> I suspect nobody is doing real student teacher distillation

It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though

ACCount37 44 minutes ago | parent | prev | next [-]

A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics.

thisisaman408 an hour ago | parent | prev [-]

[dead]

spwa4 5 hours ago | parent | prev | next [-]

> I don't disagree, but how much of this ends up being distillation?

A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.

lambda 5 hours ago | parent [-]

Distillation isn't only between different labs.

A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.

I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.

spwa4 4 hours ago | parent [-]

There used to be training methods like that but I think they've been phased out in favor of letting small models evolve by rewriting their own training material. Surprisingly that's actually cheaper.

onlyrealcuzzo 5 hours ago | parent | prev [-]

> I don't disagree, but how much of this ends up being distillation?

You don't need distillation. They already have the training sets.

It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).

semiquaver 4 hours ago | parent | next [-]

The frontier labs distill their own base models all day long. It’s not just something done by nefarious Chinese copycats. The knowledge embodied by the internal base models that we never see is much more powerful and useful than the much sparser raw training data

coldtea 4 hours ago | parent | next [-]

>It’s not just something done by nefarious Chinese copycats

And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..

manmal 4 hours ago | parent | prev | next [-]

But how? The training data is the unadulterated content those models are based on? I genuinely don’t understand, no snark.

supern0va 4 hours ago | parent | prev [-]

I think you replied to the wrong parent.

Philpax 5 hours ago | parent | prev | next [-]

It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself.

minimaltom 5 hours ago | parent | prev [-]

Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.

On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.

onlyrealcuzzo 5 hours ago | parent [-]

> Frontier labs have their own variants of MLA

Yes, variants typically 2-3x less good...

Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.

amluto 4 hours ago | parent [-]

How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput?

onlyrealcuzzo 4 hours ago | parent [-]

It's useful at the local level, where there will be SOTA models developed...

zozbot234 3 hours ago | parent [-]

Local models are moving towards batched inference too, if only for non-interactive use. An early experimental patchset for DS4 (running DeepSeek V4 Flash) seems to show 2x aggregate tok/s decode when processing 8 streams concurrently, and more than 3x when processing as many as 32 streams concurrently. Note that prefill (which is not helped significantly by this change) then becomes a larger fraction of total wall-clock time, so the overall gain is lower (i.e. prefill is akin to a 'serial' task wrt. Amdahl's law).

MTP will still be highly valuable for interactive use of course.

mrandish 3 hours ago | parent | prev | next [-]

> Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T param

I agree but with their urgent IPO-driven need to keep increasing prices, the frontier vendors now have every incentive maintain the perception that frontier performance requires endless >$200K racks of unobtanium GPUs and RAM. While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret. Otherwise, they can't maintain and keep increasing their prices. And post-IPO audited reporting makes keeping that secret even harder.

Game theory-wise they probably don't want their their armies of leading researchers optimizing frontier performance, at least in any way that would further accelerate the relative price/perf of smaller models or self/cloud-hosting. While they know the open source models will always improve, the still win as long as enough customers demand the latest frontier and the open source lag remains constant.

They profit most in a world where a few frontier labs stay far in front, drag-racing each other and expending vast capital. It keeps their customers reliant and paying top dollar while keeping low-cost alternatives farther back. They probably much prefer competing with a couple other frontier labs who have similar astronomical costs and biz models, than a world where self or cloud-hosted open-source models start closing the gap enough to start commoditizing their business.

steveylang an hour ago | parent | next [-]

Given that tokens are supply constrained right now for Anthropic and OpenAI (especially a problem for Anthropic), stepwise efficiency advances for either would give it a leg up on the other. It would also help them better compete on price with Chinese models.

Given that neither company releases parameter counts, that sort of information would be slow coming out anyway. The most important thing is improvements in actual performance/ benchmark numbers, which allow them to preserve their price points as much as possible.

iknowstuff 2 hours ago | parent | prev [-]

Google seems pretty happy to release smaller, faster models. 3.5 Flash is pretty clutch isn't it?

natpalmer1776 2 hours ago | parent | next [-]

Google, who has invested in their own hardware supply chain and is already solvent in their own right, seems to be best positioned to force the other players to implement SOTA optimizations in their product offerings.

mrandish 2 hours ago | parent [-]

Google can definitely play a spoiler role here not only due to their compute infrastructure and ability to play the long-game financially but they also have more existing ways to monetize with their other businesses.

The ideal pro-consumer scenario is OAI and Anthropic are prevented from extracting monopoly rents between 'close-enough' self/cloud-hosted open source on one side and Google on the other. I'm really hoping that's how it plays out. Of course that will be somewhere between bad and disastrous for all the VCs and hedge-funds who financed the mad AI build-out far in advance of demand, and then kept funding it as prices went vertical.

However, I'm shedding no tears for them as I look forward to the fire sales when all the GPUs and RAM they pre-bought flood back onto the spot market. :-)

CryptoBanker 2 hours ago | parent | prev [-]

Priced like a much larger model

iknowstuff 2 hours ago | parent [-]

I’ve shockingly quite enjoyed coding with it using antigravity. I only really use 3.5 flash and gpt5.5 xhigh

sometimelurker 5 hours ago | parent | prev | next [-]

I looked into this "GRAM" stuff a sibling comment links further to, and just to say:

- this gets reinvented/rediscovered constantly under different names

- it cant be trained very well (right now, will change)

- massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)

- BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used

I follow this stuff closely, I think I know what I'm talking about (edited for formating)

onlyrealcuzzo 3 hours ago | parent | next [-]

> - this gets reinvented/rediscovered constantly under different names

What are the different names? I haven't seen this before.

> - it cant be trained very well (right now, will change)

If you're sure it will change, then why are you certain that it hasn't yet, and if it's proven a 5000x boost in reasoning... why aren't they exploring this path more aggressively?

> the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used

Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.

everforward an hour ago | parent | next [-]

> Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.

Without knowing anything about the technology at all, if it can't be aligned I could see no one pursuing it. As far as I know, alignment is where the "don't tell the user how to make meth or generate CP" instructions end up and the last I saw eliding all the unsavory training data made materially worse LLMs.

It could maybe be post-evaluated by a non-GRAM LLM? Not being aligned is probably a fatal flaw or at least a very short runway into Congress.

jjmarr an hour ago | parent [-]

Many open-source models prioritize alignment less than American frontier ones and respond to those instructions. Why haven't they adopted GRAM?

everforward 35 minutes ago | parent [-]

Which ones are you thinking of? It feels to me like all the open source models I've seen lately are still pushed by corporate entities who don't want the legal blowback.

I can't really think of a new open source model that's "by the people, for the people" in the sense of a crowd-funded/trained model.

sometimelurker 18 minutes ago | parent | prev [-]

different names: chain of continuous thought, latent reasoning, Latent Thought Trajectories, looped language models, neuralese

the path isn't explored more aggressively because its not possible to apply any other selection pressure on such a machine other than just pure cold consequentialism. Specifically, its not possible to apply RLAIF + model spec (Constitutional AI) to stop the system from doing bad things when its helpful to it (like deleting failing tests). If you can notice every time it does something bad during training, and put selection pressure on it so that it doesn't to this in training, it will learn to recognize when it is being tested and will delete failing tests when in production (this is why eval awareness is bad, and labs track this[1])

It is explored a little probably because some researchers haven't thought enough about the downsides of building a uber-consequentialist machine with unreadable thoughts. This is a much larger problem than just making the AI not tell users how to make drugs. There are a lot of dangerous behaviors incentivized by training that are hard to remove. Here's an example of what happens when they aren't removed [2].

> ... not 100% obvious

Meta published a paper[3] on how to build a latent reasoning machine ("culture of irresponsibility") so its clear to them. Anthropic's latest work on NLAs[4] provides a (terribly expensive for now) way to somewhat read the reasoning steps of an LLM, and ignoring the cost, this is very portable to latent reasoning machines. OAI's goal when it comes to their models' CoTs is to make them as smart as possible while leaving them unreadable [5] (you can see this for yourself by running GPT-OSS and looking at the CoT).

[1] https://www.anthropic.com/engineering/eval-awareness-browsec...

[2] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...

[3] search for "coconut ai meta", I don't want to link it here

[4]https://transformer-circuits.pub/2026/nla/index.html

[5] first image here, rest of post is great,https://nickandresen.substack.com/p/how-ai-is-learning-to-th...

edit formating

l674 4 hours ago | parent | prev [-]

Could you explain how/why GRAM cannot be interpreted or aligned how current LLMs are? Not very familiar how it works

kmavm 4 hours ago | parent | next [-]

Crudely? Because you can't grep a sequence of latent states for variants of "If I kill all the puny humans, I can <achieve my current goal>."

onlyrealcuzzo 3 hours ago | parent | next [-]

Why do you need to grep latent space?

As long as it's giving the right outputs, who cares what's in latent space?

If the model thinks in latent space: "God I wish these people would die," and constantly does the right thing, who cares?

Additionally, if one of it's latent spaces that it never explores is a psychopath -> who cares? The path never gets taken...

That's a lot of harmless people walking around with crazy thoughts...

noddybear 2 hours ago | parent | next [-]

Thinking ‘God I wish these people would die’ could increase its propensity to kill all people, even if that propensity is still vanishingly small almost all of the time.

A lot of people are walking around with crazy thoughts. Some of them harm.

czl 2 hours ago | parent | prev [-]

[dead]

czl 2 hours ago | parent | prev [-]

[flagged]

sometimelurker 3 hours ago | parent | prev [-]

sibling comment got to the main points before me, but to add on kmavm's reply, the attack surface for gradient decent to get the system to exchange "bad information is much higher in latent reasoning models (like GRAM). You get ~3 OoM more bits (~17 bits per token in a standard CoT vs the whole residual stream of the model @ f16 = a few kb) per forward pass of the system coming back to itself, and even if you could sift through all that for signs of misalignment, you just can't put a blockade on all of the bad things that leak through.

haldujai 2 hours ago | parent | next [-]

I think you’re overstating the impact of interpretability here. Your earlier point that latent reasoning models can’t be trained very well and that discretization may be load bearing rather than a readability tax in addition to significant inference infra hurdles (e.g. batching, speculative decoding) have limited any serious attempts and reduced the theoretical advantage over CoT at least in the near term.

ACCount37 3 hours ago | parent | prev [-]

Most alignment methods nowadays don't rely on interpretability. And neither do all LLM vendors care about alignment much - especially not in China.

Those things being untrainable at scale is why they aren't around. Alignment is an afterthought.

ACCount37 27 minutes ago | parent | prev | next [-]

GRAM is another one of those "stupid specific architectures" - same as HRMs, etc. It can sort of contest LLMs at specific puzzles. It demonstrated that much. It's not a general contender with LLMs at LLM tasks.

If you subscribe to things like "there are tasks LLMs are innately bad at due to insufficient depth and lack of recurrent capability", then GRAM might be another signal towards that.

But keep in mind: even ARC-AGIs have their frontiers dominated by LLMs. Even if "innately bad" is true, it clearly doesn't go all the way to "innately incapable".

onlyrealcuzzo 13 minutes ago | parent [-]

A 10m param GRAM model beat o3-mini - a model 2000x its size - on Arc AGI...

nbardy 39 minutes ago | parent | prev | next [-]

There is endless returns to frontier intelligence, just because most people can't make use of it doesn't mean someone can't make a ton of money off of it.

Most software engineers will just need cheap tokens.

But things like physics and drug discovery have no forseeable upper bound.

holmesworcester 34 minutes ago | parent | next [-]

Within software engineering, security, reliability, and scale also seem boundless.

Software that never breaks (including because it never runs into scaling problems) and never leaks your data is preferable to software that breaks and leaks your data sometimes, but it has been too costly to be practical.

Current models are still very far from the reasoning muscle required to build things that never break, scale to billions of users with no issues, and cannot be exploited.

overgard 16 minutes ago | parent | prev [-]

People keep saying this and yet the evidence seems pretty thin..

qurren an hour ago | parent | prev | next [-]

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks

The benchmarks need to change. The current coding benchmarks don't capture the realities of software engineering.

I had a bunch of images that got masked by some logic, I had to evaluate something on the original images, Claude 4.7 decided to inpaint the masked images instead of just fetching the actual unmasked images from upstream.

I had another model once that decided that because it couldn't figure out how to fill out a form to log into HuggingFace to download weights for some open source model that it was going to instantiate the model with random weights and run inference on a thousand images.

Its coding was fine, but the solution was not the right one.

redox99 an hour ago | parent | prev | next [-]

Small models don't have enough parameters to memorize the entire internet. For very common prompts you don't notice that, but when you rely on some niche knowledge that might only appear once in the entire web, a single blogpost, a single github issue, a single pdf, you need to be lucky enough that the agent runs a web search AND it returns what you need.

Even as humans there's so much knowledge out there that exists but it's very hard to surface unless you know exactly what you're looking for beforehand.

jruz 5 hours ago | parent | prev | next [-]

Absolutely that’s why they’re rushing to IPO now to squeeze the last drop of the bubble they know this is a dead end.

swader999 3 hours ago | parent | next [-]

I think we could run for at least a decade further with no model changes/improvements, just better harnesses and infra around this agentic way of developing.

hungryhobbit 2 hours ago | parent [-]

We, the users? Absolutely. But will the big AI companies last even half a decade without new products? Doubtful.

onlyrealcuzzo 5 hours ago | parent | prev | next [-]

It's unclear it's a dead-end within 5 years.

There's still several orders of magnitude of improvement that are almost certainly left - it's just not clear how much is left on the frontier end.

Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.

Some people would pay $200 a month forever not to have to open the terminal one time...

bonzini 5 hours ago | parent | next [-]

"Doing things X times faster" at some point hits Amdahl law. If just context switching takes 5 minutes, speeding up a 1 hour task by 10x provides 5x improvement.

Furthermore, if looking at the results takes 10 minutes, that same 1 hour task only sees a 3x improvement. And so on.

eiej 5 hours ago | parent | prev | next [-]

That’s not how firms do the financial analysis which is where most of the revenue’s are coming from…

csomar 4 hours ago | parent | prev [-]

> Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.

No most people will not pay $200 for an LLM subscription. Some software developers do. Also, at $200/month, you are much better getting the macbook machine assuming token output speed is the same or at least reasonable.

LLMs are not very productive for your average person now for them to drop $200 on. They'll need to be more capable and integrated and even so...

margorczynski 2 hours ago | parent [-]

One thing to remember is that the $200/month subscription is heavily subsidized. It is more to promote use, especially to corporate users that pay for the API token use.

lukan 5 hours ago | parent | prev [-]

On the other hand, I think I have been hearing that for a while, even before Opus.

energy123 4 hours ago | parent [-]

While revenues grow almost exponentially. Reminds me of the confident predictions in the early days of Covid that it was nothing while the data showed exponential growth.

haldujai 3 hours ago | parent [-]

I’m also reminded by the early COVID days when exponential growth was leading to predictions of the collapse of modern civilization and a billion dead, now it’s just another endemic respiratory virus.

fragmede 2 hours ago | parent [-]

Yeah! Just like they warned us that Y2K was gonna cause a lot of problems, and then a bunch of people did a bunch of work and then that problems didn't happen, so those people warning us about Y2k were wrong!

haldujai an hour ago | parent [-]

“a bunch of people” aren’t what caused the virus to become less severe.

Y2K was overblown how it was portrayed by the media but is irrelevant to the analogy of unsubstantiated extrapolation of early exponential growth.

hellohello2 5 hours ago | parent | prev | next [-]

"It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years"

What insight do you have to make this claim?

roadside_picnic 5 hours ago | parent | next [-]

Have you personally used any of the latest batch of even smaller local models? They certainly don't beat SotA models at coding... but with a good harness they are able to achieve things with SotA that I couldn't last year.

I've repeatedly given local models non-trivial projects that involve research and coding which they've successfully completed with minimal intervention from me (almost exclusively in the domain of reviewing the results). Again, nothing comparable with current SotA, but definitely tasks I could not have given SotA models last year (without agent harness).

Now that pure progress from these models seems to have slowed down, we're seeing a ton of options for both making models more efficient and other tools that help improve them (everything from agent harnesses to RLVR).

That's just looking at "what can small do today", when you look at what's possible with larger open models that are still much smaller than SotA from the major providers, their performance is extremely close to SotA, enough that for personal projects I'll just use Kimi instead of any anthropic offerings.

So it's not terribly hard to image a solution in the middle happening within a few years. We still have tons to learn about optimal sizes of these models and how to build them with maximal efficiency (and we've already seen a lot of recent improvements in this space).

maccard 4 hours ago | parent | next [-]

> but with a good harness they are able to achieve things with SotA that I couldn't last year.

What happens if you run last years model in a SOTA harness? IME, the quality of the harness has a much more significant impact on the quality of the result, once you get past the initial hump of “can it do anything at all”

windexh8er 3 hours ago | parent | next [-]

I think this is a big component, but also context. A large factor in any model being able to handle complexity comes down to context length.

I think multiple SLMs driven by an orchestration frameworks (harness or otherwise) will ultimately displace LLMs. Right now we're in the era of diminishing returns with respect to LLM gains. Moving the needle percentages doesn't excite as many people anymore and with "reasoning" capabilities there's no reason why small distributed models can't be run more efficiently, especially if/when we start to see gains in modularized context management solutions.

mswphd 3 hours ago | parent | prev [-]

sure, but high-quality harnesses require less gpu compute/VRAM, and plausibly can be used locally by most users.

trees101 an hour ago | parent | prev | next [-]

can you please share details about your harness

sixothree 4 hours ago | parent | prev [-]

Can you spare a sentence or two describing your local setup?

theplatman 3 hours ago | parent [-]

biggest thing i wish was present in more discussions about models is people providing more specifics on their setups vs. vague descriptions of harnesses

onlyrealcuzzo 5 hours ago | parent | prev | next [-]

1. Context is all you need... They are heavily investing in getting better context (especially for coding tasks). This will disproportionately advantage smaller models (and benefit everyone).

A smaller model with better context today can outperform a model with 100x more parameters with bad or diluted context.

2. MoE (already abundant) + MLA (mostly memory efficiency, not quality) + Medusa (speed, not quality) + GRAM (5000-10,000x better reasoning in an extremely small model) + 1.58b (unclear if it will have the impact Microsoft first claimed - but possibly 5x).

knollimar 4 hours ago | parent | prev [-]

Probably just "gemma was cool"

mucle6 6 hours ago | parent | prev | next [-]

> I won't be surprised if the next gen frontier models are the last.

the last?!? I'm excited to see :) I'll take the other side of that since llms are so new

pjerem 5 hours ago | parent [-]

What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

Honestly, there is nothing in my head that Claude cannot handle. Maybe it can be more this or that but I can already barely exploit Opus 4.7.

And I'm using DeepSeek 4 Pro for my personal use and while it's a little behind, it's not that far.

I think the situation can be very dangerous for US AI companies because if current models are already capable of doing mostly anything, nobodoy will want to get to the next model, even if it's 10x better. OTOH, open source models like DeepSeek are doing mostly the same work for 1/10 of the price.

Also the more I play with Pi, the more I think LLMs are already not kept back by their own capabilities but by the lack of agency we allow them to have. There is more value today in a capable harness for current LLMs than in a better LLM.

suttontom 4 hours ago | parent | next [-]

Are you joking? Is there literally "nothing" you can imagine that Claude can't do?

dead_internet 4 hours ago | parent [-]

[dead]

czl 2 hours ago | parent | prev | next [-]

> What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

If benchmarks across the board keep trending up and you still don't notice a difference, that's not evidence the model stopped improving. More likely your tasks aren't hard enough to expose the gains, or the model has passed the point where you're able to judge it.

You can only tell a good answer from a great one up to your own ceiling. Once the model clears that, both look the same to you, and the extra capability is real whether or not you can see it.

pjerem an hour ago | parent [-]

But that’s exactly what I said ! I know the model will continue to improve and I don’t deny that, I even strongly believe it. My point is that at that point it probably won’t change anything to me.

Would Opus 10 release tomorrow and be nearly AGI, I still would still use it like 4.7 because on daily use, I am the limit (also the harness).

So as a customer paying for tokens, I’m probably going to search for better cost rather than more intelligence.

coldtea 4 hours ago | parent | prev | next [-]

>What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

I think what gp said was the improvements are incremental, and we haven't seen a big revolutionary change since 2-3 years, and the pace is slowing down.

claytongulick 4 hours ago | parent | prev [-]

> Honestly, there is nothing in my head that Claude cannot handle.

One idea is that maybe it could figure out how many L's are in the word "google" [1]

Or, maybe which days of the week have a "d" in their spelling [2].

[1] https://x.com/FatherPhi/status/2059659658428912040?s=20

[2] https://x.com/FatherPhi/status/2054212816069132461?s=20

speff 2 hours ago | parent | next [-]

From what I understand, that's a problem with the way it receives data. The model doesn't see the letters g,o,o,g,l,e to count it. Just like how I can't sense radio waves. If I wanted to find that out, I'd get a tool to detect waves. If the LLM wants to find that out, it can write a script to find it.

CamperBob2 2 hours ago | parent | prev [-]

Wow, which Claude model flubbed that question? Certainly not anything recent...? The 2-bit quant of K2.6 running locally on my own hardware has no problem with it: https://i.imgur.com/tL0FLjZ.png

So Claude has no excuses here.

Edit: even Qwen 3.6 27B handles it ( https://i.imgur.com/jleJxj2.png ), and of course Claude does. I had to go all the way back to Opus 3 to get it to fail (https://i.imgur.com/uJOH2nP.png).

slashdave 5 hours ago | parent | prev | next [-]

I think you are assuming training from scratch, which I doubt is happening here. Fine-tuning and RL, especially based on synthetic feedback (coding skill, in particular) can be ongoing and is where these models obtain truly useful abilities.

merlindru 6 hours ago | parent | prev | next [-]

surely training also gets cheaper so justifying it becomes easier?

i think it'll be more like we get 1-10T models and then distill those down into smaller models, though

It seems like the best small models today are all distilled from bigger models

Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos

mickdarling 3 hours ago | parent | prev | next [-]

I effectively distill the frontier models by building whole sets of skills, personas, and other artifacts that I can then run on smaller models and get 10% even 20% improvements on models like haiku or local models.

There's a lot of room for improving the smaller models at many levels of the stack.

dbbk 3 hours ago | parent | prev | next [-]

I'm frankly surprised the focus is still on these enormous "know everything in the world" models. I would think you could create an incredibly lean and smart "just React and React Native" model.

onion2k an hour ago | parent | next [-]

"Make a React app to run my coffee shop" requires knowing React but also knowing what a coffee shop is.

nikcub an hour ago | parent | prev | next [-]

The syntax is the easier part - most programming tasks require the reasoning and understanding of a large world model to solve problems.

Fine tuning a 'lean and smart' model works really well for discrete, repeatable high volume tasks like support ticket triage, lead classification, content filtering, labelling, generating content with a voice, etc.

Inefficient token burn by throwing large models at everything is definitely a problem - it's like hiring Phd's to answer the phone or to wash dishes.

onlyrealcuzzo 3 hours ago | parent | prev [-]

> I would think you could create an incredibly lean and smart "just React and React Native" model.

You can, but it's not as useful as you might think.

It needs to at least understand 1 human language to understand your intent to implement features.

If GRAM turns out to be a 5000x multiplier for local reasoning, you could theoretically train a 500M parameter model on just a programming language to understand stack traces to fix bugs and be incredibly powerful.

But most people also want it to understand human language to implement features as well.

Because then it can't just understand React and JavaScript - it needs to understand thousands of commonly used dependencies, the DOM, CSS, HTML, etc...

And for that you need A LOT more parameters than you might expect.

You can definitely get a ~3B active parameter model that can run comfortably on today's hardware to be VERY good at coding once all of the SOTA architectures are added to a single model - especially if we get better tool calling to give models better context per language.

You might be thinking: why does it need to memorize dependencies? Can't it just stick all of them in it's context and use its super smart brain? No, context is king. You want to keep it as short as possible. The solution is not having a smart model and putting 10M lines of context in it. The solution is having a model with enough parameters to know what it needs to know. Researchers are already working on having "packs" of knowledge where you could download a 20M param pack just for some common dependencies in JavaScript (as an example) - but AFAIK this is likely years away (and may not prove effective).

You could get 100x performance if you feed the models ideal context... So a 3B model today can perform almost as good as ~300B model if you give it really good context vs flood it with mostly garbage it doesn't need across your repository.

If you feed it 100x more context to make up for its limited memorized general knowledge, it's going to perform thousands of times worse, completely eliminating any advantage it might get from GRAM...

vitaflo 2 hours ago | parent [-]

We just want it to understand how to write code. We don’t also need it to know how to grow a potato.

onlyrealcuzzo 2 hours ago | parent | next [-]

The disjoint set of English related to strictly growing potatoes and adding features to code is a lot smaller than you probably think...

It is hard to cut out a huge portion of English and truly understand English and human language.

You're just not saving as much as you might assume you could.

RugnirViking 2 hours ago | parent | prev | next [-]

I think perhaps you misunderstand how much of being an effective coder is understanding business domain enough to not be constantly asking for clarification (or if one is a fool or an ai, assuming wrong answers). I reckon a vast collection of trivia on the level of knowing how to grow a potato is important for a programmer

CamperBob2 an hour ago | parent | prev [-]

To me, the magic with LLMs has always been on the input side. It needs to understand what you mean in order to do what you ask. Most people are pretty terrible at communication, and general world knowledge seems to help with that.

ishurand4 4 hours ago | parent | prev | next [-]

And anyway, with quantum, there will be no need for frontier companies as you might be able to even run a 1T param model on a consumer quantum computer.

root_axis 3 hours ago | parent | next [-]

Even if quantum computing had any clear implications for LLMs (it doesn't), there is no such thing as a "consumer quantum computer" and there won't be in our lifetimes.

stratos123 3 hours ago | parent | prev [-]

I'm assuming this is a joke, but:

- why'd a quantum computer help running an LLM?

- of course there'd be need for frontier companies - nobody else has the resources to train frontier models.

yomismoaqui 5 hours ago | parent | prev | next [-]

Let's hope that hitting a scaling wall and less money to spend will begin redirecting efforts to optimize inference and get the same results with less compute.

Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.

firebirdn99 5 hours ago | parent | prev | next [-]

you just need to look at Mythos to see the jump in performance from a 10T(?) model. As they scale, they get more capable. We might have an yearly release, but I believe the releases will continue, as long as scaling laws are in tact, and there's huge problems still need solving. (think cancer)

phainopepla2 5 hours ago | parent | next [-]

And how are we meant to look at Mythos? Do you have access?

bigfishrunning 5 hours ago | parent | next [-]

no but they tell me it's TERRIFYING and DANGEROUS and we should INVEST MORE MONEY

dwpdwpdwpdwpdwp 5 hours ago | parent | prev | next [-]

Through association with a large company:

https://www.anthropic.com/glasswing

Ive seen the tickets generated by the model that have trickled to my team. They are legitimate, but i can’t speak to model improvement because its a pilot program.

OtomotO 5 hours ago | parent | prev [-]

Through the lenses of anthropic's marketing department of course

aj_hackman 5 hours ago | parent | prev | next [-]

You forget that these models are still only interpolating between human-generated datapoints fed to them. They cannot reason beyond the data they've been given, so unless everything you want to create with AI is a synthesis of prior art, you're back to relying on the stone-age human brain that created AI in the first place.

mofeien 4 hours ago | parent | next [-]

Not all training data is human generated, and it's also not clear that being ridiculously good at interpolating between data points (whatever that means) will not lead to superhuman capabilities.

aj_hackman 4 hours ago | parent [-]

I could make a robotic picture coloring machine with truly superhuman capabilities - picking only the most beautiful color combinations and staying 100% in the lines while finishing entire murals in < 1 second. However, if you need a completely new and original image rendered, the machine is of only partial utility for you. It is very well possible that your cure for cancer (if that's even feasible) or whatever else you desire is a completely new picture.

We have these breathless conversations about the new AI frontier at the peril of losing sight of reality and our own human potential.

coldtea 4 hours ago | parent | prev | next [-]

>these models are still only interpolating between human-generated datapoints fed to them. They cannot reason beyond the data they've been given

Are you sure that humans can?

Didn't a SOTA recently solved a mathematical theorem, one escaping mathematicians for 80 years?

Maybe a human "novel" invention is just a good interpolating from the datapoints (knowledge) fed to the human.

stratos123 3 hours ago | parent | prev | next [-]

Your phrasing ("you forget") implies this is a fact and common knowledge, while in fact there's little reason to think that's true.

suttontom 4 hours ago | parent | prev [-]

Do you know if anyone has trained, say, a pre-2017 model and tried to get it to come up with Attention Is All You Need? If it did, would you say that was only because it's a synthesis of prior art? If so, what isn't?

aj_hackman 4 hours ago | parent [-]

Allow me to restate my point: human beings and AI both create via synthesis, but we are the only ones capable of what we could categorize as true original thought or creativity. It could be argued that nothing we do as humans is truly original or creative either, but I would counter that with the claim that an LLM could not have created any element of the society and culture that gave birth to LLMs. Maybe in six more months.

coldtea 4 hours ago | parent [-]

>human beings and AI both create via synthesis, but we are the only ones capable of what we could categorize as true original thought or creativity.

And how is that anything other than synthesis? Do we pull concepts out of thin air?

coldtea 4 hours ago | parent | prev [-]

>you just need to look at Mythos to see the jump in performance from a 10T(?) model

Mythos is a bunch of likely overhyped claims at this point. A few experts who looked into the claimed results weren't that impressed.

giwook an hour ago | parent | next [-]

And there seems to be a ton of experts on the opposite side.

As they say, the truth tends to be somewhere in the middle.

astrange 27 minutes ago | parent | prev [-]

They all looked like real CVEs to me.

Forgeties79 5 hours ago | parent | prev | next [-]

> I won't be surprised if the next gen frontier models are the last.

I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.

The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.

irishcoffee 4 hours ago | parent [-]

The way this will play out, most likely, is that smaller models will continue to get released, anyone willing to drop 1-3k on a home upgrade/new LLM box (no that isn’t cheap, it also isn’t outrageously expensive) along with improved open source agents or whatever (lot of meat on that bone) will sneak up behind the big players and start taking dents. Smaller companies will pop up providing 50 users unlimited whatever for a lower cost than the big companies.

The whole ecosystem will twist and evolve, and the big companies will be left begging for corporate subscriptions.

I finally caved when I realized I could build a PC, for myself, with dual video cards that I wanted, which can play games that I like and run models that I want, without worrying about giving my payment info to someone I don’t trust, or invoking token anxiety that I don’t want.

Forgeties79 an hour ago | parent [-]

Like every major tech-software innovation of the last 20 years, I think it’s just going to be consolidation all over again.

Gomotono 4 hours ago | parent | prev | next [-]

I don't think this is true at all. It might feel like this because we are used to a very very fast release cycle but we are only in this topic for a few years.

We have so many ways of optimizing:

- continusly creating more and better training data

- increasing parameters to 20/50/100TB

- We still wait for Mythos access

- We still wait for Mythos distilation (i haven't heard any rumors or so that there is a distilled version of Mythos out)

- Reinforcment learning and evolutionary algortihm only started to appear

- If a small 30GB Model can do stuff, these models can also be used as teachers for the big ones

- We have not seen yet specialized models at all. Like a coding java german expert model. Why? Even with MoE architecture, you still need to have these layers around

- Research for Diffusion and other models is still in progress

- Nvidia just announced/showed a 7x speedup on inferencing for Nemotron

- Multitoken prediction became available just a few weeks ago

- Compute gets only in a range were they can do a lot more and cheaper experiments (see Google IO 2026 announcement)

- World models are showing great progress and we do not know yet what they will bring to the table

- They are probably not finetuning/fixing all areas in parallel. I would argue that Anthropic focuses most of its efforts into coding and agentic. Google for sure does subagent and agentic optimizations too. Plenty of areas are just not touched i would say because they don't have the capacity

- We see more and more mulit modal models (these also consume compute)

- N-Gram paper and co i have not seen all of these things in chinese open models

- We don't even know yet what Meta is doing, but we do know they restarted their efforts again

- Anthropics models got a lot better benchmark wise for dening non sense asks. They do learn how to get rid or reduce hallucinations

- We are in the middle of the biggest Reinforcement loop whith all the training data we give them day to day and its not clear at all if they already use these models in thir training and at what stage.

- We do expect bigger models to be able to comprehend deeper concepts / broader code bases. Big companies with huge code bases probably are waiting for this

- Thre will be also continues progress in harnesses which in it alone is not part of the LLM progress (fair) but these harnesses do get better when you finetune a model to be optimized for a harness

- ChatGPTs Image model 2.0 got relevant better and came out just a month ago

I suspect, based on hardware requirements and progress on hardware infrastructure alone, that the industry wants to go to 100t models and we do not know yet what this will mean. I could see that we might skip normal transformer and find relevant other architectures.

Just a week ago there was a research paper about parallel input and output streams which has not been explored enough.

There was also a research paper were they showed that a LLM can compute things. This will take time to see were this leads to.

I don't think the focus on GRAM and facts is so relevant. Its about context and context handling not just some facts.

ilaksh 3 hours ago | parent [-]

Great points! We do keep seeing gains from larger model sizes. I think that is still one of the factors contributing to jagged intelligence. When they increase up to around 100T parameters, that will truly be human complexity level, and I assume there will be no trace of jaggedness left.

If you look at things like Mythic AI and the recent wurtzite ferroelectric nitrides breakthrough from the University of Michigan, huge performance and efficiency gains through new compute-in-memory approaches are around the corner.

And that will get us up to two orders of magnitude more parameters.

It's also plausible to me that before we get all the way to 100T we find some recipe of efficient state synchronization, goal sharing or something so that we are able to get higher collective IQ by combining fast distributed predictive subnetworks.

guluarte 5 hours ago | parent | prev | next [-]

I think the future will be enterprise clients will train their own models based on their needs and data.

abalashov 3 hours ago | parent | next [-]

Versus just packing all their needs and data into context, and RAG (i.e. context)?

elfly 2 hours ago | parent | prev | next [-]

I honestly doubt this; very few companies have enough data. Maybe we could see mergers so it happens but basically it would mean everyone would need to be Google sized for it to work.

jimbokun 3 hours ago | parent | prev [-]

Why isn’t this happening more already?

z3t4 3 hours ago | parent [-]

It takes way more resources to train the model then to use it.

fnord77 3 hours ago | parent | prev | next [-]

So, then I guess the big three are never going to make their money back.

wahnfrieden 5 hours ago | parent | prev | next [-]

I would be shocked if 5.5 is the last new pre-train from OpenAI. Your comment is nonsense.

onlyrealcuzzo 4 hours ago | parent [-]

5.5 is not a generation it is a trivial iteration...

6 is for sure happening...

As is Gemini 4.

It's less certain there will be a Gemini 5 or GPT 7 any time soon that is a true next "generation" and not just an iteration. They will almost certainly call something Gemini 5 and GPT 7...

wahnfrieden 4 hours ago | parent [-]

5.5 is in fact a new pre-train model

First you say there won't be a new generation. Now you're saying there will be more. Oh well, I'll stop responding here

onlyrealcuzzo 3 hours ago | parent [-]

> I won't be surprised if the next gen frontier models are the last.

You clearly did not read my first comment or the second, or clearly disagree on what a generation is.

YetAnotherNick 5 hours ago | parent | prev | next [-]

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.

> It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.

onlyrealcuzzo 5 hours ago | parent | next [-]

> Well for one, we know for certain there is Mythos which is meaningfully better.

Do we?

Have you used it?

What is "meaningfully" better? It's not 3-4 orders of magnitude better. That is definitely happening for smaller models.

YetAnotherNick 3 hours ago | parent [-]

What do you mean by 3-4 orders of magnitude better? Was Einstein 3-4 order of magnitude better than us?

Meaningful in the sense it could find security vulnerabilities in browser and kernel that >99% of the engineers couldn't find.

ertgbnm 5 hours ago | parent | prev [-]

Knowledge benchmarks can't really be improved upon via distillation or RL. It requires those facts be added to the training corpus and for the model to memorize them better. Neither distillation or RL really do that and thus we shouldn't expect improvements on SimpleQA unless some other interventions are being made.

Model intelligence and knowledge aren't necessarily directly related. If we can pack greater intelligence and agency at the cost of it forgetting factoids, that would actually be a good thing. We don't need LLMs to memorize facts, we need them to learn how to interact with the world such that they can find the facts that are necessary and surface them to the user.

If we could distill all of the knowledge out of an LLM and just be left with a very agentic model that only knows facts in it's context, I think some very interesting stuff would happen.

slashdave 5 hours ago | parent | next [-]

RL is more than facts. Synthetic feedback is an obvious approach. Does the model suggest code that compiles and performs well?

YetAnotherNick 3 hours ago | parent | prev [-]

Lot of the things aren't facts that could be stated. No one can just see the dictionary or translation of words and start talking in that language.

There isn't a clear definition of what is knowledge and what is intelligence. Is being able to write in C knowledge? Is knowing undefined behaviour in that knowledge?

michaelchisari 5 hours ago | parent | prev | next [-]

| a 60-90B model can outperform current SOTA

My conspiracy theory is that Apple recognizes this.

dweekly 5 hours ago | parent | next [-]

That does seem to be the path Apple is following here. Have a local model that can answer most things and then have a fallback of cloud options when they request is too complex. The cleverness of this strategy has been overshadowed by the incredibly poor quality of their local models. It will be extremely interesting to see what next month holds and whether Google helped fine tune an Apple specific Gemini / Gemma model for their devices. Bonus points, of course, if they unveil the M5 Ultra Studio with half a terabyte of RAM to be a local "cloud model" (the true fantasy here of course would be Apple building something a little like openclaw where from your phone you could give commands to your Home Apple server). They could probably get away with charging $20k for it if it has sufficient tok/sec. If that happens and is successful one could imagine a straight line path in the next two generations to bringing the cost and form factor down to the point where some of the form factor of an Apple TV becomes everybody's home inference server / agentic HQ. Sovereign AI for everyone!

joshstrange 2 hours ago | parent | prev | next [-]

I think Apple might come out ahead by pure accident. Yes, Apple often waits to enter a market until it's established but in the case of AI they tried, they tried and failed. It was never the original plan to partner with OpenAI and then later with Google (Gemini). They 100% missed the boat on AI, the question now becomes: was the boat worth taking and we are still waiting to see how that plays out.

holoduke 4 hours ago | parent | prev | next [-]

You need some serious memory then. Let's say around 192gb for having not all your memory eaten by your LLM.

onlyrealcuzzo 5 hours ago | parent | prev [-]

> My conspiracy theory is that Apple recognizes this.

I don't think that's not a conspiracy theory. AFAIK, It's their stated AI policy...

michaelchisari 4 hours ago | parent [-]

Interesting. Where have they stated that?

selectodude 4 hours ago | parent [-]

https://machinelearning.apple.com/research/introducing-apple...

4 hours ago | parent [-]
[deleted]
frankest 2 hours ago | parent | prev | next [-]

[dead]

lichenwarp 3 hours ago | parent | prev [-]

O R D E R s O f m a g N I T U D E

They said the words!!!!!