After TurboQuant and Gemma 4, came across the following video[0] running Gemma on local machine at 50 token/second.

That already looks like Sonnet 3x and 4 level capabilities to me where the model in question (Gemma 4) set ups whole python project with a UI and installs python libraries using uv etc.

Add this Simple Self Distillation to the picture and by 2028 I see cheaper coding model providers with much more generous usage limits in the future and power users would be mostly running their own models anyway.

Anyone using these models as "non-deterministic transpilers" from natural language to code (experienced engineers who can write code themselves) would probably not be paying to any AI providers.

[0] https://www.youtube.com/watch?v=-_hC-C_Drcw

▲

spiderfarmer 14 hours ago | parent | next [-]

I always wonder how much smaller and faster models could be if they were only trained on the latest versions of the languages I use, so for me that is PHP, SQL, HTML, JS, CSS, Dutch, English, plus tool use for my OS of choice (MacOS).

Right now it feels like hammering a house onto a nail instead of the other way around.

▲

ACCount37 12 hours ago | parent | next [-]

Not very. LLMs derive a lot of their capability profile from the sheer scale.

LLMs have something that's not entirely unlike the "g factor" in humans - a broad "capability base" that spans domains. The best of the best "coding LLMs" need both good "in-domain training" for coding specifically and a high "capability base". And a lot of where that "base" comes from is: model size and the scale of data and compute used in pre-training.

Reducing the model scale and pruning the training data would result in a model with a lower "base". It would also hurt in-domain performance - because capabilities generalize and transfer, and pruning C code from the training data would "unteach" the model things that also apply to code in PHP.

Thus, the pursuit of "narrow specialist LLMs" is misguided, as a rule.

Unless you have a well defined set bar that, once cleared, makes the task solved, and there is no risk of scope adjustment, no benefit from any future capability improvements above that bar, and enough load to justify the engineering costs of training a purpose-specific model? A "strong generalist" LLM is typically a better bet than a "narrow specialist".

In practice, this is an incredibly rare set of conditions to be met.

	▲	weitendorf 9 hours ago \| parent [-]
		It's more complicated than that. Small specialized LLMS are IMO better framed as "talking tools" than generalized intelligence. With that in mind, it's clear why something that can eg look at an image and describe things about it or accurately predict weather, then converse about it, is valuable. There are hardware-based limitations in the size of LLMs you can feasibly train and serve, which imposes a limit in the amount of information you can pack into a single model's weights, and the amount of compute per second you can get out of that model at inference-time. My company has been working on this specifically because even now most researchers don't seem to really understand that this is just as much an economics and knowledge problem (cf Hayek) as it is "intelligence" It is much more efficient to strategically delegate specialized tasks, or ones that require a lot of tokens but not a lot of intelligence, to models that can be served more cheap. This is one of the things that Claude Code does very well. It's also the basis for MOE and some similar architectures with a smarter router model serving as a common base between the experts.

▲

BarryMilo 13 hours ago | parent | prev | next [-]

I seem to remember that's one of the first things they tried, but the general models tended to win out. Turns out there's more to learn from all code/discussions than from just JS.

	▲	justinlivi 5 hours ago \| parent [-]
		From my own empirical research, the generalized models acting as specialists outperform both the tiny models acting as specialists and the generalist models acting as generalists. It seems that if peak performance is what you're after, then having a broad model act as several specialized models is the most impactful.

▲

Someone1234 13 hours ago | parent | prev | next [-]

Wouldn't that mean they're bad at migration tasks? I feel like for most languages, going from [old] to [current] is a fairly to very common usage scenario.

▲

nareyko 13 hours ago | parent | prev [-]

[dead]

▲

red75prime 12 hours ago | parent | prev [-]

> power users would be mostly running their own models

...with a fair amount of supervision, while frontier models would be running circles around them using project-specific memory and on-demand training (or whatever we would have by then).

▲

3abiton 12 hours ago | parent | next [-]

Honestly right now it's mainly stagnation in frontiere model capabilities. Most of the recent afvancemdnts are towards generation speed, compression and tool usage. The quality of the models are not improving at the same rate as before. I doubt this big gap will continue, given that open source and especially chinese labs keep pushing well documented frontiere papers.

▲

darkerside 12 hours ago | parent | prev | next [-]

Those will be great for projects that look just like everybody else's. That's not a knock. We'll see plenty of new systems built by anyone who needs one.

If you're building something groundbreaking and new, the advantage will be slim to none.

▲

littlestymaar 9 hours ago | parent | prev [-]

If what you refer to by “on demand training ” is fine tuning, it's going to be much more efficient on a small model than a big one.

	▲	red75prime 8 hours ago \| parent [-]
		LoRA can work with big models. But I mean sample-efficient RL.