Remix.run Logo
zurfer 4 days ago

It makes me wonder if we'll see an explosion of purpose trained LLMs because we hit diminishing returns on invest with pre training or if it takes a couple of months to fold these advantages back into the frontier models.

Given the size of frontier models I would assume that they can incorporate many specializations and the most lasting thing here is the training environment.

But there is probably already some tradeoff, as GPT 3.5 was awesome at chess and current models don't seem trained extensively on chess anymore.

criemen 4 days ago | parent | next [-]

> or if it takes a couple of months to fold these advantages back into the frontier models.

Right now, I believe we're seeing that the big general-purpose models outperform approximately everything else. Special-purpose models (essentially: fine tunes) of smaller models make sense when you want to solve a specific task at lower cost/lower latency, and you transfer some/most of the abilities in that domain from a bigger model to a smaller one. Usually, people don't do that, because it's a quite costly process, and the frontier models develop so rapidly, that you're perpetually behind them (so in fact, you're not providing the best possible abilities).

If/when frontier model development speed slows down, training smaller models will make more sense.

nextos 4 days ago | parent | next [-]

The advantage of small purpose-specific models is that they might be much more robust i.e., unlikely to generate wrong sequences for your particular domain. That is at least my experience working on this topic during 2025. And, obviously, smaller models mean you may deploy them on cheaper hardware, latency is reduced, energy consumption is lower, etc. In some domains like robotics, these two advantages might be very compelling, but it's obviously early to draw any long-term conclusions.

larodi 4 days ago | parent [-]

I second this. Smaller models indeed may be much better positioned for fine-tuning for the very reason you point out - less noise to begin with.

barrell 4 days ago | parent | prev | next [-]

> If/when frontier model development speed slows down

You do not believe that this has already started? It seems to me that we’re well into a massive slowdown

criemen 2 days ago | parent | next [-]

It's hard for me to say. I don't think you know you're on the S-curve until after the fact.

On the one hand, most models are "good enough" for chatgpt-like usage, and there it's hard to see/feel generation-to-generation improvements. On the other hand, if you look at instruction following, dealing with long context windows, >200 tool call interactions while staying on track, there's still plenty of improvements to be had. So, hard to say where we are.

enraged_camel 3 days ago | parent | prev [-]

Not the OP but I use AI all day every day and have noticed substantial improvements in the models over the past ~6 months. GPT-5 was a huge leap (contrary to reporting) and so was Sonnet 4.5.

barrell 3 days ago | parent | next [-]

GPT5 was by no means a huge leap. I’d be willing to believe that you prefer it, or that you found it an improvement, despite both of those being wildly contrary to my experience (and most of the rhetoric online). But objectively speaking it was a small improvement, even going by OpenAI’s marketing claims.

In practice, I upgraded everything to GPT-5 and the performance was so terrible I had to rollback the update.

embedding-shape 3 days ago | parent | prev [-]

> GPT-5 was a huge leap (contrary to reporting) and

Depends on what you compare it to. For us who were using o3/o1 Pro Mode before GPT-5, the new model isn't that huge of a leap, compared to whatever was before Pro Mode existed.

fragmede 4 days ago | parent | prev [-]

Right, the Costco problem. A small boutique eg wine store might be able to do better for picking a very specific wine for a specific occasion, but Costco is just so much bigger that they can make it up in Volume and buy cases and cases of everything with a lower markup, so it ends up being cheaper to shop at Costco, no matter how much you want to support the local wine boutique.

semi-extrinsic 3 days ago | parent [-]

In Norway there is a state-owned monopoly on selling wine and liquor (anything above 4.75% ABV). They have 350+ physical shops, a large online shop and around $2bn annual revenue. This makes them one of the largest purchasers of wine and spirits in Europe, and they can get some very good deals.

So even though you have high taxes and a restrictive alcohol policy, the end result is shops that have high customer satisfaction because they have very competent staff, excellent selection and a surprisingly good price for quality products.

The downsides are the limited opening hours and the absence of cheap low-quality wine - the tax disproportionally impacts the low quality stuff, almost nobody will buy shitty wine at $7 per bottle when the decent stuff costs $10, so the shitty wine just doesn't get imported. But for most of the population these are minor drawbacks.

Imustaskforhelp 4 days ago | parent | prev | next [-]

> But there is probably already some tradeoff, as GPT 3.5 was awesome at chess and current models don't seem trained extensively on chess anymore.

Wow, I am so curious, can you provide me the source

I am so interested in a chess LLM's benchmark as someone who occasionally plays chess. I have thought about creating things like these but it would be very interesting to find the best model at chess which isn't stockfish/lila but general purpose large language models.

I also agree that there might be an explosion of purpose trained LLM's. I had this idea some year ago when there was llama / before deepseek that what if I want to write sveltekit and there are models like deepseek which know about sveltekit but they are so damn big and bloated when I only want to use sveltekit/svelte models. Yes there are thoughts on why we might need the whole network to get better quality but I genuinely feel like right now, the better quality is debtable thanks to all this benchmarkmaxxing and I would happily take a model trained on sveltekit on like preferably 4b-8b parameter but if an extremely good SOTA-ish model for sveltekit is even around 30-40b I would be happy since I could buy a gpu on my pc to run it or run it on my mac

I think my brother who actually knows what he's talking about in the AI space, (unlike me), also said the same thing a few months back to me as well.

In fact, its funny because I had asked him to please create a website comparing benchmarks of AI playing chess and having an option where we can make two AI LLM's play against each other and we can view it or we can also play against an LLM inside an actual chess board on the web and more..., I had given this idea to him a few months ago after the talk about small llm's really lol and he said that its good but he was busy right now. I think then later he might have forgotten about it and I had forgotten about it too until now.

radarsat1 4 days ago | parent | next [-]

Just search for "chess LLM leaderboard" there are already several. Also check https://www.reddit.com/r/llmchess/ although admittedly it doesn't get a lot of traffic.

zurfer 3 days ago | parent | prev | next [-]

this was the article I had in mind, when writing this: https://dynomight.substack.com/p/chess

Imustaskforhelp 3 days ago | parent [-]

Ohhh I think this was the same article that I also had in mind

Key memory unlocked. I had an Aha moment with this article, thanks a lot for sharing it, appreciate it.

cindyllm 4 days ago | parent | prev [-]

[dead]

deepanwadhwa 4 days ago | parent | prev | next [-]

-> GPT 3.5 was awesome at chess I don't agree with this. I did try to play chess with GPT3.5 and it was horrible. Full of hallucinations.

zurfer 3 days ago | parent | next [-]

Yeah I was not precise; it was `gpt-3.5-turbo-instruct`, other variants weren't trained on it apparently. https://dynomight.substack.com/p/chess

miki123211 4 days ago | parent | prev [-]

It was GPT-3 I think.

As far as I remember, it's post-training that kills chess ability for some reason (GPT-3 wasn't post-trained).

Imustaskforhelp 4 days ago | parent [-]

This is so interesting, I am curious as to why, can you (or anyone) please provide any resources or insightful comments about it, they would really help a ton out here, thanks!

pixelmelt 4 days ago | parent [-]

Gpt3 was trained on completion data so it likely saw lots of raw chess games layed out in whatever standard format moves are listed in, while 3.5 was post trained on instruct data (talking back and forth) which would have needed to explicitly include those chess games as conversational training data for it to retain as much as it would otherwise

alephnerd 4 days ago | parent | prev | next [-]

> if we'll see an explosion of purpose trained LLMs...

Domain specific models have been on the roadmap for most companies for years now for both competitive (why give up your moat to OpenAI or Anthropic) and financial (why finance OpenAI's margins) perspective.

onlyrealcuzzo 4 days ago | parent | prev | next [-]

Isn't the whole point of the MOE architecture exactly this?

That you can individually train and improve smaller segments as necessary

ainch 4 days ago | parent | next [-]

Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query.

pama 4 days ago | parent [-]

> only uses 1/18th of the total parameters per-query.

only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query.

ainch 3 days ago | parent [-]

That's a good correction, thanks.

idiotsecant 4 days ago | parent | prev [-]

I think it's the exact opposite - you don't specifically train each 'expert' to be a SME at something. Each of the experts is a generalist but becomes better at portions of tasks in a distributed way. There is no 'best baker', but things evolve toward 'best applier of flour', 'best kneader', etc. I think explicitly domain-trained experts are pretty uncommon in modern schemes.

viraptor 4 days ago | parent [-]

That's not entirely correct. Most of moe right now are fully balanced, but there is an idea of a domain expert moe where the training benefits fewer switches. https://arxiv.org/abs/2410.07490

idiotsecant 3 days ago | parent [-]

Yes, explicitly trained experts were a thing for a little while, but not anymore. Yet another application of the Hard Lesson.

almaight 4 days ago | parent | prev | next [-]

https://seed-tars.com/game-tars

almaight 4 days ago | parent [-]

Video games have long served as a crucial proving ground for artificial intelligence. Like the real world, they offer rich, dynamic environments with responsive, real-time settings and complex challenges that push the boundaries of AI capabilities. The history of AI in gaming is marked by landmark achievements, from mastering classic board games to achieving superhuman performance in complex strategy titles. However, the next frontier lies beyond mastering individual, known environments.

To meet this challenge, we introduce Game-TARS: a next-generation generalist game agent designed to master complex video games and interactive digital environments using human-like perception, reasoning, and action. Unlike traditional game bots or modular AI frameworks, Game-TARS integrates all core faculties—visual perception, strategic reasoning, action grounding, and long-term memory—within a single, powerful vision-language model (VLM). This unified approach enables true end-to-end autonomous gameplay, allowing the agent to learn and succeed in any game without game-specific code, scripted behaviors, or manual rules.

With Game-TARS, this work is not about achieving the highest possible score in a single game. Instead, our focus is on building a robust foundation model for both generalist game-playing and broader computer use. We aim to create an agent that can learn to operate in any interactive digital environment it encounters, following instructions just like a human.

AmbroseBierce 4 days ago | parent | prev [-]

It reminds me of a story I read somewhere that some guy high on drugs climbed to the top of some elevated campus headlights shouting things about being a moth and loving lights, and the security guys tried telling him to go down but he paid no attention to that and time went on until a janitor came and shut off the lights, then turned one of those high powered handheld ones and point it at him the guy quickly climbed down there.

So yeah I think there are different levels of thinking, maybe future models with have some sort of internal models once they recognize patterns of some level of thinking, I'm not that knowledgeable of the internal workings of LLMs so maybe this is all nonsense.