I also love how AI enthusiasts just ignore the issue of exhausted training data... You cant just magically create more training data. Also synthetic training data reduces the quality of models.

▲

aspenmartin 2 days ago | parent | next [-]

Youre mixing up several concepts. Synthetic data works for coding because coding is a verifiable domain. You train via reinforcement learning to reward code generation behavior that passes detailed specs and meets other deseridata. It’s literally how things are done today and how progress gets made.

▲

zwnow 2 days ago | parent [-]

Most code out there is a legacy security nightmare, surely its good to train on that.

▲

dang 2 days ago | parent | next [-]

Would you please stop posting cynical, dismissive comments? From a brief scroll through https://news.ycombinator.com/comments?id=zwnow, it seems like your account has been doing nothing else, regardless of the topic that it's commenting on. This is not what HN is for, and destroys what it is for.

If you keep this up, we're going to have to ban you, not because of your views on any particular topic but because you're going entirely against the intended spirit of the site by posting this way. There's plenty of room to express your views substantively and thoughtfully, but we don't want cynical flamebait and denunciation. HN needs a good deal less of this.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

▲

zwnow 2 days ago | parent [-]

Then ban me u loser, as I wrote HN is full of pretentious bullshitters. But its good that u wanna ban authentic views. Way to go. If i feel like it I'll just create a new account:-)

	▲	aspenmartin 2 days ago \| parent [-]
		dang is a saint, he wants your opinion, not the other toxic stuff.

▲

aspenmartin 2 days ago | parent | prev [-]

But that doesn't really matter and it shows how confused people really are about how a coding agent like Claude or OSS models are actually created -- the system can learn on its own without simply mimicking existing codebases even though scraped/licensed/commissioned code traces are part of the training cycle.

Training looks like:

- Pretraining (all data, non-code, etc, include everything including garbage)

- Specialized pre-training (high quality curated codebases, long context -- synthetic etc)

- Supervised Fine Tuning (SFT) -- these are things like curated prompt + patch pairs, curated Q/A (like stack overflow, people are often cynical that this is done unethically but all of the major players are in fact very risk adverse and will simply license and ensure they have legal rights),

- Then more SFT for tool use -- actual curated agentic and human traces that are verified to be correct or at least produce the correct output.

- Then synthetic generation / improvement loops -- where you generate a bunch of data and filter the generations that pass unit tests and other spec requirements, followed by RL using verifiable rewards + possibly preference data to shape the vibes

- Then additional steps for e.g. safety, etc

So synthetic data is not a problem and is actually what explains the success coding models are having and why people are so focused on them and why "we're running out of data" is just a misunderstanding of how things work. It's why you don't see the same amount of focus on other areas (e.g. creative writing, art etc) that don't have verifiable rewards.

The

Agent --> Synthetic data --> filtering --> new agent --> better synthetic data --> filtering --> even better agent

flywheel is what you're seeing today so we definitely don't have any reason to suspect there is some sort of limit to this because there is in principle infinite data

▲

TeMPOraL 3 days ago | parent | prev | next [-]

They don't ignore it, they just know it's not an actual problem.

It saddens me to see AI detractors being stuck in 2022 and still thinking language models are just regurgitating bits of training data.

▲

zwnow 3 days ago | parent [-]

You are thankfully wrong. I watch lots of talks on the topic from actual experts. New models are just old models with more tooling. Training data is exhausted and its a real issue.

▲

TeMPOraL 2 days ago | parent | next [-]

Well, my experts disagree with your experts :). Sure, the supply of available fresh data is running out, but at the same time, there's way more data than needed. Most of it is low-quality noise anyway. New models aren't just old models with more tooling - the entire training pipeline has been evolving, as researchers and model vendors focus on making better use of data they have, and refining training datasets themselves.

There are more stages to LLM training than just the pre-training stage :).

▲

GrumpyGoblin 2 days ago | parent | prev [-]

Not saying it's not a problem, I actually don't know, but new CPU's are just old models with more improvements/tooling. Same with TV's. And cars. And clothes. Everything is. That's how improving things works. Running out of raw data doesn't mean running out of room for improvement. The data has been the same for the last 20 years, AI isn't new, things keep improving anyways.

	▲	zwnow 2 days ago \| parent [-]
		Well from cars or CPUs its not expected for them to eventually reach AGI, they also don't eat a trillion dollar hole into us peasants pockets. Sure, improvements can be made. But on a fundamental level, agents/LLMs can not reason (even though they love to act like they can). They are parrots learning words, these parrots wont ever invent new words once the list of words is exhausted though.

▲

puchatek 3 days ago | parent | prev [-]

That's been my main argument for why LLMs might be at their zenith. But I recently started wondering whether all those codebases we expose to them are maybe good enough training data for the next generation. It's not high quality like accepted stackoverflow answers but it's working software for the most part.

	▲	jacquesm 2 days ago \| parent [-]
		If they'd be good enough you could rent them to put together closed source stuff you can hide behind a paywall, or maybe the AI owners would also own the paywall and rent you the software instead. The second that that is possible it will happen.