Remix.run Logo
WarmWash 3 hours ago

>What I am saying is that once the high quality training data runs out, it will drop in its capabilities pretty fast.

That's more a misunderstood study that over time turned into a confidently stated fact. Yes, the model collapses if you loop the output to the input. But no, that's not how it's done.

The reality is that all the labs are already using synthetic training data, and have been for at least a year now. It basically turned out to be a non-issue if you have robust monitoring and curation in place for the generated data.

qsera 3 hours ago | parent [-]

>using synthetic training data

yea, look up how it is done.

It is exactly how a perpetual motion machine scam would project an appearance of working like using a generator to drive a motor, and the motor driving the generator..something that would obscure the fact that there is energy loss happening along the way....

WarmWash 2 hours ago | parent [-]

I'm confused with the point you are trying to make, because they are using synthetic data, and the models are getting stronger.

There is no "conservation of fallacy" law (bad data must conserve it's level of bad), so I'm struggling to connect the dots on the analogy, unless I ignore the fact that training on synthetic data works, is being used, and the models are getting better.

qsera an hour ago | parent | next [-]

If the training that did not use synthetic data failed to capture some aspect of the information contained, then using data synthesized from the original data could help to capture it, thus it could result in the models getting better.

But that is because the synthetic data helped the model capture what was already there in the training data.

But after all such information has been extracted, then it would not be possible to use synthetic data or anything that is derived from the original data to create "new" information for training....

dgb23 an hour ago | parent | prev [-]

Better by which metrics?