Remix.run Logo
willvarfar a day ago

This is a tangent but I think that this neat illustration of how LLMs regurgitate their training material makes me voice a little prediction I've been nursing recently:

LLMs are better at generating the boilerplate of todays programming languages than they will be with tomorrows programming languages.

This is because not only will tomorrows programming languages be newer and lacking in corpus to train the models in but, by the time a corpus is built, that corpus will consist largely of LLM hallucinations that got checked into github!?

The internet that that has been trawled to train the LLMs is already largely SEO spam etc, but the internet of the future will be much more so. The loop will feed into itself and become ever worse quality.

jetrink a day ago | parent [-]

That sounds like a reasonable prediction to me if the LLM makers do nothing in response. However, I'll bet coding is the easiest area for which to generate synthetic training data. You could have an LLM generate 100k solutions to 10k programming problems in the target language and throw away the results that don't pass automated tests. Have humans grade the results that do pass the tests and use the best answers for future training. Repeat until you have a corpus of high quality code.