How exactly do you foresee "pre-internet" data sources being the future of AI.

We already train on these encyclopedias, we've trained models on massive percentages of entire published book content.

None of this will be helpful either, it will be outdated and won't have modern findings, understandings. Nor will it help me diagnose a Windows Server 2019 and a DHCP issue or similar.

▲

bbarnett 7 days ago | parent [-]

We're certainly not going to get accurate data via the internet, that's for sure.

Just taking a look at python. How often does the AI know it's python 2.7 vs 3? You may think all the headers say /usr/bin/python3, but they don't. And code snippets don't.

How many coders have read something, then realised it wasn't applicable to their version of the language? My point is, we need to train with certainty, not with random gibberish off the net. We need curated data, to a degree, and even SO isn't curated enough.

And of course, that's even with good data, just not categorized enough.

So one way is to create realms of trust. Some data trusted more deeply, others less so. And we need more categorization of data, and yes, that reduces model complexity and therefore some capabilities.

But we keep aiming for that complexity, without caring about where the data comes from.

And this is where I think smaller companies will come in. The big boys are focusing in brute force. We need subtle.

	▲	ipaddr 6 days ago \| parent [-]
		New languages will emerge or at least versions of existing languages till come with codenames. What about Thunder python or uber python for the next release.