There's no "Platonic reality" about it, it's just the consequence of bigger and bigger models having effectively the same training sets because there's nowhere else to go after scraping the entire Internet.

▲

Der_Einzige 5 days ago | parent [-]

The idea that we've scraped the "entire internet" is complete nonsense. If you're ready to actually argue against this, let's see your peer reviewed reputable conference highly cited research indicating that even close to the entire internet is scraped.

At best, you've scraped a significant portion of the open internet.

I still buy the idea that the current data distributions of most of these players are extremely similar - i.e. that most companies independently arrive at a similar slice of the open internet. I don't buy that we've hit the data wall yet. Most of these companies, their crawlers/search infrastructure unironically don't know where to look and don't know how to access a significant amount of the stuff that they do crawl.

	▲	cwmoore 5 days ago \| parent [-]
		Eg. fuzzed outputs of all the source code and every Wikipedia article autocompleted