(OP) the scaling laws / bitter lesson would disagree, but I tend to agree with you with some hedging.

If you get copies of the same data, it doesn't help. In a similar fashion, going from 100 TBs of data scraped from the internet to 200TBs of data scraped from the internet... does it tell you much more? Unclear.

But there are large categories of data which aren't represented at all in LLMs. Most of the world's data just isn't on the internet. AI for Health is perhaps the most obvious example.

▲

CuriouslyC 9 hours ago | parent | next [-]

More data isn't automatically better. You're trying to build the most accurate model of the "true" latent space (estimated from user preference/computational oracles) possible. More data can give you more coverage of the latent space, it can smooth out your estimate of it, and it can let you bake more knowledge in (TBH this is low value though, freshness is a problem). If you add more data that isn't covering a new part of the latent space the value quickly goes to zero as your redundancy increases. Also, you have to be careful when you add data that you aren't giving the model ineffective biases.

▲

joe_the_user 9 hours ago | parent | prev [-]

the scaling laws / bitter lesson would disagree

I have to note that taking the "bitter lesson" position as a claim that more data will result in better LLMs is a wild misinterpretation (or perhaps a "telephone version) of the original bitter lesson article, which say only that general, scalable algorithms do better than knowledge-carrying, problem-specific algorithms. And the last I heard it was the "scaling hypothesis" that hardly had consensus among those in the field.

	▲	williamtrask 9 hours ago \| parent [-]
		Agree with you on the nuance.