| ▲ | Marha01 3 hours ago | |||||||
Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression. | ||||||||
| ▲ | jesse__ 2 hours ago | parent | next [-] | |||||||
This sounds very wrong to me. Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB. I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet. | ||||||||
| ||||||||
| ▲ | kgeist an hour ago | parent | prev | next [-] | |||||||
A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet. | ||||||||
| ||||||||
| ▲ | gmueckl an hour ago | parent | prev [-] | |||||||
This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression. | ||||||||