| ▲ | jesse__ 2 hours ago | |
This sounds very wrong to me. Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB. I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet. | ||
| ▲ | FeepingCreature an hour ago | parent [-] | |
This would imply that the English internet is not much bigger than 20x the English Wikipedia. That seems implausible. | ||