| ▲ | catapart 3 hours ago | |
Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising! | ||
| ▲ | ndriscoll 2 hours ago | parent | next [-] | |
Scraped reddit text archives (~23B items according to their corporate info page) are ~4 TB of compressed json, which includes metadata and not just the actual comment text. | ||
| ▲ | osigurdson 3 hours ago | parent | prev [-] | |
I suspect the text alone would be a lot smaller. Embeddings add a lot - 4K or more regardless of the size of the text. | ||