Remix.run Logo
catapart 3 hours ago

Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising!

ndriscoll 2 hours ago | parent | next [-]

Scraped reddit text archives (~23B items according to their corporate info page) are ~4 TB of compressed json, which includes metadata and not just the actual comment text.

osigurdson 3 hours ago | parent | prev [-]

I suspect the text alone would be a lot smaller. Embeddings add a lot - 4K or more regardless of the size of the text.