Remix.run Logo
gkbrk 3 hours ago

I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed.

catapart 3 hours ago | parent | next [-]

Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising!

ndriscoll 2 hours ago | parent | next [-]

Scraped reddit text archives (~23B items according to their corporate info page) are ~4 TB of compressed json, which includes metadata and not just the actual comment text.

osigurdson 2 hours ago | parent | prev [-]

I suspect the text alone would be a lot smaller. Embeddings add a lot - 4K or more regardless of the size of the text.

atonse 3 hours ago | parent | prev [-]

That’s crazy small. So is it fair to say that words are actually the best compression algorithm we have? You can explain complex ideas in just a few hundred words.

Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text.

binary132 3 hours ago | parent | next [-]

I don’t think I would really consider it compression if it’s not very reversible. Whatever people “uncompress” from my words isn’t necessarily what I was imagining or thinking about when I encoded them. I guess it’s more like a symbolic shorthand for meaning which relies on the second party to build their own internal model out of their own (shared public interface, but internal implementation is relatively unique…) symbols.

tiagod 34 minutes ago | parent [-]

It is compression, but it is lossy. Just like the digital counterparts like mp3 and jpeg, in some cases the final message can contain all the information you need.

binary132 15 minutes ago | parent [-]

But what’s getting reproduced in your head when you read what I’ve written isn’t what’s in my head at all. You have your own entire context, associations, and language.

_zoltan_ 3 hours ago | parent | prev [-]

how much?