| ▲ | gkbrk 3 hours ago |
| I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed. |
|
| ▲ | catapart 3 hours ago | parent | next [-] |
| Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising! |
| |
| ▲ | ndriscoll 2 hours ago | parent | next [-] | | Scraped reddit text archives (~23B items according to their corporate info page) are ~4 TB of compressed json, which includes metadata and not just the actual comment text. | |
| ▲ | osigurdson 2 hours ago | parent | prev [-] | | I suspect the text alone would be a lot smaller. Embeddings add a lot - 4K or more regardless of the size of the text. |
|
|
| ▲ | atonse 3 hours ago | parent | prev [-] |
| That’s crazy small. So is it fair to say that words are actually the best compression algorithm we have? You can explain complex ideas in just a few hundred words. Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text. |
| |
| ▲ | binary132 3 hours ago | parent | next [-] | | I don’t think I would really consider it compression if it’s not very reversible. Whatever people “uncompress” from my words isn’t necessarily what I was imagining or thinking about when I encoded them. I guess it’s more like a symbolic shorthand for meaning which relies on the second party to build their own internal model out of their own (shared public interface, but internal implementation is relatively unique…) symbols. | | |
| ▲ | tiagod 34 minutes ago | parent [-] | | It is compression, but it is lossy. Just like the digital counterparts like mp3 and jpeg, in some cases the final message can contain all the information you need. | | |
| ▲ | binary132 15 minutes ago | parent [-] | | But what’s getting reproduced in your head when you read what I’ve written isn’t what’s in my head at all. You have your own entire context, associations, and language. |
|
| |
| ▲ | _zoltan_ 3 hours ago | parent | prev [-] | | how much? |
|