| ▲ | catapart 4 hours ago |
| Am I misunderstanding what a parquet file is, or are all of the HN posts along with the embedding metadata a total of 55GB? |
|
| ▲ | gkbrk 3 hours ago | parent | next [-] |
| I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed. |
| |
| ▲ | catapart 3 hours ago | parent | next [-] | | Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising! | | |
| ▲ | ndriscoll 2 hours ago | parent | next [-] | | Scraped reddit text archives (~23B items according to their corporate info page) are ~4 TB of compressed json, which includes metadata and not just the actual comment text. | |
| ▲ | osigurdson 3 hours ago | parent | prev [-] | | I suspect the text alone would be a lot smaller. Embeddings add a lot - 4K or more regardless of the size of the text. |
| |
| ▲ | atonse 3 hours ago | parent | prev [-] | | That’s crazy small. So is it fair to say that words are actually the best compression algorithm we have? You can explain complex ideas in just a few hundred words. Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text. | | |
| ▲ | binary132 3 hours ago | parent | next [-] | | I don’t think I would really consider it compression if it’s not very reversible. Whatever people “uncompress” from my words isn’t necessarily what I was imagining or thinking about when I encoded them. I guess it’s more like a symbolic shorthand for meaning which relies on the second party to build their own internal model out of their own (shared public interface, but internal implementation is relatively unique…) symbols. | | |
| ▲ | tiagod 36 minutes ago | parent [-] | | It is compression, but it is lossy. Just like the digital counterparts like mp3 and jpeg, in some cases the final message can contain all the information you need. | | |
| ▲ | binary132 16 minutes ago | parent [-] | | But what’s getting reproduced in your head when you read what I’ve written isn’t what’s in my head at all. You have your own entire context, associations, and language. |
|
| |
| ▲ | _zoltan_ 3 hours ago | parent | prev [-] | | how much? |
|
|
|
| ▲ | simlevesque 3 hours ago | parent | prev | next [-] |
| you'd be surprised. I have a lot of text data and Parquet files with brotli compression can achieve impressive file sizes. Around 4 millions of web pages as markdown is like 1-2GB |
|
| ▲ | verdverm 4 hours ago | parent | prev | next [-] |
| based on the table they show, that would be my inclination wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant |
|
| ▲ | lazide 3 hours ago | parent | prev [-] |
| Compressed, pretty believable. |