Remix.run Logo
catapart 4 hours ago

Am I misunderstanding what a parquet file is, or are all of the HN posts along with the embedding metadata a total of 55GB?

gkbrk 3 hours ago | parent | next [-]

I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed.

catapart 3 hours ago | parent | next [-]

Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising!

ndriscoll 2 hours ago | parent | next [-]

Scraped reddit text archives (~23B items according to their corporate info page) are ~4 TB of compressed json, which includes metadata and not just the actual comment text.

osigurdson 3 hours ago | parent | prev [-]

I suspect the text alone would be a lot smaller. Embeddings add a lot - 4K or more regardless of the size of the text.

atonse 3 hours ago | parent | prev [-]

That’s crazy small. So is it fair to say that words are actually the best compression algorithm we have? You can explain complex ideas in just a few hundred words.

Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text.

binary132 3 hours ago | parent | next [-]

I don’t think I would really consider it compression if it’s not very reversible. Whatever people “uncompress” from my words isn’t necessarily what I was imagining or thinking about when I encoded them. I guess it’s more like a symbolic shorthand for meaning which relies on the second party to build their own internal model out of their own (shared public interface, but internal implementation is relatively unique…) symbols.

tiagod 36 minutes ago | parent [-]

It is compression, but it is lossy. Just like the digital counterparts like mp3 and jpeg, in some cases the final message can contain all the information you need.

binary132 16 minutes ago | parent [-]

But what’s getting reproduced in your head when you read what I’ve written isn’t what’s in my head at all. You have your own entire context, associations, and language.

_zoltan_ 3 hours ago | parent | prev [-]

how much?

simlevesque 3 hours ago | parent | prev | next [-]

you'd be surprised. I have a lot of text data and Parquet files with brotli compression can achieve impressive file sizes.

Around 4 millions of web pages as markdown is like 1-2GB

verdverm 4 hours ago | parent | prev | next [-]

based on the table they show, that would be my inclination

wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant

lazide 3 hours ago | parent | prev [-]

Compressed, pretty believable.