Am I misunderstanding what a parquet file is, or are all of the HN posts along with the embedding metadata a total of 55GB?

▲

gkbrk 3 hours ago | parent | next [-]

I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed.

▲

catapart 3 hours ago | parent | next [-]

Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising!

	▲	ndriscoll 2 hours ago \| parent \| next [-]
		Scraped reddit text archives (~23B items according to their corporate info page) are ~4 TB of compressed json, which includes metadata and not just the actual comment text.
	▲	osigurdson 3 hours ago \| parent \| prev [-]
		I suspect the text alone would be a lot smaller. Embeddings add a lot - 4K or more regardless of the size of the text.

▲

atonse 3 hours ago | parent | prev [-]

That’s crazy small. So is it fair to say that words are actually the best compression algorithm we have? You can explain complex ideas in just a few hundred words.

Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text.

▲

binary132 3 hours ago | parent | next [-]

I don’t think I would really consider it compression if it’s not very reversible. Whatever people “uncompress” from my words isn’t necessarily what I was imagining or thinking about when I encoded them. I guess it’s more like a symbolic shorthand for meaning which relies on the second party to build their own internal model out of their own (shared public interface, but internal implementation is relatively unique…) symbols.

▲

tiagod 36 minutes ago | parent [-]

It is compression, but it is lossy. Just like the digital counterparts like mp3 and jpeg, in some cases the final message can contain all the information you need.

	▲	binary132 16 minutes ago \| parent [-]
		But what’s getting reproduced in your head when you read what I’ve written isn’t what’s in my head at all. You have your own entire context, associations, and language.

▲

_zoltan_ 3 hours ago | parent | prev [-]

how much?

▲

simlevesque 3 hours ago | parent | prev | next [-]

you'd be surprised. I have a lot of text data and Parquet files with brotli compression can achieve impressive file sizes.

Around 4 millions of web pages as markdown is like 1-2GB

▲

verdverm 4 hours ago | parent | prev | next [-]

based on the table they show, that would be my inclination

wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant

▲

lazide 3 hours ago | parent | prev [-]

Compressed, pretty believable.