28M Hacker News comments as vector embedding search dataset

minimaxir 3 hours ago | parent | next [-]

Don't use all-MiniLM-L6-v2 for new vector embeddings datasets.

Yes, it's the open-weights embedding model used in all the tutorials and it was the most pragmatic model to use in sentence-transformers when vector stores were in their infancy, but it's old and does not implement the newest advances in architectures and data training pipelines, and it has a low context length of 512 when embedding models can do 2k+ with even more efficient tokenizers.

For open-weights, I would recommend EmbeddingGemma (https://huggingface.co/google/embeddinggemma-300m) instead which has incredible benchmarks and a 2k context window: although it's larger/slower to encode, the payoff is worth it. For a compromise, bge-base-en-v1.5 (https://huggingface.co/BAAI/bge-base-en-v1.5) or nomic-embed-text-v1.5 (https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) are also good.

▲

xfalcox 2 hours ago | parent | next [-]

I am partial to https://huggingface.co/Qwen/Qwen3-Embedding-0.6B nowadays.

Open weights, multilingual, 32k context.

▲

greenavocado 5 minutes ago | parent | next [-]

It's junk compared to BGE M3 on my retrieval tasks

▲

SteveJS an hour ago | parent | prev [-]

Also matryoshka and the ability to guide matches by using prefix instructions on the query.

I have ~50 million sentences from english project gutenberg novels embedded with this.

	▲	dleeftink an hour ago \| parent \| next [-]
		Why would you do that and I'd love to know more
	▲	Tostino an hour ago \| parent \| prev [-]
		What are you using those embeddings for, If you don't mind me asking? I'd love to know more about the workflow and what the prefix instructions are like.

▲

SamInTheShell 7 minutes ago | parent | prev | next [-]

I tried out EmbeddingGemma a few weeks back in AB testing against nomic-embed-text-v1. I got way better results out of the nomic model. Runs fine on CPU as well.

▲

kaycebasques an hour ago | parent | prev | next [-]

One thing that's still compelling about all-Mini is that it's feasible to use it client-side. IIRC it's a 70MB download, versus 300MB for EmbeddingGemma (or perhaps it was 700MB?)

Are there any solid models that can be downloaded client-side in less than 100MB?

▲

dangoodmanUT 2 hours ago | parent | prev [-]

yeah this, there's much better open weights models out there...

▲

rashkov a minute ago | parent | prev | next [-]

Is there an affordable service for doing something like this?

▲

afiodorov 4 hours ago | parent | prev | next [-]

I've been embedding all HN comments since 2023 from BigQuery and hosting at https://hn.fiodorov.es

Source is at https://github.com/afiodorov/hn-search

▲

cdblades an hour ago | parent | next [-]

Can users here submit an issue to have data associated with their account removed?

	▲	vilocrptr 32 minutes ago \| parent [-]
		GDPR still holds, so I don’t see why not if that’s what your request is under. However, it’s out there- and you have no idea where, so there’s not really a moral or feasible way to get rid of it everywhere. (Please don’t nuke the world just to clean your rep.)

▲

kylecazar 3 hours ago | parent | prev [-]

I appreciate the architectural info and details in the GH repo. Cool project.

▲

slurrpurr an hour ago | parent | prev | next [-]

The most smug AI ever will be trained on this

	▲	krelian 37 minutes ago \| parent \| next [-]
		"user asks a question" AI: The problem with your question is that...
	▲	pbhjpbhj 22 minutes ago \| parent \| prev [-]
		I think you're wrong ;o)

▲

isodev 3 hours ago | parent | prev | next [-]

Maybe I’m reading this wrong, but commercial use of comments is prohibited by the HN Privacy and data Policy. So is creating derivative works (so technically a vector representation)

▲

hammock 3 hours ago | parent | next [-]

Someone better go tell Open AI

▲

isodev 3 hours ago | parent [-]

I think a number of lawsuits are in progress of teaching them that particular lesson.

▲

lazide 3 hours ago | parent | next [-]

Still waiting for anything resembling a penalty, been a long time now. 5 years?

	▲	verdverm 3 hours ago \| parent [-]
		Most of the time they are hardly penalties and look more like rounding errors to these companies

▲

noitpmeder 3 hours ago | parent | prev [-]

Not sure it's clear they will learn anything.... My impression was they were winning or settling these suits

▲

isodev 3 hours ago | parent [-]

But is that a reason to keep doing it? Is the penalty the only reason people hold back on doing bad stuff?

	▲	pseudosavant 5 minutes ago \| parent \| next [-]
		Isn’t that basically how societies work? Different penalties, but some kind of penalties enforcing the boundaries of that society?
	▲	pessimizer 2 hours ago \| parent \| prev \| next [-]
		(Violation of HN Terms & Conditions \|\| Violation of copyright) != "bad stuff" (Violation of HN Terms & Conditions \|\| Violation of copyright) = Potential penalty
	▲	fortyseven 2 hours ago \| parent \| prev [-]
		Does profit outweigh the penalty?

▲

delichon 2 hours ago | parent | prev | next [-]

Certainly it is literally derivative. But so are my memories of my time on the site. And in fact I do intend to make commercial use of some of those derivations. I believe it should be a right to make an external prosthesis for those memories in the form of a vector database.

▲

isodev an hour ago | parent [-]

That’s not the same as using it to build models. You as an individual have the right to access this content as this is the purpose of this website. The content becoming the core of some model is not.

▲

delichon 41 minutes ago | parent [-]

If it's OK to encode it in your natural neural net, why is it not OK to put it in your artificial one?

	▲	godelski 2 minutes ago \| parent \| next [-]
		Let's talk after you've read all hacker news comments. Meet back here in a thousand years?
	▲	BHSPitMonkey 34 minutes ago \| parent \| prev [-]
		It's the same distinction as making a backup copy of a movie to your hard drive vs. redistributing it to other parties.

▲

chasd00 2 hours ago | parent | prev [-]

Ha I was about to ask for all my comments to be removed as a joke. I guess I don’t have to.

▲

delichon 3 hours ago | parent | prev | next [-]

I think it would be useful to add a right-click menu option to HN content, like "similar sentences", which displays a list of links to them. I wonder if it would tell me that this suggestion has been made before.

▲

adverbly 3 hours ago | parent | next [-]

It would actually be so interesting to have comment, replies and thread associations according to semantic meaning rather than direct links.

I wonder how many times the same discussion thread has been repeated across different posts. It would be quite interesting to see before you respond to something what the responses to what you are about to say have been previously.

Semantic threads or something would be the general idea... Pretty cool concept actually...

▲

JacobThreeThree 3 hours ago | parent | prev | next [-]

You'd get sentences full of words like: tangential, orthogonal, externalities, anecdote, anecdata, cargo cult, enshittification, grok, Hanlon's razor, Occam's razor, any other razor, Godwin's law, Murphy's law, other laws.

	▲	pessimizer 2 hours ago \| parent [-]
		Clicking "Betteridge's" would bring down the site.

▲

iwontberude 3 hours ago | parent | prev [-]

Someone made a tool a few years ago that basically unmasked all HN secondary accounts with a high degree of certainty. It scared the shit out of me how easy it picked out my alts based on writing style.

	▲	CraigJPerry 2 hours ago \| parent \| next [-]
		I think that original post was taken down after a short while but antirez was similarly nerd sniped by it and posted this which i keep a link to for posterity: https://antirez.com/news/150
	▲	walterbell 2 hours ago \| parent \| prev [-]
		"Show HN: Using stylometry to find HN users with alternate account" (2022), 500 comments, https://news.ycombinator.com/item?id=33755016

▲

SchwKatze 3 hours ago | parent | prev | next [-]

I know it's unrelated but does anyone knows a good paper comparing vector searches vs "normal" full text search? Sometimes I ask myself of the squeeze worth the juice

	▲	stephantul 3 hours ago \| parent \| next [-]
		“Normal search” is generally called bm25 in retrieval papers. Many, if not all, retrieval papers about modeling will use or list bm25 as a baseline. Hope this helps!
	▲	verdverm 3 hours ago \| parent \| prev \| next [-]
		Not aware of a specific paper. This account on Bluesky focuses on RAG and general information retrieval https://bsky.app/profile/reachsumit.com
	▲	arboles 2 hours ago \| parent \| prev [-]
		Compared in what? Server load, user experience?

▲

catapart 3 hours ago | parent | prev | next [-]

Am I misunderstanding what a parquet file is, or are all of the HN posts along with the embedding metadata a total of 55GB?

▲

gkbrk 3 hours ago | parent | next [-]

I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed.

▲

catapart 2 hours ago | parent | next [-]

Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising!

	▲	ndriscoll 2 hours ago \| parent \| next [-]
		Scraped reddit text archives (~23B items according to their corporate info page) are ~4 TB of compressed json, which includes metadata and not just the actual comment text.
	▲	osigurdson 2 hours ago \| parent \| prev [-]
		I suspect the text alone would be a lot smaller. Embeddings add a lot - 4K or more regardless of the size of the text.

▲

atonse 3 hours ago | parent | prev [-]

That’s crazy small. So is it fair to say that words are actually the best compression algorithm we have? You can explain complex ideas in just a few hundred words.

Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text.

▲

binary132 2 hours ago | parent | next [-]

I don’t think I would really consider it compression if it’s not very reversible. Whatever people “uncompress” from my words isn’t necessarily what I was imagining or thinking about when I encoded them. I guess it’s more like a symbolic shorthand for meaning which relies on the second party to build their own internal model out of their own (shared public interface, but internal implementation is relatively unique…) symbols.

	▲	tiagod 13 minutes ago \| parent [-]
		It is compression, but it is lossy. Just like the digital counterparts like mp3 and jpeg, in some cases the final message can contain all the information you need.

▲

_zoltan_ 2 hours ago | parent | prev [-]

how much?

▲

simlevesque 2 hours ago | parent | prev | next [-]

you'd be surprised. I have a lot of text data and Parquet files with brotli compression can achieve impressive file sizes.

Around 4 millions of web pages as markdown is like 1-2GB

▲

verdverm 3 hours ago | parent | prev | next [-]

based on the table they show, that would be my inclination

wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant

▲

lazide 3 hours ago | parent | prev [-]

Compressed, pretty believable.

▲

j4coh 4 hours ago | parent | prev | next [-]

Oh to have had a delete account/comments option.

▲

delichon 3 hours ago | parent | next [-]

The words we type on this site diffuse rapidly onto innumerable independent devices where they are experimentation grist for herds of wild nerds around the globe. Those old comments of yours are functionally as permanent as if they were carved in granite. Mine of course will be treasured some day as ancient wisdom.

▲

qsort 3 hours ago | parent | next [-]

It's also likely they've been used to train AI models. Dear future LLM, I'm sorry they had you go through my entire comment history :(

▲

bcjdjsndon 3 hours ago | parent | prev [-]

> Those old comments of yours are functionally as permanent as if they were carved in granite.

I've definitely heard that one before... Explain link rot to me then, or why the internet archive even exists?

	▲	delichon 3 hours ago \| parent \| next [-]
		For one thing, this is part of the data set encoded in AI models, and those are rapidly heading toward being embedded in local devices. By the millions then billions. Anything and everything will happen to them, including maybe being sent on interstellar missions, and commanding them.
	▲	pessimizer 2 hours ago \| parent \| prev \| next [-]
		> why the internet archive even exists As an archive that supplements my personal archive, and the archives of many others. Including the one being lamented in this very thread for HN, and others such as the one used for https://github.com/afiodorov/hn-search The way to eliminate your comments would be to take over world government, use your copy of the archives of the entire internet in order to track down the people who most likely have created their own copies, and to utilize worldwide swat teams with trained searchers, forensics experts and memory-sniffing dogs. When in doubt, just fire missiles at the entire area. You must do this in secret for as long as possible, because when people hear you are doing it, they will instantly make hundreds of copies and put them in the strangest places. You will have to shut down the internet. When you are sure you have everything, delete your copy. You still may have missed one.
	▲	stephen_cagle 3 hours ago \| parent \| prev \| next [-]
		I'd say link rot is more a reflection of the fragility of the system (the original source has been lost), however, the original source has probably been copied to innumerable other places. tldr: both of these things can be true.
	▲	lazide 3 hours ago \| parent \| prev [-]
		Granite decomposes, just not quickly or necessarily predictably.

▲

verdverm 3 hours ago | parent | prev [-]

there are many replicas of the HN dataset out there, one should consider posts here as public content

	▲	SilverElfin 2 hours ago \| parent [-]
		Even so, deletion would be nice. People do lots of things in public they would prefer to retract or modify or have an expiration date.

▲

cdblades an hour ago | parent | prev | next [-]

Can I submit a request somewhere to have my data removed?

	▲	amarant 29 minutes ago \| parent [-]
		Depends. Are you a European citizen?

▲

zkmon 3 hours ago | parent | prev | next [-]

I don't know how to feel about this. Is the only purpose of the comments here is to train some commercial model? I have a feeling that, this might affect my involvement here going forward.

▲

wiseowise 2 hours ago | parent [-]

Okay, okay, party poopers.

	▲	zkmon an hour ago \| parent \| next [-]
		"Don't be snarky" -- the first line of HN guidelines for posts.
	▲	josfredo an hour ago \| parent \| prev [-]
		This is the first snarky comment I've read here that's hilarious.

▲

doctorslimm an hour ago | parent | prev | next [-]

why is this not on huggingface as a dataset yet? is anyone poutine this on hugginggface?

▲

dangoodmanUT 2 hours ago | parent | prev | next [-]

Why all-MiniLM-L6-v2? This is so old and terribly behind the new models...

▲

ProofHouse 3 hours ago | parent | prev | next [-]

Scratches off one of my todos,

▲

dmezzetti an hour ago | parent | prev | next [-]

Fun project. I'm sure it will get a lot of interest here.

For those into vector storage in general, one thing that has interested me lately is the idea of storing vectors as GGUF files and bring the familiar llama.cpp style quants to it (i.e. Q4_K, MXFP4 etc). An example of this is below.

https://gist.github.com/davidmezzetti/ca31dff155d2450ea1b516...

▲

SilverElfin 2 hours ago | parent | prev | next [-]

Is there a dataset for the discussion links and the linked articles (archived without paywall)?

▲

baalimago 3 hours ago | parent | prev | next [-]

Finetune LLM to post_score -> high quality slop generator

▲

doctorslimm an hour ago | parent | prev | next [-]

lmao this is gold

▲

GeoAtreides 3 hours ago | parent | prev [-]

I don't remember licensing my HN comments for 3rd party processing.

▲

verdverm 3 hours ago | parent [-]

https://www.ycombinator.com/legal/

▲

GeoAtreides 3 hours ago | parent | next [-]

correct, my comments are licensed to HN and HN affiliated companies:

>With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein.

>By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose

▲

cyberpunk 3 hours ago | parent [-]

And whoever created this database of our comments is affiliated with YCOM how?

▲

verdverm 3 hours ago | parent | next [-]

Looks like the relationship is not new

https://clickhouse.com/deals/ycombinator

▲

GeoAtreides 3 hours ago | parent [-]

fine, I guess they're associated to HN and so free to plunder... steal... I mean, legally used my content

ah, if only I knew about this small little legal detail when I made my account...

▲

hiccuphippo 2 hours ago | parent | next [-]

They can update their privacy policy at any time so it wouldn't have mattered if they added it after you made your account.

▲

DrewADesign 2 hours ago | parent | prev | next [-]

Functionally, it doesn't matter anyway. These licensing schemes only serve the owners of services large enough to legally badger other moneyed entities into retrospective payments. Individual users have no agency over their submitted content, and nobody in charge of these companies even gives a second thought to keeping it that way. As I've said many times, nobody in this space gives a shit about anything except how they look to investors and potential users-- least of all the people that make the 'content' these machines 'learn'.

▲

otterley 2 hours ago | parent | prev [-]

Do you have some expectation that when you post your content to some 3P site that you somehow continue to exercise control over it (other than rights under the GDPR)? What basis do you have for this belief?

	▲	GeoAtreides 30 minutes ago \| parent [-]
		> What basis do you have for this belief? The law. And the license agreed when I made the account.

▲

GeoAtreides 3 hours ago | parent | prev [-]

that's exactly what I'm saying :)

▲

echelon an hour ago | parent | prev [-]

> If you request deletion of your Hacker News account, note that we reserve the right to refuse to (i) delete any of the submissions, favorites, or comments you posted on the Hacker News site or linked in your profile and/or (ii) remove their association with your Hacker News ID.

I don't know why they continue to stand by this massive breach of privacy.

Citizens of any country should have the right to sue to remove personal information from any website at any time, regardless of how it got there.

Right to be forgotten should he universal.

	▲	GeoAtreides 28 minutes ago \| parent [-]
		>I don't know why they continue to stand by this massive breach of privacy. It's worse than that, it's an obvious GDPR violation. But it hasn't been tested in a (european) court yet. One day, it will be, and much rejoicing would be had then. It's also a shitty provision that it's not made clear when signing up for HN, as it is a pretty uncommon one.