Also worth mentioning that we use quantization extensively:

- halfvec (16bit float) for storage - bit (binary vectors) for indexes

Which makes the storage cost and on-going performance good enough that we could enable this in all our hosting.

It still amazes me that the binary trick works.

For anyone who hasn't seen it yet: it turns out many embedding vectors of e.g. 1024 floating point numbers can be reduced to a single bit per value that records if it's higher or lower than 0... and in this reduced form much of the embedding math still works!

This means you can e.g. filter to the top 100 using extremely memory efficient and fast bit vectors, then run a more expensive distance calculation against those top 100 with the full floating point vectors to pick the top 10.

▲

xfalcox 3 days ago | parent | next [-]

I was taken back when I saw what was basically zero recall loss in the real world task of finding related topics, by doing the same thing you described where we over capture with binary embeddings, and only use the full (or half) precision on the subset.

Making the storage cost of the index 32 times smaller is the difference of being able to offer this at scale without worrying too much about the overhead.

▲

Someone 2 days ago | parent [-]

> I was taken back when I saw what was basically zero recall loss in the real world task of finding related topics

By moving the values to a single bit, you’re lumping stuff together that was different before, so I don’t think recall loss would be expected.

Also: even if your vector is only 100-dimensional, there already are 2^100 different bit vectors. That’s over 10^30.

If your dataset isn’t gigantic and has documents that are even moderately dispersed in that space, the likelihood of having many with the same bit vector isn’t large.

	▲	barrkel 2 days ago \| parent [-]
		And if dispersion isn't good, it would be worthwhile running the vectors through another model trained to disperse them.

▲

tveita 3 days ago | parent | prev | next [-]

Depending on your data you might also get better results by applying a random rotation to your vector before quantization.

https://ieeexplore.ieee.org/abstract/document/6296665/ (https://refbase.cvc.uab.cat/files/GLG2012b.pdf)

▲

FuckButtons 3 days ago | parent | prev | next [-]

why is this amazing, it’s just a 1 bit lossy compression representation of the original information? If you have a vector in n-dimensional space this is effectively just representing the basis vectors that the original has.

▲

simonw 3 days ago | parent [-]

You can take 8192 bytes of information (1024 x 32 bit floats) and reduce that to 128 bytes (1024 bits, a 64x reduction in size!) and still get results that are about 95% as good.

I find that cool and surprising.

▲

sa-code 3 days ago | parent | next [-]

I'm with you, it's very satisfying to see a simple technique work well. It's impressive

▲

computably 3 days ago | parent | prev [-]

1024 bits for a hash is pretty roomy. The embedding "just" has to be well-distributed across enough of the dimensions.

▲

ImPostingOnHN 3 days ago | parent [-]

Yeah, that's what I was thinking: Did we think 32 bits across each of the 1024 dimensions would be necessary? Maybe 32768 bits is adding unnecessary precision to what is ~1024 bits of information in the first place.

	▲	FuckButtons 2 days ago \| parent [-]
		That’s a much more interesting question, I wonder if there is a way to put a lower bound on the number of bits you could use?

▲

3abiton 3 days ago | parent | prev [-]

Now that you mention that, I wonder if LSH would perform better with slightly higher memory footprint

▲

summarity 3 days ago | parent | prev | next [-]

That's where it's at. I'm using the 1600D vectors from OpenAI models for findsight.ai, stored SuperBit-quantized. Even without fancy indexing, a full scan (1 search vector -> 5M stored vectors), takes less than 40ms. And with basic binning, it's nearly instant.

▲

tacoooooooo 3 days ago | parent [-]

this is at the expense of precision/recall though isn't it?

	▲	summarity 3 days ago \| parent \| next [-]
		With the quant size I'm using, recall is >95%.
	▲	pclmulqdq 3 days ago \| parent \| prev [-]
		Approximate nearest neighbor searches don't cost precision. Just recall.

▲

mfrye0 3 days ago | parent | prev [-]

I was going to say the same. We're using binary vectors in prod as well. Makes a huge difference in the indexes. This wasn't mentioned once in the article.