Their idea is that capacity of even 4096-wide vectors limits their performance.

Sparse models like BM25 have a huge dimension and thus don’t suffer from this limit, but they don’t capture semantics and can’t follow instructions.

It seems like the holy grail is a sparse semantic model. I wonder how splade would do?

▲ tkfoss 3 days ago | parent | next [-]

Wouldn't holy grail then be parallel channels for candidate generation;

  euclidean embedding
  hyperbolic embedding
  sparse BM25 / SPLADE lexical search
  optional multi-vector signatures

  ↓ merge & deduplicate candidates

followed by weight scoring, expansion (graph) & rerank (LLM)?

	▲	jdthedisciple 3 days ago \| parent [-]
		that is pretty much exactly what we do for our company-internal knowledge retrieval: `embedding search (0.4) lexical/keyword search (0.4) fuzzy search (0.2)` might indeed achieve the best of all worlds

▲ faxipay349 a day ago | parent | prev | next [-]

I just came across an evaluation of state-of-the-art SPLADE models. Yeah they utilize BERT's vocabulary size as their sparse vector dimensionality and do capture semantics. As expected, they significantly outperform all dense models in this benchmark. https://github.com/frinkleko/LIMIT-Sparse-Embedding OpenSearch team seemed has been working on inference-free versions of these models. Similar to BM25, these models only encode documents offline. So now we have sparse, small and efficient models while is much better than dense ones, at least on LIMIT

▲ CuriouslyC 3 days ago | parent | prev [-]

We already have "sparse" embeddings. Google's Matryoshka embedding schema can scale embeddings from ~150 dimensions to >3k, and it's the same embedding with layers of representational meaning. Imagine decomposing an embedding along principle components, then streaming the embedding vectors in order of their eigenvalue, kind of the idea.

▲

miven 3 days ago | parent | next [-]

Correct me if I'm misinterpreting something in your argument but as I see it Matryoshka embeddings just sort the vector bases of the output space roughly by order of their importance for the task, PCA-style, so when you truncate your 4096-dimensionnal embedding down to a set of let's say 256 dimensions, those are the exact same 256 vector bases doing the core job of encoding important information for each sample, so you're back to dense retrieval on 256-dimensional vectors, just that all the minor miscellaneous slack useful for a very low fraction of queries has been trimmed away.

True sparsity would imply keeping different important vector bases for different documents, but MRL doesn't magically shuffle vector bases around depending on what's your document contains, were that the case cosine similarity between the resulting documents embeddings would simply make no sense as a similarity measure.

▲

jxmorris12 3 days ago | parent | prev | next [-]

Matryoshka embeddings are not sparse. And SPLADE can scale to tens or hundreds of thousands of dimensions.

▲

faxipay349 a day ago | parent | next [-]

Yeah, the standard SPLADE model trained from BERT typically already has a vocabulary/vector size of 30,552. If the SPLADE model is based on a multilingual version of BERT, such as mBERT or XLM-R, the vocabulary size could inherently expand to approximately 100,000, as does the vector size.

▲

CuriouslyC 3 days ago | parent | prev [-]

If you consider the actual latent space the full higher dimensional representation, and you take the first principle component, the other vectors are zero. Pretty sparse. No it's not a linked list sparse matrix. Don't be a pedant.

▲

yorwba 3 days ago | parent | next [-]

When you truncate Matryoshka embeddings, you get the storage benefits of low-dimensional vectors with the limited expressiveness of low-dimensional vectors. Usually, what people look for in sparse vectors is to combine the storage benefits of low-dimensional vectors with the expressiveness of high-dimensional vectors. For that, you need the non-zero dimensions to be different for different vectors.

▲

zwaps 3 days ago | parent | prev [-]

No one means Matryoshka embeddings when they talk about sparse embeddings. This is not pedantic.

	▲	CuriouslyC 3 days ago \| parent \| next [-]
		No one means wolves when they talk about dogs, obviously wolves and dogs are TOTALLY different things.
	▲	cap11235 3 days ago \| parent \| prev [-]
		Why?

▲

3abiton 3 days ago | parent | prev [-]

Doesn't PCA compress the embeddings in this case, ie reduce the accuracy? It's similar to quantization.

	▲	CuriouslyC 3 days ago \| parent [-]
		Component analysis doesn't fundamentally reduce information, it just rotates it into a more informative basis. People usually drop vectors using the eigenvalues to do dimensionality reduction, but you don't have to do that.