Remix.run Logo
jxmorris12 3 days ago

Matryoshka embeddings are not sparse. And SPLADE can scale to tens or hundreds of thousands of dimensions.

faxipay349 a day ago | parent | next [-]

Yeah, the standard SPLADE model trained from BERT typically already has a vocabulary/vector size of 30,552. If the SPLADE model is based on a multilingual version of BERT, such as mBERT or XLM-R, the vocabulary size could inherently expand to approximately 100,000, as does the vector size.

CuriouslyC 3 days ago | parent | prev [-]

If you consider the actual latent space the full higher dimensional representation, and you take the first principle component, the other vectors are zero. Pretty sparse. No it's not a linked list sparse matrix. Don't be a pedant.

yorwba 3 days ago | parent | next [-]

When you truncate Matryoshka embeddings, you get the storage benefits of low-dimensional vectors with the limited expressiveness of low-dimensional vectors. Usually, what people look for in sparse vectors is to combine the storage benefits of low-dimensional vectors with the expressiveness of high-dimensional vectors. For that, you need the non-zero dimensions to be different for different vectors.

zwaps 3 days ago | parent | prev [-]

No one means Matryoshka embeddings when they talk about sparse embeddings. This is not pedantic.

CuriouslyC 3 days ago | parent | next [-]

No one means wolves when they talk about dogs, obviously wolves and dogs are TOTALLY different things.

cap11235 3 days ago | parent | prev [-]

Why?