Remix.run Logo
minimaxir 3 hours ago

Don't use all-MiniLM-L6-v2 for new vector embeddings datasets.

Yes, it's the open-weights embedding model used in all the tutorials and it was the most pragmatic model to use in sentence-transformers when vector stores were in their infancy, but it's old and does not implement the newest advances in architectures and data training pipelines, and it has a low context length of 512 when embedding models can do 2k+ with even more efficient tokenizers.

For open-weights, I would recommend EmbeddingGemma (https://huggingface.co/google/embeddinggemma-300m) instead which has incredible benchmarks and a 2k context window: although it's larger/slower to encode, the payoff is worth it. For a compromise, bge-base-en-v1.5 (https://huggingface.co/BAAI/bge-base-en-v1.5) or nomic-embed-text-v1.5 (https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) are also good.

xfalcox 3 hours ago | parent | next [-]

I am partial to https://huggingface.co/Qwen/Qwen3-Embedding-0.6B nowadays.

Open weights, multilingual, 32k context.

SteveJS 2 hours ago | parent | next [-]

Also matryoshka and the ability to guide matches by using prefix instructions on the query.

I have ~50 million sentences from english project gutenberg novels embedded with this.

dleeftink an hour ago | parent | next [-]

Why would you do that and I'd love to know more

Tostino an hour ago | parent | prev [-]

What are you using those embeddings for, If you don't mind me asking? I'd love to know more about the workflow and what the prefix instructions are like.

greenavocado 28 minutes ago | parent | prev [-]

It's junk compared to BGE M3 on my retrieval tasks

kaycebasques 2 hours ago | parent | prev | next [-]

One thing that's still compelling about all-Mini is that it's feasible to use it client-side. IIRC it's a 70MB download, versus 300MB for EmbeddingGemma (or perhaps it was 700MB?)

Are there any solid models that can be downloaded client-side in less than 100MB?

nijaru 11 minutes ago | parent | next [-]

For something under 100 MB, this is probably the strongest option right now.

https://huggingface.co/MongoDB/mdbr-leaf-ir

intalentive 20 minutes ago | parent | prev [-]

This is the smallest model in the top 100 of HF's MTEB Leaderboard: https://huggingface.co/Mihaiii/Ivysaur

Never used it, can't vouch for it. But it's under 100 MB. The model it's based on, gte-tiny, is only 46 MB.

SamInTheShell 29 minutes ago | parent | prev | next [-]

I tried out EmbeddingGemma a few weeks back in AB testing against nomic-embed-text-v1. I got way better results out of the nomic model. Runs fine on CPU as well.

dangoodmanUT 3 hours ago | parent | prev [-]

yeah this, there's much better open weights models out there...