Remix.run Logo
CuriouslyC 14 hours ago

Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy.

jankovicsandras 12 hours ago | parent | next [-]

You can do hybrid search in Postgres.

Shameless plug: https://github.com/jankovicsandras/plpgsql_bm25 BM25 search implemented in PL/pgSQL ( Unlicense / Public domain )

The repo includes also plpgsql_bm25rrf.sql : PL/pgSQL function for hybrid search ( plpgsql_bm25 + pgvector ) with Reciprocal Rank Fusion; and Jupyter notebook examples.

canadiantim 4 hours ago | parent [-]

Wow very impressive library great work!

postalcoder 13 hours ago | parent | prev | next [-]

I agree. Someone here posted a drop-in for grep that added the ability to do hybrid text/vector search but the constant need to re-index files was annoying and a drag. Moreover, vector search can add a ton of noise if the model isn't meant for code search and if you're not using a re-ranker.

For all intents and purposes, running gpt-oss 20B in a while loop with access to ripgrep works pretty dang well. gpt-oss is a tool calling god compared to everything else i've tried, and fast.

rao-v 12 hours ago | parent | prev | next [-]

Anybody know of a good service / docker that will do BM25 + vector lookup without spinning up half a dozen microservices?

cipherself 4 hours ago | parent | next [-]

Here's a Dockerfile that will spin up postgres with pgvector and paradedb https://gist.github.com/cipherself/5260fea1e2631e9630081fb7d...

You can use pgvector for the vector lookup and paradedb for bm25.

donkeyboy 8 hours ago | parent | prev | next [-]

Elasticsearch / Opensearch is the industry standard for this

abujazar 8 hours ago | parent [-]

Used to be, but they're very complicated to operate compared to more modern alternatives and have just gotten more and more bloated over the years. Also require a bunch of different applications for different parts of the stack in order to do the same basic stuff as e.g. Meilisearch, Manticore or Typesense.

cluckindan 7 hours ago | parent [-]

>very complicated to operate compared to more modern alternatives

Can you elaborate? What makes the modern alternatives easier to operate? What makes Elasticsearch complicated?

Asking because in my experience, Elasticsearch is pretty simple to operate unless you have a huge cluster with nodes operating in different modes.

porridgeraisin 7 hours ago | parent | prev | next [-]

For BM25 + trigram, SQLite FTS5 works well.

abujazar 8 hours ago | parent | prev [-]

Meilisearch

ehsanu1 12 hours ago | parent | prev | next [-]

I've gotten great results applying it to file paths + signatures. Even better if you also fuse those results with BM25.

CuriouslyC 5 hours ago | parent [-]

I like embeddings for natural language documents where your query terms are unlikely to be unique, and overall document direction is a good disambiguator.

itake 13 hours ago | parent | prev | next [-]

With AI needing more access to documentation, WDYT about using RAG for documentation retrieval?

CuriouslyC 5 hours ago | parent [-]

IME most documentation is coming from the web via web search. I like agentic RAG for this case, which you can achieve easily with a Claude Code subagent.

lee1012 14 hours ago | parent | prev | next [-]

static embedding models im finding quite fast lee101/gobed https://github.com/lee101/gobed is 1ms on gpu :) would need to be trained for code though the bigger code llm embeddings can be high quality too so its just yea about where is ideal on the pareto fronteir really , often yea though your right it tends to be bm25 or rg even for code but yea more complex solutions are kind of possible too if its really important the search is high quality

Der_Einzige 4 hours ago | parent | prev [-]

This is true in general with LLMs, not just for code. LLMs can be told that their RAG tool is using BM25+N-grams, and will search accordingly. keyword search is superior to embeddings based search. The moment google switched to bert based embeddings for search everyone agreed it was going down hill. Most forms of early enshittification were simply switching off BM25 to embeddings based search.

BM25/tf-idf and N grams have always been extremely difficult to beat baselines in information retrieval. This is why embeddings still have not led to a "ChatGPT" moment in information retrieval.