| ▲ | CuriouslyC 14 hours ago | |||||||||||||||||||||||||||||||||||||||||||
Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | jankovicsandras 12 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||
You can do hybrid search in Postgres. Shameless plug: https://github.com/jankovicsandras/plpgsql_bm25 BM25 search implemented in PL/pgSQL ( Unlicense / Public domain ) The repo includes also plpgsql_bm25rrf.sql : PL/pgSQL function for hybrid search ( plpgsql_bm25 + pgvector ) with Reciprocal Rank Fusion; and Jupyter notebook examples. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | postalcoder 13 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
I agree. Someone here posted a drop-in for grep that added the ability to do hybrid text/vector search but the constant need to re-index files was annoying and a drag. Moreover, vector search can add a ton of noise if the model isn't meant for code search and if you're not using a re-ranker. For all intents and purposes, running gpt-oss 20B in a while loop with access to ripgrep works pretty dang well. gpt-oss is a tool calling god compared to everything else i've tried, and fast. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | rao-v 12 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Anybody know of a good service / docker that will do BM25 + vector lookup without spinning up half a dozen microservices? | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | ehsanu1 12 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
I've gotten great results applying it to file paths + signatures. Even better if you also fuse those results with BM25. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | itake 13 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
With AI needing more access to documentation, WDYT about using RAG for documentation retrieval? | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | lee1012 14 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
static embedding models im finding quite fast lee101/gobed https://github.com/lee101/gobed is 1ms on gpu :) would need to be trained for code though the bigger code llm embeddings can be high quality too so its just yea about where is ideal on the pareto fronteir really , often yea though your right it tends to be bm25 or rg even for code but yea more complex solutions are kind of possible too if its really important the search is high quality | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Der_Einzige 4 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||
This is true in general with LLMs, not just for code. LLMs can be told that their RAG tool is using BM25+N-grams, and will search accordingly. keyword search is superior to embeddings based search. The moment google switched to bert based embeddings for search everyone agreed it was going down hill. Most forms of early enshittification were simply switching off BM25 to embeddings based search. BM25/tf-idf and N grams have always been extremely difficult to beat baselines in information retrieval. This is why embeddings still have not led to a "ChatGPT" moment in information retrieval. | ||||||||||||||||||||||||||||||||||||||||||||