| ▲ | So you wanna build a local RAG?(blog.yakkomajuri.com) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| 137 points by pedriquepacheco 5 hours ago | 30 comments | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | simonw 4 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
My advice for building something like this: don't get hung up on a need for vector databases and embedding. Full text search or even grep/rg are a lot faster and cheaper to work with - no need to maintain a vector database index - and turn out to work really well if you put them in some kind of agentic tool loop. The big benefit of semantic search was that it could handle fuzzy searching - returning results that mention dogs if someone searches for canines, for example. Give a good LLM a search tool and it can come up with searches like "dog OR canine" on its own - and refine those queries over multiple rounds of searches. Plus it means you don't have to solve the chunking problem! | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | mips_avatar 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
One thing I didn’t see here that might be hurting your performance is a lack of semantic chunking. It sounds like you’re embedding entire docs, which kind of breaks down if the docs contain multiple concepts. A better approach for recall is using some kind of chunking program to get semantic chunks (I like spacy though you have to configure it a bit). Then once you have your chunks you need to append context to how this chunk relates to the rest of your doc before you do your embedding. I have found anthropics approach to contextual retrieval to be very performant in my RAG systems (https://www.anthropic.com/engineering/contextual-retrieval) you can just use gpt oss 20b as the model for generation of context. Unless I’ve misunderstood your post and you are doing some form of this in your pipeline you should see a dramatic improvement in performance once you implement this. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | nilirl 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Why is it implicit that semantic search will outperform lexical search? Back in 2023 when I compared semantic search to lexical search (tantivy; BM25), I found the search results to be marginally different. Even if semantic search has slightly more recall, does the problem of context warrant this multi-component, homebrew search engine approach? By what important measure does it outperform a lexical search engine? Is the engineering time worth it? | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | 0xC45 43 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
For an open source, local (or cloud) vector DB, I would also recommend checking out Chroma (https://trychroma.com). It also supports full text search. Disclaimer: I work on Chroma cloud. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | mijoharas 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
I'm interested in the embeddings models suggested. I had some good results with nomic in a small embedding based tool I built. I also heard a few good things about qwen3-embedding, though the latency wasn't great for my usecase so I didn't pursue it much further. Similarly, I used sqlite-vec, and was very happy with it. (if I were already using postgres I'd have gone with that, but this was more of a cli tool). If the author is here, did you try any of those models? how would you compare the ones you did use? | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | johnebgd 44 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Interesting stack. I’ve been working on doing something like this with Apple specific tech. Swiftdata is not easy to work with. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | urbandw311er 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
When it comes to the evals for this kind of thing, is there a standard set of test data out there that one can work with to benchmark against? ie a collection of documents with questions that should result in particular documents or chunks being cited as the most relevant match. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | _joel 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
You can get local RAG with Anythingllm if you want minimal effort too fwiw. Pretty much plug and play. Used it for simple testing for an idea before getting into the weeds of langchain and agentic RAG. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | dwa3592 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
If you end up using any of the frontier models, don't forget to protect private information in your prompts - https://github.com/deepanwadhwa/zink | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | barbazoo 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
> What that means is that when you're looking to build a fully local RAG setup, you'll need to substitute whatever SaaS providers you're using for a local option for each of those components. Even starting with having "just" the documents and vector db locally is a huge first step and much more doable than going with a local LLM at the same time. I don't know any one or any org that has the resources to run their own LLM at scale. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | kbrisso 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
I built this for local RAG https://github.com/kbrisso/byte-vision it uses llama.cpp and Elasticsearch. On a laptop with 8 GB GPU it can handle a 30K token size and summarize a fairly large PDF. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | dmezzetti an hour ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Glad to see all the interest in the local RAG space, it's been something I've been pushing for a while. I just put this example together today: https://gist.github.com/davidmezzetti/d2854ed82f2d0665ec7efd... | |||||||||||||||||||||||||||||||||||||||||||||||||||||