Remix.run Logo
So you wanna build a local RAG?(blog.yakkomajuri.com)
137 points by pedriquepacheco 5 hours ago | 30 comments
simonw 4 hours ago | parent | next [-]

My advice for building something like this: don't get hung up on a need for vector databases and embedding.

Full text search or even grep/rg are a lot faster and cheaper to work with - no need to maintain a vector database index - and turn out to work really well if you put them in some kind of agentic tool loop.

The big benefit of semantic search was that it could handle fuzzy searching - returning results that mention dogs if someone searches for canines, for example.

Give a good LLM a search tool and it can come up with searches like "dog OR canine" on its own - and refine those queries over multiple rounds of searches.

Plus it means you don't have to solve the chunking problem!

cwmoore 23 minutes ago | parent | next [-]

I recently came across a “prefer the most common synonym” problem, in Google Maps, while searching for a poolhall—even literally ‘billiards’ returned results for swimming pools and chlorine. I wonder if some more NOTs aren’t necessary…interested in learning about RAGs though I’m a little behind the curve.

mips_avatar 2 hours ago | parent | prev | next [-]

In my app the best lexical search approaches completely broke my agent. For my rag system the llm would on average take 2.1 lexical searches to get the results it needed. Which wasn’t terrible but it meant sometimes it needed up to 5 searches to find it which blew up user latency. Now that I have a hybrid semantic search + lexical search it only requires 1.1 searches per result.

froobius 3 hours ago | parent | prev | next [-]

Hmm it can capture more than just single words though, e.g. meaningful phrases or paragraphs that could be written in many ways.

leetrout 4 hours ago | parent | prev | next [-]

Simon have you ever given a talk or written about this sort of pragmatism? A spin on how to achieve this with Datasette is an easy thing to imagine IMO.

tra3 3 hours ago | parent | prev | next [-]

I built a simple emacs package based on this idea [0]. It works surprisingly well, but I dont know how far it scales. It's likely not as frugal from a token usage perspective.

0: https://github.com/dmitrym0/dm-gptel-simple-org-memory

pstuart 41 minutes ago | parent | prev | next [-]

Perhaps SQLite with FTS5? Or even better, getting DuckDB into the party as it's ecosystem seems ripe for this type of work.

enraged_camel 3 hours ago | parent | prev [-]

Yes, exactly. We have our AI feature configured to use our pre-existing TypeSense integration and it's stunningly competent at figuring out exactly what search queries to use across which collections in order to find relevant results.

busssard 3 hours ago | parent [-]

if this is coupled with powerful search engines beyond elastic then we are getting somewhere. other nonmonotonic engines that can find structural information are out there.

mips_avatar 4 hours ago | parent | prev | next [-]

One thing I didn’t see here that might be hurting your performance is a lack of semantic chunking. It sounds like you’re embedding entire docs, which kind of breaks down if the docs contain multiple concepts. A better approach for recall is using some kind of chunking program to get semantic chunks (I like spacy though you have to configure it a bit). Then once you have your chunks you need to append context to how this chunk relates to the rest of your doc before you do your embedding. I have found anthropics approach to contextual retrieval to be very performant in my RAG systems (https://www.anthropic.com/engineering/contextual-retrieval) you can just use gpt oss 20b as the model for generation of context.

Unless I’ve misunderstood your post and you are doing some form of this in your pipeline you should see a dramatic improvement in performance once you implement this.

yakkomajuri 3 hours ago | parent [-]

hey, author (not op) here. we do do semantic chunking! I think maybe I gave the impression that we don't because of the mention of aggregating context but I tested this with questions that would require aggregating context from 15+ documents (meaning 2x that in chunks), hence the comment in the post!

mips_avatar 3 hours ago | parent [-]

Ah so you’re generating context from multiple docs for your chunks? How do you decide which docs get aggregated?

nilirl 4 hours ago | parent | prev | next [-]

Why is it implicit that semantic search will outperform lexical search?

Back in 2023 when I compared semantic search to lexical search (tantivy; BM25), I found the search results to be marginally different.

Even if semantic search has slightly more recall, does the problem of context warrant this multi-component, homebrew search engine approach?

By what important measure does it outperform a lexical search engine? Is the engineering time worth it?

kgeist an hour ago | parent | next [-]

It depends on how you test it. I recently found that the way devs test it differs radically from how users actually use it. When we first built our RAG, it showed promising results (around 90% recall on large knowledge bases). However, when the first actual users tried it, it could barely answer anything (closer to 30%). It turned out we relied on exact keywords too much when testing it: we knew the test knowledge base, so we formulated our questions in a way that helped the RAG find what we expected it to find. Real users don't know the exact terminology used in the articles. We had to rethink the whole thing. Lexical search is certainly not enough. Sure, you can run an agent on top of it, but that blows up latency - users aren't happy when they have to wait more than a couple of seconds.

mips_avatar 3 hours ago | parent | prev | next [-]

Depends on how important keyword matching vs something more ambiguous is to your app. In Wanderfugl there’s a bunch of queries where semantic search can find an important chunk that lacks a high bm25 score. The good news is you can get all the benefits of bm25 and semantic with a hybrid ranking. The answer isn’t one or the other.

andoando 3 hours ago | parent | prev [-]

The benefit I see is you can have queries like "conversations between two scientists".

Its very dependent on use case imo

0xC45 43 minutes ago | parent | prev | next [-]

For an open source, local (or cloud) vector DB, I would also recommend checking out Chroma (https://trychroma.com). It also supports full text search. Disclaimer: I work on Chroma cloud.

mijoharas 2 hours ago | parent | prev | next [-]

I'm interested in the embeddings models suggested. I had some good results with nomic in a small embedding based tool I built. I also heard a few good things about qwen3-embedding, though the latency wasn't great for my usecase so I didn't pursue it much further.

Similarly, I used sqlite-vec, and was very happy with it. (if I were already using postgres I'd have gone with that, but this was more of a cli tool).

If the author is here, did you try any of those models? how would you compare the ones you did use?

johnebgd 44 minutes ago | parent | prev | next [-]

Interesting stack. I’ve been working on doing something like this with Apple specific tech. Swiftdata is not easy to work with.

urbandw311er 3 hours ago | parent | prev | next [-]

When it comes to the evals for this kind of thing, is there a standard set of test data out there that one can work with to benchmark against? ie a collection of documents with questions that should result in particular documents or chunks being cited as the most relevant match.

_joel 3 hours ago | parent | prev | next [-]

You can get local RAG with Anythingllm if you want minimal effort too fwiw. Pretty much plug and play. Used it for simple testing for an idea before getting into the weeds of langchain and agentic RAG.

dwa3592 2 hours ago | parent | prev | next [-]

If you end up using any of the frontier models, don't forget to protect private information in your prompts - https://github.com/deepanwadhwa/zink

cjonas 2 hours ago | parent [-]

Doesn't seems necessary if you are using claude via bedrock or gpt via azure. At that point, its not different then sending PII through a serverless function.

barbazoo 4 hours ago | parent | prev | next [-]

> What that means is that when you're looking to build a fully local RAG setup, you'll need to substitute whatever SaaS providers you're using for a local option for each of those components.

Even starting with having "just" the documents and vector db locally is a huge first step and much more doable than going with a local LLM at the same time. I don't know any one or any org that has the resources to run their own LLM at scale.

mips_avatar 2 hours ago | parent | next [-]

It’s also just extremely viable to just host your own vector db. You just need a server with enough ram for your hnsw index.

procaryote 3 hours ago | parent | prev [-]

Aren't there a bunch of models that run OK on consumer hardware now?

kbrisso 3 hours ago | parent | prev | next [-]

I built this for local RAG https://github.com/kbrisso/byte-vision it uses llama.cpp and Elasticsearch. On a laptop with 8 GB GPU it can handle a 30K token size and summarize a fairly large PDF.

busssard 3 hours ago | parent [-]

elasticsearch is the true limitation of rag systems...

kbrisso 3 hours ago | parent [-]

The vector search works great once you figure it out. I wanted to focus on writing the application and not have to rewrite a document store.

dmezzetti an hour ago | parent | prev [-]

Glad to see all the interest in the local RAG space, it's been something I've been pushing for a while.

I just put this example together today: https://gist.github.com/davidmezzetti/d2854ed82f2d0665ec7efd...