| ▲ | RAG Eval Comparing Vertex/Bedrock/Azure/OpenAI(github.com) | |
| 3 points by colon-md 11 hours ago | 1 comments | ||
| ▲ | colon-md 11 hours ago | parent [-] | |
Last week, I read Karpathy's gist on building a personal wiki LLM (https://gist.github.com/karpathy/442a6bf555914893e9891c11519...) and decided to try it. The RAG pitch is take your own corpus of docs, layer an LLM over it, get a thing that answers questions grounded in your stuff. Wiki+RAG hybrid as the interesting architectural variant. So I started building the "traditional" retrieval architectures (pure dense, BM25, hybrid RRF, rerank) to pit against the wiki+RAG variant with structure layered over the chunks. After few days of code cleanup I have an eval testbench and a wiki LLM is only 50% built. I'm releasing the testbench now because I think the testbench is just as valuable as the RAG design itself. What the repo does: runs four hosted RAG services against identical inputs (same 81-doc enterprise corpus, same 50 questions stratified across single-hop / multi-hop / contradiction / unanswerable, same retrieve-only scoring of 0.7×recall + 0.3×precision):
Here's a surprise finding (maybe not a surprise to you): all four major RAG services hallucinate on every unanswerable question. 0/5 abstention correctness across the board. Was sort of expecting enterprise RAG providers like GCP, AWS, Azure, and OpenAI to respond "I don't know" to unanswerable questions. | ||