RAG Eval Comparing Vertex/Bedrock/Azure/OpenAI

	▲	RAG Eval Comparing Vertex/Bedrock/Azure/OpenAI(github.com)
		3 points by colon-md 11 hours ago \| 1 comments

	▲	colon-md 11 hours ago \| parent [-]
		Last week, I read Karpathy's gist on building a personal wiki LLM (https://gist.github.com/karpathy/442a6bf555914893e9891c11519...) and decided to try it. The RAG pitch is take your own corpus of docs, layer an LLM over it, get a thing that answers questions grounded in your stuff. Wiki+RAG hybrid as the interesting architectural variant. So I started building the "traditional" retrieval architectures (pure dense, BM25, hybrid RRF, rerank) to pit against the wiki+RAG variant with structure layered over the chunks. After few days of code cleanup I have an eval testbench and a wiki LLM is only 50% built. I'm releasing the testbench now because I think the testbench is just as valuable as the RAG design itself. What the repo does: runs four hosted RAG services against identical inputs (same 81-doc enterprise corpus, same 50 questions stratified across single-hop / multi-hop / contradiction / unanswerable, same retrieve-only scoring of 0.7×recall + 0.3×precision): `- Azure AI Search: 84.0 (recall 90.9%, precision 67.8%) - Vertex AI RAG Engine: 82.6 (94.5%, 54.7%) - Bedrock Knowledge Bases: 82.5 (87.9%, 70.1%) - OpenAI File Search: 78.5 (89.3%, 53.4%)` Here's a surprise finding (maybe not a surprise to you): all four major RAG services hallucinate on every unanswerable question. 0/5 abstention correctness across the board. Was sort of expecting enterprise RAG providers like GCP, AWS, Azure, and OpenAI to respond "I don't know" to unanswerable questions.