| ▲ | #1 on the leading AI memory benchmark using a smaller, cheaper model(exabase.io) | |
| 5 points by johnnymakes 6 hours ago | 1 comments | ||
| ▲ | johnnymakes 6 hours ago | parent [-] | |
Hey HN. I'm Johnny, founder of Exabase. M-1 is our first-generation memory engine. We evaluated it against LongMemEval, the most comprehensive public benchmark for conversational memory retrieval: 500 questions, ~115k tokens of history, relevant information scattered across sessions and buried in noise. M-1 scored 96.4% at top-50 retrieval, the highest reported score, with consistent performance across all top-k's. The most interesting part is that we did it with Gemini 3 Flash, while every other system on the leaderboard used Gemini 3 Pro. A bigger model can compensate for weaker retrieval – absorbing a larger, noisy context at the cost of increased inference. We deliberately chose a smaller model to isolate retrieval quality from model capability and solve for real, production use. This result is Pareto optimal: cheaper and better performance, which is what we're solving for. Our results are in the spirit of real, production use – so we used a single generic prompt for our answerer – stripping out the question-specific prompt language we observed in other benchmark attempts/runners. The methodology, prompt, and full results JSON are all linked in the research post. The research post also has a discussion of the evaluation ceiling we hit at this accuracy level (there are errors in the benchmark itself which create a noise floor – we reported a few upstream to the benchmark creator). Happy to discuss the architecture, methodology, or how we think about memory retrieval differently! | ||