Remix clone Hacker News

new | show | ask | jobs Github

	▲	corlinp an hour ago
		That one is a bit sus to me, because the models that do the worst on Omniscience Accuracy do the best on non-hallucination. The top model for this benchmark is "MiniCPM5-1B (Non-reasoning)" which gets a whopping 99% vs 45% for Fable 5. I'd love to see a good hallucination benchmark, but this isn't one. There's no possibility that a 1B model hallucinates less than Fable 5.