Remix clone Hacker News

new | show | ask | jobs Github

	▲	jameson 8 hours ago
		How should one compare benchmark results? For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?
	▲	azeirah 8 hours ago \| parent \| next [-]
		There is no hallucination benchmark currently. I was researching how to predict hallucinations using the literature (fastowski et al, 2025) (cecere et al, 2025) and the general-ish situation is that there are ways to introspect model certainty levels by probing it from the outside to get the same certainty metric that you _would_ have gotten if the model was trained as a bayesian model, ie, it knows what it knows and it knows what it doesn't know. This significantly improves claim-level false-positive rates (which is measured with the AUARC metric, ie, abstention rates; ie have the model shut up when it is actually uncertain). This would be great to include as a metric in benchmarks because right now the benchmark just says "it solves x% of benchmarks", whereas the real question real-world developers care about is "it solves x% of benchmarks reliably" AND "It creates false positives on y% of the time". So the answer to your question, we don't know. It might be a cherry picked result, it might be fewer hallucinations (better metacognition) it might be capability to solve more difficult problems (better intelligence). The benchmarks don't make this explicit.
	▲	HarHarVeryFunny 7 hours ago \| parent \| prev \| next [-]
		Benchmarks are meaningless. Try it on your own problems and see if it has improved for what you want to use it for.
	▲	zeroonetwothree 8 hours ago \| parent \| prev \| next [-]
		Benchmark results don’t directly translate to actual real world improvement. So we might guess it’s somewhat better but hard to say exactly in what way
	▲	theptip 8 hours ago \| parent \| prev [-]
		11% further along the particular bell curve of SWE-bench. Not really easy to extrapolate to real world, especially given that eg the Chinese models tend to heavily train on the benchmarks. But a 10% bump with the same model should equate to “feels noticeably smarter”. A more quantifiable eval would be METR’s task time - it’s the duration of tasks that the model can complete on average 50% of the time, we’ll have to wait to see where 4.7 lands on this one.