| ▲ | jameson 8 hours ago | |
How should one compare benchmark results? For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations? | ||
| ▲ | azeirah 8 hours ago | parent | next [-] | |
There is no hallucination benchmark currently. I was researching how to predict hallucinations using the literature (fastowski et al, 2025) (cecere et al, 2025) and the general-ish situation is that there are ways to introspect model certainty levels by probing it from the outside to get the same certainty metric that you _would_ have gotten if the model was trained as a bayesian model, ie, it knows what it knows and it knows what it doesn't know. This significantly improves claim-level false-positive rates (which is measured with the AUARC metric, ie, abstention rates; ie have the model shut up when it is actually uncertain). This would be great to include as a metric in benchmarks because right now the benchmark just says "it solves x% of benchmarks", whereas the real question real-world developers care about is "it solves x% of benchmarks *reliably*" AND "It creates false positives on y% of the time". So the answer to your question, we don't know. It might be a cherry picked result, it might be fewer hallucinations (better metacognition) it might be capability to solve more difficult problems (better intelligence). The benchmarks don't make this explicit. | ||
| ▲ | HarHarVeryFunny 7 hours ago | parent | prev | next [-] | |
Benchmarks are meaningless. Try it on your own problems and see if it has improved for what you want to use it for. | ||
| ▲ | zeroonetwothree 8 hours ago | parent | prev | next [-] | |
Benchmark results don’t directly translate to actual real world improvement. So we might guess it’s somewhat better but hard to say exactly in what way | ||
| ▲ | theptip 8 hours ago | parent | prev [-] | |
11% further along the particular bell curve of SWE-bench. Not really easy to extrapolate to real world, especially given that eg the Chinese models tend to heavily train on the benchmarks. But a 10% bump with the same model should equate to “feels noticeably smarter”. A more quantifiable eval would be METR’s task time - it’s the duration of tasks that the model can complete on average 50% of the time, we’ll have to wait to see where 4.7 lands on this one. | ||