| ▲ | vintagedave 10 hours ago | ||||||||||||||||||||||
> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions, despite our best efforts in improving on this in the initial creation of SWE-bench Verified. Is this saying a quarter* of the questions and answers were wrong, this whole time?! If so, how was this ever, in any way, a valid measurement? And what was the process for creating this benchmark and how did it end up with such an extraordinarily poor set of data? (There is a description later of how, which seems to be a high standard and I struggle to understand how it aligns with the other results they discuss.) Kudos to them for highlighting the issues, but I am left with questions. [*] Not one in four, but one in six, thanks commenters for the correction; leaving the original since, eh, my bad, and it lets replies make sense. I feel the broad point still stands! | |||||||||||||||||||||||
| ▲ | embedding-shape 10 hours ago | parent | next [-] | ||||||||||||||||||||||
> Is this saying a quarter of the questions and answers were wrong, this whole time?! No, they're saying 59.4% of the 27.6% subset had flawed test cases I think. > If so, how was this ever, in any way, a valid measurement? Benchmarks essentially aren't, for practical concerns anyways. They don't represent your use case, and they don't represent any and all use cases, they're valid for measuring exactly what's included in the benchmarks, nothing more and nothing less. I don't understand the ecosystems obsession with using public benchmarks, they hardly ever tell you anything of value. Ok, Qwen 3.5 is 50% better on Benchmark X than Qwen 2.5, does that mean it'll be 50% better for what you're using it for? Very unlikely. I've been running my own private benchmarks, with test cases I never share anywhere, for the specific problems I'm using LLMs for. Some are based on real, actual cases where a LLM went wrong and I had to adjust the prompt, and over time I've built up a suite. Most of the times when a new update comes out to a model, it moves maybe 2-3% in my own benchmarks, meanwhile they tout 30-40% increase or something ridiculous in public benchmarks, and we're supposed to believe the models' training data isn't contaminated... | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | yorwba 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
To be useful for identifying which model is better, benchmark scores only need to correlate with true performance, for which it's enough that the majority of tasks are scored correctly. You could have a terrible benchmark where 49% of the labels are wrong and a model that always answers correctly gets a score of 51%, but as long as it's higher than the always-wrong model at 49%, it's still directionally correct. Most machine-learning benchmarks have a fairly large fraction of incorrect labels, but when you just want to distinguish between different models, the time you'd need to ensure perfect scoring would usually be better spent on collecting a larger benchmark dataset, even if it ends up having more errors. | |||||||||||||||||||||||
| ▲ | sillysaurusx 10 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Imagenet is one of the most popular datasets on the planet. Turns out, a significant fraction of its images are mislabeled. In the limit case the model would have to fit towards wrong answers to get higher than a certain percentage. The answer is “it works because ML wants to work.” It’s surprising how far you can get with something flawed. It’s also why such huge breakthroughs are possible by noting flaws others haven’t. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | motoboi 10 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
It’s saying that 16% of the problems have well, problems. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | 10 hours ago | parent | prev [-] | ||||||||||||||||||||||
| [deleted] | |||||||||||||||||||||||
| |||||||||||||||||||||||