▲ | slacktivism123 5 days ago | |||||||||||||||||||||||||||||||
Fascinating case showing how LLM promoters will happily take "verified" benchmarks at their word. It's easy to publish "$NEWMODEL received an X% bump in SWE-Bench Verified!!!!". Proper research means interrogating the traces, like these researchers did (the Gist shows Claude 4 Sonnet): https://gist.github.com/jacobkahn/bd77c69d34040a9e9b10d56baa... Commentary: https://x.com/bwasti/status/1963288443452051582, https://x.com/tmkadamcz/status/1963996138044096969 | ||||||||||||||||||||||||||||||||
▲ | Workaccount2 5 days ago | parent | next [-] | |||||||||||||||||||||||||||||||
The best benchmark is the community vibe in the weeks following a release. Claude benchmarks poorly but vibes well. Gemini benchmarks well and vibes well. Grok benchmarks well but vibes poorly. (yes I know you are gushing with anecdotes, the vibes are simply the approximate color of gray born from the countless black and white remarks.) | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | k__ 5 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||
Yes, often you see huge gains in some benchmark, then the model is ran through Aider's polyglot benchmark and doesn't even hit 60%. |