| ▲ | swyx 9 hours ago | ||||||||||||||||||||||
:wave: i was on the team! AMA. some headlines - 3000 rubrics on code quality. First benchmark to measure: "would this code get actually merged?" - 20+ expert open-source maintainer created tasks on their own repos to capture their opinion & taste. - total 1000+ hours of real life software maintainer work captured in dataset. ON TOP of that, 40+ hours of real human work to turn that real life work into well validated and structured tasks with rubrics (even more work to turn tasks/prompts from devin-infra-specific to pluggable coding agent) - results in 81% lower false positive rate than SWE-Bench Pro - High quality bar: many QA stages & each task manually reviewed by Cognition researchers (examples in post) Opus 4.8 scores 13% on FrontierCode Diamond. one of my goals was also to datamine interesting stuff even on the easy tasks. for example, if you squint you can see the answer to "WTF Happened in late 2025" with coding models: https://x.com/swyx/status/2064081945567580323 | |||||||||||||||||||||||
| ▲ | tedsanders 9 hours ago | parent | next [-] | ||||||||||||||||||||||
Very cool! So glad to see people building and sharing evals that are better than SWE bench. I'm curious - any particular reason you didn't put error bars on the graphs? Seems like it could be helpful when there are only 50 unique problems in the diamond set. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | glerk 6 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
This looks really great, more thoughtful than any benchmark that I've seen until now! I'm curious if you're only interested in scoring frontier models or you would accept submission from custom harnesses? I am working on multi-model harnesses and would love to test them against your benchmark. Do you plan on releasing the tasks publicly? | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | typs 8 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
What did you do around cross-harness testing? I don't see anything in the blog post about what harnesses were used in evaluation. SOTA benchmarks have consistently shown that frontier model performance is quite sensitive to what tools are exposed (e.g. str_replace vs. apply_patch) as the labs are RLing on their own harnesses. Did you do testing of the models in a standard setup or in their native harnesses? | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | fouc 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
I'm a bit disappointed that Opus 4.6 wasn't in this because the tokenizer changed quite a bit from 4.7 onward. I was so annoyed by 4.7 that I've been forcing 4.6 ever since. I've been annoyed by 4.8 a bit too, so I haven't felt the urge to move on. | |||||||||||||||||||||||
| ▲ | great_psy 9 hours ago | parent | prev [-] | ||||||||||||||||||||||
How do you measure quality at scale ? Is there another model that determines if it adheres to codebase standard ? | |||||||||||||||||||||||
| |||||||||||||||||||||||