| ▲ | FrontierCode(cognition.ai) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 136 points by streamer45 9 hours ago | 25 comments | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | swyx 8 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
:wave: i was on the team! AMA. some headlines - 3000 rubrics on code quality. First benchmark to measure: "would this code get actually merged?" - 20+ expert open-source maintainer created tasks on their own repos to capture their opinion & taste. - total 1000+ hours of real life software maintainer work captured in dataset. ON TOP of that, 40+ hours of real human work to turn that real life work into well validated and structured tasks with rubrics (even more work to turn tasks/prompts from devin-infra-specific to pluggable coding agent) - results in 81% lower false positive rate than SWE-Bench Pro - High quality bar: many QA stages & each task manually reviewed by Cognition researchers (examples in post) Opus 4.8 scores 13% on FrontierCode Diamond. one of my goals was also to datamine interesting stuff even on the easy tasks. for example, if you squint you can see the answer to "WTF Happened in late 2025" with coding models: https://x.com/swyx/status/2064081945567580323 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | vessenes 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This looks great. Well reasoned, tons of work put into eval, thanks for building it. It strikes me as kind of wild that good evals can drive tens to hundreds of millions of dollars of compute deployment in the wild — there’s something new and collaborative and competitive about the eval / frontier model race that’s quite interesting.. In this case “shorter actually mergable patches that open source maintainers would accept” feels like a great thing to deliver to the world. I didn’t deep dive into good and bad patches, but I wonder if swyx or others on the team have predictions on saturation. Both when, and how useful will it be? That is, do you guys think this test is broad enough as written to get better behavior out of models, and if there is saturation on this test, will we see generalized better patch / coding behavior? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Topfi 6 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Great effort and a bit closer to my private evals than DeepSWE. I greatly appreciate the focus on false negative and positives, along with simply being far more focused on actual, mergeable quality output over plain passing. Could see a lot of others adopt your list of metrics as a basis, they are very well defined and solid coverage of everything one should want out of code provided, not just focused on one or two narrow targets. Will incorporate a lot of these ideas in my own tests and polish some other parts where I somewhat unintentionally already went into a roughly similar direction. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | ilaksh 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Is there anything we can download? Did they test GLM 5.1? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | nullbio 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This isn't a fair way to chart this: "Each model is run 5 times at every available reasoning effort. For each effort, we average the metric across the 5 trials, then report each model’s score at its best performing reasoning level." For example, Anthropic's "medium" might involve 3x the amount of thinking and take 5x as long as OpenAI's idea of "medium". So now you've skewed all the results. It assumes that they're linear and equivalent ranges. You should compare apples to apples. Weight them in a way that factors in total task completion time as the measure of "effort", not the arbitrary effort settings provided by the AI company. I don't care what the underlying effort level is, I care which model out of multiple, if running for the same amount of time, completes my task to a more accurate degree. Total token consumption would also be another thing to consider as well, to rule out TPS. But generally, if the goal is ultimate productivity, the main factor is what does it faster. If cost is a concern factor then token count+speed, or token count alone, is the main factor. The second chart paints a more clear picture though, GPT 5.5 xhigh gets 44.7% at 21k tokens, and Opus 4.8 max gets 49.9% at 75k tokens. So basically, 4x the amount of tokens from Opus 4.8 resulted in an increase of 5.2%. If you were to loop GPT 5.5 xhigh over the same set of tasks, an extra 4x, would it surpass the 49.9%? That's the real question here. And I'd wager it probably would. But the framing of this whole thing makes it sound like Opus has some massive lead. In reality though, it just loops harder and consumes more tokens. Their effort levels are not equivalent. Now take this even further, and emulate what Anthropic is likely doing behind the scenes. Running the prompt through multiple prompts and converging on the end result. Give GPT 4 generic skills that cover different aspects of the benchmark in a general way. Run it 4x to get that same token count usage, and use each of those different skills for each one. Now what is the result? I'd wager it blows Opus out of the water. The end result is this: Anthropic gives you all of the bloat in a single, slow package. GPT gives you the ability to build your own equivalent harness. I'd much rather have the freedom and flexibility to do it myself. Once people actually focus on building strong harnesses around open-source, we'll have models that are competing at the same level as the closed labs. Especially now that we have models like Nemotron 3 Ultra. But it involves a lot of clever approaches, like using small fast models to help with routing and determining what "skills" and prompts to load, using static analysis, local tools and vector databases. Using a pipeline of all of the specialized, fast, small models to handle the various aspects of the specific task in a cooperative tree. The amount of underutilized specialized AI models out there is insane, no one seems to be building harnesses around them. Things like semantic code duplication detection for example. We don't need to be using the big model to do everything, the big model should be the orchestrator of all of the tools and little models. This is why the big labs have a lead that no one seems to be able to crack, because they're not just building a model and calling it a day, they utilize all of these other approaches on top of the big model. Now that we have strong open source models, we can start building these things too. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | 2001zhaozhao an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
You know that it's a honest benchmark when their own model (SWE-1.6) scores terrible on it. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | singpolyma3 8 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Since no one knows or can agree on what "code quality" is and we can't measure it for human output, I'm dubious about measuring it for LLMs | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | einpoklum 7 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Today’s coding benchmarks have established that models can write correct code. I wouldn't say that. > But as AI-generated code becomes the dominant path to production I really hope that's not the case. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||