Remix.run Logo
firer 3 hours ago

> Open source models found the same bugs? Sure, if you tell them "here is a for which may contain a vulnerability, look for a big in how function XYZ handles ABC"

In one of Anthropic's blog post, they describe that that's basically what they did too. They run the agent many times, each time specifying a different file to focus on. [1]

From my experience as a security researcher, manually finding a fishy file and sicking even sonnet 4.5 yields great results for most memory corruption bugs.

No comments otherwise. I don't have a clue as to who that guy is, and I haven't watched the video yet. You might be right overall.

[1] https://red.anthropic.com/2026/mythos-preview/

Edit: looked at the open source model claims - I agree that they suck. Basically all the details are given away in the prompt - not just the file.

ryeights 2 hours ago | parent [-]

Yes, but Anthropic didn’t already know the answers. In the OSS ‘reproductions’, they fed the model the one file that actually has a vuln and even told it which parts of the code to focus on. This is obviously a much easier task.

If OSS models are equally up to the task, why not find novel vulnerabilities?

firer 2 hours ago | parent [-]

Yeah, totally agree now that I've looked into it more.

> If OSS models are equally up to the task, why not find novel vulnerabilities?

To be fair, in the same blog post Anthropic mentioned costs in the tens of thousands of dollars per project looked at it. So it's a big ask to do an experiment that compares. Would love to see it though.