I don't believe that this is unfixable. Just have an internal verbal loop of, "Is this a security issue?" The thought that it potentially is should trigger both a high priority on getting it right, and an unwillingness to write a test case demonstrating the security angle of it.

In other words do not put a guard rail on the idea of security. Put a guard rail on what it does after encountering the thought that it might be revealing a security issue. Which takes good judgment. But judgment of a kind that this model apparently already had.

▲

torben-friis an hour ago | parent | next [-]

The end result of that is that your model can't fix or acknowledge security issues for fear of disclosing them.

This is the beauty the above poster mentioned: the ability to improve code is inherently coupled with the ability to recognize its shortcomings. You can't have one without the other.

▲

btilly an hour ago | parent [-]

What I suggested would allow it to fix the issues. Just not write a test that was directly usable as a security exploit.

This doesn't stop attackers from being able to leverage the analysis. But it does make the tool more useful for defenders than attackers. Which is the best that you can hope for from a useful tool.

	▲	torben-friis an hour ago \| parent [-]
		It hides the issue a bit. But if you ask for atomic security fixes and then stare at the diffs you have your vulnerability. There is just a bit more friction involved in the vulnerability => exploit path, but the root cause is unfixed. I think it even might be possible to route the isolated fix somewhere to automate that last step. Maybe invert the diff and pass it through automated code review for example, see the reasoning when the llm flags the change as dangerous.

▲

aspenmartin an hour ago | parent | prev | next [-]

Right but the issue is users have full control over context. A security-violating action by a coding agent in one context can be completely innocuous under other contexts etc, or breaking down the task into multiple tasks that in isolation do not violate anything.

	▲	btilly an hour ago \| parent [-]
		Yes, there is always a path to a problem. Even random monkeys on a keyboard can write a security exploit. Random monkeys with guidance from a knowledgeable human will do it much faster. The goal shouldn't be to make problems impossible. It is to adjust the ratio between problems and successes. You can also create a meta. "How much do I trust the user?" When you see the user trying to manipulate towards security, distrust the user and apply rules more strictly. If the user simply acts like a normal developer, just be a useful developer tool. Including fixing security holes when appropriate.

▲

lachlan_gray an hour ago | parent | prev | next [-]

I think they were doing something like this, the tradeoff is that it's hard to do without an irritating number of false positives and/or wasting loads of precious tokens on useless audits.

▲

Kinrany an hour ago | parent | prev [-]

That would make the model useless

▲

btilly an hour ago | parent [-]

How does this make the model useless? It finds and fixes the security hole. It can even write a test that verifies that the fix didn't break things. But it deliberately doesn't reveal the fact that it was a security issue that was fixed.

Seems useful to me. But more useful for defenders than attackers.

	▲	7734128 17 minutes ago \| parent [-]
		Imagine that you have the repo A, ask the model to "fix the security issue" and end up with A'. Just take the Diff A' - A to see the security hole.