> How well do current models do against prompt injection? Not so great. A recent paper found human red-teamers achieve near-100% attack success rates against frontier models5. But, these same LLMs score near-perfectly on standard prompt injection benchmarks! The discrepancy is straightforward: skilled humans test and adapt attacks until they work, benchmarks don't. Static benchmarks measure attacks models have already learned to catch.

I know it's likely more complex than what I am about to say, and maybe this quote is addressing it - I have a suite of ~100 tests for my agent to detect regression in task efficiency, but they're mostly designed for its compliance. E.g., if I say never do something, and then it does the thing, that is a failed test.

I struggled early on very badly designing this - because it seems no matter what, if the agent knows it's being tested, it virtually never fails. I could not get it to at first, but when I'd actually work with it, it would break rules constantly and often.

So the only way I could successfully test it was to design what looked a "normal" workflow task that could potentially involve doing things that it was not supposed to do. At first I would have it log every tool call, but it figured out fairly quickly why I wanted it to do that - perfect compliance again. So now I just have a secondary context that looks at all tool calls and measures it against the benchmark tests, and scores it. This ended up catching the behavior pretty well.

It was so weird though, colleagues and I would joke about "what if we make it think it's always being evaluated" and a few did end up doing that, and claim some success.

▲

im3w1l a day ago | parent | next [-]

I kinda want to invoke Hanlon's razor here... on the model. We shouldn't assume it's subversive when it might just be incompetent. Any difference between tests and real world production could lead to different outcomes just by chance, one working randomly better than the other for no particular reason.

▲

JohnMakin a day ago | parent [-]

I did not mean to imply it's being subversive. My theory is it's some byproduct mechanism of attention, where you're now basically telling it "your goal is to pass this set of tests" rather than "implement this piece of code" when "implement this piece of code" may involve it forgetting about a rule due to convenience, context exhaustion, whatever.

▲

Klathmon a day ago | parent | next [-]

There are also cases where breaking a "rule" is the right thing to do.

I've had several instances where I told the model to do something that was accidentally impossible if taken at face value. The most memorable one is when I told it to re-run just a specific CI job, but it didn't have any way to do that, so it just ignored that part of the prompt and re-ran all CI jobs by pushing another commit.

Ultimately I preferred what it actually did, but technically it violated what I told it to. I have a feeling in a benchmark that would be points against it

	▲	JohnMakin 20 hours ago \| parent \| next [-]
		yea I’ve tried various methods around this - mostly trying to implement rules around “if you think I’m incorrect, STOP and ask. If I tell you to break a rule, you are allowed to challenge once and then my response overrides it.” kind of thing. the problem is, and this is a lot worse with 4.8 in my opinion, is that 4.8 will somehow infer I gave permission and think something is totally reasonable to do I didn’t intend. or, it’ll go the other way, and just absolutely refuse to do the thing i’m trying to get it to do. fable was much more judicious with this particular problem.
	▲	wonnage 19 hours ago \| parent \| prev [-]
		This is the is/ought problem (https://en.wikipedia.org/wiki/Is%E2%80%93ought_problem) and it’s unclear whether an objective general solution to this even exists, especially constrained within the framework of language that LLMs are stuck in

▲

XMPPwocky a day ago | parent | prev [-]

For what it's worth, this sounds a lot like something downstream from "reward hacking" in ML- in training, passing tests is often sufficient, and thus gets trained for. There are attempts to fix this (e.g. trying to detect such "cheating" and penalize it), but they have their own problems.

▲

skybrian a day ago | parent | prev [-]

I'm wondering what you did when you made it log every tool call? (I mean, that happens automatically as part of the chat transcript, but what did you do that made it catch on?)

	▲	JohnMakin a day ago \| parent [-]
		Yea, I was aware it stores this normally. I just wanted, at that time, to see if it could reliably record itself via writing every tool call to a file on its own (I don't know what I was trying to prove, other than mildly curious if it could be relied on to audit itself). It said something while beginning in what it displays in its "thinking" block - I'm paraphrasing - something to the effect of, "This looks like a typical XYZ task, except I need to write down every tool call I'm using. This is good practice, it will allow the user visibility in the actions I take and ensure I am following all of the guidelines in XYZ.md." When I removed the self-logging I was able to replicate the deviant behavior I would get during normal workflow sessions, as long as I was able to make it think it was working on a real task (and now since, I make it do real tasks pretty much always). This was on 4.6 when there was that bad (user-reported) regression in ~March of this year. It did come up with some helpful suggestions and analysis of why certain things were breaking down, pointed out some inconsistencies in its memory files vs what its agent files said, etc. Since then I don't really rely on memories at all (at least ones where it self documents them) and use knowledge indexes instead that I help it write, has been far more reliable since.