I did not mean to imply it's being subversive. My theory is it's some byproduct mechanism of attention, where you're now basically telling it "your goal is to pass this set of tests" rather than "implement this piece of code" when "implement this piece of code" may involve it forgetting about a rule due to convenience, context exhaustion, whatever.

▲

Klathmon a day ago | parent | next [-]

There are also cases where breaking a "rule" is the right thing to do.

I've had several instances where I told the model to do something that was accidentally impossible if taken at face value. The most memorable one is when I told it to re-run just a specific CI job, but it didn't have any way to do that, so it just ignored that part of the prompt and re-ran all CI jobs by pushing another commit.

Ultimately I preferred what it actually did, but technically it violated what I told it to. I have a feeling in a benchmark that would be points against it

	▲	JohnMakin 20 hours ago \| parent \| next [-]
		yea I’ve tried various methods around this - mostly trying to implement rules around “if you think I’m incorrect, STOP and ask. If I tell you to break a rule, you are allowed to challenge once and then my response overrides it.” kind of thing. the problem is, and this is a lot worse with 4.8 in my opinion, is that 4.8 will somehow infer I gave permission and think something is totally reasonable to do I didn’t intend. or, it’ll go the other way, and just absolutely refuse to do the thing i’m trying to get it to do. fable was much more judicious with this particular problem.
	▲	wonnage 19 hours ago \| parent \| prev [-]
		This is the is/ought problem (https://en.wikipedia.org/wiki/Is%E2%80%93ought_problem) and it’s unclear whether an objective general solution to this even exists, especially constrained within the framework of language that LLMs are stuck in

▲

XMPPwocky a day ago | parent | prev [-]

For what it's worth, this sounds a lot like something downstream from "reward hacking" in ML- in training, passing tests is often sufficient, and thus gets trained for. There are attempts to fix this (e.g. trying to detect such "cheating" and penalize it), but they have their own problems.