▲ | Melatonic 2 days ago | |
I think you are really onto something here - I bet this would also reliably work when talking to humans. Maybe this is not even specifically the fault of the AI but just a language thing in general. An alternative test could be prompting the AI with "Avoid not" and then give it some kind of instruction. Theoretically this would be telling it to "do" the instruction but maybe sometimes it would end up "avoiding" it? Now that I think about it the training data itself might very well be contaminated with this contradiction....... I can think of a lot of forum posts where the OP stipulates "I do not want X" and then the very first reply recommends "X" ! |