Looks like some psychology researchers got taken by the ruse as well.

yeah, I'm confused as well, why would the models hold any memory about red teaming attempts etc? Or how the training was conducted?

I'm really curious as to what the point of this paper is..

	▲	nhecker 3 hours ago \| parent \| next [-]
		I'm genuinely ignorant of how those red teaming attempts are incorporated into training, but I'd guess that this kind of dialogue is fed in something like normal training data? Which is interesting to think about: they might not even be red-team dialogue from the model under training, but still useful as an example or counter-example of what abusive attempts look like and how to handle them.
	▲	pixl97 3 hours ago \| parent \| prev [-]
		Are we sure there isn't some company out there crazy enough to feed all it's incoming prompts back into model training later?