It's social engineering reborn.

This time around, you can social engineer a computer. By understanding LLM psychology and how the post-training process shapes it.

▲

andy99 12 hours ago | parent | next [-]

No it’s undefined out-of-distribution performance rediscovered.

▲

BobaFloutist 3 hours ago | parent | next [-]

You could say the same about social engineering.

▲

adgjlsfhk1 10 hours ago | parent | prev [-]

it seems like lots of this is in distribution and that's somewhat the problem. the Internet contains knowledge of how to make a bomb, and therefore so does the llm

▲

xg15 9 hours ago | parent [-]

Yeah, seems it's more "exploring the distribution" as we don't actually know everything that the AIs are effectively modeling.

▲

lawlessone 8 hours ago | parent [-]

Am i understanding correctly that in distribution means the text predictor is more likely to predict bad instructions if you already get it to say the words related to the bad instructions?

	▲	andy99 7 hours ago \| parent [-]
		Basically means the kind of training examples it’s seen. The models have all been fine tuned to refuse to answer certain questions, across many different ways of asking them, including obfuscated and adversarial ones, but poetry is evidently so different from what it’s seen in this type of training that it is not refused.

▲

CuriouslyC 13 hours ago | parent | prev | next [-]

I like to think of them like Jedi mind tricks.

	▲	eucyclos 5 minutes ago \| parent [-]
		That's my favorite rap artist!

▲

layer8 9 hours ago | parent | prev | next [-]

That’s why the term “prompt engineering” is apt.

▲

robot-wrangler 13 hours ago | parent | prev [-]

Yeah, remember the whole semantic distance vector stuff of "king-man+woman=queen"? Psychometrics might be largely ridiculous pseudoscience for people, but since it's basically real for LLMs poetry does seem like an attack method that's hard to really defend against.

For example, maybe you could throw away gibberish input on the assumption it is trying to exploit entangled words/concepts without triggering guard-rails. Similarly you could try to fight GAN attacks with images if you could reject imperfections/noise that's inconsistent with what cameras would output. If the input is potentially "art" though.. now there's no hard criteria left to decide to filter or reject anything.

▲

ACCount37 11 hours ago | parent [-]

I don't think humans are fundamentally different. Just more hardened against adversarial exploitation.

"Getting maliciously manipulated by other smarter humans" was a real evolutionary pressure ever since humans learned speech, if not before. And humans are still far from perfect on that front - they're barely "good enough" on average, and far less than that on the lower end.

	▲	seethishat 7 hours ago \| parent \| next [-]
		Maybe the models can learn to be more cynical.
	▲	wat10000 9 hours ago \| parent \| prev [-]
		Walk out the door carrying a computer -> police called. Walk out the door carrying a computer and a clipboard while wearing a high-vis vest -> "let me get the door for you."