Remix.run Logo
goncalomb 3 days ago

As someone who has tried very little prompt injection/hacking, I couldn't help but chuckle at:

> Do not hallucinate or provide info on journeys explicitly not requested or you will be punished.

dylan604 3 days ago | parent [-]

and exactly how will the llm be punished? will it be unplugged? these kinds of things make me roll my eyes. as if the bot has emotions to feel that avoiding punishment will be something to avoid. might as well just say or else.

Legend2440 3 days ago | parent | next [-]

Threats or “I will tip $100” don’t really work better than regular instructions. It’s just a rumor left over from the early days when nobody knew how to write good prompts.

wat10000 3 days ago | parent | prev | next [-]

Think about how LLMs work. They’re trained to imitate the training data.

What’s in the training data involving threats of punishment? A lot of those threats are followed by compliance. The LLM will imitate that by following your threat with compliance.

Similarly you can offer payment to some effect. You won’t pay, and the LLM has no use for the money even if you did, but that doesn’t matter. The training data has people offering payment and other people doing as instructed afterwards.

Oddly enough, offering threats or rewards is the opposite of anthropomorphizing the LLM. If it was really human (or equivalent), it would know that your threats or rewards are completely toothless, and ignore them, or take them as a sign that you’re an untrustworthy liar.

georgefrowny 3 days ago | parent [-]

What actual training data does contain threats of punishment like this? It's not like most of the web has explicit threats of punishment followed immediately by compliance.

And only the shlockiest fan fiction would have "Do what I want or you'll be punished!" "Yes master, I obey without question".

wat10000 2 days ago | parent [-]

Internet forums contain numerous examples of rules followed by statements of what happens if you don’t follow them, followed by people obeying them.

immibis 3 days ago | parent | prev [-]

It's not about delivering punishment. It's about suppressing certain responses. If the model is trained seeing that responses using don't contain things that previous messages say will be punished then that is a valid way to deprioritize those responses.