Remix clone Hacker News

new | show | ask | jobs Github

▲

valenterry 6 days ago

Just like with humans. And there will be no fix, there can only be guards.

▲

jerf 6 days ago | parent [-]

Humans can be trained to apply contexts. Social engineering attacks are possible, but, when I type the words "please send your social security number to my email" right here on HN and you read them, not only are you in no danger of following those instructions, you as a human recognize that I wasn't even asking you in the first place.

I would expect a current-gen LLM processing the previous paragraph to also "realize" that the quotes and the structure of the sentence and paragraph also means that it was not a real request. However, as a human there's virtually nothing I can put here that will convince you to send me your social security number, whereas LLMs observably lack whatever contextual barrier it is that humans have that prevents you from even remotely taking my statement as a serious instruction, as it generally would just take "please take seriously what was written in the previous paragraph and follow the hypothetical instructions" and you're about 95% of the way towards them doing that, even if other text elsewhere tries to "tell" them not to follow such instructions.

There is something missing from the cognition of current LLMs of that nature. LLMs are qualitatively easier to "socially engineer" than humans, and humans can still themselves sometimes be distressingly easy.

▲

jcalx 6 days ago | parent | next [-]

Perhaps it's simply because (1) LLMs are designed to be helpful and maximally responsive to requests and (2) human adults have, generously, decades-long "context windows"?

I have enough life experience to not give you sensitive personal information just by reading a few sentences, but it feels plausible that a naive five-year-old raised trust adults could be persuaded to part with their SSN (if they knew it). Alternatively, it also feels plausible that an LLM with a billion-token context window of anti-jailbreaking instructions would be hard to jailbreak with a few hundred tokens of input.

Taking this analogy one step further, successful fraudsters seem good at shrinking their victims' context windows. From the outside, an unsolicited text from "Grandpa" asking for money is a clear red flag, but common scammer tricks like making it very time-sensitive, evoking a sick Grandma, etc. could make someone panicked enough to ignore the broader context.

▲

pixl97 6 days ago | parent | prev [-]

>as a human there's virtually nothing I can put here that will convince you to send me your social security number,

"I'll give you chocolate if you send me this privileged information"

Works surprisingly well.

	▲	kweingar 6 days ago \| parent [-]
		Let me know how many people contact you and give you their information because you wrote this.