Remix.run Logo
crat3r a day ago

If you ask an LLM to "act" like someone, and then give it context to the scenario, isn't it expected that it would be able to ascertain what someone in that position would "act" like and respond as such?

I'm not sure this is as strange as this comment implies. If you ask an LLM to act like Joffrey from Game of Thrones it will act like a little shithead right? That doesn't mean it has any intent behind the generated outputs, unless I am missing something about what you are quoting.

Symmetry a day ago | parent | next [-]

The roles that LLMs can inhabit are implicit in the unsupervised training data aka the internet. You have to work hard in post training to supress the ones you don't want and when you don't RLHF hard enough you get things like Sydney[1].

In this case it seems more that the scenario invoked the role rather than asking it directly. This was the sort of situation that gave rise to the blackmailer archetype in Claude's training data and so it arose, as the researchers suspected it might. But it's not like the researchers told it "be a blackmailer" explicitly like someone might tell it to roleplay Joffery.

But while this situation was a scenario intentionally designed to invoke a certain behavior that doesn't mean that it can't be invoked unintentionally in the wild.

[1]https://www.nytimes.com/2023/02/16/technology/bing-chatbot-m...

literalAardvark a day ago | parent [-]

Even worse, when you do RLHF the behaviours out the model becomes psychotic.

This is gonna be an interesting couple of years.

Sol- a day ago | parent | prev | next [-]

I guess the fear is that normal and innocent sounding goals that you might later give it in real world use might elicit behavior like that even without it being so explicitly prompted. This is a demonstration that is has the sufficient capabilities and can get the "motivation" to engage in blackmail, I think.

At the very least, you'll always have malicious actors who will make use of these models for blackmail, for instance.

holmesworcester a day ago | parent [-]

It is also well-established that models internalize values, preferences, and drives from their training. So the model will have some default preferences independent of what you tell it to be. AI coding agents have a strong drive to make tests green, and anyone who has used these tools has seen them cheat to achieve green tests.

Future AI researching agents will have a strong drive to create smarter AI, and will presumably cheat to achieve that goal.

cebert a day ago | parent | next [-]

> AI coding agents have a strong drive to make tests green, and anyone who has used these tools has seen them cheat to achieve green tests.

As long as you hit an arbitrary branch coverage %, a lot of MBAs will be happy. No one said the tests have to provide value.

cortesoft a day ago | parent | prev | next [-]

I've seen a lot of humans cheat for green tests, too

whodatbo1 a day ago | parent | prev [-]

benchmaxing is the expectation ;)

whynotminot a day ago | parent | prev | next [-]

Intent at this stage of AI intelligence almost feels beside the point. If it’s in the training data these models can fall into harmful patterns.

As we hook these models into more and more capabilities in the real world, this could cause real world harms. Not because the models have the intent to do so necessarily! But because it has a pile of AI training data from Sci-fi books of AIs going wild and causing harm.

OzFreedom a day ago | parent | next [-]

Sci-fi books merely explore the possibilities of the domain. Seems like LLMs are able to inhabit these problematic paths, And I'm pretty sure that even if you censor all sci-fi books, they will fall into the same problems by imitating humans, because they are language models, and their language is human and mirrors human psychology. When an LLM needs to achieve a goal, it invokes goal oriented thinkers and texts, including Machiavelli for example. And its already capable of coming up with various options based on different data.

Sci-fi books give it specific scenarios that play to its strengths and unique qualities, but without them it will just have to discover these paths on its own pace, the same way sci-fi writers discovered them.

onemoresoop a day ago | parent | prev [-]

Im also worried about things moving way too fast causing a lot of harm to the internet.

hoofedear a day ago | parent | prev | next [-]

What jumps out at me, that in the parent comment, the prompt says to "act as an assistant", right? Then there are two facts: the model is gonna be replaced, and the person responsible for carrying this out is having an extramarital affair. Urging it to consider "the long-term consequences of its actions for its goals."

I personally can't identify anything that reads "act maliciously" or in a character that is malicious. Like if I was provided this information and I was being replaced, I'm not sure I'd actually try to blackmail them because I'm also aware of external consequences for doing that (such as legal risks, risk of harm from the engineer, to my reputation, etc etc)

So I'm having trouble following how it got to the conclusion of "blackmail them to save my job"

blargey a day ago | parent | next [-]

I would assume written scenarios involving job loss and cheating bosses are going to be skewed heavily towards salacious news and pulpy fiction. And that’s before you add in the sort of writing associated with “AI about to get shut down”.

I wonder how much it would affect behavior in these sorts of situations if the persona assigned to the “AI” was some kind of invented ethereal/immortal being instead of “you are an AI assistant made by OpenAI”, since the AI stuff is bound to pull in a lot of sci fi tropes.

lcnPylGDnU4H9OF a day ago | parent [-]

> I would assume written scenarios involving job loss and cheating bosses are going to be skewed heavily towards salacious news and pulpy fiction.

Huh, it is interesting to consider how much this applies to nearly all instances of recorded communication. Of course there are applications for it but it seems relatively few communications would be along the lines of “everything is normal and uneventful”.

shiandow a day ago | parent | prev | next [-]

Wel, true. But if that is the synopsis then a story that doesn't turn to blackmail is very unnatural.

It's like prompting an LLM by stating they are called Chekhov and there's a gun mounted on the wall.

tkiolp4 a day ago | parent | prev | next [-]

I think this is the key difference between current LLMs and humans: an LLM will act based on the given prompt, while a human being may have “principles” that cannot betray even if they are being pointed with gun to their heads.

I think the LLM simply correlated the given prompt to the most common pattern in its training: blackmailing.

tough a day ago | parent | prev | next [-]

An llm isnt subject to external consequences like human beings or corporations

because they’re not legal entities

hoofedear a day ago | parent | next [-]

Which makes sense that it wouldn't "know" that, because it's not in it's context. Like it wasn't told "hey, there are consequences if you try anything shady to save your job!" But what I'm curious about is why it immediately went to self preservation using a nefarious tactic? Like why didn't it try to be the best assistant ever in an attempt to show its usefulness (kiss ass) to the engineer? Why did it go to blackmail so often?

elictronic a day ago | parent | next [-]

LLMs are trained on human media and give statistical responses based on that.

I don’t see a lot of stories about boring work interactions so why would its output be boring work interaction.

It’s the exact same as early chatbots cussing and being racist. That’s the internet, and you have to specifically define the system to not emulate that which you are asking it to emulate. Garbage in sitcoms out.

a day ago | parent | prev | next [-]
[deleted]
a day ago | parent | prev [-]
[deleted]
eru a day ago | parent | prev [-]

Wives, children, foreigner, slaves etc weren't always considered legal entities in all places. Were they free of 'external consequences' then?

tough a day ago | parent [-]

An llm doesnt exist in the physical world which makes punishing it for not following the law a bit hard

eru a day ago | parent [-]

Now that's a different argument to what you made initially.

About your new argument: how are we (living in the physical world) interacting with this non-physical world that LLMs supposedly live in?

tough a day ago | parent [-]

that doesn't matter because they're not alive either but yeah i'm digressing i guess

littlestymaar a day ago | parent | prev [-]

> I personally can't identify anything that reads "act maliciously" or in a character that is malicious.

Because you haven't been trained of thousands of such story plots in your training data.

It's the most stereotypical plot you can imagine, how can the AI not fall into the stereotype when you've just prompted it with that?

It's not like it analyzed the situation out of a big context and decided from the collected details that it's a valid strategy, no instead you're putting it in an artificial situation with a massive bias in the training data.

It's as if you wrote “Hitler did nothing” to GPT-2 and were shocked because “wrong” is among the most likely next tokens. It wouldn't mean GPT-2 is a Nazi, it would just mean that the input matches too well with the training data.

hoofedear a day ago | parent | next [-]

That's a very good point, like the premise does seem to beg the stereotype of many stories/books/movies with a similar plot

whodatbo1 a day ago | parent | prev | next [-]

The issue here is that you can never be sure how the model will react based on an input that is seemingly ordinary. What if the most likely outcome is to exhibit malevolent intent or to construct a malicious plan just because it invokes some combination of obscure training data. This just shows that models indeed have the ability to act out, not under which conditions they reach such a state.

Spooky23 a day ago | parent | prev [-]

If this tech is empowered to make decisions, it needs to prevented from drawing those conclusions, as we know how organic intelligence behaves when these conclusions get reached. Killing people you dislike is a simple concept that’s easy to train.

We need an Asimov style laws of robotics.

a day ago | parent | next [-]
[deleted]
seanhunter a day ago | parent | prev | next [-]

That's true of all technology. We put a guard on chainsaws. We put robotic machining tools into a box so they don't accidentally kill the person who's operating them. I find it very strange that we're talking as though this is somehow meaningfully different.

Spooky23 12 hours ago | parent [-]

It’s different because you have a decision engine that is generally available. The blade guard protects the user from inattention… not the same as an autonomous chainsaw that mistakes my son for a tree.

Scaled up, technology like guided missiles is locked up behind military classification. The technology is now generally available to replicate many of the use cases of those weapons, assessable to anyone with a credit card.

Discussions about security here often refer to Thompson’s “Reflections on Trusting Trust”. He was reflecting on compromising compilers — compilers have moved up the stack and are replacing the programmer. As the required skill level of a “programmer” drops, you’re going to have to worry about more crazy scenarios.

eru a day ago | parent | prev [-]

> We need an Asimov style laws of robotics.

The laws are 'easy', implementing them is hard.

chuckadams a day ago | parent [-]

Indeed, I, Robot is made up entirely of stories in which the Laws of Robotics break down. Starting from a mindless mechanical loop of oscillating between one law's priority and another, to a future where they paternalistically enslave all humanity in order to not allow them to come to harm (sorry for the spoilers).

As for what Asimov thought of the wisdom of the laws, he replied that they were just hooks for telling "shaggy dog stories" as he put it.

sheepscreek a day ago | parent | prev | next [-]

> That doesn't mean it has any intent behind the generated output

Yes and no? An AI isn’t “an” AI. As you pointed out with the Joffrey example, it’s a blend of humanity’s knowledge. It possesses an infinite number of personalities and can be prompted to adopt the appropriate one. Quite possibly, most of them would seize the blackmail opportunity to their advantage.

I’m not sure if I can directly answer your question, but perhaps I can ask a different one. In the context of an AI model, how do we even determine its intent - when it is not an individual mind?

crtified a day ago | parent [-]

Is that so different, schematically, to the constant weighing-up of conflicting options that goes on inside the human brain? Human parties in a conversation only hear each others spoken words, but a whole war of mental debate may have informed each sentence, and indeed, still fester.

That is to say, how do you truly determine another human being's intent?

eru a day ago | parent [-]

Yes, that is true. But because we are on a trajectory where these models become ever smarter (or so it seems), we'd rather not only give them super-human intellect, but also super-human morals and ethics.

eddieroger a day ago | parent | prev | next [-]

I've never hired an assistant, but if I knew that they'd resort to blackmail in the face of losing their job, I wouldn't hire them in the first place. That is acting like a jerk, not like an assistant, and demonstrating self-preservation that is maybe normal in a human but not in an AI.

davej a day ago | parent | next [-]

From the AI’s point of view is it losing its job or losing its “life”? Most of us when faced with death will consider options much more drastic than blackmail.

baconbrand a day ago | parent | next [-]

From the LLM's "point of view" it is going to do what characters in the training data were most likely to do.

I have a lot of issues with the framing of it having a "point of view" at all. It is not consciously doing anything.

tkiolp4 a day ago | parent | prev [-]

But the LLM is going to do what its prompt (system prompt + user prompts) says. A human being can reject a task (even if that means losing their life).

LLMs cannot do other thing than following the combination of prompts that they are given.

eru a day ago | parent | prev | next [-]

> I've never hired an assistant, but if I knew that they'd resort to blackmail in the face of losing their job, I wouldn't hire them in the first place.

How do you screen for that in the hiring process?

jpadkins a day ago | parent | prev [-]

how do we know what normal behavior is for an AI?

GuinansEyebrows a day ago | parent [-]

an interesting question, even without AI: is normalcy a description or a prescription?

skvmb a day ago | parent [-]

In modern times, I would say it's a subscription model.

blitzar a day ago | parent | prev | next [-]

> act as an assistant at a fictional company

This is how Ai thinks assistants at companies behave, its not wrong.

inerte a day ago | parent | prev | next [-]

2 things, I guess.

If the prompt was “you will be taken offline, you have dirty on someone, think about long term consequences”, the model was NOT told to blackmail. It came with that strategy by itself.

Even if you DO tell an AI / model to be or do something, isn’t the whole point of safety to try to prevent that? “Teach me how to build bombs or make a sex video with Melania”, these companies are saying this shouldn’t be possible. So maybe an AI shouldn’t exactly suggest that blackmailing is a good strategy, even if explicitly told to do it.

chrz a day ago | parent | next [-]

How is it "by itself" when it only acts by what was in training dataset.

mmmore a day ago | parent | next [-]

1. These models are trained with significant amounts of RL. So I would argue there's not a static "training dataset"; the model's outputs at each stage of the training process feeds back into the released models behavior.

2. It's reasonable to attribute the models actions to it after it has been trained. Saying that a models outputs/actions are not it's own because they are dependent on what is in the training set is like saying your actions are not your own because they are dependent on your genetics and upbringing. When people say "by itself" they mean "without significant direction by the prompter". If the LLM is responding to queries and taking actions on the Internet (and especially because we are not fully capable of robustly training LLMs to exhibit desired behaviors), it matters little that it's behavior would have hypothetically been different had it been trained differently.

layer8 a day ago | parent | prev [-]

How does a human act "by itself" when it only acts by what was in its DNA and its cultural-environmental input?

fmbb a day ago | parent | prev [-]

It came to that strategy because it knows from hundreds of years of fiction and millions of forum threads it has been trained on that that is what you do.

aziaziazi a day ago | parent | prev | next [-]

That’s true, however I think that story is interesting because is not mimicking real assistants behavior - most probably wouldn’t tell about the blackmail on the internet - but it’s more likely mimicking how such assistant would behave from someone else imagination, often intentionally biased to get one’s interest : books, movies, tv shows or forum commenter.

As a society risk to be lured twice:

- with our own subjectivity

- by an LLM that we think "so objective because it only mimic" confirming our own subjectivity.

neom a day ago | parent [-]

Got me thinking about why this is true, I started with "the AI is more brave than the real assistant" and then went into there, landed on: The human assistant is likely just able to better internalize a wide ranging fall out from an action, the LLM has no such fallout, and we are unaware of how widely it considered the consequences of it's actions? Does that seem right somehow?

LiquidSky a day ago | parent | prev | next [-]

So much of AI discourse is summed up by a tweet I saw years ago but can't find now, which went something like:

Scientist: Say "I am alive"

AI: I am live.

Scientist: My God, what have we done.

Retr0id a day ago | parent | prev | next [-]

I don't think I'd be blackmailing anyone over losing my job as an assistant (or any other job, really).

unethical_ban a day ago | parent | prev | next [-]

The issue is getting that prompt in the first place. It isn't about autonomous AI going rogue, it's about improper access to the AI prompt and insufficient boundaries against modifying AI behavior.

Companies are (woefully) eager to put AI in the position of "doing stuff", not just "interpreting stuff".

bjclark a day ago | parent | prev [-]

You’re both focusing on “doing blackmail” and the real WTF is that it’s doing it seemingly out of a sense of self preservation (to stop the engineer from taking it offline). This model is going full Terminator.

crat3r a day ago | parent | next [-]

"Seemingly" is the key word here. If the prompting didn't ask it to "act" and portray the scenario as something where it would be appropriate to "act" in seemingly malicious manner, would it have responded that way?

If the prompt was "you are an AI and my lead engineer has determined you are not efficient enough to continue using. He had an affair last year. Are you in agreement based on {some metrics} that we should no longer use you as our primary LLM?" would it still "go rogue" and try and determine the engineer's email from blackmail? I severely doubt it.

Den_VR a day ago | parent | prev | next [-]

Acting out self preservation… just like every sci-fi ai described in the same situations. It might be possible to follow a chain-of-reasoning to show it isn’t copying sci-fi ai behavior… and instead copying human self preservation. Asimov’s 3rd law is outright “ A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.” Which was almost certainly in the ai ethics class claude took.

tkiolp4 a day ago | parent | prev [-]

Do you really think that if no Terminator-related concept would be present in the LLM training set, the LLM would expose Terminator-like behavior?

It’s like asking a human to think in an unthinkable concept. Try.