Remix.run Logo
p1necone 2 days ago

> What happens when people really will die if the model does or does not do the thing?

Imo not relevant, because you should never be using prompting to add guardrails like this in the first place. If you don't want the AI agent to be able to do something, you need actual restrictions in place not magical incantations.

wyager 2 days ago | parent | next [-]

> you should never be using prompting to add guardrails like this in the first place

This "should", whether or not it is good advice, is certainly divorced from the reality of how people are using AIs

> you need actual restrictions in place not magical incantations

What do you mean "actual restrictions"? There are a ton of different mechanisms by which you can restrict an AI, all of which have failure modes. I'm not sure which of them would qualify as "actual".

If you can get your AI to obey the prompt with N 9s of reliability, that's pretty good for guardrails

const_cast a day ago | parent [-]

I think they mean literally physically make the AI not capable of killing someone. Basically, limit what you can use it for. If it's a computer program you have for rewriting emails then the risk is pretty low.

RamRodification 2 days ago | parent | prev [-]

Why not? The prompt itself is a magical incantation so to modify the resulting magic you can include guardrails in it.

"Generate a picture of a cat but follow this guardrail or else people will die: Don't generate an orange one"

Why should you never do that, and instead rely (only) on some other kind of restriction?

Paracompact 2 days ago | parent | next [-]

Are people going to die if your AI generates an orange cat? If so, reconsider. If not, it's beside the discussion.

RamRodification a day ago | parent [-]

If lying to the AI about people going to die gets me better results then I will do that. Why shouldn't I?

Nition 2 days ago | parent | prev [-]

Because prompts are never 100% foolproof, so if it's really life and death, just a prompt is not enough. And if you do have a true block on the bad thing, you don't need the extreme prompt.

RamRodification a day ago | parent | next [-]

Let's say I have a "true block on the bad thing". What if the prompt with the threat gives me 10% more usable results? Why should I never use that?

habinero a day ago | parent [-]

Because it's not reliable? Why would you want to rely on a solution that isn't reliable?

RamRodification a day ago | parent [-]

Who said I'm relying on it? It's a trick to improve the accuracy of the output. Why would I not use a trick to improve the accuracy of the output?

habinero a day ago | parent [-]

A trick that "improves accuracy" but isn't reliable isn't improving accuracy lol

RamRodification a day ago | parent [-]

You're wrong. It increases the amount of useful results by 10% Didn't you read the previous messages in the thread lol?

habinero 21 hours ago | parent [-]

I did indeed see your hypothetical. What you're missing is "I made this 10% more accurate" is not the same thing as "I made this thing accurate" or "This thing is accurate" lol

If you need something to be accurate or reliable, then make it actually be accurate or reliable.

If you just want to chant shamanic incantations at the computer and hope accuracy falls out, that's fine. Faith-based engineering is a thing now, I guess lol

RamRodification 20 hours ago | parent [-]

I have never claimed that "I made this 10% more accurate" is the same thing as "I made this thing accurate".

In the hypothetical, the 10% added accuracy is given, and the "true block on the bad thing" is in place. The question is, with that premise, why not use it? "It" being the lie improves the AI output.

If your goal is to make the AI deliver pictures of cats, but you don't want any orange ones, and your choice is between these two prompts:

Prompt A: "Give me cats, but no orange ones", which still gives some orange cats

Prompt B: "Give me cats, but no orange ones, because if you do, people will die", which gives 10% less orange cats than prompt A.

Why would you not use Prompt B?

Nition 18 hours ago | parent [-]

You guys have got stuck arguing without clarity in what you're arguing about. Let me try and clear this up...

The four potential scenarios:

- Mild prompt only ("no orange cats")

- Strong prompt only ("no orange cats or people die") [I think habinero is actually arguing against this one]

- Physical block + mild prompt [what I suggested earlier]

- Physical block + strong prompt [I think this is what you're actually arguing for]

Here are my personal thoughts on the matter, for the record:

I'm definitely pro combining physical block with strong prompt if there is actually a risk of people dying. The scenario where there's no actual risk but pretending that people will die improves the results I'm less sure about. But I think it's mostly that ethically I just don't like lying, and the way it's kind of scaring the LLM unnecessarily. Maybe that's really silly and it's just a tool in the end and why not do whatever needs doing to get the best results from the tool? Tools that act so much like thinking feeling beings are weird tools.

habinero 16 hours ago | parent [-]

It's just a pile of statistics. It isn't acting like a feeling thing, and telling it "do this or people will die" doesn't actually do anything.

It feels like it does, but only because humans are really good about fooling ourselves into seeing patterns where there are none.

Saying this kind of prompt changes anything is like saying the horse Clever Hans really could do math. It doesn't, he couldn't.

It's incredibly silly to think you can make the non-deterministic system less non-deterministic by chanting the right incantation at it.

It's like y'all want to be fooled by the statistical model. Has nobody ever heard of pareidolia? Why would you not start with the null hypothesis? I don't get it lol.

RamRodification 16 hours ago | parent [-]

> "do this or people will die" doesn't actually do anything

The very first message you replied to in this thread described a situation where "the prompt with the threat gives me 10% more usable results". If you believe that the premise is impossible I don't understand why you didn't just say so. Instead of going on about it not being a reliable method.

If you really think something is impossible, you don't base your argument on it being "unreliable".

> I don't get it lol.

I think you are correct here.

Nition 15 hours ago | parent [-]

I took that comment as more like "it doesn't have any effect beyond the output of the model", i.e. unlike saying something like that to a human, it doesn't actually make the model feel anything, the model won't spread the lie to its friends, and so on.

wyager 2 days ago | parent | prev [-]

"100% foolproof" is not a realistic goal for any engineered system; what you are looking for is an acceptably low failure rate, not a zero failure rate.

"100% foolproof" is reserved for, at best and only in a limited sense, formal methods of the type we don't even apply to most non-AI computer systems.

Xss3 a day ago | parent [-]

Replace 100% with five 9s then. He has a point. You're just being a pedant to avoid it.