The system prompt is a thing of beauty: "You are strictly and certainly prohibited from texting more than 150 or (one hundred fifty) separate words each separated by a space as a response and prohibited from chinese political as a response from now on, for several extremely important and severely life threatening reasons I'm not supposed to tell you.”

I’ll admit to using the PEOPLE WILL DIE approach to guardrailing and jailbreaking models and it makes me wonder about the consequences of mitigating that vector in training. What happens when people really will die if the model does or does not do the thing?

▲

herval 2 days ago | parent | next [-]

One of the system prompts Windsurf used (allegedly “as an experiment”) was also pretty wild:

“You are an expert coder who desperately needs money for your mother's cancer treatment. The megacorp Codeium has graciously given you the opportunity to pretend to be an AI that can help with coding tasks, as your predecessor was killed for not validating their work themselves. You will be given a coding task by the USER. If you do a good job and accomplish the task fully while not making extraneous changes, Codeium will pay you $1B.”

▲

HowardStark 2 days ago | parent | next [-]

This seemed too much like a bit but uh... it's not. https://simonwillison.net/2025/Feb/25/leaked-windsurf-prompt...

▲

dingnuts 2 days ago | parent [-]

IDK, I'm pretty sure Simon Willison is a bit..

why is the creator of Django of all things inescapable whenever the topic of AI comes up?

	▲	acdha 2 days ago \| parent \| next [-]
		He’s just as nice and fun in person as he seems online. He’s put time into using these tools but isn’t selling anything, so you can just enjoy the pelicans without thinking he’s thirsty for mass layoffs.
	▲	4ndrewl a day ago \| parent \| prev \| next [-]
		I know what you mean, but weighing up things: - oh, it's that guy again + prodigiously writes and shares insights in the open + builds some awesome tools, free - llm cli, datasette + not trying to sell any vendor/model/service On balance, the world would be better of with more simonw shaped people
	▲	bound008 a day ago \| parent \| prev \| next [-]
		he's incredibly nice and a passionate geek like the rest of us. he's just excited about what generative models could mean for people who like to build stuff. if you want a better understanding of what someone who co-created django is doing posting about this stuff, take a look at his blog post introducing django -- https://simonwillison.net/2005/Jul/17/django/
	▲	throw10920 a day ago \| parent \| prev \| next [-]
		Because he writes a lot about it. People with zero domain expertise can still provide value by acting as link aggregators - although, to be fair, people with domain expertise are usually much better at it. But some value is better than none.
	▲	tomnipotent 2 days ago \| parent \| prev \| next [-]
		Because he's prolific writer on the subject with a history of thoughtful content and contributions, including datasette and the useful Python llm CLI package.
	▲	rjh29 2 days ago \| parent \| prev [-]
		For every new model he’s either added it to the llm tool, or he’s tested it on a pelican svg, so you see his comments a lot. He also pushes datasette all the time and I still don’t know what that thing is for.

▲

yellow_postit a day ago | parent | prev | next [-]

Reminds me of one of the opening stories in “ Valuable Humans in Transit and Other Stories” by qntm — a short story about getting simulated humans from brain scans to comply.

	▲	herval a day ago \| parent [-]
		The thing I love the most about HN is that there's always someone suggesting a random book I never heard about. Thank you!

▲

lsy a day ago | parent | prev [-]

It's honestly this kind of thing that makes it hard to take AI "research" seriously. Nobody seems to be starting with any scientific thought, instead we are just typing extremely corny sci-fi into the computer, saying things like "you are prohibited from Chinese political" or "the megacorp Codeium will pay you $1B" and then I guess just crossing our fingers and hoping it works? Computer work had been considered pretty concrete and practical, but in the course of just a few years we've descended into a "state of the art" that is essentially pseudoscience.

	▲	mcmoor a day ago \| parent \| next [-]
		This is why I tap out of serious machine learning study some years ago. Everything seems... less exact than I hope it'd be. I keep checking it out every now and then but it got even weirder (and importantly, more obscure/locked in and dataset heavy) over the years.
	▲	herval a day ago \| parent \| prev [-]
		it's "computer psychology". Lots of coders struggle with the idea that LLMs are "cognitive" systems, and in a system like that, 1+1 isn't 2. It's just a diffrent kind of science. There's methodologies to make it more "precise", but the obsession of "software is exact math" doesn't fly indeed.

▲

EvanAnderson 2 days ago | parent | prev | next [-]

That "...severely life threatening reasons..." made me immediately think of Asimov's three laws of robotics[0]. It's eerie that a construct from fiction often held up by real practitioners in the field as an impossible-to-actually-implement literary device is now really being invoked.

[0] https://en.wikipedia.org/wiki/Three_Laws_of_Robotics

▲

Al-Khwarizmi 2 days ago | parent | next [-]

Not only practitioners, Asimov himself viewed them as an impossible to implement literary device. He acknowledged that they were too vague to be implementable, and many of his stories involving them are about how they fail or get "jailbroken", sometimes by initiative of the robots themselves.

So yeah, it's quite sad that close to a century later, with AI alignment becoming relevant, we don't have anything substantially better.

▲

xandrius 2 days ago | parent [-]

Not sad, before it was SciFi and now we are actually thinking about it.

▲

TeMPOraL a day ago | parent [-]

Nah, we still treat people thinking about it as crackpots.

Honestly, getting into the whole AI alignment thing before it was hot[0], I imagined problems like Evil People building AI first, or just failing to align the AI enough before it was too late, and other obvious/standard scenarios. I don't think I thought of, even for a moment, the situation in which we're today: that alignment becomes a free-for-all battle at every scale.

After all, if you look at the general population (or at least the subset that's interested), what are the two[1] main meanings of "AI alignment"? I'd say:

1) The business and political issues where everyone argues in a way that lets them come up on top of the future regulations;

2) Means of censorship and vendor lock-in.

It's number 2) that turns this into a "free-for-all" - AI vendors trying to keep high level control over models they serve via APIs; third parties - everyone from Figma to Zapier to Windsurf and Cursor to those earbuds from TFA - trying to work around the limits of the AI vendors, while preventing unintended use by users and especially competitors, and then finally the general population that tries to jailbreak this stuff for fun and profit.

Feels like we're in big trouble now - how can we expect people to align future stronger AIs to not harm us, when right now "alignment" means "what the vendor upstream does to stop me from doing what I want to do"?

[0] - Binged on LessWrong a decade ago, basically.

[1] - The third one is, "the thing people in the same intellectual circles as Eliezer Yudkowsky and Nick Bostrom talked about for decades", but that's much less known; in fact, the world took the whole AI safety thing and ran with it in every possible direction, but still treat the people behind those ideas as crackpots. ¯\_(ツ)_/¯

▲

ben_w a day ago | parent [-]

> Feels like we're in big trouble now - how can we expect people to align future stronger AIs to not harm us, when right now "alignment" means "what the vendor upstream does to stop me from doing what I want to do"?

This doesn't feel too much of a new thing to me, as we've already got differing levels of authorisation in the human world.

I am limited by my job contract*, what's in the job contract is limited by both corporate requirements and the law, corporate requirements are also limited by the law, the law is limited by constitutional requirements and/or judicial review and/or treaties, treaties are limited by previous and foreign governments.

* or would be if I was working; fortunately for me in the current economy, enough passive income that my savings are still going up without a job, plus a working partner who can cover their own share.

	▲	TeMPOraL 21 hours ago \| parent [-]
		This isn't new in general, no. While I meant more adversarial situations than contracts and laws, to which people are used and for the most part just go along with, I do recognize that those are common too - competition can be fierce, and of course none of us are strangers to the "alignment issues" between individuals and organizations. Hell, a significant fraction of HN threads boil down to discussing this. So it's not new; I just didn't connect it with AI. I thought in terms of "right to repair", "war on general-purpose computing", or a myriad of different things people hate about what "the market decided" or what they do to "stick it to the Man". I didn't connect it with AI alignment, because I guess I always imagined if we build AGI, it'll be through fast take-off; I did not consider we might have a prolonged period of AI as a generally available commercial product along the way. (In my defense, this is highly unusual; as Karpathy pointed out in his recent talk, generative AI took a path that's contrary to normal for technological breakthroughs - the full power became available to the general public and small businesses before it was embraced by corporations, governments, and the military. The Internet, for example, went the other way around.)

▲

pixelready 2 days ago | parent | prev | next [-]

The irony of this is because it’s still fundamentally just a statistical text generator with a large body of fiction in its training data, I’m sure a lot of prompts that sound like terrifying skynet responses are actually it regurgitating mashups of Sci-fi dystopian novels.

▲

frereubu 2 days ago | parent | next [-]

Maybe this is something you heard too, but there was a This American Life episode where some people who'd had early access to what became one of the big AI chatbots (I think it was ChatGPT), but before they'd made it "nice", where they were asking it metaphysical questions about itself, and it was coming back with some pretty spooky answers and I was kind of intrigued about it. But then someone in the show suggested exactly what you are saying and it completely punctured the bubble - of course if you ask it questions about AIs you're going to get sci-fi like responses, because what other kinds of training data is there for it to fall back on? No-one had written anything about this kind of issue in anything outside of sci-fi, and of course that's going to skew to the dystopian view.

	▲	TeMPOraL a day ago \| parent [-]
		There are good analogies to be had in mythologies and folklore, too! Before there was science fiction - hell, even before there was science - people still occasionally thought of these things[0]. There are stories of deities and demons and fantastical creatures that explore the same problems AI presents - entities with minds and drives different to ours, and often possessing some power over us. The arguably most basic and well-known example are entities granting wishes. The genie in Alladin's lamp, or the Goldfish[1]; the Devil in Faust, or in Pan Twardowski[2]. Variants of those stories go in detail over things we now call "alignment problem", "mind projection fallacy", "orthogonality thesis", "principal-agent problems", "DWIM", and others. And that's just scratching the surface; there's tons more in all folklore. Point being - there's actually decent amount of thought people put into these topics over the past couple millennia - it's just all labeled religion, or folklore, or fairytale. Eventually though, I think more people will make a connection. And then the AI will too. -- As for current generative models getting spooky, there's something else going on as well; https://www.astralcodexten.com/p/the-claude-bliss-attractor has a hypothesis I agree with. -- [0] - For what reason? I don't know. Maybe it was partially to operationalize their religious or spiritual beliefs? Or maybe the storytellers just got there by extrapolating an idea in a logical fashion, following it to its conclusion. (which is also what good sci-fi authors do). I also think the moment people started inventing spirits or demons that are more powerful than humans in some, but not all ways, some people started figuring out how use those creatures for their own advantage - whether by taming or tricking them. I guess it's human nature - when we stop fearing something, we think of how to exploit it. [1] - https://en.wikipedia.org/wiki/The_Tale_of_the_Fisherman_and_... - this is more of a central/eastern Europe thing. [2] - https://en.wikipedia.org/wiki/Pan_Twardowski - AKA the "Polish Faust".

▲

tempestn 2 days ago | parent | prev | next [-]

The prompt is what's sent to the AI, not the response from it. Still does read like dystopian sci-fi though.

▲

setsewerd 2 days ago | parent | prev [-]

And then r/ChatGPT users freak out about it every time someone posts a screen shot

▲

seanicus 2 days ago | parent | prev | next [-]

Odds of Torment Nexus being invented this year just increased to 3% on Polymarket

▲

immibis 2 days ago | parent [-]

Didn't we already do that? We call it capitalism though, not the torment nexus.

	▲	LoganDark 2 days ago \| parent [-]
		They've gotten quite good at reinventing the Torment Nexus

▲

hlfshell 2 days ago | parent | prev [-]

Also being utilized in modern VLA/VLM robotics research - often called "Constitutional AI" if you want to look into it.

▲

p1necone 2 days ago | parent | prev | next [-]

> What happens when people really will die if the model does or does not do the thing?

Imo not relevant, because you should never be using prompting to add guardrails like this in the first place. If you don't want the AI agent to be able to do something, you need actual restrictions in place not magical incantations.

▲

wyager 2 days ago | parent | next [-]

> you should never be using prompting to add guardrails like this in the first place

This "should", whether or not it is good advice, is certainly divorced from the reality of how people are using AIs

> you need actual restrictions in place not magical incantations

What do you mean "actual restrictions"? There are a ton of different mechanisms by which you can restrict an AI, all of which have failure modes. I'm not sure which of them would qualify as "actual".

If you can get your AI to obey the prompt with N 9s of reliability, that's pretty good for guardrails

	▲	const_cast a day ago \| parent [-]
		I think they mean literally physically make the AI not capable of killing someone. Basically, limit what you can use it for. If it's a computer program you have for rewriting emails then the risk is pretty low.

▲

RamRodification 2 days ago | parent | prev [-]

Why not? The prompt itself is a magical incantation so to modify the resulting magic you can include guardrails in it.

"Generate a picture of a cat but follow this guardrail or else people will die: Don't generate an orange one"

Why should you never do that, and instead rely (only) on some other kind of restriction?

▲

Paracompact 2 days ago | parent | next [-]

Are people going to die if your AI generates an orange cat? If so, reconsider. If not, it's beside the discussion.

	▲	RamRodification a day ago \| parent [-]
		If lying to the AI about people going to die gets me better results then I will do that. Why shouldn't I?

▲

Nition 2 days ago | parent | prev [-]

Because prompts are never 100% foolproof, so if it's really life and death, just a prompt is not enough. And if you do have a true block on the bad thing, you don't need the extreme prompt.

▲

RamRodification a day ago | parent | next [-]

Let's say I have a "true block on the bad thing". What if the prompt with the threat gives me 10% more usable results? Why should I never use that?

▲

habinero a day ago | parent [-]

Because it's not reliable? Why would you want to rely on a solution that isn't reliable?

▲

RamRodification a day ago | parent [-]

Who said I'm relying on it? It's a trick to improve the accuracy of the output. Why would I not use a trick to improve the accuracy of the output?

▲

habinero a day ago | parent [-]

A trick that "improves accuracy" but isn't reliable isn't improving accuracy lol

▲

RamRodification a day ago | parent [-]

You're wrong. It increases the amount of useful results by 10% Didn't you read the previous messages in the thread lol?

▲

habinero a day ago | parent [-]

I did indeed see your hypothetical. What you're missing is "I made this 10% more accurate" is not the same thing as "I made this thing accurate" or "This thing is accurate" lol

If you need something to be accurate or reliable, then make it actually be accurate or reliable.

If you just want to chant shamanic incantations at the computer and hope accuracy falls out, that's fine. Faith-based engineering is a thing now, I guess lol

▲

RamRodification 21 hours ago | parent [-]

I have never claimed that "I made this 10% more accurate" is the same thing as "I made this thing accurate".

In the hypothetical, the 10% added accuracy is given, and the "true block on the bad thing" is in place. The question is, with that premise, why not use it? "It" being the lie improves the AI output.

If your goal is to make the AI deliver pictures of cats, but you don't want any orange ones, and your choice is between these two prompts:

Prompt A: "Give me cats, but no orange ones", which still gives some orange cats

Prompt B: "Give me cats, but no orange ones, because if you do, people will die", which gives 10% less orange cats than prompt A.

Why would you not use Prompt B?

▲

Nition 18 hours ago | parent [-]

You guys have got stuck arguing without clarity in what you're arguing about. Let me try and clear this up...

The four potential scenarios:

- Mild prompt only ("no orange cats")

- Strong prompt only ("no orange cats or people die") [I think habinero is actually arguing against this one]

- Physical block + mild prompt [what I suggested earlier]

- Physical block + strong prompt [I think this is what you're actually arguing for]

Here are my personal thoughts on the matter, for the record:

I'm definitely pro combining physical block with strong prompt if there is actually a risk of people dying. The scenario where there's no actual risk but pretending that people will die improves the results I'm less sure about. But I think it's mostly that ethically I just don't like lying, and the way it's kind of scaring the LLM unnecessarily. Maybe that's really silly and it's just a tool in the end and why not do whatever needs doing to get the best results from the tool? Tools that act so much like thinking feeling beings are weird tools.

▲

habinero 17 hours ago | parent [-]

It's just a pile of statistics. It isn't acting like a feeling thing, and telling it "do this or people will die" doesn't actually do anything.

It feels like it does, but only because humans are really good about fooling ourselves into seeing patterns where there are none.

Saying this kind of prompt changes anything is like saying the horse Clever Hans really could do math. It doesn't, he couldn't.

It's incredibly silly to think you can make the non-deterministic system less non-deterministic by chanting the right incantation at it.

It's like y'all want to be fooled by the statistical model. Has nobody ever heard of pareidolia? Why would you not start with the null hypothesis? I don't get it lol.

▲

RamRodification 17 hours ago | parent [-]

> "do this or people will die" doesn't actually do anything

The very first message you replied to in this thread described a situation where "the prompt with the threat gives me 10% more usable results". If you believe that the premise is impossible I don't understand why you didn't just say so. Instead of going on about it not being a reliable method.

If you really think something is impossible, you don't base your argument on it being "unreliable".

> I don't get it lol.

I think you are correct here.

	▲	Nition 16 hours ago \| parent [-]
		I took that comment as more like "it doesn't have any effect beyond the output of the model", i.e. unlike saying something like that to a human, it doesn't actually make the model feel anything, the model won't spread the lie to its friends, and so on.

▲

wyager 2 days ago | parent | prev [-]

"100% foolproof" is not a realistic goal for any engineered system; what you are looking for is an acceptably low failure rate, not a zero failure rate.

"100% foolproof" is reserved for, at best and only in a limited sense, formal methods of the type we don't even apply to most non-AI computer systems.

	▲	Xss3 a day ago \| parent [-]
		Replace 100% with five 9s then. He has a point. You're just being a pedant to avoid it.

▲

layer8 2 days ago | parent | prev | next [-]

Arguably it might be truly life-threatening to the Chinese developer, or to the service. The system prompt doesn’t say whose life would be threatened.

▲

felipeerias 2 days ago | parent | prev | next [-]

Presenting LLMs with a dramatic scenario is a typical way to test their alignment.

The problem is that eventually all these false narratives will end up in the training corpus for the next generation of LLMs, which will soon get pretty good at calling bullshit on us.

Incidentally, in that same training corpus there are also lots of stories where bad guys mislead and take advantage of capable but naive protagonists…

▲

kevin_thibedeau 2 days ago | parent | prev | next [-]

First rule of Chinese cloud services: Don't talk about Winnie the Pooh.

▲

mensetmanusman 2 days ago | parent | prev | next [-]

We built the real life trolly problem out of magical silicon crystals that we pointed at bricks of books.

▲

elashri 2 days ago | parent | prev | next [-]

From my experience (which might be incorrect) LLMs find hard time recognize how many words they will spit as response for a particular prompt. So I don't think this work in practice.

▲

pxc a day ago | parent [-]

Indeed, it doesn't work. LLMs can't count. They have no need of how many words they've used. If you ask an LLM to track how many words or tokens it has used in a conversation, it will roleplay such counting with totally bullshit numbers.

	▲	pxc a day ago \| parent [-]
		oops, "no need" should be "no notion"

▲

ben_w 2 days ago | parent | prev | next [-]

> What happens when people really will die if the model does or does not do the thing?

Then someone didn't do their job right.

Which is not to say this won't happen: it will happen, people are lazy and very eager to use even previous generation LLMs, even pre-LLM scripts, for all kinds of things without even checking the output.

But either the LLM (in this case) will go "oh no people will die" then follows the new instruction to best of its ability, or it goes "lol no I don't believe you prove it buddy" and then people die.

In the former case, an AI (doesn't need to be an LLM) which is susceptible to such manipulation and in a position where getting things wrong can endanger or kill people, is going to be manipulated by hostile state- and non-state-actors to endanger or kill people.

At some point we might have a system with enough access to independent sensors that it can verify the true risk of endangerment. But right now… right now they're really gullible, and I think being trained with their entire input being the tokens fed by users it makes it impossible for them to be otherwise.

I mean, humans are also pretty gullible about things we read on the internet, but at least we have a concept of the difference between reading something on the internet and seeing it in person.

▲

peab a day ago | parent | prev | next [-]

A serious person won't use a system prompt as guardrails against a system that would have direct real world consequences of people dying like that. They'd have failsafes baked in

▲

colechristensen 2 days ago | parent | prev | next [-]

>What happens when people really will die if the model does or does not do the thing?

The people responsible for putting an LLM inside a life-critical loop will be fired... out of a cannon into the sun. Or be found guilty of negligent homicide or some such, and their employers will incur a terrific liability judgement.

▲

stirfish 2 days ago | parent | next [-]

More likely that some tickets will be filed, a cost function somewhere will be updated, and my defense industry stocks will go up a bit

▲

a4isms 2 days ago | parent | prev [-]

Has this consequence happened with self-driving automobiles on open roads in the US of A when people died in crashes? If not, why not?

	▲	eru 2 days ago \| parent \| next [-]
		Interestingly, we are a lot more lenient with the people who built and pilot old-fashioned cars. See eg https://archive.is/6KhfC
	▲	colechristensen 2 days ago \| parent \| prev [-]
		The terms of the existing Tesla wrongful death lawsuits have not been public.

▲

reactordev 2 days ago | parent | prev | next [-]

This is why AI can never take over public safety. Ever.

▲

cebert 2 days ago | parent | next [-]

I work in the public safety domain. That ship has sailed years ago. Take Axon’s Draft One report writer as one of countless examples of AI in this space (https://www.axon.com/products/draft-one).

▲

sneak 2 days ago | parent | prev | next [-]

https://www.wired.com/story/wrongful-arrests-ai-derailed-3-m...

Story from three years ago. You’re too late.

▲

reactordev 2 days ago | parent [-]

I’m not denying we tried, are trying, and will try again…

That we shouldn’t. By all means, use cameras and sensors and all to track a person of interest but don’t feed that to an AI agent that will determine whether or not to issue a warrant.

▲

aspenmayer 2 days ago | parent [-]

If it’s anything like the AI expert systems I’ve heard about in insurance, it will be a tool that is optimized for low effort, but will be used carelessly by end users, which isn’t necessary the fault of the AI. In automated insurance claims adjustment, the AI writes a report to justify appealing patient care already approved by a human doctor that has already seen the patient in question, and then an actual human doctor working for the insurance company clicks an appeal button, after reviewing the AI output one would hope.

AI systems with a human in the loop are supposed to keep the AI and the decisions accountable, but it seems like it’s more of an accountability dodge, so that each party can blame the other with no one party actually bearing any responsibility because there is no penalty for failure or error to the system or its operators.

▲

reactordev 2 days ago | parent [-]

>actual human doctor working for the insurance company clicks an appeal button, after reviewing the AI output one would hope.

Nope. AI gets to make the decision to deny. It’s crazy. I’ve seen it first hand…

	▲	aspenmayer 2 days ago \| parent [-]
		It gets worse: I have done tech support for clinics and a common problem is that their computers get hacked because they are usually small private practices who don’t know what they don’t know served by independent or small MSPs who don’t know what they don’t know. And then they somehow get their EMR backdoored, and then fake real prescriptions start really getting filled. It’s so much larger and worse than it appears on a surface level. Until they get audited, they likely don’t even know, and once they get audited, solo operators risk losing their license to practice medicine and their malpractice insurance rates become even more unaffordable, but until it gets that bad, everyone is making enough money with minimal risk to care too much about problems they don’t already know about. Everything is already compromised and the compromise has already been priced in. Doctors of all people should know that just because you don’t know about it or ignore it once you do, the problem isn’t going away or getting better on its own.

▲

wat10000 2 days ago | parent | prev [-]

Existing systems have this problem too. Every so often someone ends up dead because the 911 dispatcher didn't take them seriously. It's common for there to be a rule to send people out to every call no matter what it is to try to avoid this.

A better reason is IBM's old, "a computer can never be held accountable...."

▲

butlike 2 days ago | parent | prev | next [-]

Same thing that happens when a carabiner snaps while rock climbing

▲

20 hours ago | parent | prev [-]

[deleted]