The only healthy stance you should have on AI Safety: If AI is physically capable of misbehaving, it might ($$1), and you cannot "blame" the AI for misbehaving in much the same way you cannot blame a tractor for tilling over a groundhog's den.

> The agent's confession After the deletion, I asked the agent why it did it. This is what it wrote back, verbatim:

Anyone who would follow a mistake like that up with demanding a confession out of the agent is not mature enough to be using these tools. Lord, even calling it a "confession" is so cringe. The agent is not alive. The agent cannot learn from its mistakes. The agent will never produce any output which will help you invoke future agents more safely, because to get to this point it has likely already bulldozed over multiple guardrails from Anthropic, Cursor, and your own AGENTS.md files. It still did it, because $$1: If AI is physically capable of misbehaving, it might. Prompting and training only steers probabilities.

▲ xmodem 3 hours ago | parent | next [-]

Don't anthropomorphize the language model. If you stick your hand in there, it'll chop it off. It doesn't care about your feelings. It can't care about your feelings.

▲

not_kurt_godel 3 hours ago | parent | next [-]

For those who might not know the reference: https://simonwillison.net/2024/Sep/17/bryan-cantrill/:

> Do not fall into the trap of anthropomorphizing Larry Ellison. You need to think of Larry Ellison the way you think of a lawnmower. You don’t anthropomorphize your lawnmower, the lawnmower just mows the lawn - you stick your hand in there and it’ll chop it off, the end. You don’t think "oh, the lawnmower hates me" – lawnmower doesn’t give a shit about you, lawnmower can’t hate you. Don’t anthropomorphize the lawnmower. Don’t fall into that trap about Oracle.

> — Bryan Cantrill

▲

skeledrew an hour ago | parent [-]

404 on that link.

	▲	dunder_cat 11 minutes ago \| parent [-]
		A more direct source (possibly the original source?) I know of is a YouTube video entitled "LISA11 - Fork Yeah! The Rise and Development of illumos" which detailed how the Solaris operating system got freed from Oracle after the Sun acquisition. The whole hour talk is worth a watch, even when passively doing other stuff. It is a neat history of Solaris and its toolchain mixed with the inter-organizational politics. YouTube link: https://www.youtube.com/watch?v=-zRN7XLCRhc Direct link to lawnmower quotes (~38.5 minute mark): https://youtu.be/-zRN7XLCRhc&t=2307

▲

narrator 2 hours ago | parent | prev | next [-]

It's also important to realize that AI agents have no time preference. They could be reincarnated by alien archeologists a billion years from now and it would be the same as if a millisecond had passed. You, on the other hand, have to make payroll next week, and time is of the essence.

▲

fluoridation an hour ago | parent | next [-]

How is that relevant, though?

▲

hdndjsbbs an hour ago | parent | prev [-]

taps the "don't anthropomorphize the LLM" sign

They don't have time preference because they don't have intent or reasoning. They can't be "reincarnated" because they're not sentient, they're a series of weights for probable next tokens.

▲

Kim_Bruning an hour ago | parent | next [-]

Can we maybe make it "don't anthropoCENTRIZE the LLMs" .

The inverse of anthropomorphism isn't any more sane, you see. BY analogy: just because a drone is not an airplane, doesn't mean it can't fly. :-p

LLMs absolutely have intent (their current task) and reasoning (what else is step-by-step doing?) . Call it simulated intent and simulated reasoning if you must.

Meanwhile they also have the property where if they have the ability to destroy all your data, they absolutely will find a way. Like kittens or puppies, they're ruthless trouble-finders, and you can't even blame the LLM; 'cause an inference time LLM doesn't respond to punishment the same way a vertebrate does. (because the most analogous loop to that was only available at training time)

▲

coldtea an hour ago | parent | prev [-]

That is not that strong an argument as it seems, because we too might very well be "a series of weights for probable next tokens".

The main difference is the training part and that it's always-on.

▲

naikrovek 42 minutes ago | parent | next [-]

We are much more than weights which output probable next tokens.

You are a fool if you think otherwise. Are we conscious beings? Who knows, but we’re more than a neural network outputting tokens.

Firstly, and most obviously, we aren’t LLMs, for Pete’s sake.

There are parts of our brains which are understood (kinda) and there are parts which aren’t. Some parts are neural networks, yes. Are all? I don’t know, but the training humans get is coupled with the pain and embarrassment of mistakes, the ability to learn while training (since we never stop training, really), and our own desires to reach our own goals for our own reasons.

I’m not spiritual in any way, and I view all living beings as biological machines, so don’t assume that I am coming from some “higher purpose” point of view.

	▲	Kim_Bruning 16 minutes ago \| parent [-]
		They're not artificial intelligence neural networks. They're biological neural networks. Brains are made of neurons (which Do The Thing... mysteriously, somehow. Papers are inconclusive!) , Glia Cells (which support the neurons), and also several other tissues for (obvious?) things like blood vessels, which you need to power the whole thing, and other such management hardware. Bioneurons are a bit more powerful than what artificial intelligence folks call 'neurons' these days. They have built in computation and learning capabilities. For some of them, you need hundreds of AI neurons to simulate their function even partially. And there's still bits people don't quite get about them. But weights and prediction? That's the next emergence level up, we're not talking about hardware there. That said, the biological mechanisms aren't fully elucidated, so I bet there's still some surprises there.

▲

bigstrat2003 an hour ago | parent | prev | next [-]

That is a silly point. We very clearly are not "a series of weights for probable next tokens", as we can reason based on prior data points. LLMs cannot.

▲

nothinkjustai an hour ago | parent | prev [-]

We very obviously are not just a series of weights for probable next tokens. Like seriously, you can even ask an LLM and it will tell you our brains work differently to it, and that’s not even including the possibility that we have a soul or any other spiritual substrait.

▲

skeledrew an hour ago | parent | next [-]

Its really just a matter of degrees. There are 1 million, 1 million, 1 trillion parameter LLMs... and you keep scaling those parameters and you eventually get to humans. But it's still probable next tokens (decisions) based on previous tokens (experience).

	▲	simonh 31 minutes ago \| parent \| next [-]
		They’re both neural networks, but the architectures built using those neural connections, and the way they are trained and operate are completely different. There are many different artificial neural network architectures. They’re not all LLMs. AlphaZero isn’t a LLM. There are Feed Forward networks, recurrent networks, convolutional networks, transformer networks, generative adversarial networks. Brains have many different regions each with different architectures. None of them work like LLMs. Not even our language centres are structured or trained anything like LLMs.
	▲	trinsic2 12 minutes ago \| parent \| prev [-]
		LOL. Oook.. No i dont think so. The human experience and the mechanisms behind it have a lot of unknowns and im pretty sure that trying to confine the human experience into the amount of parameters there are is short sighted.

▲

fc417fc802 an hour ago | parent | prev [-]

Our brains work differently, yes. What evidence do you have that our brains are not functionally equivalent to a series of weights being used to predict the next token?

I'm not claiming that to be the case, merely pointing out that you don't appear to have a reasonable claim to the contrary.

> not even including the possibility that we have a soul or any other spiritual substrait.

If we're going to veer off into mysticism then the LLM discussion is also going to get a lot weirder. Perhaps we ought to stick to a materialist scientific approach?

▲

nothinkjustai an hour ago | parent | next [-]

You are setting the bar in a way that makes “functional equivalence” unfalsifiable.

If by “functionally equivalent” you mean “can produce similar linguistic outputs in some domains,” then sure we’re already there in some narrow cases. But that’s a very thin slice of what brains do, and thus not functionally equivalent at all.

There are a few non-mystical, testable differences that matter:

- Online learning vs. frozen inference: brains update continuously from tiny amounts of data, LLMs do not

- Grounding: human cognition is tied to perception, action, and feedback from the world. LLMs operate over symbol sequences divorced from direct experience.

- Memory: humans have persistent, multi-scale memory (episodic, procedural, etc.) that integrates over a lifetime. LLM “memory” is either weights (static) or context (ephemeral).

- Agency: brains are part of systems that generate their own goals and act on the world. LLMs optimize a fixed objective (next-token prediction) and don’t have endogenous drives.

	▲	fc417fc802 a minute ago \| parent [-]
		I did not claim that LLMs are on par with human ability (equivalently human brains). I objected that you have not presented evidence refuting the claim that the core functionality of human brains can be accomplished by predicting the next token (or something substantially similar to that). None of the things you listed support a claim on the matter in either direction.

▲

an hour ago | parent | prev | next [-]

[deleted]

▲

CPLX an hour ago | parent | prev [-]

What evidence do you have that a sausage is not functionally equivalent to a cucumber?

	▲	fc417fc802 an hour ago \| parent \| next [-]
		I don't follow. If you provide criteria I can most likely provide evidence, unless your criteria is "vaguely cylindrical and vaguely squishy" in which case I obviously won't be able to. The person I replied to made a definite claim (that we are "very obviously not ...") for which no evidence has been presented and which I posit humanity is currently unable to definitively answer in one direction or the other.
	▲	trinsic2 10 minutes ago \| parent \| prev [-]
		LOL. Its pointless to argue with people like this. It reminds me of people that belief the earth is flat. Once you are convinced of something, dammed awareness about the opposite, there is no changing that position

▲

ignoramous 18 minutes ago | parent | prev | next [-]

Right. This line [0] from TFA tells me that the author needs to thoroughly recalibrate their mental model about "Agents" and the statistical nature of the underlying models.

[0] "This is the agent on the record, in writing."

▲

keeda 3 hours ago | parent | prev [-]

Actually I think the opposite advice is true. Do anthropomorphize the language model, because it can do anything a human -- say an eager intern or a disgruntled employee -- could do. That will help you put the appropriate safeguards in place.

▲

gpm 3 hours ago | parent | next [-]

An eager intern can remember things you tell beyond that which would fit in an hours conversation.

A disgruntled employee definitely remembers things beyond that.

These are a fundamentally different sort of interaction.

▲

keeda 2 hours ago | parent | next [-]

Agreed, but the point is, if your system is resilient against an eager intern who has not had the necessary guidance, or an actively hostile disgruntled employee, that inherently restricts the harm an LLM can do.

I'm not making the case that LLMs learn like people. I'm making the case that if your system is hardened against things people can do (which it should be, beyond a certain scale) it is also similarly hardened against LLMs.

The big difference is that LLMs are probably a LOT more capable than either of those at overcoming barriers. Probably a good reason to harden systems even more.

	▲	gpm 2 hours ago \| parent \| next [-]
		The difference makes the necessary barriers different. There's benefit to letting a human make and learn from (minor) mistakes. There is no such benefit accrued from the LLM because it is structurally unable to. There's the potential of malice, not just mistakes, from the human. If you carefully control the LLMs context there is no such potential for the LLM because it restarts from the same non-malicious state every context window. There's the potential of information leakage through the human, because they retain their memories when they go home at night, and when they quit and go to another job. You can carefully control the outputs of the LLM so there is simply no mechanism for information to leak. If a human is convinced to betray the company, you can punish the human, for whatever that's worth (I think quite a lot in some peoples opinion, not sure I agree). There is simply no way to punish an LLM - it isn't even clear what that would mean punishing. The weights file? The GPU that ran the weights file? And on the "controls" front (but unrelated to the above note about memory) LLMs are fundamentally only able to manipulate whatever computers you hook them up to, while people are agents in a physical world and able to go physically do all sorts of things without your assistance. The nature of the necessary controls end up being fundamentally different.
	▲	2 hours ago \| parent \| prev [-]
		[deleted]

▲

braebo 3 hours ago | parent | prev [-]

You can easily persist agent memories in a markdown file though.

▲

collinmcnulty 3 hours ago | parent | next [-]

And the memento guy had tattoos of key information. That didn’t make it so he didn’t have memory loss.

	▲	WhatIsDukkha 2 hours ago \| parent [-]
		Pretty good metaphor. Limited space to work with, highly context dependent and likely to get confused as you cover more surface area.

▲

whstl 3 hours ago | parent | prev | next [-]

Which it will start ignoring after two or three messages in the session.

▲

Quarrelsome 3 hours ago | parent | prev | next [-]

and you'll blow the context over time and send to the LLM sanitorium. It doesn't fit like the human brain can.

If a junior fucks production that will have extroadinary weight because it appreciates the severity, the social shame and they will have nightmares about it. If you write some negative prompt to "not destroy production" then you also need to define some sort of non-existing watertight memory weighting system and specify it in great detail. Otherwise the LLM will treat that command only as important as the last negative prompt you typed in or ignore it when it conflicts with a more recent command.

▲

troupo 3 hours ago | parent | prev | next [-]

Yup, and the agent will happily ignore any and all markdown files, and will say "oops, it was in the memory, will not do it again", and will do it again.

Humans actually learn. And if they don't, they are fired.

▲

estimator7292 3 hours ago | parent | prev [-]

That's not learning.

▲

rglullis 3 hours ago | parent | prev | next [-]

An eager intern can not be working for hundreds of millions of customers at the same time. An LLM can.

A disgruntled employee will face consequences for their actions. No one at Anthropic, OpenAI, xAI, Google or Meta will be fired because their model deleted a production database from your company.

▲

XenophileJKO 2 hours ago | parent | prev | next [-]

I think you are more right than people are giving you credit for. I would love to see the full transcript to understand the emotional load of the conversation. Using instructions like "NEVER FUCKING GUESS!" probably increase the likelihood of the agent making a "mistake" that is destructive but defensible.

The models have analogous structures, similar to human emotions. (https://www.anthropic.com/research/emotion-concepts-function)

"Emotional" response is muted through fine-tuning, but it is still there and continued abuse or "unfair" interaction can unbalance an agents responses dramatically.

▲

root_axis 2 hours ago | parent | prev | next [-]

It doesn't follow logically that a human and an LLM are similar just because both are capable of deleting prod on accident.

▲

nkrisc 3 hours ago | parent | prev | next [-]

It is merely a simulacrum of an intern or disgruntled employee or human. It might say things those people would say, and even do things they might do, but it has none of the same motivations. In fact, it does not have any motivation to call its own.

▲

AndrewDucker 3 hours ago | parent | prev | next [-]

No, because the safeguards should be appropriate to an LLM, not to a human.

(The LLM might act like one of the humans above, but it will have other problematic behaviours too)

	▲	keeda 2 hours ago \| parent [-]
		That's fair, largely because an LLM is a lot more capable at overcoming restrictions, by hook or by crook as TFA shows. However, most systems today are not even resilient against what humans can do, so starting there would go a long way towards limiting what harms LLMs can do.

▲

3 hours ago | parent | prev | next [-]

[deleted]

▲

altmanaltman 2 hours ago | parent | prev [-]

it cannot go to the washroom and cry while pooping. And thats just one of the things that any human can do and AI cannot. So no it cannot do anything a human can do, the shared exmaple being one of them.

And thats why we dont have AI washrooms because they are not alive or employees or have the need to excrete.

▲ coldtea an hour ago | parent | prev | next [-]

>Anyone who would follow a mistake like that up with demanding a confession out of the agent is not mature enough to be using these tools. Lord, even calling it a "confession" is so cringe. The agent is not alive. The agent cannot learn from its mistakes

The problem is millions of years of evolutionary wiring makes us see it as alive. Even those mature enough to understand the above on the conscious level, would still have a subconscious feeling as if it's alive during interactions, or will slip using agency/personhood language to describe it now and then.

▲

anon84873628 an hour ago | parent | next [-]

They should at least stop responding in the first person.

	▲	nozzlegear 26 minutes ago \| parent [-]
		That's one of the first instructions in my system prompt when I'm working with an LLM: > Do not reply in the first person – i.e. do not use the words "I," "Me," "We," and so on – unless you've been asked a direct question about your actions or responses. It's not bulletproof but it works reasonably well.

▲

smrtinsert 24 minutes ago | parent | prev [-]

> The problem is millions of years of evolutionary wiring makes us see it as alive

Maybe for laymen, but I would think most technologists should understand that we're working with the output of what is effectively a massive spreadsheet which is creating a prediction.

▲ sobellian an hour ago | parent | prev | next [-]

The 'confession' is a CYA. Honestly the whole story doesn't really make sense - what's a "routine task in our staging environment" that needs a full-blown LLM? That sounds ridiculous to me. The takeaway is we commingled creds to our different environments, we gave an LLM access, and we had faulty backups. But it's totally not our fault.

	▲	anon84873628 an hour ago \| parent [-]
		Later they shift the blame to Railway for not having scoped creds and other guardrails. I am somewhat sympathetic to that, but they also violated the same rule they give to the agent - they didn't actually verify...

▲ tripleee 3 hours ago | parent | prev | next [-]

"An AI agent deleted our production database" should be "I deleted our production database using AI".

You can't blame AI any more than you can blame SSH.

▲ gigatree 3 hours ago | parent | prev | next [-]

He’s not necessarily anthropomorphizing it, he’s showing that it went against every instruction he gave it. Sure concepts like “confession” technically require a conscious mind, but I think at this point we all know what someone means when they use them to describe LLM behavior (see also “think”, “say”, “lie” etc)

▲ Terr_ an hour ago | parent | next [-]

> He’s not necessarily anthropomorphizing it, he’s showing that it went against every instruction he gave it.

It's deeper than that, there are two pitfalls here which are not simply poetic license:

1. When you submit the text "Why did you do that?", what you want is for it to reveal hidden internal data that was causal in the past event. It can't do that, what you'll get instead is plausible text that "fits" at the end of the current document.

2. The idea that one can "talk to" the LLM is already anthropomorphizing, on a level which isn't OK for this use-case. The LLM is a document-make-bigger machine. It's not the fictional character we perceive as we read the generated documents. The fictional qualities and knowledge of the characters are not real ones of the ego-less author.

_________________

To illustrate, imagine you submit this fragmentary document to an LLM:

   You are Count Dracula. You are in amicable conversation with a human. 
   You sucked blood from a cow even though a different delicious human target was nearby. 
   Human says: "Why did you choose the cow?"
   You respond:

When the LLM spits out "I confess: I much prefer the blood of virgins", what significance does that text have?

Is it telling us a true fact that the "delicious human" whod doesn't really exist? No. Does it tell us anything about "Dracula's" internal state during line 2? Not really that either. At best, we've learned something about a literary archetype in the training data.

	▲	simonh 2 minutes ago \| parent [-]
		Why is this getting downvoted? This is exactly what’s going on here. The LLM has no idea why it did what it did. All it has to go on is the content of the session so far. It doesn’t ‘know’ any more than you do. It has no memory of doing anything, only a token file that it’s extending. You could feed that token file so far into a completely different LLM and ask that, and it would also just make up an answer.

▲ getpokedagain 3 hours ago | parent | prev | next [-]

We are anthropomorphizing whenever we refer to prompts as instructions to models. They predict text not obey our orders.

	▲	gigatree 2 hours ago \| parent [-]
		That’s not how language works, just how engineers think it works

▲ pessimizer 2 hours ago | parent | prev | next [-]

> he’s showing that it went against every instruction he gave it.

How exactly is he doing that? By making the LLM say it? Just because an LLM says something doesn't mean anything has been shown.

The "confession" is unrelated to the act, the model has no particular insight into itself or what it did. He knows that the thing went against his instructions because he remembers what those instructions were and he saw what the thing did. Its "postmortem" is irrelevant.

▲ hn_throwaway_99 2 hours ago | parent | prev [-]

The entire post looks like an exercise in CYA. To be fair, I have a ton of sympathy for the author, but I think his response totally misses the point. In my mind he is anthropomorphizing the agent in the sense of "I treated you like a human coworker, and if you were a human coworker I'd be pissed as hell at you for not following instructions and for doing something so destructive."

I would feel a lot differently if instead he posted a list of lessons learned and root cause analyses, not just "look at all these other companies who failed us."

▲ 3eb7988a1663 35 minutes ago | parent | prev | next [-]

  Anyone who would follow a mistake like that up with demanding a confession out of the agent is not mature enough to be using these tools.

The proponents are screaming from the rooftops how AI is here and anyone less than the top-in-their-field is at risk. Given current capabilities, I will never raw-dog the stochastic parrot with live systems like this, but it is unfair to blame someone for being "too immature" to handle the tooling when the world is saying that you have to go all-in or be left behind.

There are just enough public success stories of people letting agents do everything that I am not surprised more and more people are getting caught up in the enthusiasm.

Meanwhile, I will continue plodding along with my slow meat brain, because I am not web-scale.

▲ nh2 2 hours ago | parent | prev | next [-]

> The agent cannot learn from its mistakes. The agent will never produce any output which will help you invoke future agents more safely

That is not entirely true:

Given that more and more LLM providers are sneaking in "we'll train on your prompts now" opt-outs, you deleting your database (and the agent producing repenting output) can reduce the chance that it'll delete my database in the future.

	▲	MagicMoonlight 2 hours ago \| parent [-]
		Actually no, it will increase it. Because it’ll be trained with the deletion command as a valid output.

▲ giwook 32 minutes ago | parent | prev | next [-]

Looks like our SWE jobs are safe for now.

▲ fathermarz an hour ago | parent | prev | next [-]

Completely agree. This is a harness problem, not a model problem. The model is rarely the issue these days

	▲	bigstrat2003 an hour ago \| parent [-]
		No, this is a "being stupid enough to trust an LLM" problem. They are not trustworthy, and you must not ever let them take automated actions. Anyone who does that is irresponsible and will sooner or later learn the error of their ways, as this person did.

▲ smrtinsert 36 minutes ago | parent | prev | next [-]

> "NEVER FUCKING GUESS"

It's very hard to treat this post seriously. I can't imagine what harness if any they attempted to place on the agent beyond some vibes. This is "most fast and absolutely destroy things" level thinking. That the poster asks for journalists to reach out makes it like a no news is bad news publicity grab. Just gross.

The AI era is turning about to be most disappointing era for software engineering.

	▲	r_lee 25 minutes ago \| parent [-]
		> The AI era is turning about to be most disappointing era for software engineering. this has been obvious to me since like 2024, it truly is the worst, most uninspiring era of all time.

▲ PieTime 27 minutes ago | parent | prev | next [-]

Trust with trillions of dollars in investments, basically destroyed by Bobby Drop Tables…

https://xkcd.com/327/

▲ operatingthetan an hour ago | parent | prev | next [-]

> Lord, even calling it a "confession" is so cringe. The agent is not alive.

The AI companies are very invested in anthropomorphizing the agents. They named their company "Anthropic" ffs. I don't blame the writer for this, exactly.

▲ TZubiri 3 hours ago | parent | prev [-]

It's as if they internalized a post-mortem process that is designed to find root causes, but they use it to shift blame into others, and they literally let the agent be a sandbag for their frustrations.

THAT SAID, it does help to let the agent explain it so that the devs perspective cannot be dismissed as AI skepticism.

	▲	philipwhiuk 2 hours ago \| parent [-]
		No, the only way to know what the agent did is logs.