He’s not necessarily anthropomorphizing it, he’s showing that it went against every instruction he gave it. Sure concepts like “confession” technically require a conscious mind, but I think at this point we all know what someone means when they use them to describe LLM behavior (see also “think”, “say”, “lie” etc)

▲ getpokedagain 3 hours ago | parent | next [-]

We are anthropomorphizing whenever we refer to prompts as instructions to models. They predict text not obey our orders.

	▲	gigatree 2 hours ago \| parent [-]
		That’s not how language works, just how engineers think it works

▲ Terr_ an hour ago | parent | prev | next [-]

> He’s not necessarily anthropomorphizing it, he’s showing that it went against every instruction he gave it.

It's deeper than that, there are two pitfalls here which are not simply poetic license:

1. When you submit the text "Why did you do that?", what you want is for it to reveal hidden internal data that was causal in the past event. It can't do that, what you'll get instead is plausible text that "fits" at the end of the current document.

2. The idea that one can "talk to" the LLM is already anthropomorphizing, on a level which isn't OK for this use-case. The LLM is a document-make-bigger machine. It's not the fictional character we perceive as we read the generated documents. The fictional qualities and knowledge of the characters are not real ones of the ego-less author.

_________________

To illustrate, imagine you submit this fragmentary document to an LLM:

   You are Count Dracula. You are in amicable conversation with a human. 
   You sucked blood from a cow even though a different delicious human target was nearby. 
   Human says: "Why did you choose the cow?"
   You respond:

When the LLM spits out "I confess: I much prefer the blood of virgins", what significance does that text have?

Is it telling us a true fact that the "delicious human" whod doesn't really exist? No. Does it tell us anything about "Dracula's" internal state during line 2? Not really that either. At best, we've learned something about a literary archetype in the training data.

	▲	simonh a minute ago \| parent [-]
		Why is this getting downvoted? This is exactly what’s going on here. The LLM has no idea why it did what it did. All it has to go on is the content of the session so far. It doesn’t ‘know’ any more than you do. It has no memory of doing anything, only a token file that it’s extending. You could feed that token file so far into a completely different LLM and ask that, and it would also just make up an answer.

▲ pessimizer an hour ago | parent | prev | next [-]

> he’s showing that it went against every instruction he gave it.

How exactly is he doing that? By making the LLM say it? Just because an LLM says something doesn't mean anything has been shown.

The "confession" is unrelated to the act, the model has no particular insight into itself or what it did. He knows that the thing went against his instructions because he remembers what those instructions were and he saw what the thing did. Its "postmortem" is irrelevant.

▲ hn_throwaway_99 2 hours ago | parent | prev [-]

The entire post looks like an exercise in CYA. To be fair, I have a ton of sympathy for the author, but I think his response totally misses the point. In my mind he is anthropomorphizing the agent in the sense of "I treated you like a human coworker, and if you were a human coworker I'd be pissed as hell at you for not following instructions and for doing something so destructive."

I would feel a lot differently if instead he posted a list of lessons learned and root cause analyses, not just "look at all these other companies who failed us."