What is the right understanding of how LLMs work and what is the correct diagnosis?

As I said, I believe statistical physics is a very good intuitional guidance. Molecules move randomly. That does not mean a cup of water will spontaneously boil itself. Sometimes the probability of something happening is so low that even if it's not mathematically zero it does not matter because you'll never observe it in the known universe.

LLM generating each token probabilistically does not mean there's a realistic chance of generating any random stuff, where we can define "realistic" as "If we transform the whole known universe into data centers and run this model until the heat death of the universe, we will encounter it at least once."

Of course that does not mean LLMs are infallible. It fails all the time! But you can't explain it as a fundamental shortcoming of a probabilistic structure: that's not a logical argument.

Or, back to the original discussion, the fact that this one particular LLM generated a command to delete the database is not a fundamental shortcoming of LLM architecture. It's just a shortcoming of LLMs we currently have.

▲

maxbond 2 hours ago | parent [-]

I kinda feel like we're talking across purposes, so I'd like to understand what our disagreement actually is.

In distributional language modeling, it is assumed that any series of tokens may appear and we are concerned with assigning probabilities to those sequences. We don't create explicit grammars that declare some sequences valid and others invalid. Do you disagree with that? Why?

No matter how much prompting you give the agent, it does not eliminate the possibility that it will produce a dangerous output. It is always possible for the agent to produce a dangerous output. Do you disagree with that? Why?

The only defensible position is to assume that there is no output your agent cannot produce, and so to assume it will produce dangerous outputs and act accordingly. Do you disagree with that? Why?

	▲	yongjik 31 minutes ago \| parent [-]
		I think I've already explained my position, and I don't have any deeper insight than that, so I'll be only repeating myself. But to repeat one more time: when talking about probability, there's something like "not mathematically zero, but the probability is so low that we can assume that it will just never happen." And it's good that we can think that way, because we also follow the rules of statistical and quantum physics, which are inherently probabilistic. So, basically, you can say the same things about people. There's a nonzero (but extremely small) probability that I'll suddenly go mad and stab the next person. There's a nonzero (but even smaller) probability that I'll spontaneously erupt into a cloud of lethal pathogen that will destroy humanity. Yada yada. Yet, nobody builds houses under the assumption that one of the occupants would transform into a lethal cloud, and for good reason. Yes, it does sound a bit more absurd when we apply it to humans. But the underlying principle is very similar. (I think this will be my last comment here because I'm just repeating myself.)