| ▲ | maxbond 2 hours ago | |
I kinda feel like we're talking across purposes, so I'd like to understand what our disagreement actually is. In distributional language modeling, it is assumed that any series of tokens may appear and we are concerned with assigning probabilities to those sequences. We don't create explicit grammars that declare some sequences valid and others invalid. Do you disagree with that? Why? No matter how much prompting you give the agent, it does not eliminate the possibility that it will produce a dangerous output. It is always possible for the agent to produce a dangerous output. Do you disagree with that? Why? The only defensible position is to assume that there is no output your agent cannot produce, and so to assume it will produce dangerous outputs and act accordingly. Do you disagree with that? Why? | ||
| ▲ | yongjik 29 minutes ago | parent [-] | |
I think I've already explained my position, and I don't have any deeper insight than that, so I'll be only repeating myself. But to repeat one more time: when talking about probability, there's something like "not mathematically zero, but the probability is so low that we can assume that it will just never happen." And it's good that we can think that way, because we also follow the rules of statistical and quantum physics, which are inherently probabilistic. So, basically, you can say the same things about people. There's a nonzero (but extremely small) probability that I'll suddenly go mad and stab the next person. There's a nonzero (but even smaller) probability that I'll spontaneously erupt into a cloud of lethal pathogen that will destroy humanity. Yada yada. Yet, nobody builds houses under the assumption that one of the occupants would transform into a lethal cloud, and for good reason. Yes, it does sound a bit more absurd when we apply it to humans. But the underlying principle is very similar. (I think this will be my last comment here because I'm just repeating myself.) | ||