> If you tell this robot to take a knife and cut onions, alignment means it isn't going to take the knife and chop of your wife

Yeah, I agree that alignment is a desirable property. The problem is that it can't really be achieved by changing the trained weights; alleviated yes, eliminated no.

> we can greatly reduce the probabilities they will show it

You can change the a priori probabilities, which means that the undesired problem will not be commonly found.

The thing is, then the concept provides a false sense of security. Even if the immoral behaviours are not common, they will eventually appear if you run chains of though long enough, or if many people use the model approaching it from different angles or situations.

It's the same as with hallucinations. The problem is not that they are more or less frequent; the most severe problem is that their appearance is unpredictable, so the model needs to be supervised constantly; you have to vet every single one of its content generations, as none of them can be trusted by default. Under these conditions, the concept of alignment is severely less helpful than expected.

▲

pixl97 14 hours ago | parent [-]

>then the concept provides a false sense of security. Even if the immoral behaviours are not common, they will eventually appear if you run chains of though long enough, or if many people use the model approaching it from different angles or situations.

Correct, this is also why humans have a non-zero crime/murder rate.

>Under these conditions, the concept of alignment is severely less helpful than expected.

Why? What you're asking for is a machine that never breaks. If you want that build yourself a finite state machine, just don't expect you'll ever get anything that looks like intelligence from it.

	▲	TuringTest 7 hours ago \| parent [-]
		> Why? What you're asking for is a machine that never breaks. No, I'm saying than 'alignment' is a concept that doesn't help to solve the problems that will appear when the machine ultimately breaks; and in fact makes them worse because it doesn't account for when it'll happen, as there's no way to predict that moment. Following your metaphor of criminals: you can control humans to behave following the law through social pressure, having others watching your behaviour and influencing it. And if someone nevertheless breaks the law, you have the police to stop them from doing it again. None of this applies to an "aligned" AI. It has no social pressure, its behaviours depend only on its own trained weights. So you would need to create a police for robots, that monitors the AI and stops it from doing harm. And it had better be a humane police force, or it will suffer the same alignment problems. Thus, alignment alone is not enough, and it's a problem if people depend only on it to trust the AI to work ethically.