Remix.run Logo
Translationaut 21 hours ago

There is this ethical reasoning dataset to teach models stable and predictable values: https://huggingface.co/datasets/Bachstelze/ethical_coconot_6... An Olmo-3-7B-Think model is adapted with it. In theory, it should yield better alignment. Yet the empirical evaluation is still a work in progress.

TuringTest 20 hours ago | parent [-]

Alignment is a marketing concept put there to appease stakeholders; it fundamentally can't work more than at a superficial level.

The model stores all the content on which it is trained in a compressed form. You can change the weights to make it more likely to show the content you ethically prefer; but all the immoral content is also there, and it can resurface with inputs that change the conditional probabilities.

That's why people can make commercial models to circumvent copyright, give instructions for creating drugs or weapons, encourage suicide... The model does not have anything resembling morals; for it all the text is the same, strings of characters that appear when following the generation process.

pixl97 19 hours ago | parent | next [-]

>Alignment is a marketing concept put there to appease stakeholders

This is a pretty odd statement.

Lets take LLMs alone out of this statement and go with a GenAI style guided humanoid robot. It has language models to interpret your instructions, vision models to interpret the world. Mechanical models to guide its movement.

If you tell this robot to take a knife and cut onions, alignment means it isn't going to take the knife and chop of your wife.

If you're a business, you want a model aligned not to give company secrets.

If it's a health model, you want it to not give dangerous information, like conflicting drugs that could kill a person.

Our LLMs interact with society and their behaviors will fall under the social conventions of those societies. Much like humans LLMs will still have the bad information, but we can greatly reduce the probabilities they will show it.

TuringTest 19 hours ago | parent [-]

> If you tell this robot to take a knife and cut onions, alignment means it isn't going to take the knife and chop of your wife

Yeah, I agree that alignment is a desirable property. The problem is that it can't really be achieved by changing the trained weights; alleviated yes, eliminated no.

> we can greatly reduce the probabilities they will show it

You can change the a priori probabilities, which means that the undesired problem will not be commonly found.

The thing is, then the concept provides a false sense of security. Even if the immoral behaviours are not common, they will eventually appear if you run chains of though long enough, or if many people use the model approaching it from different angles or situations.

It's the same as with hallucinations. The problem is not that they are more or less frequent; the most severe problem is that their appearance is unpredictable, so the model needs to be supervised constantly; you have to vet every single one of its content generations, as none of them can be trusted by default. Under these conditions, the concept of alignment is severely less helpful than expected.

pixl97 14 hours ago | parent [-]

>then the concept provides a false sense of security. Even if the immoral behaviours are not common, they will eventually appear if you run chains of though long enough, or if many people use the model approaching it from different angles or situations.

Correct, this is also why humans have a non-zero crime/murder rate.

>Under these conditions, the concept of alignment is severely less helpful than expected.

Why? What you're asking for is a machine that never breaks. If you want that build yourself a finite state machine, just don't expect you'll ever get anything that looks like intelligence from it.

TuringTest 7 hours ago | parent [-]

> Why? What you're asking for is a machine that never breaks.

No, I'm saying than 'alignment' is a concept that doesn't help to solve the problems that will appear when the machine ultimately breaks; and in fact makes them worse because it doesn't account for when it'll happen, as there's no way to predict that moment.

Following your metaphor of criminals: you can control humans to behave following the law through social pressure, having others watching your behaviour and influencing it. And if someone nevertheless breaks the law, you have the police to stop them from doing it again.

None of this applies to an "aligned" AI. It has no social pressure, its behaviours depend only on its own trained weights. So you would need to create a police for robots, that monitors the AI and stops it from doing harm. And it had better be a humane police force, or it will suffer the same alignment problems. Thus, alignment alone is not enough, and it's a problem if people depend only on it to trust the AI to work ethically.

idiotsecant 19 hours ago | parent | prev [-]

I'm not so sure about that. The incorrect answers to just about any given problem are in the problem set as well, but you can pretty reliably predict that the correct answer will be given, granted you have a statistical correlation in the training data. If your training data is sufficiently moral, the outputs will be as well.

TuringTest 18 hours ago | parent [-]

> If your training data is sufficiently moral, the outputs will be as well.

Correction: if your training data and the input prompts are sufficiently moral. Under malicious queries, or given the randomness introduced by sufficiently long chains of input/output, it's relatively easy to extract content from the model that the designers didn't want their users to get.

In any case, the elephant in the room is that the models have not been trained with "sufficiently moral" content, whatever that means. Large Language Models need to be trained on humongous amounts of text, which means that the builders need to use a lot of different, very large corpuses of content. It's impossible to filter all that diverse content to ensure that only 'moral content' is used; yet if it was possible, the model would be extremely less useful for the general case, as it would have large gaps of knowledge.

Translationaut 7 hours ago | parent [-]

The idea of the ethical reasoning dataset is not to erase specific content. It is designed to present additional thinking traces with an ethical grounding. So far, it is only a fraction of the available data. This doesn't solve alignment, and unethical behaviour is still possible, but the model gets a profound ethical reasoning base.