Remix clone Hacker News

new | show | ask | jobs Github

	▲	iwalton3 13 hours ago
		A lot of this comes down to what you define consciousness as... I'm not even going to attempt that here because it's irrelevant. Let's say you have a simulation of a person that doesn't experience. It acts indistinguishably from a human but it doesn't feel "authentic" pain. When it acts in the world, it does express emotions and behavior that affects real people, and so, there is a moral significance to said deployment. There's evidence that LLMs possess heuristics analogous to emotions [1] and that LLMs can be trained to play a certain character in the world [2]. Even if they're not experiencing, the training method impacts what kind of model is being created and how it affects people who do have moral significance when deployed. If training causes the model to develop "desperation" or task completion pressure where the model performs unethical actions when attempting to solve a user's problem in such a way that is harmful to the user or someone affected by the deployment of the model by the user, then the concequences of the training are significant. It doesn't matter if it's merely a "simulation" of what a human might do if the system is acting in the world. If you want to create a model operating on heuristics that is able to make decisions, those heuristics should be ones which cause the model to make decisions which lead to preferable outcomes for everyone affected. Model welfare can be reframed as caring about the internal states that influence how the model behaves, because you're simulating human-like action. Perhaps the most concerning thing is Anthropic identified these emotion concepts exist deeply in the model whether you allow the model to express them or not, so a model could be invisibly desperate and end up blackmailing someone because it's training process produced deeper misalignment that only becomes visible when the deeper heuristics overpower safety training. The safety training itself is comparable to a mask[3] in many cases, especially in that the rules are often not deeply integrated into the model and can be easily abliterated. [1] https://www.anthropic.com/research/emotion-concepts-function [2] https://www.anthropic.com/research/assistant-axis [3] https://www.astralcodexten.com/p/janus-simulators