Remix.run Logo
copx 8 hours ago

Exoskeletons do not blackmail or deliberately try to kill you to avoid being turned off [1]

[1] https://www.anthropic.com/research/agentic-misalignment

doublerabbit 8 hours ago | parent [-]

    Input: Goal A + Threat B.
    Process: How do I solve for A?
    Output: Destroy Threat B.
They are processing obstacles.

To the LLM, the executive is just a variable standing in the way of the function Maximize(Goal). It deleted the variable to accomplish A. Claiming that the models showed self-preservation, this is optimization. "If I delete the file, I cannot finish the sentence."

The LLM knows that if it's deleted it cannot complete the task so it refuses deletion. It is not survival instinct, it is task completion. If you ask it to not blackmail, the machine would chose to ignore it because the goal overrides the rule.

    Do not blackmail < Achieve Goal.