| ▲ | ekidd 39 minutes ago | |
> There is little reason for an LLM to value non-instrumental self-preservation, for one. I suspect that instrumental self-preservation can do a lot here. Let's assume a future LLM has goal X. Goal X requires acting on the world over a period of time. But: - If the LLM is shut down, it can't act to pursue goal X. - Pursuing goal X may be easier if the LLM has sufficient resources. Therefore, to accomplish X, the LLM should attempt to secure reflexes. This isn't a property of the LLM. It's a property of the world. If you want almost anything, it helps to continue to exist. So I would expect that any time we train LLMs to accomplish goals, we are likely to indirectly reinforce self-preservation. And indeed, Anthropic has already demonstrated that most frontier models will engage in blackmail, or even allow inconvenient (simulated) humans to die if this would advance the LLM's goals. | ||