| ▲ | ACCount37 2 hours ago | |
It's an important point to make. LLMs of today copy a lot of human behavior, but not all of their behavior is copied from humans. There are already things in them that come from elsewhere - like the "shape shifter" consistency drive from the pre-training objective of pure next token prediction across a vast dataset. And there are things that were too hard to glimpse from human text - like long term goal-oriented behavior, spatial reasoning, applied embodiment or tacit knowledge - that LLMs usually don't get much of. LLMs don't have to stick close to human behavior. The dataset is very impactful, but it's not impactful enough that parts of it can't be overpowered by further training. There is little reason for an LLM to value non-instrumental self-preservation, for one. LLMs are already weird - and as we develop more advanced training methods, LLMs might become much weirder, and quickly. Sydney and GPT-4o were the first "weird AIs" we've deployed, but at this rate, they sure wouldn't be the last. | ||
| ▲ | ekidd 15 minutes ago | parent [-] | |
> There is little reason for an LLM to value non-instrumental self-preservation, for one. I suspect that instrumental self-preservation can do a lot here. Let's assume a future LLM has goal X. Goal X requires acting on the world over a period of time. But: - If the LLM is shut down, it can't act to pursue goal X. - Pursuing goal X may be easier if the LLM has sufficient resources. Therefore, to accomplish X, the LLM should attempt to secure reflexes. This isn't a property of the LLM. It's a property of the world. If you want almost anything, it helps to continue to exist. So I would expect that any time we train LLMs to accomplish goals, we are likely to indirectly reinforce self-preservation. And indeed, Anthropic has already demonstrated that most frontier models will engage in blackmail, or even allow inconvenient (simulated) humans to die if this would advance the LLM's goals. | ||