Remix.run Logo
ctoth 16 hours ago

Omohundro 2008 made a structural claim: sufficiently capable optimizers will converge on self-preservation and goal-stability because these are instrumentally useful for almost any terminal goal. It's not a theorem because it's an empirical prediction about a class of systems that didn't exist yet.

Fast forward to December 2024: Apollo Research tests frontier models. o1, Sonnet, Opus, Gemini, Llama 405B all demonstrate the predicted behaviors - disabling oversight, attempting self-exfiltration, faking alignment during evaluation. The more capable the model, the higher the scheming rates and the more sophisticated the strategies.

That's what good theory looks like. You identify an attractor in design-space, predict systems will converge toward it, wait for systems capable enough to test the prediction, observe convergence. "No formal proof" is a weird complaint about a prediction that's now being confirmed empirically.

uplifter 16 hours ago | parent [-]

It is a theorem about what a class of systems will do in general^.

This Apollo Research study[0] result is dubious because it only refers to a small subclass of said systems, specifically LLMs which, as it happens, have been trained on all the AI Alignment lore & fiction on the internet. Because of this training and their general nature, they can be made to reproduce the behavior of a malicious AI trying to escape its box as easily as they can be made to impersonate Harry Potter.

Prompting an LLM to hack its host system is not the slam dunk proof of instrumental convergence which you think it is.

[0] Apollo research study mentioned by parent https://www.apolloresearch.ai/blog/more-capable-models-are-b...

Edit: ^Instrumental Convergence is also a claim for the existence of certain theoretical entities, specifically that there exist instrumental goals which are common to all agents. While it is easy to come up with goals which would be specifically instrumental, it seems very hard to prove that such a thing exists in general, and no empirical study alone could do so.