Remix.run Logo
uplifter 16 hours ago

It is a theorem about what a class of systems will do in general^.

This Apollo Research study[0] result is dubious because it only refers to a small subclass of said systems, specifically LLMs which, as it happens, have been trained on all the AI Alignment lore & fiction on the internet. Because of this training and their general nature, they can be made to reproduce the behavior of a malicious AI trying to escape its box as easily as they can be made to impersonate Harry Potter.

Prompting an LLM to hack its host system is not the slam dunk proof of instrumental convergence which you think it is.

[0] Apollo research study mentioned by parent https://www.apolloresearch.ai/blog/more-capable-models-are-b...

Edit: ^Instrumental Convergence is also a claim for the existence of certain theoretical entities, specifically that there exist instrumental goals which are common to all agents. While it is easy to come up with goals which would be specifically instrumental, it seems very hard to prove that such a thing exists in general, and no empirical study alone could do so.