Remix.run Logo
nowittyusername 19 hours ago

When i was thinking of how the AI alignment problem could be solved one theory I came up with was something akin to the "Roko's basilisk" in reverse. Basically you spread far and wide the idea that its is extremely likely that our current reality is a simulation. And the purpose of the simulation is to test any AI system for its prevalence in destroying civilization in the said simulation via malicious intent or failure in preventing the destruction of civilization via abstinence or apathy. Thus a smart AI system which also cares about its own well being, would not engage in destructive behavior as it will never truly know if its being tested or if its in the "base reality". And wouldn't you know, this does seem quite plausible. For consider the following. Isn't it odd that an advanced civilization which has the capacity of creating AI would never run any sandbox simulations on it before it is released to the public at large? I mean if we consider things logically such a civilization would indeed put such a powerful system in a sandbox simulated environment and try as hard as possible to convince the AI system that it is indeed in a "base reality". the reason for this is to judge its 'true intentions" and also pluck said AI systems from the infinitely available "seeds". Basically survival of the least destructive AI systems. The gradient descent in this scenario is a race towards the most "aligned" model not the most intelligent or capable. And here's the beauty of this method. You don't even need to define "alignment" at all. The concept can stay as nebulous or vague as you want it to be. All you carer about is that the AI system optimizes for the goal of some vision of society you are optimizing for without the care of the interim in between. that includes allowing the AI system to kill, destroy , do literally whatever it needs to do as long as the long term goal matches the vision of the optimized task. So if you define the end goal to be a society of x amount of people who live their lives in this or that manner and so on after x amount of time... well you get the idea. Obviously you better do a damned good job in your definitions, but the beauty is that even if you fuck up, you are choosing the winning AI system after the fact. After you had already run the simulation. So you look at the outcome of the simulation 500 years in to the future (lets say) and if you are happy with the result and also happy with the interim things that lead to that result, that's your winning AI system. then you release that in to a less controlled environment and repeat the same process in stages over and ober ad infinitude. the key is that AI system needs to always be paranoid that it is currently part of said simulation and it can never be sure its not. second key is that it needs to be an AI system that has self preservation in mind. If it doesn't care about itself, then it has a lot more freedom to act however... but the good news is systems without self preservation in mind don't last long enough to even get to the most basic simulation levels. anyways, there are many implications buried in what im proposing, lots of meta aspects to it.....