▲ | whodatbo1 a day ago | |
The issue here is that you can never be sure how the model will react based on an input that is seemingly ordinary. What if the most likely outcome is to exhibit malevolent intent or to construct a malicious plan just because it invokes some combination of obscure training data. This just shows that models indeed have the ability to act out, not under which conditions they reach such a state. |