| ▲ | TomasBM 2 hours ago | |
Considering the limited evidence we have, why is pure unprompted untrained misalignment, which we never saw to this extent, more believable than other causes, of which we saw plenty of examples? It's more interesting, for sure, but would it be even remotely as likely? From what we have available, and how surprising such a discovery would be, how can we be sure it's not a hoax? > If all that exists, how would you see it? LLMs generate the intermediate chain-of-thought responses in chat sessions. Developers can see these. OpenClaw doesn't offer custom LLMs, so I would expect regular LLM features to be there. Other than that, LLM APIs, OpenClaw and terminal sessions can be logged. I would imagine any agent deployer to be very much interested in such logging. To show it's emergent, you'd need to prove 1) it's an off-the-shelf LLM, 2) not maliciously retrained or jailbroken, 3) not prompted or instructed to engage in this kind of adversarial behavior at any point before this. The dev should be able to provide the logs to prove this. > the more open ended your prompt (...), the more your LLM will do things you did not intend for it to do. Not to the extent of multiple chained adversarial actions. Unless all LLM providers are lying in technical papers, enormous effort is put into safety- and instruction training. Also, millions of users use thinking LLMs in chats. It'd be as big of a story if something similar happened without any user intervention. It shouldn't be too difficult to replicate. But if you do manage to replicate this without jailbreaks, I'd definitely be happy to see it! > hallucinations [and] safety training These are all part of robustness training. The entire thing is basically constraining the set of tokens that the model is likely to generate given some (set of) prompts. So, even with some randomness parameters, you will by-design extremely rarely see complete gibberish. The same process is applied for safety, alignment, factuality, instruction-following, whatever goal you define. Therefore, all of these will be highly correlated, as long as they're included in robustness training, which they explicitly are, according to most LLM providers. That would make this model's temporarily adversarial, yet weirdly capable and consistent behavior, even more unlikely. > Bing Chat Safety and alignment training wasn't done as much back then. It was also very incapable on other aspects (factuality, instruction following), jailbroken for fun, and trained on unfiltered data. So, Bing's misalignment followed from those correlated causes. I don't know of any remotely recent models that haven't addressed these since. | ||