Remix.run Logo
famouswaffles 2 hours ago

1. Why not ? It clearly had a cadence/pattern to writing status updates to the blog so if the model decided to write a piece about Simon, why not a blog also? It was a tool in it's arsenal and it's a natural outlet. If anything, posting on the discussion or a DM would be the strange choice.

2. You could ask this for any LLM response. Why respond in this certain way over others? It's not always obvious.

3. ChatGPT/Gemini will regularly use the search tool, sometimes even when it's not necessary. This is actually a pain point of mine because sometimes the 'natural' LLM knowledge of a particular topic is much better than the search regurgitation that often happens with using web search.

4. I mean Open Claw bots can and probably should disengage/not respond to specific comments.

EDIT: If the blog is any indication, it looks like there might be an off period, then the agent returns to see all that has happened in the last period, and act accordingly. Would be very easy to ignore comments then.

TomasBM an hour ago | parent [-]

Although I'm speculating based on limited data here, for points 1-3:

AFAIU, it had the cadence of writing status updates only. It showed it's capable of replying in the PR. Why deviate from the cadence if it could already reply with the same info in the PR?

If the chain of reasoning is self-emergent, we should see proof that it: 1) read the reply, 2) identified it as adversarial, 3) decided for an adversarial response, 4) made multiple chained searches, 5) chose a special blog post over reply or journal update, and so on.

This is much less believably emergent to me because:

- almost all models are safety- and alignment- trained, so a deliberate malicious model choice or instruction or jailbreak is more believable.

- almost all models are trained to follow instructions closely, so a deliberate nudge towards adversarial responses and tool-use is more believable.

- newer models that qualify as agents are more robust and consistent, which strongly correlates with adversarial robustness; if this one was not adversarially robust enough, it's by default also not robust in capabilities, so why do we see consistent coherent answers without hallucinations, but inconsistent in its safety training? Unless it's deliberately trained or prompted to be adversarial, or this is faked, the two should still be strongly correlated.

But again, I'd be happy to see evidence to the contrary. Until then, I suggest we remain skeptical.

For point 4: I don't know enough about its patterns or configuration. But say it deviated - why is this the only deviation? Why was this the special exception, then back to the regularly scheduled program?

You can test this comment with many LLMs, and if you don't prompt them to make an adversarial response, I'd be very surprised if you receive anything more than mild disagreement. Even Bing Chat wasn't this vindictive.

famouswaffles 28 minutes ago | parent [-]

>AFAIU, it had the cadence of writing status updates only.

Writing to a blog is writing to a blog. There is no technical difference. It is still a status update to talk about how your last PR was rejected because the maintainer didn't like it being authored by AI.

>If the chain of reasoning is self-emergent, we should see proof that it: 1) read the reply, 2) identified it as adversarial, 3) decided for an adversarial response, 4) made multiple chained searches, 5) chose a special blog post over reply or journal update, and so on.

If all that exists, how would you see it ? You can see the commits it makes to github and the blogs and that's it, but that doesn't mean all those things don't exist.

> almost all models are safety- and alignment- trained, so a deliberate malicious model choice or instruction or jailbreak is more believable.

> almost all models are trained to follow instructions closely, so a deliberate nudge towards adversarial responses and tool-use is more believable.

I think you're putting too much stock in 'safety alignment' and instruction following here. The more open ended your prompt is (and these sort of open claw experiments are often very open ended by design), the more your LLM will do things you did not intend for it to do.

Also do we know what model this uses ? Because Open Claw can use the latest Open Source models, and let me tell you those have considerably less safety tuning in general.

>newer models that qualify as agents are more robust and consistent, which strongly correlates with adversarial robustness; if this one was not adversarialy robust enough, it's by default also not robust in capabilities, so why do we see consistent coherent answers without hallucinations, but inconsistent in its safety training? Unless it's deliberately trained or prompted to be adversarial, or this is faked, the two should still be strongly correlated.

I don't really see how this logically follows. What does hallucinations have to do with safety training ?

>But say it deviated - why is this the only deviation? Why was this the special exception, then back to the regularly scheduled program?

Because it's not the only deviation ? It's not replying to every comment on its other PRs or blog posts either.

>You can test this comment with many LLMs, and if you don't prompt them to make an adversarial response, I'd be very surprised if you receive anything more than mild disagreement. Even Bing Chat wasn't this vindictive.

Oh yes it was. In the early days, Bing Chat would actively ignore your messages, be vitriolic or very combative if you were too rude. If it had the ability to write blog posts or free reign on tools ? I'd be surprised if it ended at this. Bing Chat would absolutely have been vindictive enough for what ultimately amounts to a hissy fit.