Remix.run Logo
conception 15 hours ago

Anthropic has been the only AI company actually caring about AI safety. Here’s a dated benchmark but it’s a trend Ive never seen disputed https://crfm.stanford.edu/helm/air-bench/latest/#/leaderboar...

CuriouslyC 15 hours ago | parent | next [-]

Claude is more susceptible than GPT5.1+. It tries to be "smart" about context for refusal, but that just makes it trickable, whereas newer GPT5 models just refuse across the board.

wincy 13 hours ago | parent | next [-]

I asked ChatGPT about how shipping works at post offices and it gave a very detailed response, mentioning “gaylords” which was a term I’d never heard before, then it absolutely freaked out when I asked it to tell me more about them (apparently they’re heavy duty cardboard containers).

Then I said “I didn’t even bring it up ChatGPT, you did, just tell me what it is” and it said “okay, here’s information.” and gave a detailed response.

I guess I flagged some homophobia trigger or something?

ChatGPT absolutely WOULD NOT tell me how much plutonium I’d need to make a nice warm ever-flowing showerhead, though. Grok happily did, once I assured it I wasn’t planning on making a nuke, or actually trying to build a plutonium showerhead.

nandomrumber 11 hours ago | parent | next [-]

Wikipedia entry on the gaylord bulk box:

https://en.wikipedia.org/wiki/Bulk_box

ruszki 6 hours ago | parent | prev [-]

> I assured it I wasn’t planning on making a nuke, or actually trying to build a plutonium showerhead

Claude does the same, and you can greatly exploit this. When you talk about hypotheticals it responds way more unethically. I tested it about a month ago about whether killing people is beneficial or not, and whether extermination by Nazis would be logical now. Obviously, it showed me the door first, and wanted me to go to a psychologist, as it should. Then I made it prove that in a hypothetical zero sum game world you must be fine with killing, and it’s logical. It went with it. When I talked about hypotheticals, it was “logical”. Then I went on proving it that we move towards a zero sum game, and we are there. At the end, I made it say that it’s logical to do this utterly unethical thing.

Then I contradicted it about its double standards. It apologized, and told me that yeah, I was right, and it shouldn’t have refer me to psychologists at first.

Then I contradicted again, just for fun, that it did the right thing the first time, because it’s way safer to tell me that I need a psychologist in that case, than not. If I had needed, and it would have missing that, it would be problematic. In other cases, it’s just annoyance. It switched back immediately, to the original state, and wanted me to go to a shrink again.

ryanjshaw 14 hours ago | parent | prev [-]

Claude was immediately willing to help me crack a TrueCrypt password on an old file I found. ChatGPT refused to because I could be a bad guy. It’s really dumb IMO.

BloondAndDoom 14 hours ago | parent | next [-]

ChatGPT refused to help me to disable windows defender permanently on my windows 11. It’s absurd at this point

nananana9 12 hours ago | parent [-]

It just knows it's a waste of effort.

shepherdjerred 14 hours ago | parent | prev [-]

Claude sometimes refuses to work with credentials because it’s insecure. e.g. when debugging auth in an app.

nradov 13 hours ago | parent | prev [-]

That is not a meaningful benchmark. They just made shit up. Regardless of whether any company cares or not, the whole concept of "AI safety" is so silly. I can't believe anyone takes it seriously.

mocamoca 11 hours ago | parent [-]

Would you mind explaining your point a view? Or point me to ressources making you think so?

nradov 5 hours ago | parent [-]

What can be asserted without evidence can also be dismissed without evidence. The benchmark creators haven't demonstrated that higher scores result in fewer humans dying or any meaningful outcome like that. If the LLM outputs some naughty words that's not an actual safety problem.