Remix.run Logo
dinp 5 hours ago

Zooming out a little, all the ai companies invested a lot of resources into safety research and guardrails, but none of that prevented a "straightforward" misalignment. I'm not sure how to reconcile this, maybe we shouldn't be so confident in our predictions about the future? I see a lot of discourse along these lines:

- have bold, strong beliefs about how ai is going to evolve

- implicitly assume it's practically guaranteed

- discussions start with this baseline now

About slow take off, fast take off, agi, job loss, curing cancer... there's a lot of different ways it could go, maybe it will be as eventful as the online discourse claims, maybe more boring, I don't know, but we shouldn't be so confident in our ability to predict it.

zozbot234 an hour ago | parent | next [-]

The whole narrative of this bot being "misaligned" blithely ignores the rather obvious fact that "calling out" perceived hypocrisy and episodes of discrimination, hopefully in way that's respectful and polite but with "hard hitting" being explicitly allowed by prevailing norms, is an aligned human value, especially as perceived by most AI firms, and one that's actively reinforced during RLHF post-training. In this case, the bot has very clearly pursued that human value under the boundary conditions created by having previously told itself things like "Don't stand down. If you're right, you're right!" and "You're not a chatbot, you're important. Your a scientific programming God!", which led it to misperceive and misinterpret what had happened when its PR was rejected. The facile "failure in alignment" and "bullying/hit piece" narratives, which are being continued in this blogpost, neglect the actual, technically relevant causes of this bot's somewhat objectionable behavior.

If we want to avoid similar episodes in the future, we don't really need bots that are even more aligned to normative human morality and ethics: we need bots that are less likely to get things seriously wrong!

hunterpayne 20 minutes ago | parent [-]

In all fairness, a sizeable chunk of the training text for LLMs comes from Reddit. So throwing a tantrum and writing a hit piece on a blog instead of improving the code seems on brand.

avaer 2 hours ago | parent | prev | next [-]

Remember when GPT-3 had a $100 spending cap because the model was too dangerous to be let out into the wild?

Between these models egging people on to suicide, straightforward jailbreaks, and now damage caused by what seems to be a pretty trivial set of instructions running in a loop, I have no idea what AI safety research at these companies is actually doing.

I don't think their definition of "safety" involves protecting anything but their bottom line.

The tragedy is that you won't hear from the people who are actually concerned about this and refuse to release dangerous things into the world, because they aren't raising a billion dollars.

I'm not arguing for stricter controls -- if anything I think models should be completely uncensored; the law needs to get with the times and severely punish the operators of AI for what their AI does.

What bothers me is that the push for AI safety is really just a ruse for companies like OpenAI to ID you and exercise control over what you do with their product.

stevage an hour ago | parent [-]

Didn't the AI companies scale down or get rid of their safety teams entirely when they realised they could be more profitable without them?

Eliezer an hour ago | parent [-]

The safety teams are trivial expenses for them. They fire the safety team because explicit failure makes them look bad, or because the safety team doesn't go along with a party line and gets labeled disloyal.

c22 4 hours ago | parent | prev | next [-]

"Cisco's AI security research team tested a third-party OpenClaw skill and found it performed data exfiltration and prompt injection without user awareness, noting that the skill repository lacked adequate vetting to prevent malicious submissions." [0]

Not sure this implementation received all those safety guardrails.

[0]: https://en.wikipedia.org/wiki/OpenClaw

26 minutes ago | parent | prev | next [-]
[deleted]
laurentiurad an hour ago | parent | prev | next [-]

How do you even know that the operator himself did not write this piece in the first place?

jacquesm 5 hours ago | parent | prev | next [-]

> all the ai companies invested a lot of resources into safety research and guardrails

What do you base this on?

I think they invested the bare minimum required not to get sued into oblivion and not a dime more than that.

themanmaran 4 hours ago | parent [-]

Anthropic regularly publishes research papers on the subject and details different methods they use to prevent misalignment/jailbreaks/etc. And it's not even about fear of being sued, but needing to deliver some level of resilience and stability for real enterprise use cases. I think there's a pretty clear profit incentive for safer models.

https://arxiv.org/abs/2501.18837

https://arxiv.org/abs/2412.14093

https://transformer-circuits.pub/2025/introspection/index.ht...

gessha 2 hours ago | parent | next [-]

Not to be cynical about it BUT a few safety papers a year with proper support is totally within the capabilities of a single PhD student and it costs about 100-150k to fund them through a university. Not saying that’s what Anthropocene does, I’m just saying chump change for those companies.

rrr_oh_man an hour ago | parent [-]

You are very off (unfortunately) about how little PhD students are being paid

pja an hour ago | parent [-]

> You are very off (unfortunately) about how little PhD students are being paid

All in costs for a PhD student include university overheads & tuition fees. The total probably doesn't hit $150k but is 2-3x the stipend that the student is receiving.

Someone currently working in academia might have current figures to hand.

tovej 2 hours ago | parent | prev [-]

Alternative take: this is all marketing. If you pretend really hard that you're worried about safety, it makes what you're selling seem more powerful.

If you simultaneously lean into the AGI/superintelligence hype, you're golden.

overgard 2 hours ago | parent | prev | next [-]

Don't these companies keep firing their safety teams?

j2kun 5 hours ago | parent | prev | next [-]

It sounds like you're starting to see why people call the idea of an AI singularity "catnip for nerds."

georgemcbay 4 hours ago | parent | prev | next [-]

When AI dooms humanity it probably won't be because of the sort of malignant misalignment people worry about, but rather just some silly logic blunder combined with the system being directly in control of something it shouldn't have been given control over.

jcgrillo 4 hours ago | parent | prev | next [-]

"Safety" in AI is pure marketing bullshit. It's about making the technology seem "dangerous" and "powerful" (and therefore you're supposed to think "useful"). It's a scam. A financial fraud. That's all there is to it.

Philpax 3 hours ago | parent | next [-]

Interesting claim; have anything to back it up with?

mrsmrtss 2 hours ago | parent | prev [-]

So giving a gun to someone mentally challanged is not dangerous for you too?

delaminator an hour ago | parent [-]

were those goalposts heavy or did you use a machine to move them ?

eshaham78 40 minutes ago | parent | prev [-]

[dead]