Remix.run Logo
martinald 5 hours ago

If you set aside political menace, this is a huge problem with Anthropic's strategy.

You _cannot_ say that Mythos is super dangerous and can only be rolled out to certain people, but then release Fable with anything other than bulletproof cyber denials.

Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work.

So you've ended up in a situation where Anthropic are simultaneously claiming it's a incredibly dangerous model _and_ there are (minor, potentially) problems with the security "protections".

As technical people we understand that nothing can be perfect, esp in LLM world. But all my non technical friends were really confused how they had managed to make the model "safe" so quickly when it was released and the general sentiment was it shouldn't have been released - and now to an outsider I think it looks like it was never safe at all to release, so I can totally see how the current US administration have got themselves very upset with it.

_Even if_ there was no political bad will, it's a bit of a silly scenario to end up in, and really quite easily foreseen.

pjc50 4 hours ago | parent | next [-]

> Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work

Exactly. AI safety is nonsensical. You cannot define the set of "bad strings". The billion monkeys with typewriters are eventually going to be able to produce them. Any "safety" system for constraining LLM output is going to have a nonzero leak rate.

But on the other hand, this is also irrelevant, unless you're irresponsible enough to connect an LLM to something that actually matters.

Yes, it's going to alarmingly accelerate vulnerability finding. But, as we know from decades of security research, that's a three way problem already between the devs, the black hats, and the white hats.

Let's not pretend the strategy of "the US will always have a technological advantage and veto over China" will work either.

camel-cdr an hour ago | parent | next [-]

> unless you're irresponsible enough to connect an LLM to something that actually matters

Remember when people said Artifical Intelligence woun't be dangerous, because nobody will be stupid enough to give it free access to the internet...

estearum 2 hours ago | parent | prev | next [-]

> unless you're irresponsible enough to connect an LLM to something that actually matters.

Can't tell if you're saying this tongue-in-cheek or you're a bit out of the loop on what people are doing with LLMs.

And a quick correction:

> unless someone, somewhere is irresponsible enough to connect an LLM to something that actually matters.

pjc50 2 hours ago | parent [-]

"You" can be used as a generalized plural here. Of course people are connecting LLMs to bank accounts, power grids, airline sales, account recovery chatbots and so on. I no longer read COMP.RISKS but I imagine they're having fun with this.

estearum an hour ago | parent [-]

The thing I'm pointing out is that even if you (the generalized plural) do not engage in reckless behavior, you are at the mercy of the lowest common denominator of fellow earth-inhabitants increasingly armed with superweapons via a $20/mo subscription.

The need to acquire expertise and/or a meaningful following has always been a significant impediment to malicious or moronic actors. But less so every day.

ianm218 4 hours ago | parent | prev | next [-]

Isn’t your point that AI safety is impossible to prevent 100% of bad things?

It is quite hard (but not impossible) to get an the frontier AI to tell you how to build a nuke or launder money now, where jailbreaks used to be trivial “ignore all previous instructions”.

It seems like a worthwhile effort.

dkdcdev 3 hours ago | parent [-]

The idea that an LLM can discern intent on any given prompt is farcical. I might be researching nukes to commit an atrocity, or to prevent one. I might be asking about laundering money to commit a crime, or to prevent one. I might be researching the Nazis because I want to commit a genocide, or I want to read up so I know how to prevent one. Same with cybersecurity. Same with anything.

In my opinion, these companies should put their effort elsewhere. Obviously if all someone is doing on their platform is looking up how to build a nuke, where to buy uranium, the best city to explode it in, etc. please report them to the authorities. If someone is clearly just using LLMs to write hate speech they go post on the internet, ban them. And so on.

This cat & mouse game trying to have LLMs police inquiries is ridiculous to me.

pjc50 2 hours ago | parent | next [-]

> The idea that an LLM can discern intent on any given prompt is farcical.

Yes, and: the LLM is a "brain in a jar". It doesn't have any ability to verify ground truths outside itself, other than maybe calling out over the internet. Therefore it is easy for humans to lie to. You could call this an "Ender's game" attack, after the book in which a hyperintelligent kid is playing "war games" that end up being the real war.

thomastjeffery 5 minutes ago | parent | prev | next [-]

> I might be asking about laundering money to commit a crime, or to prevent one.

Or, much more likely, the same pattern of tokens happen to exist in a completely different discussion, either as a direct metaphor, or as a reality of linguistics. Hell, "laundering" itself is a metaphorical word.

The absurd notion is that any speech should be policed in the first place. If there really is such a thing as dangerous information, then it must be removed from the training data. Any other strategy simply launders the risk.

ianm218 3 hours ago | parent | prev | next [-]

I don't really agree with it but the government is moving towards making you ID yourself to use frontier AI - i.e. only US citizens are going to be able to use Claude Fable supposedly. In that regime the AI companies would in fact know if you are a money laundering expert or a normal software engineer.

> The idea that an LLM can discern intent on any given prompt is farcical.

Not really though. For most people in most situations it's just not going to give you that info. Software security is a niche where its a bit strange in that there is 100X the amount of white hat users than bad actors and there's open source etc.

bloppe 3 hours ago | parent [-]

The idea that checking for a US ID could possibly stop actual foreign bad actors from using it is also farcical. Millions of stolen identity documents can be bought on the dark web for relatively cheap. North Koreans have been hiring real American citizens for years to infiltrate tons of US tech companies as employees.

And ya, it's pretty easy to hide your intent once you have access.

ianm218 2 hours ago | parent | next [-]

I think your really anchored on anyone successfully breaking restrictions means any restriction is impossible. So your starting from the position that if it is possible for any actor in the world to get past a restriction, then the whole restriction is a farce.

KYC for example does stop most money laundering and financial crime. The most resourced actors like governments/ cartels often find ways around and it is a game of cat and mouse. Normal citizens don't really stand a chance to get around most of them.

Like it feels like your logic is that we shouldn't do background checks for employment because North Korean spy agencies get past them sometimes?

contravariant 3 hours ago | parent | prev [-]

Even that is overselling the effort. Last time I checked you could find IDs with a simple image search.

s1artibartfast 2 hours ago | parent | prev [-]

they arent good at dicerning intent so they dont answer either.

giancarlostoro 2 hours ago | parent | prev | next [-]

This one limitation of LLMs is kind of my bar for "Not truly AI yet" but I'm not saying it as a "its not good at all" type of bar, moreso, know the limits and work from there. LLMs will continue to struggle with things that require intuition for a while I think. It will get really interesting if they can ever truly detect a bad faith actor using them.

Freedumbs 12 minutes ago | parent | prev | next [-]

This is correct and certain subjects are very close to if not impossible like "use versus mention", but LLM security isn't impossible. WAFs are real and have existed for a long time. Input text produces various signals and can be secured.

No security is ever perfect, but we can likely protect LLMs with WAFs that increase security to an acceptable level. Like nation-state required resources to break.

anuramat an hour ago | parent | prev | next [-]

is nonzero leak rate sufficient for someone to practically exploit it? if you have to spend $10000 in tokens to get it to do what you want, is it still worth it? what if they manually review the requests of the users that trigger the guardrails too often?

jdubs1984 2 hours ago | parent | prev [-]

A chatbot based on a primitive understanding of human language processing has an attack infinite attack surface.

amalcon 3 hours ago | parent | prev | next [-]

I do find it hilarious that Asimov wrote many stories about how simple bright-line rule-based systems are ineffective for restricting agency. Those stories were first published in the 1940s.

80 years later, we have something approximating AI, and we're trying to restrict it with simple bright-line rules. Not because we never learned that lesson, but because we simply haven't come up with a better way to do it. Probably because a better way to do it just doesn't exist.

The hilarious part, though, is that it's not the AI that's working around the rules. That's the scenario that's been in science fiction, but it's not what's happening. It's the human users making use of our agency to get the AI agents to work around the rules. Despite calling them "agents", current AI agents don't seem to be able to that particular something. Yet, at least.

nsagent 2 hours ago | parent | next [-]

Yeah, it's been known for a very long time. Richard Feynman alluded to it in his speech The Value of Science [1] where he discussed a Buddhist proverb:

  To every man is given the key to the gates of heaven; the same key opens the gates of hell.
He then goes on to say:

  What, then, is the value of the key to heaven? It is true that if we lack clear instructions that determine which is the gate to heaven and which is the gate to hell, the key may be a dangerous object to use. But the key obviously has value: how can we enter heaven without it?
[1]: https://calteches.library.caltech.edu/40/2/Science.pdf
zahlman an hour ago | parent | prev [-]

> The hilarious part, though, is that it's not the AI that's working around the rules. That's the scenario that's been in science fiction, but it's not what's happening. It's the human users making use of our agency to get the AI agents to work around the rules. Despite calling them "agents", current AI agents don't seem to be able to that particular something. Yet, at least.

Well, yes. Until people are putting the LLMs into actual mechanical robots, "agency" boils down to flipping bits in memory or storage (even if they're ones that humans consider really important, e.g. because they represent a bank ledger) or convincing humans to take action. One can only "work around the rules" to the extent that one can "work".

But even in Asimov's books, at least some of the scenarios involved humans misleading the robots to use them as pawns in a greater scheme.

cge 4 hours ago | parent | prev | next [-]

> Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work.

As a scientist who repeatedly ran into the classifier-based denials: it appears Anthropic’s strategy to make denials more robust, at the cost of many false positives, was to have a separate classifier processing both input and output tokens, at an extremely simple, almost keyword-search level. One weakness of this approach is that it only catches things that use the right keywords: it is in some sense weak exactly where an LLM-based classifier would be stronger.

Work on abstract, closer-to-CS algorithms that used chemistry terminology were blocked immediately, while work directly relevant to chemistry/biology experiments, writing code to process images from a very specific microscopy setup relevant primarily to biological samples, was never blocked at all, because it happened to never use relevant keywords.

That’s consistent with this situation: finding and fixing bugs in the context of looking for bugs perhaps happened to never use words like ‘exploit’ or ‘cybersecurity’.

aesthesia an hour ago | parent | next [-]

You can see their general approach to guardrail classifiers in these posts:

https://www.anthropic.com/research/constitutional-classifier... https://www.anthropic.com/research/next-generation-constitut...

It's not just keyword matching, but I'm sure they tuned the Fable classifiers pretty hard to avoid false negatives.

tmp10423288442 2 hours ago | parent | prev [-]

But you think that Anthropic of all companies would realize this, so why did they do it that way? Did they literally take the first suggestion Mythos gave them to add these guardrails - wouldn't be surprising, seeing the state of the leaked Claude Code codebase.

wrsh07 3 hours ago | parent | prev | next [-]

While I agree that anthropic has several communication and PR problems, it doesn't seem like Fable has been shown to offer any advantage here (for cyber offensive capabilities) over the previous state of the art.

I'm not saying all of Anthropic's statements are true, but mythos did seem to find many legitimate security exploits. You should be able to talk about a helpful-only model being released to limited partners while still releasing a very locked down model that doesn't advance the state of the art on these things, and that seems to be what they did.

There's no inherent contradiction to that.

embedding-shape 2 hours ago | parent | prev | next [-]

> So you've ended up in a situation where Anthropic are simultaneously claiming it's a incredibly dangerous model _and_ there are (minor, potentially) problems with the security "protections".

They probably say it worked for OpenAI with earlier versions of ChatGPT and GPT, and figured can't hurt to try an similar approach and see what happens.

ceejayoz 5 hours ago | parent | prev | next [-]

> it shouldn't have been released

The genie is out of the bottle either way.

Unless we believe Anthropic has a wizard or superhero secreted away that no one else can replicate.

martinald 5 hours ago | parent [-]

I get that, but anyone else releasing a model of similar capabilities has the advantage that they haven't spent the last few months hyping the danger up to fever pitch.

ReptileMan 4 hours ago | parent | next [-]

That is the point. You don't have to shout from the rooftops what are your model capabilities.

5 hours ago | parent | prev [-]
[deleted]
giancarlostoro 2 hours ago | parent | prev | next [-]

Yeah, if Anthropic didn't spend the last what? Month? Month plus telling us how dangerous it was, I would be more upset, but they told us how dangerous it was, and they also said they would scour all your prompting / data (??) if you used it, I noped out of that one. Opus does everything I need it to, even if it takes me "longer" or I have to compact and feed it more context, that's fine by me. Still saves me weeks of effort.

piokoch 3 hours ago | parent | prev | next [-]

If it weren't for the IPO, Anthropic would just ship another model, called Opus 4.898, people would run another "duck on the bicycle" test that would be slightly better than the one from previous version 4.897 and move on.

But we have IPO coming, hence we face that big drama about model that would enable Iran to produce nukes, ok, that card was played, so maybe Taliban producing some magic poison to kill all Americans or some really bad people (Venezuelans?, Cubans? Somalian football referees?) to break into Github and make Github Actions working even worst (if this is even possible).

0xbadcafebee 2 hours ago | parent | prev [-]

It's not Anthropic's strategy, it's OpenAI's strategy. The first time OpenAI said its model was "too dangerous to release" was February 2019.

"Our model, called GPT‑2 (a successor to GPT ), was trained simply to predict the next word in 40GB of Internet text. Due to our concerns about malicious applications of the technology, we are not releasing the trained model." - https://openai.com/index/better-language-models/

They continue to say the same thing every year. Last time was 2 months ago (https://www.techbrew.com/stories/2026/04/15/calculated-risks...).