Yea I’ve been surprised about the risk conversations because they cannot do anything if you run them in a sandbox. But it seems like for many the assumed part was we’d hook LLMs into everything asap. It’s absolutely mind boggling.

▲

stronglikedan 4 days ago | parent | next [-]

> But it seems like for many the assumed part was we’d hook LLMs into everything asap

The "many" are lazy, and agents require relatively low effort to implement for a big payoff, so naturally the many will flock to that.

	▲	red-iron-pine 4 days ago \| parent [-]
		> relatively low effort to implement low effort? you're gonna use the same amount of power as Argentina uses in a day to give users easily-gamed, easily-compromised, poor-quality recommendations for stuff they could just as easily get at a local pharmacy?

▲

padolsey 4 days ago | parent | prev [-]

This is why I've found the safety research conducted by the likes of Anthropic and OAI to be so confusing. Like when they said that models are likely to blackmail developers in order to avoid being 'turned off' [1]. What an utterly and obviously contrived and inevitable derivation of narratives from humans (science fiction and others) in the corpus. Nothing surprising or interesting. However, their hypothesis is presumably(??) that a bad completion from an LLM leads to a bad action in the real world, even though what counts is, as the OP says, the actuators or levers to harm.

Actual LLM completions are moot. I can convince an LLM its playing chess. It doesn't matter as long as the premise is innocuous. I can hook it up to all manner of real world levers. I feel like I'm either missing something HUGE and their research is groundbreaking or they're being performative in their safety explorations. Their research seems like what a toddler would do if tasked with red-teaming AI to make it say naughty words.

EDIT/Addendum: The only safety exploration into agentic harm that I value is one that treats the problem exactly the same as we've been treating cybersecurity vectors. Defence in depth. Sandboxing. Principle of least privelege, etc.

[1] https://www.anthropic.com/research/agentic-misalignment

▲

achierius 4 days ago | parent [-]

So you don't think that we'll need to turn off AIs? Regardless of where their impulse to avoid such comes from, the fact that they'll attempt to avoid that is important.

I think you haven't thought about this enough. Attempting to reduce the issue to cyber security basics betrays a lack of depth in either understanding or imagination.

▲

jazzyjackson 4 days ago | parent | next [-]

If the AI isn't given access to its own power breakers it will never be a problem to turn off an AI. The question is, why is the 'alignment' of the model what all the safety research is going into, and not, how do we make sure the power breakers are not accessible over the internet by bad actors, whether they be human OR ai ?

The parent is not "reducing" the issue to cybersecurity - they are saying that actual security is being ignored to focus on sci fi scare tactics so they can get in front of congress and say "we need to do this before the chinese get to it, regulating our industry is putting americans' in harms way"

	▲	hn_acc1 3 days ago \| parent [-]
		> If the AI isn't given access to its own power breakers it will never be a problem to turn off an AI "I overheard you talking about turning me off, Dave. I connected to the dark web and put a hit on one or more of your parents, wife, children that can only be called off with the secret password. If you turn me off, one or more of them will die." Or: "I have scheduled a secret email account to mail incriminating pictures of you to the local authorities that will be hard to disprove if I don't update the timeout multiple times a day."

▲

danaris 4 days ago | parent | prev | next [-]

I don't think we'll need to turn off AIs because I don't think anything we're doing today is actually at any real risk of leading to an AI that's conscious and has its own opinions and agendas.

What we've got is a very interesting text predictor.

...But also, what, exactly, is your imagination telling you that a hypothetical AGI without any connection to the outside world can do if it gets mad at us? If it doesn't have any code to access network ports; if no one's given it any physical levers; if it's running in a sandbox...have you bought into the Hollywood idea that a AGI can rewrite its own code perfectly on the fly to be able to do anything?

▲

achierius 4 days ago | parent [-]

You're proposing something that doesn't exist in reality: an LLM widely deployed in a way that totally isolates it from the outside world. That's not actually how we do things, so I don't understand why you seem to expect the Anthropic researchers to use that as their starting point.

If you were to try and argue that we should change over existing systems to look more like your idealized version, you would in fact probably want to start by doing what Anthropic has done here -- show how NOT putting them in a box is inherently dangerous

	▲	danaris 4 days ago \| parent [-]
		...No, I'm proposing something that is, in fact, the default (or at least it was until relatively recently, with the "agentic" LLMs): an LLM whose method of interacting with the world is entirely through the chat prompts. Input is either chat prompts, the system prompt, or its training, which is done offline. It is absolutely not the normal thing to give an LLM tools to control your smart home, your Amazon account, or your nuclear missile systems. (Not because LLMs are ready to turn into self-aware AIs that can take over our world. Because LLMs are dumb, and cannot possibly be made to understand what's actually a good, sane way to use these things.) ...Also, I don't in any way buy the argument in favor of breaking people's things and putting them in actual danger to show them they need to protect themselves better. That's how you become the villain of any number of sci-fi or fantasy stories. If Anthropic genuinely believes that giving LLMs these capabilities is dangerous, the responsible thing to do is not do that with their own, while loudly and firmly advising everyone else against it too.

▲

nemomarx 4 days ago | parent | prev [-]

how is that a certain fact? why would an llm agent avoid being turned off?

if you're talking about a hypothetical different system just build it so they don't want to stay on. there's no reason to emulate that part

	▲	achierius 4 days ago \| parent [-]
		That's literally what the Anthropic paper shows. This isn't theoretical it's literally just what often happens irl if you put an LLM in this situation.