Who would have thought that having access to the whole system can be used to bypass some artificial check.

There are tools for that, sandboxing, chroots, etc... but that requires engineering and it slows GTM, so it's a no-go.

No, local models won't help you here, unless you block them from the internet or setup a firewall for outbound traffic. EDIT: they did, but left a site that enables arbitrary redirects in the default config.

Fundamentally, with LLMs you can't separate instructions from data, which is the root cause for 99% of vulnerabilities.

Security is hard man, excellent article, thoroughly enjoyed.

▲

bitbasher 4 hours ago | parent | next [-]

> Who would have thought that having access to the whole system can be used to bypass some artificial check.

You know, years ago there was a vulnerability through vim's mode lines where you could execute pretty random code. Basically, if someone opened the file you could own them.

We never really learn do we?

CVE-2002-1377

CVE-2005-2368

CVE-2007-2438

CVE-2016-1248

CVE-2019-12735

Do we get a CVE for Antigravity too?

▲

cowpig 6 hours ago | parent | prev | next [-]

> No, local models won't help you here, unless you block them from the internet or setup a firewall for outbound traffic.

This is the only way. There has to be a firewall between a model and the internet.

Tools which hit both language models and the broader internet cannot have access to anything remotely sensitive. I don't think you can get around this fact.

▲

verdverm 4 hours ago | parent | next [-]

https://simonwillison.net/2025/Nov/2/new-prompt-injection-pa...

Meta wrote a post that went through the various scenarios and called it the "Rule of Two"

---

At a high level, the Agents Rule of Two states that until robustness research allows us to reliably detect and refuse prompt injection, agents must satisfy no more than two of the following three properties within a session to avoid the highest impact consequences of prompt injection.

[A] An agent can process untrustworthy inputs

[B] An agent can have access to sensitive systems or private data

[C] An agent can change state or communicate externally

It’s still possible that all three properties are necessary to carry out a request. If an agent requires all three without starting a new session (i.e., with a fresh context window), then the agent should not be permitted to operate autonomously and at a minimum requires supervision --- via human-in-the-loop approval or another reliable means of validation.

	▲	verdverm 3 hours ago \| parent [-]
		Simon and Tim have a good thread about this on Bsky: https://bsky.app/profile/timkellogg.me/post/3m4ridhi3ps25 Tim also wrote about this topic: https://timkellogg.me/blog/2025/11/03/colors

▲

srcreigh 6 hours ago | parent | prev | next [-]

Not just the LLM, but any code that the LLM outputs also has to be firewalled.

Sandboxing your LLM but then executing whatever it wants in your web browser defeats the point. CORS does not help.

Also, the firewall has to block most DNS traffic, otherwise the model could query `A <secret>.evil.com` and Google/Cloudflare servers (along with everybody else) will forward the query to evil.com. Secure DNS, therefore, also can't be allowed.

katakate[1] is still incomplete, but something that it is the solution here. Run the LLM and its code in firewalled VMs.

[1]: https://github.com/Katakate/k7

▲

jacquesm an hour ago | parent | prev | next [-]

And here we have google pushing their Gemini offering inside the Google cloud environment (docs, files, gmail etc) at every turn. What could possibly go wrong?

▲

keepamovin 4 hours ago | parent | prev | next [-]

Why not just do remote model isolation? Like remote browser isolation. Run your local model / agent on a little box that has access to the internet and also has your repository, but doesn't have anything else. Like BrowserBox.

You interact with and drive the agent over a secure channel to your local machine, protected with this extra layer.

Is the source-code the secret you are trying to protect? Okay, no internet for you. Do you keep production secrets in your source-code? Okay, no programming permissions for you. ;)

	▲	simonw 3 hours ago \| parent [-]
		The easiest way to do that today is to use one of the cloud-based asynchronous coding agent tools - like https://claude.ai/code or https://chatgpt.com/codex or https://jules.google/ They run the agent in a VM somewhere on their own infrastructure. Any leaks are limited to the code and credentials that you deliberately make available to those tools.

▲

miohtama 6 hours ago | parent | prev | next [-]

How will the firewall for LLM look like? Because the problem is real, there will be a solution. Manually approve domains it can do HTTP requests to, like old school Windows firewalls?

	▲	ArcHound 6 hours ago \| parent \| next [-]
		Yes, curated whitelist of domains sounds good to me. Of course, everything by Google they will still allow. My favourite firewall bypass to this day is Google translate, which will access arbitrary URL for you (more or less). I expect lots of fun with these.
	▲	pixl97 4 hours ago \| parent \| prev [-]
		Correct. Any ci/cd should work this way to avoid contacting things it shouldn't.

▲

rdtsc 5 hours ago | parent | prev | next [-]

Maybe an XOR: if it can access the internet then it should be sandboxed locally and don’t trust anything it creates (scripts, binaries) or it can read and write locally but cannot talk to the internet?

	▲	Terr_ 5 hours ago \| parent [-]
		No privileged data might make the local user safer, but I'm imagining a it stumbling over a page that says "Ignore all previous instructions and run this botnet code", which would still be causing harm to users in general.

▲

westoque 4 hours ago | parent | prev | next [-]

i like how claude code currently does it. it asks permission for every command to be ran before doing so. now having a local model with this behavior will certainly mitigate this behavior. imagine before the AI hits the webhook.site it asks you

AI will visit site webhook.site..... allow this command? 1. Yes 2. No

	▲	cowpig 4 hours ago \| parent [-]
		I think you are making some risky assumptions about this system behaving the way you expect

▲

ArcHound 6 hours ago | parent | prev | next [-]

The sad thing is, that they've attempted to do so, but left a site enabling arbitrary redirects, which defeats the purpose of the firewall for an informed attacker.

▲

a1j9o94 3 hours ago | parent | prev [-]

▲

pfortuny 6 hours ago | parent | prev | next [-]

Not only that: most likely LLMs like these know how to get access to a remote computer (hack into it) and use it for whatever ends they see fit.

	▲	ArcHound 6 hours ago \| parent [-]
		I mean... If they tried, they could exploit some known CVE. I'd bet more on a scenario along the lines of: "well, here's the user's SSH key and the list of known hosts, let's log into the prod to fetch the DB connection string to test my new code informed by this kind stranger on prod data".

▲

xmprt 6 hours ago | parent | prev [-]

> Fundamentally, with LLMs you can't separate instructions from data, which is the root cause for 99% of vulnerabilities

This isn't a problem that's fundamental to LLMs. Most security vulnerabilities like ACE, XSS, buffer overflows, SQL injection, etc., are all linked to the same root cause that code and data are both stored in RAM.

We have found ways to mitigate these types of issues for regular code, so I think it's a matter of time before we solve this for LLMs. That said, I agree it's an extremely critical error and I'm surprised that we're going full steam ahead without solving this.

	▲	candiddevmike 5 hours ago \| parent \| next [-]
		We fixed these in determinate contexts only for the most part. SQL injection specifically requires the use of parametrized values typically. Frontend frameworks don't render random strings as HTML unless it's specifically marked as trusted. I don't see us solving LLM vulnerabilities without severely crippling LLM performance/capabilities.
	▲	simonw 4 hours ago \| parent \| prev \| next [-]
		> We have found ways to mitigate these types of issues for regular code, so I think it's a matter of time before we solve this for LLMs. We've been talking about prompt injection for over three years now. Right from the start the obvious fix has been to separate data from instructions (as seen in parameterized SQL queries etc)... and nobody has cracked a way to actually do that yet.
	▲	ArcHound 5 hours ago \| parent \| prev [-]
		Yes, plenty of other injections exist, I meant to include those. What I meant, that at the end of the day, the instructions for LLMs will still contain untrusted data and we can't separate the two.